Video-to-Text Conversion Using FFmpeg and Whisper: A Two-Stage Approach

Introduction

Extracting meaningful textual content from video files has become a critical capability in modern AI applications. This approach leverages FFmpeg for audio extractoin followed by Whisper for speech recognition, creating a robust two-stage pipeline for video understanding.

FFmpeg Overview

FFmpeg is a powerful open-source multimedia framwork capable of processing audio and video streams. It handles transcoding, format conversion, stream splitting, and merging operations across multiple platforms.

Core Functionality

  • Stream Parsing: FFmpeg decodes various container formats (MP4, MKV, AVI, MP3, OGG) into its internal unified representation.
  • Encoding/Decoding: The framework supports numerous codecs like H.264 for video compression and AAC for audio compression.
  • Filtering System: Built-in filters enable video cropping, rotation, scaling, and audio effects processing.
  • Muxing/Demuxing: FFmpeg can combine multiple audio/video streams into a single file or extract individual streams.
  • Parallel Processing: Multi-threading enables concurrent encoding tasks for improved throughput.

Basic Audio Extraction Commend

ffmpeg -i input.mp4 -vn -ar 44100 -ac 2 -ab 192k -f mp3 output.mp3

Parameters:

  • -i input.mp4: Specifies the input file
  • -vn: Disables video recording (audio-only output)
  • -ar 44100: Sets sample rate to 44.1kHz
  • -ac 2: Configures stereo output (2 channels)
  • -ab 192k: Sets audio bitrate to 192kbps
  • -f mp3: Forces MP3 output format

Implementation: Two-Stage Video-to-Text Pipeline

Environment Setup

FFmpeg installation via apt:

sudo apt-get update && apt-get install ffmpeg

Create a dedicated conda environment:

conda create -n video2text python=3.11
conda activate video2text

Install the transformers library:

pip install transformers -i https://mirrors.cloud.tencent.com/pypi/simple

Model Initialization

Configure HuggingFace mirror for faster downloads and initialize the Whisper pipeline:

import os

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"

from transformers import pipeline

transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-medium")

Whisper model variants:

Model Parameters Multilingual VRAM Speed
tiny 39M Yes ~1GB Fastest
base 74M Yes ~1GB Fast
small 244M Yes ~2GB Medium
medium 769M Yes ~5GB Slow
large 1550M Yes ~10GB Slowest

Audio Extraction: Method 1 — Subprocess

import subprocess

def extract_audio(input_file, output_file):
    """
    Extract audio track from video file and save as MP3.
    
    Args:
        input_file: Path to input video file
        output_file: Path for output MP3 file
    """
    ffmpeg_cmd = [
        'ffmpeg', '-i', input_file,
        '-vn', '-acodec', 'libmp3lame', output_file
    ]
    
    try:
        subprocess.run(ffmpeg_cmd, check=True)
        print(f"Extracted audio from {input_file} to {output_file}")
    except subprocess.CalledProcessError as e:
        print(f"Processing failed: {e}")

Audio Extraction: Method 2 — ffmpeg-python

Install the Python wrapper:

pip install ffmpeg-python -i https://mirrors.cloud.tencent.com/pypi/simple
import ffmpeg

def extract_audio(input_file, output_file):
    """
    Extract audio track from video file and save as MP3.
    
    Args:
        input_file: Path to input video file
        output_file: Path for output MP3 file
    """
    try:
        ffmpeg.input(input_file).output(
            output_file, 
            acodec="libmp3lame", 
            ac=2, 
            ar="44100"
        ).run()
        print(f"Extracted audio from {input_file} to {output_file}")
    except Exception as e:
        print(f"Processing failed: {e}")

Speech-to-Text Conversion

from transformers import pipeline

def speech2text(audio_file):
    transcriber = pipeline(
        task="automatic-speech-recognition", 
        model="openai/whisper-medium"
    )
    result = transcriber(audio_file)
    return result

Complete Pipeline Implementation

import os
import argparse
import subprocess

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"

from transformers import pipeline

def extract_audio(input_file, output_file):
    """
    Extract audio track from video file and save as MP3.
    
    Args:
        input_file: Path to input video file
        output_file: Path for output MP3 file
    """
    ffmpeg_cmd = [
        'ffmpeg', '-i', input_file,
        '-vn', '-acodec', 'libmp3lame', output_file
    ]
    
    try:
        subprocess.run(ffmpeg_cmd, check=True)
        print(f"Extracted audio from {input_file} to {output_file}")
    except subprocess.CalledProcessError as e:
        print(f"Processing failed: {e}")


def speech2text(audio_file):
    transcriber = pipeline(
        task="automatic-speech-recognition", 
        model="openai/whisper-medium"
    )
    result = transcriber(audio_file)
    return result


def main():
    parser = argparse.ArgumentParser(description="Video to Text Conversion")
    parser.add_argument("--video", "-v", type=str, help="Input video file path")
    parser.add_argument("--audio", "-a", type=str, help="Output audio file path")
    args = parser.parse_args()
    
    print(args)
    
    extract_audio(args.video, args.audio)
    result = speech2text(args.audio)
    print("Transcribed text:\n" + result["text"])


if __name__ == "__main__":
    main()

Usage

python video2text.py --video input.mp4 --audio output.mp3

The pipeline first extracts audio using FFmpeg, then passes the audio file to Whisper for transcription. The final output contains the textual representation of all speech detected in the video.

API Deployment

For production deployments, wrap the pipeline with FastAPI:

from fastapi import FastAPI

app = FastAPI()

@app.post("/transcribe")
async def transcribe_video(video_path: str):
    audio_path = "temp_audio.mp3"
    extract_audio(video_path, audio_path)
    result = speech2text(audio_path)
    return {"text": result["text"]}

Output Format

The Whisper model returns a dictionary:

{"text": "transcribed content here", ...}

The transcribed text can then be fed into large language models for summarization, question answering, or other downstream NLP tasks.

Architecture Summary

Video File → FFmpeg Audio Extraction → MP3 File → Whisper ASR → Text Output → LLM Processing

This two-stage architecture separates concerns effectively: FFmpeg handles multimedia processing while Whisper focuses on speech recognition. Both components are well-established, production-ready tools with extensive community support.

Tags: ffmpeg Whisper Video Processing Speech Recognition ASR

Posted on Sun, 10 May 2026 05:25:05 +0000 by enterume