Video-to-Text Conversion Using FFmpeg and Whisper: A Two-Stage Approach

Introduction

Extracting meaningful textual content from video files has become a critical capability in modern AI applications. This approach leverages FFmpeg for audio extractoin followed by Whisper for speech recognition, creating a robust two-stage pipeline for video understanding.

FFmpeg Overview

FFmpeg is a powerful open-source multimedia framwork capable of processing audio and video streams. It handles transcoding, format conversion, stream splitting, and merging operations across multiple platforms.

Core Functionality

Stream Parsing: FFmpeg decodes various container formats (MP4, MKV, AVI, MP3, OGG) into its internal unified representation.
Encoding/Decoding: The framework supports numerous codecs like H.264 for video compression and AAC for audio compression.
Filtering System: Built-in filters enable video cropping, rotation, scaling, and audio effects processing.
Muxing/Demuxing: FFmpeg can combine multiple audio/video streams into a single file or extract individual streams.
Parallel Processing: Multi-threading enables concurrent encoding tasks for improved throughput.

Basic Audio Extraction Commend

ffmpeg -i input.mp4 -vn -ar 44100 -ac 2 -ab 192k -f mp3 output.mp3

Parameters:

-i input.mp4: Specifies the input file
-vn: Disables video recording (audio-only output)
-ar 44100: Sets sample rate to 44.1kHz
-ac 2: Configures stereo output (2 channels)
-ab 192k: Sets audio bitrate to 192kbps
-f mp3: Forces MP3 output format

Implementation: Two-Stage Video-to-Text Pipeline

Environment Setup

FFmpeg installation via apt:

sudo apt-get update && apt-get install ffmpeg

Create a dedicated conda environment:

conda create -n video2text python=3.11
conda activate video2text

Install the transformers library:

pip install transformers -i https://mirrors.cloud.tencent.com/pypi/simple

Model Initialization

Configure HuggingFace mirror for faster downloads and initialize the Whisper pipeline:

import os

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"

from transformers import pipeline

transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-medium")

Whisper model variants:

Model	Parameters	Multilingual	VRAM	Speed
tiny	39M	Yes	~1GB	Fastest
base	74M	Yes	~1GB	Fast
small	244M	Yes	~2GB	Medium
medium	769M	Yes	~5GB	Slow
large	1550M	Yes	~10GB	Slowest

Audio Extraction: Method 1 — Subprocess

import subprocess

def extract_audio(input_file, output_file):
    """
    Extract audio track from video file and save as MP3.
    
    Args:
        input_file: Path to input video file
        output_file: Path for output MP3 file
    """
    ffmpeg_cmd = [
        'ffmpeg', '-i', input_file,
        '-vn', '-acodec', 'libmp3lame', output_file
    ]
    
    try:
        subprocess.run(ffmpeg_cmd, check=True)
        print(f"Extracted audio from {input_file} to {output_file}")
    except subprocess.CalledProcessError as e:
        print(f"Processing failed: {e}")

Audio Extraction: Method 2 — ffmpeg-python

Install the Python wrapper:

pip install ffmpeg-python -i https://mirrors.cloud.tencent.com/pypi/simple

import ffmpeg

def extract_audio(input_file, output_file):
    """
    Extract audio track from video file and save as MP3.
    
    Args:
        input_file: Path to input video file
        output_file: Path for output MP3 file
    """
    try:
        ffmpeg.input(input_file).output(
            output_file, 
            acodec="libmp3lame", 
            ac=2, 
            ar="44100"
        ).run()
        print(f"Extracted audio from {input_file} to {output_file}")
    except Exception as e:
        print(f"Processing failed: {e}")

Speech-to-Text Conversion

from transformers import pipeline

def speech2text(audio_file):
    transcriber = pipeline(
        task="automatic-speech-recognition", 
        model="openai/whisper-medium"
    )
    result = transcriber(audio_file)
    return result

Complete Pipeline Implementation

import os
import argparse
import subprocess

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"

from transformers import pipeline

def extract_audio(input_file, output_file):
    """
    Extract audio track from video file and save as MP3.
    
    Args:
        input_file: Path to input video file
        output_file: Path for output MP3 file
    """
    ffmpeg_cmd = [
        'ffmpeg', '-i', input_file,
        '-vn', '-acodec', 'libmp3lame', output_file
    ]
    
    try:
        subprocess.run(ffmpeg_cmd, check=True)
        print(f"Extracted audio from {input_file} to {output_file}")
    except subprocess.CalledProcessError as e:
        print(f"Processing failed: {e}")


def speech2text(audio_file):
    transcriber = pipeline(
        task="automatic-speech-recognition", 
        model="openai/whisper-medium"
    )
    result = transcriber(audio_file)
    return result


def main():
    parser = argparse.ArgumentParser(description="Video to Text Conversion")
    parser.add_argument("--video", "-v", type=str, help="Input video file path")
    parser.add_argument("--audio", "-a", type=str, help="Output audio file path")
    args = parser.parse_args()
    
    print(args)
    
    extract_audio(args.video, args.audio)
    result = speech2text(args.audio)
    print("Transcribed text:\n" + result["text"])


if __name__ == "__main__":
    main()

Usage

python video2text.py --video input.mp4 --audio output.mp3

The pipeline first extracts audio using FFmpeg, then passes the audio file to Whisper for transcription. The final output contains the textual representation of all speech detected in the video.

API Deployment

For production deployments, wrap the pipeline with FastAPI:

from fastapi import FastAPI

app = FastAPI()

@app.post("/transcribe")
async def transcribe_video(video_path: str):
    audio_path = "temp_audio.mp3"
    extract_audio(video_path, audio_path)
    result = speech2text(audio_path)
    return {"text": result["text"]}

Output Format

The Whisper model returns a dictionary:

{"text": "transcribed content here", ...}

The transcribed text can then be fed into large language models for summarization, question answering, or other downstream NLP tasks.

Architecture Summary

Video File → FFmpeg Audio Extraction → MP3 File → Whisper ASR → Text Output → LLM Processing

This two-stage architecture separates concerns effectively: FFmpeg handles multimedia processing while Whisper focuses on speech recognition. Both components are well-established, production-ready tools with extensive community support.

Tags: ffmpeg Whisper Video Processing Speech Recognition ASR

Posted on Sun, 10 May 2026 05:25:05 +0000 by enterume

Freaks City