Introduction
Extracting meaningful textual content from video files has become a critical capability in modern AI applications. This approach leverages FFmpeg for audio extractoin followed by Whisper for speech recognition, creating a robust two-stage pipeline for video understanding.
FFmpeg Overview
FFmpeg is a powerful open-source multimedia framwork capable of processing audio and video streams. It handles transcoding, format conversion, stream splitting, and merging operations across multiple platforms.
Core Functionality
- Stream Parsing: FFmpeg decodes various container formats (MP4, MKV, AVI, MP3, OGG) into its internal unified representation.
- Encoding/Decoding: The framework supports numerous codecs like H.264 for video compression and AAC for audio compression.
- Filtering System: Built-in filters enable video cropping, rotation, scaling, and audio effects processing.
- Muxing/Demuxing: FFmpeg can combine multiple audio/video streams into a single file or extract individual streams.
- Parallel Processing: Multi-threading enables concurrent encoding tasks for improved throughput.
Basic Audio Extraction Commend
ffmpeg -i input.mp4 -vn -ar 44100 -ac 2 -ab 192k -f mp3 output.mp3
Parameters:
-i input.mp4: Specifies the input file-vn: Disables video recording (audio-only output)-ar 44100: Sets sample rate to 44.1kHz-ac 2: Configures stereo output (2 channels)-ab 192k: Sets audio bitrate to 192kbps-f mp3: Forces MP3 output format
Implementation: Two-Stage Video-to-Text Pipeline
Environment Setup
FFmpeg installation via apt:
sudo apt-get update && apt-get install ffmpeg
Create a dedicated conda environment:
conda create -n video2text python=3.11
conda activate video2text
Install the transformers library:
pip install transformers -i https://mirrors.cloud.tencent.com/pypi/simple
Model Initialization
Configure HuggingFace mirror for faster downloads and initialize the Whisper pipeline:
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"
from transformers import pipeline
transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-medium")
Whisper model variants:
| Model | Parameters | Multilingual | VRAM | Speed |
|---|---|---|---|---|
| tiny | 39M | Yes | ~1GB | Fastest |
| base | 74M | Yes | ~1GB | Fast |
| small | 244M | Yes | ~2GB | Medium |
| medium | 769M | Yes | ~5GB | Slow |
| large | 1550M | Yes | ~10GB | Slowest |
Audio Extraction: Method 1 — Subprocess
import subprocess
def extract_audio(input_file, output_file):
"""
Extract audio track from video file and save as MP3.
Args:
input_file: Path to input video file
output_file: Path for output MP3 file
"""
ffmpeg_cmd = [
'ffmpeg', '-i', input_file,
'-vn', '-acodec', 'libmp3lame', output_file
]
try:
subprocess.run(ffmpeg_cmd, check=True)
print(f"Extracted audio from {input_file} to {output_file}")
except subprocess.CalledProcessError as e:
print(f"Processing failed: {e}")
Audio Extraction: Method 2 — ffmpeg-python
Install the Python wrapper:
pip install ffmpeg-python -i https://mirrors.cloud.tencent.com/pypi/simple
import ffmpeg
def extract_audio(input_file, output_file):
"""
Extract audio track from video file and save as MP3.
Args:
input_file: Path to input video file
output_file: Path for output MP3 file
"""
try:
ffmpeg.input(input_file).output(
output_file,
acodec="libmp3lame",
ac=2,
ar="44100"
).run()
print(f"Extracted audio from {input_file} to {output_file}")
except Exception as e:
print(f"Processing failed: {e}")
Speech-to-Text Conversion
from transformers import pipeline
def speech2text(audio_file):
transcriber = pipeline(
task="automatic-speech-recognition",
model="openai/whisper-medium"
)
result = transcriber(audio_file)
return result
Complete Pipeline Implementation
import os
import argparse
import subprocess
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"
from transformers import pipeline
def extract_audio(input_file, output_file):
"""
Extract audio track from video file and save as MP3.
Args:
input_file: Path to input video file
output_file: Path for output MP3 file
"""
ffmpeg_cmd = [
'ffmpeg', '-i', input_file,
'-vn', '-acodec', 'libmp3lame', output_file
]
try:
subprocess.run(ffmpeg_cmd, check=True)
print(f"Extracted audio from {input_file} to {output_file}")
except subprocess.CalledProcessError as e:
print(f"Processing failed: {e}")
def speech2text(audio_file):
transcriber = pipeline(
task="automatic-speech-recognition",
model="openai/whisper-medium"
)
result = transcriber(audio_file)
return result
def main():
parser = argparse.ArgumentParser(description="Video to Text Conversion")
parser.add_argument("--video", "-v", type=str, help="Input video file path")
parser.add_argument("--audio", "-a", type=str, help="Output audio file path")
args = parser.parse_args()
print(args)
extract_audio(args.video, args.audio)
result = speech2text(args.audio)
print("Transcribed text:\n" + result["text"])
if __name__ == "__main__":
main()
Usage
python video2text.py --video input.mp4 --audio output.mp3
The pipeline first extracts audio using FFmpeg, then passes the audio file to Whisper for transcription. The final output contains the textual representation of all speech detected in the video.
API Deployment
For production deployments, wrap the pipeline with FastAPI:
from fastapi import FastAPI
app = FastAPI()
@app.post("/transcribe")
async def transcribe_video(video_path: str):
audio_path = "temp_audio.mp3"
extract_audio(video_path, audio_path)
result = speech2text(audio_path)
return {"text": result["text"]}
Output Format
The Whisper model returns a dictionary:
{"text": "transcribed content here", ...}
The transcribed text can then be fed into large language models for summarization, question answering, or other downstream NLP tasks.
Architecture Summary
Video File → FFmpeg Audio Extraction → MP3 File → Whisper ASR → Text Output → LLM Processing
This two-stage architecture separates concerns effectively: FFmpeg handles multimedia processing while Whisper focuses on speech recognition. Both components are well-established, production-ready tools with extensive community support.