Faster Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

For longer audio files (>10 minutes) not in English, it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option.

Max audio file length: 1800 s

Whisper - Model (for audio)
Whisper - Language
M2M100 - Model (for translate)
M2M100 - Language
Task

Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.

VAD

Extract word-level timestamps using the cross-attention pattern and dynamic time warping, and include the timestamps for each word in each segment.

if word_timestamps is True, underline each word as it is spoken in srt and vtt

Filter the results of Whisper transcribe with the following conditions. It is recommended to enable this feature when using the large-v3 model to avoid hallucinations.

Whether to perform speaker diarization

Diarization Version

pyannote.audio speaker diarization pipeline v3.1 is expected to be much better (and faster) than v2.x. Benchmark

Read the documentation here.

# Standard Options To transcribe or translate an audio file, you can either copy an URL from a website (all websites supported by YT-DLP will work, including YouTube). Otherwise, upload an audio file (choose "All Files (.)" in the file selector to select any file type, including video files) or use the microphone.

For longer audio files (>10 minutes), it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option, especially if you are using the large-v1 model. Note that large-v2 is a lot more forgiving, but you may still want to use a VAD with a slightly higher "VAD - Max Merge Size (s)" (60 seconds or more).

Model

Select the model that Whisper will use to transcribe the audio:

Size Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x
large-v2 1550 M N/A large ~10 GB 1x
large-v3 1550 M N/A large ~10 GB 1x
turbo 809 M N/A turbo ~6 GB 8x

Language

Select the language, or leave it empty for Whisper to automatically detect it.

Note that if the selected language and the language in the audio differs, Whisper may start to translate the audio to the selected language. For instance, if the audio is in English but you select Japaneese, the model may translate the audio to Japanese.

Inputs

The options "URL (YouTube, etc.)", "Upload Files" or "Micriphone Input" allows you to send an audio input to the model.

Multiple Files

Note that the UI will only process either the given URL or the upload files (including microphone) - not both.

But you can upload multiple files either through the "Upload files" option, or as a playlist on YouTube. Each audio file will then be processed in turn, and the resulting SRT/VTT/Transcript will be made available in the "Download" section. When more than one file is processed, the UI will also generate a "All_Output" zip file containing all the text output files.

Task

Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.

Vad

Using a VAD will improve the timing accuracy of each transcribed line, as well as prevent Whisper getting into an infinite loop detecting the same sentence over and over again. The downside is that this may be at a cost to text accuracy, especially with regards to unique words or names that appear in the audio. You can compensate for this by increasing the prompt window.

Note that English is very well handled by Whisper, and it's less susceptible to issues surrounding bad timings and infinite loops. So you may only need to use a VAD for other languages, such as Japanese, or when the audio is very long.

  • none
    • Run whisper on the entire audio input
  • silero-vad
    • Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Whisper is also run on the gaps between each speech section, by either expanding the section up to the max merge size, or running Whisper independently on the non-speech section.
  • silero-vad-expand-into-gaps
    • Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Each spech section will be expanded such that they cover any adjacent non-speech sections. For instance, if an audio file of one minute contains the speech sections 00:00 - 00:10 (A) and 00:30 - 00:40 (B), the first section (A) will be expanded to 00:00 - 00:30, and (B) will be expanded to 00:30 - 00:60.
  • silero-vad-skip-gaps
    • As above, but sections that doesn't contain speech according to Silero will be skipped. This will be slightly faster, but may cause dialogue to be skipped.
  • periodic-vad
    • Create sections of speech every 'VAD - Max Merge Size' seconds. This is very fast and simple, but will potentially break a sentence or word in two.

VAD - Merge Window

If set, any adjacent speech sections that are at most this number of seconds apart will be automatically merged.

VAD - Max Merge Size (s)

Disables merging of adjacent speech sections if they are this number of seconds long.

VAD - Process Timeout (s)

This configures the number of seconds until a process is killed due to inactivity, freeing RAM and video memory. The default value is 30 minutes.

VAD - Padding (s)

The number of seconds (floating point) to add to the beginning and end of each speech section. Setting this to a number larger than zero ensures that Whisper is more likely to correctly transcribe a sentence in the beginning of a speech section. However, this also increases the probability of Whisper assigning the wrong timestamp to each transcribed line. The default value is 1 second.

VAD - Prompt Window (s)

The text of a detected line will be included as a prompt to the next speech section, if the speech section starts at most this number of seconds after the line has finished. For instance, if a line ends at 10:00, and the next speech section starts at 10:04, the line's text will be included if the prompt window is 4 seconds or more (10:04 - 10:00 = 4 seconds).

Note that detected lines in gaps between speech sections will not be included in the prompt (if silero-vad or silero-vad-expand-into-gaps) is used.

Diarization

If checked, Pyannote will be used to detect speakers in the audio, and label them as (SPEAKER 00), (SPEAKER 01), etc.

This requires a HuggingFace API key to function, which can be supplied with the --auth_token command line option for the CLI, set in the config.json5 file for the GUI, or provided via the HF_ACCESS_TOKEN environment variable.

Diarization - Speakers

The number of speakers to detect. If set to 0, Pyannote will attempt to detect the number of speakers automatically.

Command Line Options

Both app.py and cli.py also accept command line options, such as the ability to enable parallel execution on multiple CPU/GPU cores, the default model name/VAD and so on. Consult the README in the root folder for more information.

Additional Options

In addition to the above, there's also a "Full" options interface that allows you to set all the options available in the Whisper model. The options are as follows:

Initial Prompt

Optional text to provide as a prompt for the first 30 seconds window. Whisper will attempt to use this as a starting point for the transcription, but you can also get creative and specify a style or format for the output of the transcription.

For instance, if you use the prompt "hello how is it going always use lowercase no punctuation goodbye one two three start stop i you me they", Whisper will be biased to output lower capital letters and no punctuation, and may also be biased to output the words in the prompt more often.

Temperature

The temperature to use when sampling. Default is 0 (zero). A higher temperature will result in more random output, while a lower temperature will be more deterministic.

Best Of - Non-zero temperature

The number of candidates to sample from when sampling with non-zero temperature. Default is 5.

Beam Size - Zero temperature

The number of beams to use in beam search when sampling with zero temperature. Default is 5.

Patience - Zero temperature

The patience value to use in beam search when sampling with zero temperature. As in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search.

Length Penalty - Any temperature

The token length penalty coefficient (alpha) to use when sampling with any temperature. As in https://arxiv.org/abs/1609.08144, uses simple length normalization by default.

Suppress Tokens - Comma-separated list of token IDs

A comma-separated list of token IDs to suppress during sampling. The default value of "-1" will suppress most special characters except common punctuations.

Condition on previous text

If True, provide the previous output of the model as a prompt for the next window. Disabling this may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop.

FP16

Whether to perform inference in fp16. True by default.

Temperature increment on fallback

The temperature to increase when falling back when the decoding fails to meet either of the thresholds below. Default is 0.2.

Compression ratio threshold

If the gzip compression ratio is higher than this value, treat the decoding as failed. Default is 2.4.

Logprob threshold

If the average log probability is lower than this value, treat the decoding as failed. Default is -1.0.

No speech threshold

If the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to logprob_threshold, consider the segment as silence. Default is 0.6.

Diarization - Min Speakers

The minimum number of speakers for Pyannote to detect.

Diarization - Max Speakers

The maximum number of speakers for Pyannote to detect.

Repetition Penalty

  • ctranslate2: repetition_penalty
    This parameter only takes effect in faster-whisper (ctranslate2). Penalty applied to the score of previously generated tokens (set > 1 to penalize).

No Repeat Ngram Size

  • ctranslate2: no_repeat_ngram_size
    This parameter only takes effect in faster-whisper (ctranslate2). Prevent repetitions of ngrams with this size (set 0 to disable).

Whisper Filter options

This is an experimental feature and may potentially filter out correct transcription results.

when enabled, can effectively improve the whisper hallucination, especially for the large-v3 version of the whisper model.

Observations for transcriptions:

  1. duration: calculated by subtracting start from end, it might indicate hallucinated results when inversely proportional to text length.
  2. segment_last: the last result for each segment during VAD transcription has a certain probability of being a hallucinated result.
  3. avg_logprob: average log probability, ranging from logprob_threshold (default: -1) to 0, is better when a larger value. A value lower than -0.9 might suggest a poor result.
  4. compression_ratio: gzip compression ratio, ranging from 0 to compression_ratio_threshold (default: 2.4), a higher positive value is preferable. If it is lower than 0.9, it might indicate suboptimal results.
  5. no_speech_prob: no_speech(<|nospeech|> token) probability, ranging from 0 to no_speech_threshold (default: 0.6), a smaller positive value is preferable. If it exceeds 0.1, it might suggest suboptimal results.

Four sets of filtering conditions have now been established, utilizing text length, duration length, as well as the avg_logprob, compression_ratio, and no_speech_prob parameters returned by Whisper.

  1. avg_logprob < -0.9
  2. (durationLen < 1.5 || segment_last), textLen > 5, avg_logprob < -0.4, no_speech_prob > 0.5
  3. (durationLen < 1.5 || segment_last), textLen > 5, avg_logprob < -0.4, no_speech_prob > 0.07, compression_ratio < 0.9
  4. (durationLen < 1.5 || segment_last), compression_ratio < 0.9, no_speech_prob > 0.1

Translation - Batch Size

  • transformers: batch_size
    When the pipeline will use DataLoader (when passing a dataset, on GPU for a Pytorch model), the size of the batch to use, for inference this is not always beneficial.
  • ctranslate2: max_batch_size
    The maximum batch size.

Translation - No Repeat Ngram Size

  • transformers: no_repeat_ngram_size
    Value that will be used by default in the generate method of the model for no_repeat_ngram_size. If set to int > 0, all ngrams of that size can only occur once.
  • ctranslate2: no_repeat_ngram_size
    Prevent repetitions of ngrams with this size (set 0 to disable).

Translation - Num Beams

  • transformers: num_beams
    Number of beams for beam search that will be used by default in the generate method of the model. 1 means no beam search.
  • ctranslate2: beam_size
    Beam size (1 for greedy search).

Translation - Torch Dtype float16

  • transformers: torch_dtype=torch.float16
    Load the float32 translation model with float16 when the system supports GPU (reducing VRAM usage, not applicable to models that have already been quantized, such as Ctranslate2, GPTQ, GGUF)

Translation - Using Bitsandbytes

  • transformers: load_in_8bit, load_in_4bit Load the float32 translation model into mixed-8bit or 4bit precision quantized model when the system supports GPU (reducing VRAM usage, not applicable to models that have already been quantized, such as Ctranslate2, GPTQ, GGUF)



Describe

The translate task in Whisper only supports translating other languages into English. OpenAI does not guarantee translations between arbitrary languages. In such cases, you can opt to use the Translation Model for translation tasks. However, it's important to note that the Translation Model runs very slowly on CPU, and the completion time may be twice as long as usual. It is recommended to run the Translation Model on devices with GPUs for better performance.

The larger the parameters of the Translation model, the better its translation capability is expected. However, this also requires higher computational resources and slower running speed.

The translation model is now compatible with the Word Timestamps - Highlight Words feature.

Currently, when the Highlight Words timestamps option is enabled in the Whisper Word Timestamps options, it cannot be used simultaneously with the Translation Model. This is because Highlight Words splits the source text, and after translation, it becomes a non-word-level string.

Translation Model

The required VRAM is provided for reference and may not apply to everyone. If the model's VRAM requirement exceeds the available capacity of the system, the model will operate on the CPU, resulting in significantly longer execution times.

CTranslate2 is a C++ and Python library for efficient inference with Transformer models. Models converted from CTranslate2 can run with lower resources and faster speed. Encoder-decoder models currently supported: Transformer base/big, M2M-100, NLLB, BART, mBART, Pegasus, T5, Whisper.

M2M100

M2M100 is a multilingual translation model introduced by Facebook AI in October 2020. It supports arbitrary translation among 101 languages. The paper is titled "Beyond English-Centric Multilingual Machine Translation" (arXiv:2010.11125).

Name Parameters Size type/quantize Required VRAM
facebook/m2m100_418M 418M 1.94 GB float32 ≈2 GB
facebook/m2m100_1.2B 1.2B 4.96 GB float32 ≈5 GB
facebook/m2m100-12B-last-ckpt 12B 47.2 GB float32 ≈22.1 GB (torch dtype in float16)

M2M100-CTranslate2

Name Parameters Size type/quantize Required VRAM
michaelfeil/ct2fast-m2m100_418M 418M 970 MB float16 ≈0.6 GB
michaelfeil/ct2fast-m2m100_1.2B 1.2B 2.48 GB float16 ≈1.3 GB
michaelfeil/ct2fast-m2m100-12B-last-ckpt 12B 23.6 GB float16 N/A

NLLB-200

NLLB-200 is a multilingual translation model introduced by Meta AI in July 2022. It supports arbitrary translation among 202 languages. The paper is titled "No Language Left Behind: Scaling Human-Centered Machine Translation" (arXiv:2207.04672).

Name Parameters Size type/quantize Required VRAM
facebook/nllb-200-distilled-600M 600M 2.46 GB float32 ≈2.5 GB
facebook/nllb-200-distilled-1.3B 1.3B 5.48 GB float32 ≈5.9 GB
facebook/nllb-200-1.3B 1.3B 5.48 GB float32 ≈5.8 GB
facebook/nllb-200-3.3B 3.3B 17.58 GB float32 ≈13.4 GB
facebook/nllb-moe-54b 54B 220.2 GB float32 N/A

NLLB-200-CTranslate2

Name Parameters Size type/quantize Required VRAM
michaelfeil/ct2fast-nllb-200-distilled-1.3B 1.3B 1.38 GB int8_float16 ≈1.3 GB
michaelfeil/ct2fast-nllb-200-3.3B 3.3B 3.36 GB int8_float16 ≈3.2 GB
JustFrederik/nllb-200-1.3B-ct2-int8 1.3B 1.38 GB int8 ≈1.3 GB
JustFrederik/nllb-200-1.3B-ct2-float16 1.3B 2.74 GB float16 ≈1.3 GB
JustFrederik/nllb-200-distilled-600M-ct2 600M 2.46 GB float32 ≈0.6 GB
JustFrederik/nllb-200-distilled-600M-ct2-float16 600M 1.23 GB float16 ≈0.6 GB
JustFrederik/nllb-200-distilled-600M-ct2-int8 600M 623 MB int8 ≈0.6 GB
JustFrederik/nllb-200-distilled-1.3B-ct2-float16 1.3B 2.74 GB float16 ≈1.3 GB
JustFrederik/nllb-200-distilled-1.3B-ct2-int8 1.3B 1.38 GB int8 ≈1.3 GB
JustFrederik/nllb-200-distilled-1.3B-ct2 1.3B 5.49 GB float32 ≈1.3 GB
JustFrederik/nllb-200-1.3B-ct2 1.3B 5.49 GB float32 ≈1.3 GB
JustFrederik/nllb-200-3.3B-ct2-float16 3.3B 6.69 GB float16 ≈3.2 GB

MT5

mT5 is a multilingual pre-trained Text-to-Text Transformer introduced by Google Research in October 2020. It is a multilingual variant of the T5 model, pre-trained on datasets in 101 languages. Further fine-tuning is required to transform it into a translation model. The paper is titled "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer" (arXiv:2010.11934).
The 'mt5-zh-ja-en-trimmed' model is finetuned from Google's 'mt5-base' model. This model has a relatively good translation speed, but it only supports three languages: Chinese, Japanese, and English.

Name Parameters Size type/quantize Required VRAM
mt5-base N/A 2.33 GB float32 N/A
K024/mt5-zh-ja-en-trimmed N/A 1.32 GB float32 ≈1.4 GB
engmatic-earth/mt5-zh-ja-en-trimmed-fine-tuned-v1 N/A 1.32 GB float32 ≈1.4 GB

ALMA

ALMA is a many-to-many LLM-based translation model introduced by Haoran Xu and colleagues in September 2023. It is based on the fine-tuning of a large language model (LLaMA-2). The approach used for this model is referred to as Advanced Language Model-based trAnslator (ALMA). The paper is titled "A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models" (arXiv:2309.11674).
The official support for ALMA currently includes 10 language directions: English↔German, English↔Czech, English↔Icelandic, English↔Chinese, and English↔Russian. However, the author hints that there might be surprises in other directions, so there are currently no restrictions on the languages that ALMA can be chosen for in the web UI.

Name Parameters Size type/quantize Required VRAM
haoranxu/ALMA-7B 7B 26.95 GB float32 ≈13.2 GB (torch dtype in float16)
haoranxu/ALMA-13B 13B 52.07 GB float32 ≈25.4 GB (torch dtype in float16)

ALMA-GPTQ

Due to the poor support of GPTQ for CPUs, the execution time per iteration exceeds a thousand seconds when operating on a CPU. Therefore, it is strongly discouraged to operate it on CPU.
GPTQ is a technique used to quantize the parameters of large language models into integer formats such as int8 or int4. Although the quantization process may lead to a loss in model performance, it significantly reduces both file size and the required VRAM.

Name Parameters Size type/quantize Required VRAM
TheBloke/ALMA-7B-GPTQ 7B 3.9 GB 4 Bits ≈4.3 GB
TheBloke/ALMA-13B-GPTQ 13B 7.26 GB 4 Bits ≈8.4 GB

ALMA-GGUF

GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.
k-quants: a series of 2-6 bit quantization methods, along with quantization mixes

Name Parameters Size type/quantize Required VRAM
TheBloke/ALMA-7B-GGUF-Q4_K_M 7B 4.08 GB Q4_K_M(4 Bits medium) ≈5.3 GB
TheBloke/ALMA-13B-GGUF-Q4_K_M 13B 7.87 GB Q4_K_M(4 Bits medium) ≈9.3 GB

ALMA-CTranslate2

CTranslate2 does not currently support 4-bit quantization. Currently, it can only use int8_float16 quantization, so the file size and required VRAM will be larger than the GPTQ model quantized with 4 bits. However, it runs much faster on the CPU than GPTQ. If you plan to run ALMA in an environment without a GPU, you may consider choosing the CTranslate2 version of the ALMA model.

Name Parameters Size type/quantize Required VRAM
avans06/ALMA-7B-ct2-int8_float16 7B 6.74 GB int8_float16 ≈6.6 GB
avans06/ALMA-13B-ct2-int8_float16 13B 13 GB int8_float16 ≈12.6 GB

madlad400

madlad400 is a multilingual machine translation model based on the T5 architecture introduced by Google DeepMind, Google Research in Sep 2023. It was trained on 250 billion tokens covering over 450 languages using publicly available data. The paper is titled "MADLAD-400: A Multilingual And Document-Level Large Audited Dataset" (arXiv:2309.04662).

Name Parameters Size type/quantize Required VRAM
jbochi/madlad400-3b-mt 3B 11.8 GB float32 ≈12 GB
jbochi/madlad400-7b-mt 7.2B 33.2 GB float32 ≈19.7 GB (torch dtype in float16)
jbochi/madlad400-7b-mt-bt 7.2B 33.2 GB float32 (finetuned on backtranslated data) ≈19.7 GB (torch dtype in float16)
jbochi/madlad400-8b-lm 8B 34.52 GB float32 N/A
jbochi/madlad400-10b-mt 10.7B 42.86 GB float32 ≈24.3 GB (torch dtype in float16)

madlad400-CTranslate2

Name Parameters Size type/quantize Required VRAM
SoybeanMilk/madlad400-3b-mt-ct2-int8_float16 3B 2.95 GB int8_float16 ≈2.7 GB
avans06/madlad400-7b-mt-bt-ct2-int8_float16 7.2B 8.31 GB int8_float16 (finetuned on backtranslated data) ≈8.5 GB
SoybeanMilk/madlad400-10b-mt-ct2-int8_float16 10.7B 10.7 GB int8_float16 ≈10 GB

SeamlessM4T

SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

It enables multiple tasks without relying on separate models:

Speech-to-speech translation (S2ST)
Speech-to-text translation (S2TT)
Text-to-speech translation (T2ST)
Text-to-text translation (T2TT)
Automatic speech recognition (ASR)

SeamlessM4T-v1 introduced by Seamless Communication team from Meta AI in Aug 2023. The paper is titled "SeamlessM4T: Massively Multilingual & Multimodal Machine Translation"(arXiv:2308.11596)
SeamlessM4T-v2 introduced by Seamless Communication team from Meta AI in Dec 2023. The paper is titled "Seamless: Multilingual Expressive and Streaming Speech Translation"(arXiv:2312.05187)

Name Parameters Size type/quantize Required VRAM
facebook/hf-seamless-m4t-medium 1.2B 4.84 GB float32 N/A
facebook/seamless-m4t-large 2.3B 11.4 GB float32 N/A
facebook/seamless-m4t-v2-large 2.3B 11.4 GB (safetensors:9.24 GB) float32 ≈9.2 GB

Llama

Meta developed and released the Meta Llama 3 family of large language models (LLMs). This program modifies them through prompts to function as translation models.

Name Parameters Size type/quantize Required VRAM
avans06/Meta-Llama-3.2-8B-Instruct-ct2-int8_float16 8B 8.04 GB int8_float16 ≈ 7.9 GB
avans06/Meta-Llama-3.1-8B-Instruct-ct2-int8_float16 8B 8.04 GB int8_float16 ≈ 7.9 GB
avans06/Meta-Llama-3-8B-Instruct-ct2-int8_float16 8B 8.04 GB int8_float16 ≈ 7.9 GB
jncraton/Llama-3.2-3B-Instruct-ct2-int8 3B 3.22 GB int8 ≈ 3.3 GB

Options

Translation - Batch Size

  • transformers: batch_size
    When the pipeline will use DataLoader (when passing a dataset, on GPU for a Pytorch model), the size of the batch to use, for inference this is not always beneficial.
  • ctranslate2: max_batch_size
    The maximum batch size.

Translation - No Repeat Ngram Size

  • transformers: no_repeat_ngram_size
    Value that will be used by default in the generate method of the model for no_repeat_ngram_size. If set to int > 0, all ngrams of that size can only occur once.
  • ctranslate2: no_repeat_ngram_size
    Prevent repetitions of ngrams with this size (set 0 to disable).

Translation - Num Beams

  • transformers: num_beams
    Number of beams for beam search that will be used by default in the generate method of the model. 1 means no beam search.
  • ctranslate2: beam_size
    Beam size (1 for greedy search).

Translation - Torch Dtype float16

  • transformers: torch_dtype=torch.float16
    Load the float32 translation model with float16 when the system supports GPU (reducing VRAM usage, not applicable to models that have already been quantized, such as Ctranslate2, GPTQ, GGUF)

Translation - Using Bitsandbytes

  • transformers: load_in_8bit, load_in_4bit Load the float32 translation model into mixed-8bit or 4bit precision quantized model when the system supports GPU (reducing VRAM usage, not applicable to models that have already been quantized, such as Ctranslate2, GPTQ, GGUF)

title: Faster Whisper Webui with translate emoji: ✨ colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.16.0 app_file: app.py pinned: true license: apache-2.0 duplicated_from: aadnk/whisper-webui

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Running Locally

To run this program locally, first install Python 3.9+ and Git. Then install Pytorch 10.1+ and all the other dependencies:

pip install -r requirements.txt

You can find detailed instructions for how to install this on Windows 10/11 here (PDF).

Finally, run the full version (no audio length restrictions) of the app with parallel CPU/GPU enabled:

python app.py --input_audio_max_duration -1 --server_name 127.0.0.1 --auto_parallel True

You can also run the CLI interface, which is similar to Whisper's own CLI but also supports the following additional arguments:

python cli.py \
[--vad {none,silero-vad,silero-vad-skip-gaps,silero-vad-expand-into-gaps,periodic-vad}] \
[--vad_merge_window VAD_MERGE_WINDOW] \
[--vad_max_merge_size VAD_MAX_MERGE_SIZE] \
[--vad_padding VAD_PADDING] \
[--vad_prompt_window VAD_PROMPT_WINDOW]
[--vad_cpu_cores NUMBER_OF_CORES]
[--vad_parallel_devices COMMA_DELIMITED_DEVICES]
[--auto_parallel BOOLEAN]

In addition, you may also use URL's in addition to file paths as input.

python cli.py --model large --vad silero-vad --language Japanese "https://www.youtube.com/watch?v=4cICErqqRSM"

Rather than supplying arguments to app.py or cli.py, you can also use the configuration file config.json5. See that file for more information. If you want to use a different configuration file, you can use the WHISPER_WEBUI_CONFIG environment variable to specify the path to another file.

Multiple Files

You can upload multiple files either through the "Upload files" option, or as a playlist on YouTube. Each audio file will then be processed in turn, and the resulting SRT/VTT/Transcript will be made available in the "Download" section. When more than one file is processed, the UI will also generate a "All_Output" zip file containing all the text output files.

Diarization

To detect different speakers in the audio, you can use the whisper-diarization application.

Download the JSON file after running Whisper on an audio file, and then run app.py in the whisper-diarization repository with the audio file and the JSON file as arguments.

Whisper Implementation

You can choose between using whisper or faster-whisper. Faster Whisper as a drop-in replacement for the default Whisper which achieves up to a 4x speedup and 2x reduction in memory usage.

You can install the requirements for a specific Whisper implementation in requirements-fasterWhisper.txt or requirements-whisper.txt:

pip install -r requirements-fasterWhisper.txt

And then run the App or the CLI with the --whisper_implementation faster-whisper flag:

python app.py --whisper_implementation faster-whisper --input_audio_max_duration -1 --server_name 127.0.0.1 --server_port 7860 --auto_parallel True

You can also select the whisper implementation in config.json5:

{
    "whisper_implementation": "faster-whisper"
}

GPU Acceleration

In order to use GPU acceleration with Faster Whisper, both CUDA 11.2 and cuDNN 8 must be installed. You may want to install it in a virtual environment like Anaconda.

Google Colab

You can also run this Web UI directly on Google Colab, if you haven't got a GPU powerful enough to run the larger models.

See the colab documentation for more information.

Parallel Execution

You can also run both the Web-UI or the CLI on multiple GPUs in parallel, using the vad_parallel_devices option. This takes a comma-delimited list of device IDs (0, 1, etc.) that Whisper should be distributed to and run on concurrently:

python cli.py --model large --vad silero-vad --language Japanese \
--vad_parallel_devices 0,1 "https://www.youtube.com/watch?v=4cICErqqRSM"

Note that this requires a VAD to function properly, otherwise only the first GPU will be used. Though you could use period-vad to avoid taking the hit of running Silero-Vad, at a slight cost to accuracy.

This is achieved by creating N child processes (where N is the number of selected devices), where Whisper is run concurrently. In app.py, you can also set the vad_process_timeout option. This configures the number of seconds until a process is killed due to inactivity, freeing RAM and video memory. The default value is 30 minutes.

python app.py --input_audio_max_duration -1 --vad_parallel_devices 0,1 --vad_process_timeout 3600

To execute the Silero VAD itself in parallel, use the vad_cpu_cores option:

python app.py --input_audio_max_duration -1 --vad_parallel_devices 0,1 --vad_process_timeout 3600 --vad_cpu_cores 4

You may also use vad_process_timeout with a single device (--vad_parallel_devices 0), if you prefer to always free video memory after a period of time.

Auto Parallel

You can also set auto_parallel to True. This will set vad_parallel_devices to use all the GPU devices on the system, and vad_cpu_cores to be equal to the number of cores (up to 8):

python app.py --input_audio_max_duration -1 --auto_parallel True

Docker

To run it in Docker, first install Docker and optionally the NVIDIA Container Toolkit in order to use the GPU. Then either use the GitLab hosted container below, or check out this repository and build an image:

sudo docker build -t whisper-webui:1 .

You can then start the WebUI with GPU support like so:

sudo docker run -d --gpus=all -p 7860:7860 whisper-webui:1

Leave out "--gpus=all" if you don't have access to a GPU with enough memory, and are fine with running it on the CPU only:

sudo docker run -d -p 7860:7860 whisper-webui:1

GitLab Docker Registry

This Docker container is also hosted on GitLab:

sudo docker run -d --gpus=all -p 7860:7860 registry.gitlab.com/aadnk/whisper-webui:latest

Custom Arguments

You can also pass custom arguments to app.py in the Docker container, for instance to be able to use all the GPUs in parallel (replace administrator with your user):

sudo docker run -d --gpus all -p 7860:7860 \
--mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
--mount type=bind,source=/home/administrator/.cache/huggingface,target=/root/.cache/huggingface \
--restart=on-failure:15 registry.gitlab.com/aadnk/whisper-webui:latest \
app.py --input_audio_max_duration -1 --server_name 0.0.0.0 --auto_parallel True \
--default_vad silero-vad --default_model_name large

You can also call cli.py the same way:

sudo docker run --gpus all \
--mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
--mount type=bind,source=/home/administrator/.cache/huggingface,target=/root/.cache/huggingface \
--mount type=bind,source=${PWD},target=/app/data \
registry.gitlab.com/aadnk/whisper-webui:latest \
cli.py --model large --auto_parallel True --vad silero-vad \
--output_dir /app/data /app/data/YOUR-FILE-HERE.mp4

Caching

Note that the models themselves are currently not included in the Docker images, and will be downloaded on the demand. To avoid this, bind the directory /root/.cache/whisper to some directory on the host (for instance /home/administrator/.cache/whisper), where you can (optionally) prepopulate the directory with the different Whisper models.

sudo docker run -d --gpus=all -p 7860:7860 \
--mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
registry.gitlab.com/aadnk/whisper-webui:latest