AI Voiceover

Share

2025-08-31

What is an AI Voiceover?

An AI voiceover, or AI VO, is an AI-generated voice that is added to a video to narrate it instead of a human voice. An AI voiceover doesn't mean that an AI analyzes the video and narrates it. Instead, it's just a type of text-to-speech (TTS) program.

In a TTS program, you give the program some text, either by typing it in a textbox or giving it a text file, and the program synthesizes audio (i.e. creates synthetic audio) that sounds like a person saying the words in the text. This type of program has existed for a long time. An AI voiceover simply uses "AI" to perform the synthesis instead.

In practice, what this means is that the program uses a neural network (NN) and an AI model that was trained on a dataset of pairs of audio clips and their text transcriptions. That is, someone would, for example, record themselves reading passages of a book, save the audio as a file, and pair it with a text file that contains the text of the passage that they read. After recording several hours of audio this way, the data is fed into a training program that tries figure out which text matches which sounds.

The TTS AI doesn't necessarily convert English text directly into audio. You could have a middle step that converts English text into language-agnostic pronunciation tokens, which are then converted into audio. This is important because the same word in English can have multiple pronunciations, e.g. "many lives" and "he lives here."

An example of an AI voiceover program is kokoro. This is a multi-language open-weight AI model with 82 million parameters [https://huggingface.co/hexgrad/Kokoro-82M] (accessed 2025-08-31). You can run it locally via a separate open source project called kokoro TTS [https://github.com/nazdridoy/kokoro-tts] (accessed 2025-08-31). Once installed, the terminal command to run it looks like this:

kokoro-tts input.txt output.wav --speed 1.2 --lang en-us --voice af_sarah

It's worth noting that in many cases the people who created a dataset, the people who trained the model on that dataset, and the people who created the tools to use the model are separate people. Each of these three things can have different licenses, which is very important. That's because if a model is trained on someone's voice, that is effectively cloning that person's voice, and if that's released publicly, it means that anyone on the Internet can make that person "say" something that they never said. This is a similar situation with "model rights" in the sense where the model is a person. On Pexels, a website where you can get royalty-free images, for example, their license explicitly disallows: "Identifiable people may not appear in a bad light or in a way that is offensive." It wouldn't be surprising if AI voiceover software had the same limitations for the same reasons.

View Comments