Add an AI voice over to a video using an API

Are you looking to use text to speech to add an AI voice over to your video? Do you want an easy to use API to incorporate a background voice over to your video? Then this post has all that you need.

Some common use cases where you might want to generate a video with an AI voice over include:

  • Adding a realistic voice to news videos
  • Making tutorials and how-to videos more informative and accessible
  • Adding voice commentary to promotional videos
  • Creating informative voice overs for educational presentations

I will show you how to create an AI voice over and add it to a video using Shotstacks media APIs. It will be a step-by-step, easy-to-follow guide so even those with no prior programmatic video editing experience can automate the generation of videos with professional sounding AI narration.

About Shotstack and the Create and Edit API's

Shotstack is an API-driven, video automation platform for creating, editing, and distributing dynamic videos at scale. In this post, I will guide you through generating the AI voice over using the Shotstack Create API. Then, I will show you how to integrate the voice over with your video assets using the Shotstack Edit API.

Prerequisites

Before you start, register on the Shotstack website to get a free API key. You'll need this key to make authorized requests to the Create and Edit APIs. You should also be familiar with the cURL utility and running commands from the command line of your operating system.

Create a voice over using the Create API

The Create API creates video, images, audio and text using the latest generative AI services. In addition to built-in services, it also allows you to seamlessly invoke third-party services. You can use ElevenLabs for hyper-realistic voice overs, HeyGen or D-ID for text-to-avatar creation and or Stability AI to generate images from text prompts.

Let's explore two ways to generate a voice over using the Create API. One using the built in API service and one using Elevenlabs. To make the example more realistic we'll mock up a weather report style video using our own script and a background video from Pexels.

Using the Shotstack text to speech provider

We will send a POST request to the Create API using cURL to generate a voice over from text. Execute the following command on your shell. Make sure to use your own stage/sandbox API key as the value for the x-api-key header parameter, instead of YOUR_API_KEY:

curl -X POST \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
https://api.shotstack.io/create/stage/assets/ \
-d '
{
"provider": "shotstack",
"options": {
"type": "text-to-speech",
"text": "Moving down to the Central area we are seeing clear skies, a gentle breeze and mild temperatures, perfect for an evening stroll. Temperatures are hovering around a comfortable 55 degrees fahrenheit, the ideal weather for outdoor activities.",
"voice": "Matthew",
"language": "en-US"
}
}'

The stage keyword in the URL is the environment you are working in. The JSON body of the command (the value of the -d option) includes the text to convert to speech and the chosen voice and language combination. To get a list of all available languages and voices, check out the text-to-speech API docs.

Expect an output like the following:

{
"data": {
"type": "asset",
"id": "01hmg-6n6yd-k3q2w-me4kg-3rgtn9",
"attributes": {
"owner": "c2jsl2d4xd",
"provider": "shotstack",
"type": "text-to-speech",
"status": "queued",
"created": "2024-01-19T06:31:14.425Z",
"updated": "2024-01-19T06:31:14.425Z"
}
}
}

Copy the id from the response as we will be using it to check the status of the audio in the next step.

Check the status of the Shotstack voice over generation

It can take time for the voice over generation to complete so wait for a few seconds and then run the following command. Make sure to replace the id in the URL with the one received in the response of the last API call.

curl -X GET \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
https://api.shotstack.io/create/stage/assets/01hmg-6n6yd-k3q2w-me4kg-3rgtn9

Expect an output similar to this:

{
"data": {
"type": "asset",
"id": "01hmg-6n6yd-k3q2w-me4kg-3rgtn9",
"attributes": {
"owner": "c2jsl2d4xd",
"provider": "shotstack",
"type": "text-to-speech",
"url": "https://shotstack-create-api-stage-assets.s3.amazonaws.com/c2jsl2d4xd/01hmg-6n6yd-k3q2w-me4kg-3rgtn9.mp3",
"status": "done",
"created": "2024-01-19T06:31:14.425Z",
"updated": "2024-01-19T06:31:20.181Z"
}
}
}

The status parameter in the response should show done if the audio has been generated. If the generation is still in progress, you might see statuses like rendering, saving, or queued. In that case, just wait for a few seconds and resend the same GET request.

Once the status says done, download or visit the link at the url parameter in the response. It should play an mp3 audio file.

Here is an example of the Shotstack generated audio:

Using ElevenLabs text to voice provider

Another method to generate a realistic AI voice over is by leveraging the third-party integration with Elevenlabs. Follow these steps:

  1. Sign up for a free account on the ElevenLabs website (if you haven't already).
  2. Retrieve the ElevenLabs API key from your ElevenLabs profile.
  3. In the Shotstack dashboard select Integrations from the user profile menu in the top right corner.
  4. Navigate to the ElevenLabs Text to Speech integration and select Configure.
  5. Enter your ElevenLabs API key under the relevant environment (Sandbox or Production) and hit Save. You can configure different API keys for each environment.

Now we are ready to use ElevenLabs. As for the POST request, we simply need to change the provider attribute inside the JSON body. It should look like this:

Just like we did with the other API requests use your own stage/sandbox API key as the value for the x-api-key header parameter, instead of YOUR_API_KEY.

curl -X POST \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY " \
https://api.shotstack.io/create/stage/assets/ \
-d '
{
"provider": "elevenlabs",
"options": {
"type": "text-to-speech",
"text": "Moving down to the Central area we are seeing clear skies, a gentle breeze and mild temperatures, perfect for an evening stroll. Temperatures are hovering around a comfortable 55 degrees fahrenheit, the ideal weather for outdoor activities.",
"voice": "Adam"
}
}'

The ElevenLabs request body is very similar to the Shotstack text to speech request except the provider value is elevenlabs, there is no language choice and the list of voices is different. In this example we use the voice Adam. For a full list of voices check out the ElevenLabs options.

The response is the same as using the Shotstack text to speech and includes the id of the asset being generated. Use the id to check the status of the asset using exactly the same approach as before, like this:

curl -X GET \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
https://api.shotstack.io/create/stage/assets/01hmg-acrmr-3yd0c-q5c61-w5b7sh

As before the response includes the status and url of the generated audio file. Here is an example of the audio generated:

Adding the AI voice over to a video using the Edit API

For the second part of this guide, we will use the Edit API to add our AI voice over to our weather video. First, create an empty file named video.json and paste the following JSON to it:

{
"timeline": {
"background": "#000000",
"tracks": [
{
"clips": [
{
"asset": {
"type": "audio",
"src": "https://shotstack-create-api-stage-assets.s3.amazonaws.com/c2jsl2d4xd/01hmg-acrmr-3yd0c-q5c61-w5b7sh.mp3"
},
"start": 0,
"length": 16
}
]
},
{
"clips": [
{
"asset": {
"type": "video",
"src": "https://player.vimeo.com/external/428974406.hd.mp4?s=8e75e82ef712ac04df173007f2e5f32ee00180fd&profile_id=174&oauth2_token_id=57447761"
},
"start": 0,
"length": 17,
"transition": {
"in": "fade",
"out": "fade"
}
}
]
}
]
},
"output": {
"format": "mp4",
"resolution": "hd"
}
}

This JSON file specifies the following properties for the video:

  • An audio clip using the URL of the AI voice over we just generated with settings including start time and duration. Make sure you replace the src parameter with the URL of your voice over mp3 file.
  • A background video clip, along with its settings for start time, duration, and transitions. The URL of the video is from Pexels stock footage.
  • The output settings for the video to be an MP4 file with HD (720p) resolution.

Note that the MP3 file generated using ElevenLabs is 16 seconds long, so we have set the length of the audio clip to 16 seconds. The video from Pexels is 22 seconds long but we have cut it short at 17 seconds.

Now, we will send a POST request to the Edit API to generate the video based on this JSON file. Run the following command from your shell with your own API key:

curl -X POST \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-d @video.json \
https://api.shotstack.io/edit/stage/render

Expect a response like this:

{
"success": true,
"message": "Created",
"response": {
"message": "Render Successfully Queued",
"id": "b609765f-ec2e-4727-8f22-ded38efd4f4d"
}
}

Copy the id from the response as we will be using it to check the status of the video render in the next step.

Check the status of the render

The Edit API takes time to render a video. After you send the render request to the API, wait a few seconds and run the command below. Make sure to replace the id in the URL with the one received in the response of the last API call.

curl -X GET \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
https://api.shotstack.io/edit/stage/render/b609765f-ec2e-4727-8f22-ded38efd4f4d

A response similar to below will be returned:

{
"success": true,
"message": "OK",
"response": {
"id": "b609765f-ec2e-4727-8f22-ded38efd4f4d",
"owner": "c2jsl2d4xd",
"plan": "sandbox",
"status": "done",
"error": "",
"duration": 15,
"billable": 15,
"renderTime": 9911.59,
"url": "https://shotstack-api-stage-output.s3-ap-southeast-2.amazonaws.com/c2jsl2d4xd/b609765f-ec2e-4727-8f22-ded38efd4f4d.mp4",
"poster": null,
"thumbnail": null,
"data": {
...
}
}

Just like the Create API, the status parameter in the response should show done if the video has finished rendering. If the rendering is still in progress, you might see statuses like rendering, saving, or queued. In that case, just wait for a few more seconds and resend the same GET request.

Once the status is done, download or visit the link at the url parameter in the response. The URL is the path to an mp4 video file which will play in your browser or a video player.

Here is the final weather video:

Summary

In this post, we explored easy ways to generate AI voice overs using an API and seamlessly blend them into videos using simple JSON payloads and simple API requests.

From this simple example you can imagine creating a full featured personalised weather report video with a different voice over and background video for each location and weather conditions. You could also add lower thirds titles, weather icons and a subtle background soundtrack using the Edit API to make the video even more engaging.

To learn more started, visit the official docs or check out more of our developer guides and tutorials.

Maab Saleem

BY MAAB SALEEM
1st February, 2024

Become an Automated Video Editing Pro

Every month we share articles like this one to keep you up to speed with automated video editing.


You might also like