Convert audio to video with an AI generated image

Do you have an audio clip that you want to turn into a shareable video? Perhaps you want to add a visual element to your podcast, voice over, or lecture. Or you want to convert your audio into video content for social media.

This guide will walk you through how to create a video from an input audio file. We'll begin by generating a transcript of the audio file, and use that transcript to create an image prompt. Then, we'll use a text-to-image API to create an image based on the prompt. And finally, we'll put the audio and image together to generate a video.

About Shotstack and the Ingest, Create and Edit APIs

Shotstack is a cloud-based video editing platform. It's designed to make it convenient for developers to automate video editing at scale. Shotstack provides various APIs including the three we will use in this tutorial. These are the Ingest API, Create API, and Edit API.

We will be using the Ingest API to generate a transcript of the audio. We will also use the Create API to create the background image using AI. And for putting together the video, we will use the Edit API.

Requirements

To follow the steps outlined in the guide, you'll need the following:

  • A Shotstack API Key. Get one when you sign up for a free account.
  • Basic understanding of using the command line to run curl requests.

The following covers a step-by-step process of how to convert an audio file into a video with a background image.

Getting the transcript of the audio

For this guide, we will be using an audio excerpt of a financial podcast:

The audio needs to be available somewhere online, so we have uploaded it to the following URL: https://shotstack-assets.s3-ap-southeast-2.amazonaws.com/audio/financial-podcast.mp3.

The next step is to send a POST request to the Ingest API to generate a transcript of the audio. The request should contain a JSON payload with details like the input audio's URL and the transcription format. In this case, we're using the SRT format.

Run the curl command below in your terminal to generate an SRT file from the audio. Replace SHOTSTACK_API_KEY with your API key.

curl -X POST \
-H "Content-Type: application/json" \
-H "x-api-key: SHOTSTACK_API_KEY" \
https://api.shotstack.io/ingest/stage/sources \
-d '
{
"url": "https://shotstack-assets.s3-ap-southeast-2.amazonaws.com/audio/financial-podcast.mp3",
"outputs": {
"transcription": {
"format": "srt"
}
}
}'

If the request succeeds, you should see a response like the one below. Take note of the render id. We will use it in the next step.

{
"data": {
"type": "source",
"id": "zzy885gw-1m3y-rv30-xfcw-4e2ykd4xloct"
}
}

Getting the status of the audio transcription

Wait for a few seconds for the audio transcription process to complete. Then run the command below to send a GET request to the API. This will retrieve the generated SRT file.

Again, make sure to replace SHOTSTACK_API_KEY with your API key. And replace ID with the id from the previous JSON response.

curl -X GET https://api.shotstack.io/ingest/stage/sources/ID \
-H 'Accept: application/json' \
-H 'x-api-key: SHOTSTACK_API_KEY'

You should see a response with relevant details about the generated file. It's going to look similar to this:

{
"data": {
"type": "source",
"id": "zzy885gw-1m3y-rv30-xfcw-4e2ykd4xloct",
"attributes": {
"id": "zzy885gw-1m3y-rv30-xfcw-4e2ykd4xloct",
"owner": "c2jsl2d4xd",
"input": "https://shotstack-assets.s3-ap-southeast-2.amazonaws.com/audio/financial-podcast.mp3",
"source": "https://shotstack-ingest-api-stage-sources.s3.ap-southeast-2.amazonaws.com/c2jsl2d4xd/zzy885gw-1m3y-rv30-xfcw-4e2ykd4xloct/source.mp3",
"status": "ready",
"outputs": {
"transcription": {
"status": "ready",
"url": "https://shotstack-ingest-api-stage-sources.s3.ap-southeast-2.amazonaws.com/c2jsl2d4xd/zzy885gw-1m3y-rv30-xfcw-4e2ykd4xloct/transcript.srt"
}
},
"duration": 67.38,
"created": "2024-04-03T07:08:27.673Z",
"updated": "2024-04-03T07:08:48.150Z"
}
}
}

Note: You may see waiting, processing or another status for the status parameter under outputs.transcription. If that's the case, retry the same GET request until the status reads ready.

When the status is ready, the response will include a URL to the output SRT file. For our example, the URL is:

https://shotstack-ingest-api-stage-sources.s3.ap-southeast-2.amazonaws.com/c2jsl2d4xd/zzy885gw-1m3y-rv30-xfcw-4e2ykd4xloct/transcript.srt

The podcast audio transcription

Below is the transcript of the example podcast audio.

1
00:00:00,009 --> 00:00:02,589
Nutrition was launched just over a year ago

2
00:00:02,829 --> 00:00:05,980
um as part of Blackrock's Sustainable Thematic Suite

3
00:00:06,239 --> 00:00:09,260
and I've been named on it uh since the start of this year.

4
00:00:09,800 --> 00:00:14,989
And the fund's mandate is to invest in anything related to

5
00:00:15,140 --> 00:00:20,049
um food and beverage consumer trends. And our job is to

6
00:00:20,229 --> 00:00:25,309
uh one make sure that the fund invests in those fast and moving um rivers

7
00:00:25,415 --> 00:00:25,954
within that

8
00:00:26,094 --> 00:00:31,364
overall thematic, but also uh to abide by our sustainability mandate,

9
00:00:31,604 --> 00:00:35,244
which is to ensure that at least 70% of the fund is

10
00:00:35,255 --> 00:00:38,235
investing in companies which align with

11
00:00:38,244 --> 00:00:41,034
the United Nations Sustainability Development Goals.

12
00:00:41,689 --> 00:00:48,029
Um So it's important um and critical that the companies that we invest in broadly um

13
00:00:48,349 --> 00:00:52,470
are helping the world move towards a more sustainable food chain

14
00:00:52,680 --> 00:00:54,770
and of course, plant-based.

15
00:00:54,779 --> 00:00:58,830
And um some of the other topics that we're gonna talk about today are really

16
00:00:58,840 --> 00:01:04,500
big piece to it because the food chain is possibly one of the single most polluting

17
00:01:04,680 --> 00:01:06,589
pieces of humankind.

Using ChatGPT and the transcription to generate an image prompt

You've now successfully generated a transcript of the audio. Copy and paste the content of the transcript into ChatGPT. Ask ChatGPT to use the transcript to create a prompt to generate an image via an API call to a text-to-image service.

Here is an example prompt you can use:

Use the content of an SRT transcription file, pasted below, to write a prompt to generate
an image using an AI text-to-image service:

CONTENT_OF_SRT_FILE

Replace CONTENT_OF_SRT_FILE with the transcript we generated previously.

Below is the response we got from ChatGPT using the prompt above:

"A world map with highlights on areas with sustainable food production practices. Emphasize
plant-based agriculture and companies that align with the UN Sustainable Development
Goals. Include a flowing blue river representing fast-moving trends within the sustainable
food chain."

Let's use the prompt we got from ChatGPT to generate the image in the next step.

Note: if you want to automate this step you could use the OpenAI ChatGPT API to generate the prompt. This article to video guide does something very similar.

Generating an image from the prompt using text-to-image AI

The next step is to send a POST request to the Create API to generate an image based on the prompt. We will use the built in Shotstack text-to-image service which uses generative AI to generate the image. The payload includes the image width and height, the type and prompt parameters.

Run the curl command below in your terminal. Make sure to replace SHOTSTACK_API_KEY with your API key.

curl -X POST \
-H 'Content-Type: application/json' \
-H 'x-api-key: SHOTSTACK_API_KEY' \
https://api.shotstack.io/create/stage/assets \
-d '
{
"provider": "shotstack",
"options": {
"type": "text-to-image",
"prompt": "A world map with highlights on areas with sustainable food production practices. Emphasize plant-based agriculture and companies that align with the UN Sustainable Development Goals. Include a flowing blue river representing fast-moving trends within the sustainable food chain.",
"width": 1024,
"height": 512
}
}'

A successful request should yield a response that looks like this: Take note of the value of id.

{
"data": {
"type": "asset",
"id": "01hth-fjy7w-jd184-znpdm-r1hrv6",
"attributes": {
"owner": "c2jsl2d4xd",
"provider": "shotstack",
"type": "text-to-image",
"status": "queued",
"created": "2024-04-03T08:00:42.295Z",
"updated": "2024-04-03T08:00:42.295Z"
}
}
}

Getting the text-to-image status

Wait for a few seconds for the AI text-to-image generation to complete. Then run the command below to send a GET request to fetch the generated image. Replace ID with the id from the previous response, and SHOTSTACK_API_KEY with your API key.

curl -X GET https://api.shotstack.io/create/stage/assets/ID \
-H 'Accept: application/json' \
-H 'x-api-key: SHOTSTACK_API_KEY'

You should expect a response similar to this:

{
"data": {
"type": "asset",
"id": "01hth-fjy7w-jd184-znpdm-r1hrv6",
"attributes": {
"owner": "c2jsl2d4xd",
"provider": "shotstack",
"type": "text-to-image",
"url": "https://shotstack-create-api-stage-assets.s3.amazonaws.com/c2jsl2d4xd/01hth-fjy7w-jd184-znkdm-r1hrv7.png",
"status": "done",
"created": "2024-04-03T08:00:42.295Z",
"updated": "2024-04-03T08:00:52.312Z"
}
}
}

Note: If the status parameter shows anything other than done, simply retry the same GET request until its value is "done".

The response will contain a url parameter whose value is the URL to the generated image.

Sample of image generated by the tex-to-speech API

A sample image generated using our text prompt

Creating the video with the audio and image

Almost set! What's left to do is to put the audio and image together to make a video. We can do that using the Edit API.

Create a new file named edit.json and paste the JSON below to it. Make sure to replace the value of src for the image asset with the URL from the previous response. The audio asset includes the URL of our podcast mp3 file.

{
"timeline": {
"background": "#000000",
"tracks": [
{
"clips": [
{
"asset": {
"type": "image",
"src": "https://shotstack-create-api-stage-assets.s3.amazonaws.com/c2jsl2d4xd/01hth-fjy7w-jd184-znkdm-r1hrv7.png"
},
"start": 0,
"length": 67.38,
"effect": "zoomIn"
}
]
},
{
"clips": [
{
"asset": {
"type": "audio",
"src": "https://shotstack-assets.s3-ap-southeast-2.amazonaws.com/audio/financial-podcast.mp3",
"volume": 1
},
"start": 0,
"length": 67.38
}
]
}
]
},
"output": {
"format": "mp4",
"resolution": "sd"
}
}

Then run the following command to send a POST request to the Edit API. Note that we're using the content of the edit.json file as the payload using the curl -d @edit.json argument.

curl -X POST \
-H 'Content-Type: application/json' \
-H 'x-api-key: SHOTSTACK_API_KEY' \
-d @edit.json \
https://api.shotstack.io/edit/stage/render

Here is an example of the response you will get back from the API.

{
"success": true,
"message": "Created",
"response": {
"message": "Render Successfully Queued",
"id": "bf65d0a2-3c78-453e-851a-05565fe0ab23"
}
}

Getting the video render status

Wait for a few seconds for the video to finish rendering. Then, run the following command to get the video URL.

Replace SHOTSTACK_API_KEY with your API key and ID with the id received in the previous JSON response.

curl -X GET \
-H 'Content-Type: application/json' \
-H 'x-api-key: SHOTSTACK_API_KEY' \
https://api.shotstack.io/edit/stage/render/ID

If successful, you will receive a response similar to this:

{
"success": true,
"message": "OK",
"response": {
"id": "bf65d0a2-3c78-453e-851a-05565fe0ab23",
"owner": "c2jsl2d4xd",
"plan": "sandbox",
"status": "done",
"error": "",
"duration": 67.38,
"billable": 67.38,
"renderTime": 16690.03,
"url": "https://shotstack-api-stage-output.s3-ap-southeast-2.amazonaws.com/c2jsl2d4xd/bf65d0a2-3c78-453e-851a-05565fe0ab23.mp4",
"poster": null,
"thumbnail": null,
"created": "2024-04-03T08:07:17.382Z",
"updated": "2024-04-03T08:07:35.482Z"
}
}

You can access the final video via the provided URL in the response. Copy and paste the URL to view it in your browser or to download it.

The final video

Here's how the final video looks, converted from our audio file in to a video with an AI generated background image:

Conclusion

Now you know how you can convert audio to a video with a background image using the Shotstack APIs and generative AI. This guide only covered the very basics. You could expand on this tutorial and start adding text, multiple images, transitions and effects. There's so much more that you can achieve with our APIs. Check out our developer guides to learn how to convert YouTube videos to MP3s, generate videos from images, add AI voice overs to your videos, and more.

Maab Saleem

BY MAAB SALEEM
5th April, 2024

Become an Automated Video Editing Pro

Every month we share articles like this one to keep you up to speed with automated video editing.


You might also like

Convert articles to videos with ChatGPT

Convert articles to videos with ChatGPT

Maab Saleem
Generate SRT and VTT subtitles using an API

Generate SRT and VTT subtitles using an API

Maab Saleem