Use Amazon Polly to create a video voice over

An introduction to Amazon Polly

Text-to-speech can greatly improve the user experience of your application. It offers your users more than one way to consume and interact with the content on your app. There are different services that enable you to include this functionality in your app like Amazon Polly, Google Cloud Text-to-Speech, Microsoft Azure Text-to-Speech, etc. In this article, we will look at Amazon Polly.

Amazon Polly is part of Amazon Web Services (AWS) and can be used to turn text into lifelike speech. It uses deep learning to synthesize natural-sounding human speech and offers different lifelike male and female voices across a broad set of languages. We'll use the service to convert text into speech and then use it as a voice over for our video.

Perform text-to-speech with Amazon Polly

You can convert text to speech using Amazon Polly in 3 ways: using the web console, with the AWS CLI or by using an SDK in your preferred language. We'll cover the last two, but if you prefer using the console, find the instructions on the website.

Convert text to speech with the AWS Command Line Interface (AWS CLI)

All operations that can be performed on the Amazon Polly console can be done using the AWS Command Line Interface, except listening to the synthesized audio; you'll have to open the downloaded audio file in a media player. Let's see how you can use the CLI to create a voice over.

Before you can use the AWS CLI, you'll first need to install it and set up credentials for your user account. Note that there are two supported versions of the CLI: version 1 and version 2. We'll use version 2 in this article, which includes the latest features that AWS has to offer. After installing the CLI, configure it with your AWS credentials which will give the tool the necessary permissions to interact with AWS.

To convert text to speech with the CLI, use the synthesize-speech command. You can directly pass it a string of text as shown below:

$ aws polly synthesize-speech \
--output-format mp3 \
--voice-id Joanna \
--text 'Hello. Here is some sample synthesized speech.' \
test.mp3

Or use a text file containing the text to be synthesized:

$ aws polly synthesize-speech \
--output-format mp3 \
--voice-id Joanna \
--text file://test.txt \
test.mp3

The commands above are formatted for Unix, Linux, and macOS. For Windows, replace the backslash (\) at the end of each line with a caret (^) and use double quotation marks (") on the text with single quotes (').

We specify the output format of the audio file, as well as the voice ID that should be used. There are several voices available in various languages. You can get a list of available voices with the describe-voices command:

# List all available voices
$ aws polly describe-voices

# List the available voices for a specific language
$ aws polly describe-voices \
--language-code ru-RU

After running the synthesize-speech command, a file containing the synthesized speech will be downloaded with the label we specified in the command (test.mp3, in our case).

Convert text to speech in Node.js with the AWS SDK for JavaScript

Next, let's see how we can convert text to speech using code. We'll use the @aws-sdk/client-polly library in a Node.js application to convert text to speech using the Amazon Polly API.

First, create a new Node.js project:

$ npm init -y

Then, install the @aws-sdk/client-polly package:

$ npm install @aws-sdk/client-polly

Create a file named index.js at the root of your project and add the following code to it.

const {
PollyClient,
SynthesizeSpeechCommand,
} = require("@aws-sdk/client-polly");
const fs = require("fs");

// Set the AWS Region
const REGION = "us-east-2";

// Create an Amazon Polly service client object
const client = new PollyClient({ region: REGION });

let params = {
Text: "Hello, this Node.js script will send this text to the AWS API where it will be converted to audio by AWS Polly.",
OutputFormat: "mp3",
VoiceId: "Joanna",
};

const synthesizeText = async () => {
try {
const data = await client.send(new SynthesizeSpeechCommand(params));

data.AudioStream.pipe(fs.createWriteStream("audio.mp3"));
} catch (err) {
console.error("Error:", err);
}
};

synthesizeText();

To use the SDK, you have to first set AWS account credentials that will determine the resources that the SDK can access. If you have previously installed the AWS CLI and set up credentials, then you are good to go. They are stored at ~/.aws/credentials (on Linux, Unix and macOS) and C:\Users\USER_NAME\.aws\credentials (on Windows).

We specify similar parameters as we did with the CLI — Text, OutputFormat and VoiceId — and use them to instantiate an instance of SynthesizeSpeechCommand. This can synthesize text or SSML (Speech Synthesis Markup Language) into a stream of bytes. We pass this into the send() method of an Amazon Polly client object which makes the call to the API. If everything goes well, the response will contain an AudioStream which we save to a file called audio.mp3.

Run the app with node index.js and a file labeled audio.mp3 will be downloaded in the root directory of the project. The file will contain the audio version of the text we specified.

Saving the audio file to an S3 Bucket

You can also save the audio file to an S3 bucket by using StartSpeechSynthesisTaskCommand instead of SynthesizeSpeechCommand and include the OutputS3BucketName parameter with the name of the bucket you want to save to.

const {
PollyClient,
StartSpeechSynthesisTaskCommand,
} = require("@aws-sdk/client-polly");

// Set the AWS Region
const REGION = "us-east-2";

// Create an Amazon Polly service client object
const client = new PollyClient({ region: REGION });

let params = {
Text: "Hello, this Node.js script will send this text to the AWS API where it will be converted to audio by AWS Polly.",
OutputFormat: "mp3",
VoiceId: "Joanna",
OutputS3BucketName: "shotstack-audio",
};

const synthesizeText = async () => {
try {
const data = await client.send(new StartSpeechSynthesisTaskCommand(params));
console.log(`Saved to ${params.OutputS3BucketName} bucket`);
console.log(data);
} catch (err) {
console.error("Error:", err);
}
};

synthesizeText();

On running the above code, you will get a similar response as shown below:

SynthesisTask: {
CreationTime: 2021-05-26T13:23:17.408Z,
Engine: undefined,
LanguageCode: undefined,
LexiconNames: undefined,
OutputFormat: 'mp3',
OutputUri: 'https://s3.us-east-2.amazonaws.com/shotstack-audio/299558ee-fac5-448f-b7b5-9726ba1d5cf3.mp3',
RequestCharacters: 111,
SampleRate: undefined,
SnsTopicArn: undefined,
SpeechMarkTypes: undefined,
TaskId: '299558ee-fac5-448f-b7b5-9726ba1d5cf3',
TaskStatus: 'scheduled',
TaskStatusReason: undefined,
TextType: 'text',
VoiceId: 'Joanna'
}

StartSpeechSynthesisTaskCommand creates an asynchronous synthesis task. When a task is created, a SynthesisTask object is returned in the response which includes a link to the output file OutputUri and the status of the task TaskStatus, among other things. TaskId holds the task identifier which is also the label of the output mp3 file stored in the S3 bucket.

Using Speech Synthesis Markup Language (SSML)

Other than plain text, which we've used so far, Amazon Polly also supports Speech Synthesis Markup Language (SSML) which enables you to add various effects to the output speech like pauses, emphasis, emotion, etc. Below is an example of SSML. Check the documentation for a list of supported tags.

<speak> 
Shotstack provides the media generation infrastructure that customers can use to build video applications and automate their workflow.
<prosody rate='120%'> <prosody volume='loud'>
Let's do it faster! <amazon:breath duration='long' volume='loud'/> </prosody> Shotstack provides the media generation infrastructure that customers can use to build video applications and automate their workflow.
</prosody>
</speak>

Add audio to a video with the Shotstack API

Text to speech services have many use cases, one of which can be to create a voice over from text and overlay it onto a video. You can create a video with a voice over without having to employ a voice over artist. You can, for instance, create a video by putting together various stock images or videos and then add the voice over to the video.

Most video editing tools can do this but they usually take some skills to be able to properly use the tool. Other than that, you are usually limited to processing one video at a time and the processing uses considerable computing power. To get around this, you can use the Shotstack API.

Shotstack is a tool that you can use to create and edit video in the cloud. To use it, you define configurations in JSON format that determine how assets such as images, videos, audio, fonts, etc., will be arranged and used when rendering a video. You then post this to the API, either directly (via CURL or Wget, for instance) or you can use one of the available SDKs. Shotstack will then render your video according to the specified instructions.

You can use it to process multiple videos in parallel, and since it happens in the cloud, you don't have to worry about it hogging up your computer's resources. For a brief introduction follow this Hello World quick-start guide.

Below is a sample of audio generated by Amazon Polly, overlayed on this video by Shotstack and it also includes some soft background music:

Here is the JSON that created the video:

{
"timeline": {
"soundtrack": {
"src": "https://shotstack-assets.s3-ap-southeast-2.amazonaws.com/music/unminus/berlin.mp3",
"volume": 0.15
},
"tracks": [
{
"clips": [
{
"asset": {
"type": "audio",
"src": "https://shotstack-assets.s3-ap-southeast-2.amazonaws.com/audio/polly-voiceover.mp3"
},
"start": 0,
"length": 21.6
},
{
"asset": {
"type": "video",
"src": "https://shotstack-assets.s3-ap-southeast-2.amazonaws.com/audio-waveforms/circle-spectrum.mp4"
},
"start": 0,
"length": 21.6
}
]
}
]
},
"output": {
"format": "mp4",
"resolution": "sd",
"aspectRatio": "1:1"
}
}

To send the JSON to the Shotstack API for rendering, you can use the following command in your terminal. Place your API key in the command.

$ curl -X POST \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-d @waveform.json \
https://api.shotstack.io/stage/render

You should get a similar response to the one below:

{
"success": true,
"message": "Created",
"response": {
"message": "Render Successfully Queued",
"id": "ec8d1eb3-5a0f-4210-b8f1-e2a8a941870b"
}
}

To get a link to the rendered video, copy the id from the output from the previous step and insert it in the command below, as part of the URL (again, place in your API key).

$ curl -X GET \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
https://api.shotstack.io/stage/render/ec8d1eb3-5a0f-4210-b8f1-e2a8a941870b

You will get data back that will let you know the status of the render. If it's done, you should get a response with a url parameter with a value similar to https://shotstack-api-stage-output.s3-ap-southeast-2.amazonaws.com/h1m3kfhjai/ec8d1eb3-5a0f-4210-b8f1-e2a8a941870b.mp4 that is the link to your rendered video.

Next steps

There are a whole range of different applications on how to use synthesised speech to spruce up your videos. And in addition to just using synthesized audio, you can also use a service such as Amazon Transcribe to generate subtitles which will make your videos more accessible to a wider audience.

Jeff Shillitto

BY JOYCE ECHESSA
28th June, 2021