Skip to main content

Rich Captions (beta)

info

The rich-caption asset is currently in beta. As this asset is subject to change we do not recommend using it in production workflows yet.

The rich-caption asset provides advanced captioning with word-level animations, active word highlighting, and rich-text styling. It supports karaoke-style effects, bounce, pop, slide, and more — all synchronized to your audio. Use it with SRT/VTT subtitle files or auto-generate captions from audio and video clips using aliases.

Common rich-caption patterns

Before diving into individual properties, here are solutions to common captioning needs:

Automated Styled word-by-Word Captions

Create custom-styled captions where each word highlight is highlighted as its spoken, auto-generated from a video clip. Captions are automatically transcribed by using the alias property. Simply set the alias property on the clip you wish to have transcribed, and reference that clip in the src property of the rich-caption by using the alias:// prefix.

{
"timeline": {
"tracks": [
{
"clips": [
{
"asset": {
"type": "rich-caption",
"src": "alias://scott",
"font": {
"family": "_gP_1RrxsjcxVyin9l9n_j2RStR3qDpraA",
"size": 52,
"color": "#ffffff",
"opacity": 1,
"weight": 700
},
"wordAnimation": {
"style": "highlight"
},
"border": {
"width": 0,
"color": "#000000",
"opacity": 1,
"radius": 18
},
"style": {
"textTransform": "uppercase"
},
"padding": {
"top": 0,
"right": 0,
"bottom": 0,
"left": 0
},
"stroke": {
"width": 3,
"color": "#000000",
"opacity": 1
},
"active": {
"stroke": {
"width": 3,
"color": "#000000",
"opacity": 1
},
"font": {
"background": "#690be9"
}
}
},
"start": 0,
"length": "end",
"width": 522,
"height": 187,
"offset": {
"x": 0,
"y": 0
},
"transform": {
"rotate": {
"angle": -7.5
}
}
}
]
},
{
"clips": [
{
"asset": {
"type": "video",
"src": "https://shotstack-assets.s3.amazonaws.com/footage/scott-ko.mp4"
},
"start": 0,
"length": "auto",
"alias": "scott"
}
]
}
],
"fonts": [
{
"src": "https://fonts.gstatic.com/s/luckiestguy/v25/_gP_1RrxsjcxVyin9l9n_j2RStR3qDpraA.ttf"
}
]
},
"output": {
"size": {
"width": 1024,
"height": 576
},
"format": "mp4"
}
}
Show full JSON payload

Simple SRT/VTT Captions

Create captions using an SRT or VTT file by referencing the url with the SRT or VTT file:

{
"timeline": {
"tracks": [
{
"clips": [
{
"asset": {
"type": "rich-caption",
"src": "https://shotstack-assets.s3.amazonaws.com/captions/transcript.srt",
"font": {
"family": "Roboto",
"size": 28,
"color": "#ffffff"
},
"align": {
"vertical": "bottom"
},
"stroke": {
"width": 2
}
},
"start": 0,
"length": "end"
}
]
},
{
"clips": [
{
"asset": {
"type": "video",
"src": "https://shotstack-assets.s3.amazonaws.com/footage/scott-ko.mp4"
},
"start": 0,
"length": "auto"
}
]
}
]
},
"output": {
"format": "mp4",
"size": {
"width": 1280,
"height": 720
}
}
}
Show full JSON payload

Basic usage

Using an SRT or VTT file

The simplest rich-caption uses an external subtitle file:

{
"asset": {
"type": "rich-caption",
"src": "https://shotstack-assets.s3.amazonaws.com/captions/transcript.srt"
},
"start": 0,
"length": "end"
}

The src property accepts a URL to a publicly accessible SRT or VTT file. The captions are timed and displayed automatically based on the subtitle file's timestamps.

Auto-generated captions

You can automatically generate captions from audio or video clips using aliases. Assign an alias to your media clip, then reference it with the alias:// prefix. Shotstack automatically detects the language and generates word-level captions from the audio. We support 99 languages.

{
"timeline": {
"tracks": [
{
"clips": [
{
"asset": {
"type": "rich-caption",
"src": "alias://speech"
},
"start": 0,
"length": "end"
}
]
},
{
"clips": [
{
"alias": "speech",
"asset": {
"type": "video",
"src": "https://shotstack-assets.s3.amazonaws.com/footage/scott-ko.mp4"
},
"start": 0,
"length": "auto"
}
]
}
]
},
"output": {
"format": "mp4",
"size": {
"width": 1280,
"height": 720
}
}
}
Show full JSON payload
info

Auto-captioning works with video, audio, and text-to-speech clips. See Aliases for more details on declaring and referencing aliases.

Fonts

Rich captions use the same font system as Rich Text, including the same built-in fonts, custom font loading, and international font support.

Active word styling

What are active words? The active property isolates the exact word currently being spoken in the audio and applies distinct styling to it (such as a color change, size increase, or background highlight). It temporarily overrides the base font style to create dynamic, real-time text tracking.

The active property supports font, stroke and shadow. The font, stroke, and shadow properties work the same way as the base properties — only the values you set override the base styling, everything else is inherited.

{
"timeline": {
"tracks": [
{
"clips": [
{
"asset": {
"type": "rich-caption",
"src": "alias://source_310e0028",
"font": {
"family": "V8mDoQDjQSkFtoMM3T6r8E7mDbZyCts0DqQ",
"size": 152,
"color": "#ffffff",
"weight": 700
},
"align": {
"vertical": "middle"
},
"stroke": {
"width": 2,
"color": "#000000",
"opacity": 1
},
"wordAnimation": {
"style": "slide"
},
"active": {
"font": {
"color": "#efbf04",
"opacity": 1
}
}
},
"start": 0,
"length": "end"
}
]
},
{
"clips": [
{
"asset": {
"type": "audio",
"src": "https://shotstack-video-hosting.s3.amazonaws.com/documentation/rich-captions/the-great-dictator-trimmed.m4a",
"effect": "fadeOut"
},
"start": 0,
"length": "auto",
"alias": "source_310e0028"
}
]
}
],
"fonts": [
{
"src": "https://fonts.gstatic.com/s/spacegrotesk/v22/V8mDoQDjQSkFtoMM3T6r8E7mDbZyCts0DqQ.ttf"
}
]
},
"output": {
"format": "mp4",
"size": {
"width": 1080,
"height": 1920
}
}
}
Show full JSON payload

Here the base styling sets white text with a black stroke. The active override changes only the color to gold (#efbf04) — the font family, size, weight, and stroke are all inherited from the base.

Animations

The wordAnimation property controls how words are highlighted or revealed as they're spoken. Choose from 8 animation styles.

karaoke

Word-by-word color fill as spoken. All words are visible from the start; the active word changes to the active color.

"wordAnimation": {
"style": "karaoke"
}

highlight

Similar to karaoke — the active word changes to the active color when spoken, but with a more immediate color transition.

"wordAnimation": {
"style": "highlight"
}

pop

Each word scales up when active, creating an energetic, punchy effect.

"wordAnimation": {
"style": "pop"
}

fade

Gradual opacity transition per word. Words fade in as they become active.

"wordAnimation": {
"style": "fade"
}

slide

Words slide in from a direction. Use the direction property to control where words slide from.

"wordAnimation": {
"style": "slide",
"direction": "up"
}

Direction options: left, right, up (default), down

bounce

Spring animation on word appearance. Words bounce into place with a natural spring effect.

"wordAnimation": {
"style": "bounce"
}

typewriter

Words appear one by one and stay visible. Each word is revealed in sequence as it's spoken, building up the full caption.

"wordAnimation": {
"style": "typewriter"
}

none

No animation. All words are visible immediately with no highlighting or transitions.

"wordAnimation": {
"style": "none"
}

Migrating from legacy captions

The rich-caption asset is the successor to the caption asset. Your existing SRT/VTT files and alias:// references work identically — change "type": "caption" to "type": "rich-caption" and you're ready to use the new styling options.

  • Rich Text - Static text with rich styling
  • Aliases - Auto-generate captions from audio/video clips
  • Positioning - Position and scale caption containers