Captions

How to Add Closed Captioning to a Video: Methods, Formats, and Code

14 min read
Video editing software timeline used to add closed captioning to a video
Reading Time: 10 minutes

Roughly 85% of videos on social platforms are watched with the sound off, and about 1 in 5 people experience some degree of hearing loss. Captions cover both groups at once — they make your content readable on a muted phone in a noisy room and accessible to viewers who are deaf or hard of hearing. They also give search engines text to index, which helps your videos rank.

This guide shows you how to add closed captioning to a video three ways: by hand, with automatic speech recognition, and through an API for apps that serve video at scale. You’ll see the caption file formats that matter, the exact steps for popular platforms, and the code needed to attach captions to an HLS stream so they play correctly across devices.

What Is Closed Captioning?

Closed captioning is a text version of the spoken audio and relevant sounds in a video that viewers can turn on or off. The word “closed” means the captions live in a separate track or file and stay hidden until the viewer enables them — unlike open captions, which are burned into the video frames and always visible.

Closed captions include more than dialogue. A proper caption track also describes speaker changes, music cues, and sound effects like [APPLAUSE] or [DOOR SLAMS], because the viewer may not hear them at all. That distinction matters for accessibility compliance and for the experience of deaf and hard-of-hearing audiences.

Closed Captions vs Subtitles vs Open Captions

People use these terms interchangeably, but they describe different things. The table below breaks down the differences.

Type Toggle on/off? Includes sound effects? Primary purpose
Closed captions Yes Yes Accessibility for deaf/hard-of-hearing viewers
Open captions No (burned in) Yes Guaranteed display where toggling isn’t supported
Subtitles Yes No (dialogue only) Translation for viewers who can hear but don’t speak the language

In short: captions assume the viewer can’t hear the audio, while subtitles assume they can hear but need a translation. Apple labels accessibility-focused tracks “SDH” (Subtitles for the Deaf and Hard of hearing). If you want a deeper breakdown, see our guide on closed captioning vs subtitles.

Caption File Formats You Need to Know

Before you add closed captioning to a video, you need the captions in a file format your player or platform accepts. Four formats cover almost every workflow.

Format Extension Styling support Best for
SubRip .srt Minimal Uploads to YouTube, Facebook, LinkedIn
WebVTT .vtt Positioning, color, fonts HTML5 video and HLS streaming
TTML / IMSC .ttml, .xml Rich layout Broadcast and DASH workflows
CEA-608/708 Embedded Limited (608), rich (708) Broadcast TV and in-band streams

SRT is the most widely accepted upload format. It’s plain text with sequential numbers, start/end timecodes, and the caption line. Most social and hosting platforms accept it directly.

WebVTT (Web Video Text Tracks) is the format for the modern web. It’s the only caption format the HTML5 <track> element supports natively, and it’s the standard for captions delivered inside HLS streams. WebVTT files must be UTF-8 encoded and start with a WEBVTT header line. Our VTT vs SRT comparison covers when to pick one over the other.

CEA-608 and CEA-708 are the legacy broadcast standards, embedded directly in the video signal rather than carried as a sidecar file. You’ll mostly encounter them when ingesting broadcast feeds or converting them to WebVTT for web delivery.

Here’s what a minimal WebVTT file looks like:

WEBVTT

00:00:01.000 --> 00:00:04.000
Welcome to the live stream.

00:00:04.500 --> 00:00:07.200
[upbeat music]

00:00:07.500 --> 00:00:10.000
Today we're shipping captions.

The SRT equivalent is nearly identical, but uses comma decimal separators and numbered cues:

1
00:00:01,000 --> 00:00:04,000
Welcome to the live stream.

2
00:00:04,500 --> 00:00:07,200
[upbeat music]

Three Ways to Add Closed Captioning to a Video

There are three practical methods, and the right one depends on how many videos you handle and where they’re delivered.

Method Speed Accuracy Best for
Manual Slow Highest One-off videos, exact compliance
Automatic (ASR) Fast 85–95% Bulk libraries, social clips
API / programmatic Fast at scale Depends on source Apps and streaming platforms

The next three sections walk through each in order.

How to Add Closed Captioning to a Video Manually

Manual captioning gives you the most accurate result because a human writes and times every line. The trade-off is time — expect roughly 4 to 6 minutes of work per minute of video. Here are the steps.

  1. Transcribe the audio. Play the video and type out every spoken line, plus sound effects and speaker labels. Keep each caption to one or two lines so it fits on screen.
  2. Add timecodes. Note the start and end time for each caption in HH:MM:SS,mmm format. Captions should appear as the words are spoken and stay on screen long enough to read — usually 1 to 6 seconds.
  3. Save as an SRT or VTT file. Use a plain-text editor — TextEdit on Mac (in plain-text mode) or Notepad on Windows. Save with the .srt or .vtt extension and UTF-8 encoding.
  4. Upload the file alongside your video. Most platforms have a “subtitles/CC” option where you attach the file. The platform syncs it to the video using the timecodes.

This approach works well for a handful of videos where accuracy is non-negotiable, such as legal, medical, or training content.

How to Add Closed Captions Automatically

Automatic speech recognition (ASR) generates a caption track from the audio in minutes. Modern engines hit 85–95% accuracy on clear audio, which is good enough for a first draft you then correct.

The workflow is consistent across tools:

  1. Upload the video to a captioning service or platform with built-in ASR.
  2. The engine transcribes the audio and generates timed captions.
  3. Review and edit the output — fix names, technical terms, and any spots where multiple people talk at once.
  4. Publish the corrected track or export it as an SRT/VTT file.

Cloud speech APIs power most of these tools under the hood. Google Cloud Speech-to-Text, AWS Transcribe, and Azure Speech all return word-level timestamps you can convert into WebVTT cues. If you’re building this into your own product, you’d extract the audio with a tool like FFmpeg, send it to the speech API, and format the response as a caption file.

Always review automatic captions before publishing. ASR struggles with proper nouns, accents, overlapping speakers, and background noise — exactly the details that matter for accessibility.

How to Add Captions on Popular Platforms

If you’re publishing to a hosted platform rather than your own app, each one has its own caption upload flow. Here are the most common.

  • YouTube: In YouTube Studio, open Subtitles from the left menu, pick a language, then choose Upload file to add an SRT, or type captions manually. YouTube can also auto-generate a draft you edit. Note that YouTube’s caption auto-sync was changed in 2024, so uploading a properly timed file is the reliable path.
  • Vimeo: Open a video’s settings, go to the Advanced or Distribution tab, find Subtitles, and upload an SRT or VTT file per language.
  • Facebook & LinkedIn: Both accept SRT files on upload. Facebook requires a specific filename format that includes the language locale (for example, video.en_US.srt).
  • Zoom & video conferencing: Live closed captioning is enabled in account settings, then turned on during a meeting, either through automatic captions or a captioner typing in real time.

These flows are fine for marketing clips and webinars. But if you’re building a product that hosts or streams video — an OTT service, a course platform, or a social app — you need captions to work programmatically across thousands of videos. That’s where an API approach takes over.

How to Add Closed Captioning to a Video via API

For developers, manual uploads don’t scale. When your application ingests user uploads or runs live events, you want captions attached through code as part of your pipeline. A video API handles this by accepting a caption file and associating it with a video asset.

The general pattern across video platforms looks like this:

  1. Authenticate and get an access token.
  2. Identify the video asset you want to caption by its ID.
  3. Upload the caption file (usually WebVTT) and tag it with a language code.
  4. Optionally set a default track so captions appear automatically.

A typical caption upload request looks like this:

curl -X POST "https://api.example.com/videos/VIDEO_ID/captions/en" \
  -H "Authorization: Bearer ACCESS_TOKEN" \
  -F "[email protected]"

You can repeat the call for each language, attaching fr, es, or de tracks to the same asset. The player then exposes a caption menu so viewers choose their language.

When you build on a platform like LiveAPI, captions ride along with the same video hosting API and embeddable player you already use for playback. Because the platform handles video encoding and HLS packaging for you, your WebVTT track is delivered through the same global CDN as the video segments — no separate hosting or sync setup required. That matters when you’re shipping a streaming feature fast rather than building caption infrastructure from scratch.

Adding Captions in HTML5 Video

If you’re serving a single progressive MP4 in the browser, the HTML5 <track> element is the simplest path. Point it at a WebVTT file:

<video controls>
  <source src="lecture.mp4" type="video/mp4" />
  <track
    src="captions-en.vtt"
    kind="captions"
    srclang="en"
    label="English"
    default
  />
</video>

The kind="captions" attribute tells the browser this track is for accessibility (versus kind="subtitles" for translation), and default turns it on automatically. This works for short clips, but for adaptive streaming you need captions inside the HLS manifest.

How to Add Closed Captions to an HLS Stream

Most production video uses adaptive bitrate streaming over HLS so playback adjusts to each viewer’s connection. Captions in HLS are delivered “out of band” — as a separate WebVTT playlist referenced from the master manifest — so they can be toggled independently of the video.

You declare a subtitle track in the master .m3u8 playlist with an EXT-X-MEDIA tag:

#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="English",\
LANGUAGE="en-US",AUTOSELECT=YES,DEFAULT=NO,FORCED=NO,\
CHARACTERISTICS="public.accessibility.transcribes-spoken-dialog",\
URI="subtitles/en/playlist.m3u8"

Then each video variant references that subtitle group:

#EXT-X-STREAM-INF:BANDWIDTH=2000000,SUBTITLES="subs"
video/720p/playlist.m3u8

Three flags control caption behavior, and getting them right prevents the most common bugs:

  • AUTOSELECT=YES — the player may pick this track automatically based on the viewer’s system language or accessibility settings.
  • DEFAULT=NO — the track does not play unless the viewer (or AUTOSELECT) chooses it. Set only one track to DEFAULT=YES per group.
  • FORCED=NO — reserve YES for captions that must always show, like translating on-screen foreign dialogue.

One detail trips up most developers: HLS expects the WebVTT to be segmented to match your video segments. If your media segments are 6 seconds, your subtitle playlist should reference 6-second WebVTT chunks so captions stay in sync as the player buffers. A managed live streaming API handles this segmentation automatically when you supply a caption file, which removes the trickiest part of the job.

How to Add Live Closed Captioning

Live closed captioning adds text to a stream in real time, which is required for live events in many jurisdictions. There are two approaches.

Human captioners (also called CART) type or re-speak the audio into captioning software that injects CEA-608/708 data or pushes WebVTT cues into the live manifest. This delivers the highest accuracy and is common for broadcast and regulated events.

Automatic live captions route the stream’s audio through a real-time speech API that returns text with a short delay — typically 2 to 5 seconds behind the audio. Accuracy is lower than human captioning but the cost is a fraction.

For app builders, the practical setup is to capture the audio from your RTMP or SRT ingest, send it to a streaming speech-to-text service, format the results as timed WebVTT cues, and push them into the live HLS playlist. Because captions and the recording flow through the same pipeline, live-to-VOD recordings keep the captions automatically when the event ends.

Best Practices for Accessible Captions

Adding a caption track is step one. These practices make captions genuinely useful and keep you compliant.

  • Match reading speed to comprehension. Keep captions to about 32 characters per line and 1 to 2 lines at a time. Display each cue for at least 1 second.
  • Identify speakers and sounds. Use labels like >> JOHN: for speaker changes and brackets for non-speech audio: [phone rings].
  • Sync tightly. Captions should appear within a fraction of a second of the spoken words. Drift over a long video is the most common complaint.
  • Use WebVTT for the web. It’s the format with the broadest player and HLS support and it allows positioning so captions don’t cover on-screen text.
  • Provide multiple languages where it helps. Attach separate tracks per language rather than burning one language into the frames.
  • Meet the standards that apply to you. In the US, the FCC closed captioning rules govern TV and much online content, and the WCAG guidelines set the accessibility bar for the web.

Following these makes captions readable for everyone and reduces the risk of accessibility complaints. They also improve video SEO, since search engines read the caption text to understand and rank your content.

Is an API Approach Right for Your Project?

Use this quick checklist to decide whether to handle captions manually or through a video API:

  • You publish more than a few videos a month → an API removes the per-video upload work.
  • You run live events → you need programmatic live caption injection into the HLS manifest.
  • You serve a global audience → you’ll attach multiple language tracks per asset.
  • You’re building an app, not posting to a platform → you need captions delivered through your own player and CDN.

If two or more of these apply, building captions into your video streaming app through an API will save far more time than manual uploads ever could.

How to Add Closed Captioning to a Video: FAQ

What’s the difference between closed captions and subtitles?

Closed captions assume the viewer can’t hear the audio, so they include sound effects and speaker labels along with dialogue. Subtitles assume the viewer can hear but needs a translation, so they only carry dialogue. Both can be toggled on and off.

What file format should I use for captions?

Use SRT for uploads to platforms like YouTube, Facebook, and LinkedIn. Use WebVTT (.vtt) for HTML5 video and for captions inside HLS streams, since it’s the format with the broadest web and player support.

How do I add closed captioning to a video automatically?

Upload the video to a tool or platform with automatic speech recognition, let it generate a timed caption track, then review and correct the text before publishing. Cloud APIs like Google Speech-to-Text, AWS Transcribe, and Azure Speech power most of these tools.

Why is it called “closed” captioning?

“Closed” means the captions are stored separately from the video and stay hidden until the viewer turns them on. “Open” captions are burned into the video frames and can’t be turned off.

Can I add captions to a live stream?

Yes. Live closed captioning works either through a human captioner typing in real time or through an automatic speech-to-text service that injects WebVTT cues into the live HLS manifest with a few seconds of delay.

Do captions help with SEO?

Yes. Search engines can’t watch a video, but they can read its caption text. A caption track gives crawlers indexable content about the video, which can improve how it ranks and surfaces in search.

How accurate is automatic captioning?

Automatic speech recognition reaches roughly 85–95% accuracy on clear audio with a single speaker. Accuracy drops with accents, background noise, overlapping speakers, and technical jargon, so always review auto-generated captions before publishing.

Will captions stay synced if the video uses adaptive bitrate?

They will if the WebVTT track is segmented to match your video segments. HLS expects subtitle segments aligned to the media segments (for example, 6-second chunks), which a managed streaming platform handles automatically when you supply the caption file.

Add Captions to Your Video Without Building the Pipeline

Adding closed captioning to a video comes down to three choices: write captions by hand for maximum accuracy, generate them automatically with speech recognition, or attach them through an API when you’re serving video at scale. For apps and streaming products, the API route wins — it delivers WebVTT through the same HLS pipeline as your video, segments captions to stay in sync, and carries them into your live-to-VOD recordings without extra work.

LiveAPI gives you live streaming, video hosting, encoding, and an embeddable player through one developer-friendly API, so captions ship as part of the same delivery you already use. Get started with LiveAPI and add accessible, searchable captions to your videos in days, not months.

Join 200,000+ satisfied streamers

Still on the fence? Take a sneak peek and see what you can do with Castr.

No Castr Branding

No Castr Branding

We do not include our branding on your videos.

No Commitment

No Commitment

No contracts. Cancel or change your plans anytime.

24/7 Support

24/7 Support

Highly skilled in-house engineers ready to help.

  • Check Free 7-day trial
  • CheckCancel anytime
  • CheckNo credit card required

Related Articles