Transcribe speech in audio or video files

🤖/speech/transcribe transcribes speech in audio or video files.

You can use the text that we return in your application, or you can pass the text down to other Robots to filter audio or video files that contain (or do not contain) certain content, or burn the text into images or video for example.

Another common use case is automatically subtitling videos, or making audio searchable.

Usage example

Transcribe speech in French from uploaded audio or video, and save it to a text file:

{
  "steps": {
    "transcribed": {
      "robot": "/speech/transcribe",
      "use": ":original",
      "provider": "aws",
      "source_language": "fr-FR",
      "format": "text"
    }
  }
}

Parameters

output_meta
Record<string, boolean> | boolean | Array<string>
Allows you to specify a set of metadata that is more expensive on CPU power to calculate, and thus is disabled by default to keep your Assemblies processing fast.

For images, you can add "has_transparency": true in this object to extract if the image contains transparent parts and "dominant_colors": true to extract an array of hexadecimal color codes from the image.

For videos, you can add the "colorspace: true" parameter to extract the colorspace of the output video.

For audio, you can add "mean_volume": true to get a single value representing the mean average volume of the audio file.

You can also set this to false to skip metadata extraction and speed up transcoding.
result
boolean (default: false)
Whether the results of this Step should be present in the Assembly Status JSON
queue
batch
Setting the queue to 'batch', manually downgrades the priority of jobs for this step to avoid consuming Priority job slots for jobs that don't need zero queue waiting times
force_accept
boolean (default: false)
Force a Robot to accept a file type it would have ignored.

By default, Robots ignore files they are not familiar with. 🤖/video/encode, for example, will happily ignore input images.

With the force_accept parameter set to true, you can force Robots to accept all files thrown at them. This will typically lead to errors and should only be used for debugging or combatting edge cases.
ignore_errors
boolean | Array<meta | execute> (default: [])
Ignore errors during specific phases of processing.

Setting this to ["meta"] will cause the Robot to ignore errors during metadata extraction.

Setting this to ["execute"] will cause the Robot to ignore errors during the main execution phase.

Setting this to true is equivalent to ["meta", "execute"] and will ignore errors in both phases.
use
string | Array<string> | Array<object> | object
Specifies which Step(s) to use as input.
- You can pick any names for Steps except ":original" (reserved for user uploads handled by Transloadit)
- You can provide several Steps as input with arrays:
```
{
  "use": [
    ":original",
    "encoded",
    "resized"
  ]
}
```
Tip

That's likely all you need to know about use, but you can view Advanced use cases.
provider — required
aws | gcp | replicate | fal | transloadit
Which AI provider to leverage.

Transloadit outsources this task and abstracts the interface so you can expect the same data structures, but different latencies and information being returned. Different cloud vendors have different areas they shine in, and we recommend to try out and see what yields the best results for your use case.
granularity
full | list (default: "full")
Whether to return a full response ("full"), or a flat list of descriptions ("list").
format
json | meta | srt | meta | text | webvtt (default: "json")
Output format for the transcription.
- "text" outputs a plain text file that you can store and process.
- "json" outputs a JSON file containing timestamped words.
- "srt" and "webvtt" output subtitle files of those respective file types, which can be stored separately or used in other encoding Steps.
- "meta" does not return a file, but stores the data inside Transloadit's file object (under ${file.meta.transcription.text}) that's passed around between encoding Steps, so that you can use the values to burn the data into videos, filter on them, etc.
source_language
string (default: "en-US")
The spoken language of the audio or video. This will also be the language of the transcribed text.

The language should be specified in the BCP-47 format, such as "en-GB", "de-DE" or "fr-FR". Please also consult the list of supported languages for the gcp provider and the the aws provider.

`target_language`

string (default: "en-US")

  This will also be the language of the written text.

  The language should be specified in the [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) format, such as `"en-GB"`, `"de-DE"` or `"fr-FR"`. Please consult the list of supported languages and voices.

Demos

Tech preview: new AI Robots for enhanced media processing February 17, 2020
New feature: auto-transcribe videos with subtitles March 8, 2021
Building a screen reader plugin with /text/speak Robot June 3, 2021
Building an AI-powered video dubber with Transloadit July 9, 2021
Celebrating transloadit’s 2021 milestones and progress January 31, 2022
Build a Reddit video subtitling bot with Transloadit February 10, 2022
Creating engaging audio visualizations with Transloadit April 2, 2023

Transcribe speech in audio or video files

Usage example

Parameters

`output_meta`

`result`

`queue`

`force_accept`

`ignore_errors`

`use`

Tip

`provider` — required

`granularity`

`format`

`source_language`

`target_language`

Demos

Transcribe speech in audio or video files

Usage example

Parameters

output_meta

result

queue

force_accept

ignore_errors

use

Tip

provider — required

granularity

format

source_language

target_language

Demos

Related blog posts

`output_meta`

`result`

`queue`

`force_accept`

`ignore_errors`

`use`

`provider` — required

`granularity`

`format`

`source_language`

`target_language`