ElevenLabs Speech-to-Text Transcribe audio to text with Scribe v2 - supports 90+ languages, speaker diarization, and word-level timestamps. Setup: See Installation Guide . For JavaScript, use @elevenlabs/* packages only. Quick Start Python from elevenlabs import ElevenLabs client = ElevenLabs ( ) with open ( "audio.mp3" , "rb" ) as audio_file : result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" ) print ( result . text ) JavaScript import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js" ; import { createReadStream } from "fs" ; const client = new ElevenLabsClient ( ) ; const result = await client . speechToText . convert ( { file : createReadStream ( "audio.mp3" ) , modelId : "scribe_v2" , } ) ; console . log ( result . text ) ; cURL curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \ -H "xi-api-key: $ELEVENLABS_API_KEY " -F "file=@audio.mp3" -F "model_id=scribe_v2" Models Model ID Description Best For scribe_v2 State-of-the-art accuracy, 90+ languages Batch transcription, subtitles, long-form audio scribe_v2_realtime Low latency (~150ms) Live transcription, voice agents Transcription with Timestamps Word-level timestamps include type classification and speaker identification: result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" , timestamps_granularity = "word" ) for word in result . words : print ( f" { word . text } : { word . start } s - { word . end } s (type: { word . type } )" ) Speaker Diarization Identify WHO said WHAT - the model labels each word with a speaker ID, useful for meetings, interviews, or any multi-speaker audio: result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" , diarize = True ) for word in result . words : print ( f"[ { word . speaker_id } ] { word . text } " ) Keyterm Prompting Help the model recognize specific words it might otherwise mishear - product names, technical jargon, or unusual spellings (up to 100 terms): result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" , keyterms = [ "ElevenLabs" , "Scribe" , "API" ] ) Language Detection Automatic detection with optional language hint: result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" , language_code = "eng"

ISO 639-1 or ISO 639-3 code

)

print

(

f"Detected:

{

result

.

language_code

}

(

{

result

.

language_probability

:

.0%

}

)"

)

Supported Formats

Audio:

MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus

Video:

MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP

Limits:

Up to 3GB file size, 10 hours duration

Response Format

{

"text"

:

"The full transcription text"

,

"language_code"

:

"eng"

,

"language_probability"

:

0.98

,

"words"

:

[

{

"text"

:

"The"

,

"start"

:

0.0

,

"end"

:

0.15

,

"type"

:

"word"

,

"speaker_id"

:

"speaker_0"

}

,

{

"text"

:

" "

,

"start"

:

0.15

,

"end"

:

0.16

,

"type"

:

"spacing"

,

"speaker_id"

:

"speaker_0"

}

]

}

Word types:

word

- An actual spoken word

spacing

- Whitespace between words (useful for precise timing)

audio_event

- Non-speech sounds the model detected (laughter, applause, music, etc.)

Error Handling

try

:

result

=

client

.

speech_to_text

.

convert

(

file

=

audio_file

,

model_id

=

"scribe_v2"

)

except

Exception

as

e

:

print

(

f"Transcription failed:

{

e

}

"

)

Common errors:

401

Invalid API key

422

Invalid parameters

429

Rate limit exceeded

Tracking Costs

Monitor usage via

request-id

response header:

response

=

client

.

speech_to_text

.

convert

.

with_raw_response

(

file

=

audio_file

,

model_id

=

"scribe_v2"

)

result

=

response

.

parse

(

)

print

(

f"Request ID:

{

response

.

headers

.

get

(

'request-id'

)

}

"

)

Real-Time Streaming

For live transcription with ultra-low latency (~150ms), use the real-time API. The real-time API produces two types of transcripts:

Partial transcripts

Interim results that update frequently as audio is processed - use these for live feedback (e.g., showing text as the user speaks)
Committed transcripts: Final, stable results after you "commit" - use these as the source of truth for your application A "commit" tells the model to finalize the current segment. You can commit manually (e.g., when the user pauses) or use Voice Activity Detection (VAD) to auto-commit on silence. Python (Server-Side) import asyncio from elevenlabs import ElevenLabs client = ElevenLabs ( ) async def transcribe_realtime ( ) : async with client . speech_to_text . realtime . connect ( model_id = "scribe_v2_realtime" , include_timestamps = True , ) as connection : await connection . stream_url ( "https://example.com/audio.mp3" ) async for event in connection : if event . type == "partial_transcript" : print ( f"Partial: { event . text } " ) elif event . type == "committed_transcript" : print ( f"Final: { event . text } " ) asyncio . run ( transcribe_realtime ( ) ) JavaScript (Client-Side with React) import { useScribe , CommitStrategy } from "@elevenlabs/react" ; function TranscriptionComponent ( ) { const [ transcript , setTranscript ] = useState ( "" ) ; const scribe = useScribe ( { modelId : "scribe_v2_realtime" , commitStrategy : CommitStrategy . VAD , // Auto-commit on silence for mic input onPartialTranscript : ( data ) => console . log ( "Partial:" , data . text ) , onCommittedTranscript : ( data ) => setTranscript ( ( prev ) => prev + data . text ) , } ) ; const start = async ( ) => { // Get token from your backend (never expose API key to client) const { token } = await fetch ( "/scribe-token" ) . then ( ( r ) => r . json ( ) ) ; await scribe . connect ( { token , microphone : { echoCancellation : true , noiseSuppression : true } , } ) ; } ; return < button onClick = { start }

Start Recording < / button

; } Commit Strategies Strategy Description Manual You call commit() when ready - use for file processing or when you control the audio segments VAD Voice Activity Detection auto-commits when silence is detected - use for live microphone input // React: set commitStrategy on the hook (recommended for mic input) import { useScribe , CommitStrategy } from "@elevenlabs/react" ; const scribe = useScribe ( { modelId : "scribe_v2_realtime" , commitStrategy : CommitStrategy . VAD , // Optional VAD tuning: vadSilenceThresholdSecs : 1.5 , vadThreshold : 0.4 , } ) ; // JavaScript client: pass vad config on connect const connection = await client . speechToText . realtime . connect ( { modelId : "scribe_v2_realtime" , vad : { silenceThresholdSecs : 1.5 , threshold : 0.4 , } , } ) ; Event Types Event Description partial_transcript Live interim results committed_transcript Final results after commit committed_transcript_with_timestamps Final with word timing error Error occurred See real-time references for complete documentation. References Installation Guide Transcription Options Real-Time Client-Side Streaming Real-Time Server-Side Streaming Commit Strategies Real-Time Event Reference

安装

ISO 639-1 or ISO 639-3 code