ElevenLabs Speech-to-Text Transcribe audio to text with Scribe v2 - supports 90+ languages, speaker diarization, and word-level timestamps. Setup: See Installation Guide . For JavaScript, use @elevenlabs/* packages only. Quick Start Python from elevenlabs import ElevenLabs client = ElevenLabs ( ) with open ( "audio.mp3" , "rb" ) as audio_file : result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" ) print ( result . text ) JavaScript import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js" ; import { createReadStream } from "fs" ; const client = new ElevenLabsClient ( ) ; const result = await client . speechToText . convert ( { file : createReadStream ( "audio.mp3" ) , modelId : "scribe_v2" , } ) ; console . log ( result . text ) ; cURL curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \ -H "xi-api-key: $ELEVENLABS_API_KEY " -F "file=@audio.mp3" -F "model_id=scribe_v2" Models Model ID Description Best For scribe_v2 State-of-the-art accuracy, 90+ languages Batch transcription, subtitles, long-form audio scribe_v2_realtime Low latency (~150ms) Live transcription, voice agents Transcription with Timestamps Word-level timestamps include type classification and speaker identification: result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" , timestamps_granularity = "word" ) for word in result . words : print ( f" { word . text } : { word . start } s - { word . end } s (type: { word . type } )" ) Speaker Diarization Identify WHO said WHAT - the model labels each word with a speaker ID, useful for meetings, interviews, or any multi-speaker audio: result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" , diarize = True ) for word in result . words : print ( f"[ { word . speaker_id } ] { word . text } " ) Keyterm Prompting Help the model recognize specific words it might otherwise mishear - product names, technical jargon, or unusual spellings (up to 100 terms): result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" , keyterms = [ "ElevenLabs" , "Scribe" , "API" ] ) Language Detection Automatic detection with optional language hint: result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" , language_code = "eng"
ISO 639-1 or ISO 639-3 code
- )
- (
- f"Detected:
- {
- result
- .
- language_code
- }
- (
- {
- result
- .
- language_probability
- :
- .0%
- }
- )"
- )
- Supported Formats
- Audio:
- MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus
- Video:
- MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP
- Limits:
- Up to 3GB file size, 10 hours duration
- Response Format
- {
- "text"
- :
- "The full transcription text"
- ,
- "language_code"
- :
- "eng"
- ,
- "language_probability"
- :
- 0.98
- ,
- "words"
- :
- [
- {
- "text"
- :
- "The"
- ,
- "start"
- :
- 0.0
- ,
- "end"
- :
- 0.15
- ,
- "type"
- :
- "word"
- ,
- "speaker_id"
- :
- "speaker_0"
- }
- ,
- {
- "text"
- :
- " "
- ,
- "start"
- :
- 0.15
- ,
- "end"
- :
- 0.16
- ,
- "type"
- :
- "spacing"
- ,
- "speaker_id"
- :
- "speaker_0"
- }
- ]
- }
- Word types:
- word
- - An actual spoken word
- spacing
- - Whitespace between words (useful for precise timing)
- audio_event
- - Non-speech sounds the model detected (laughter, applause, music, etc.)
- Error Handling
- try
- :
- result
- =
- client
- .
- speech_to_text
- .
- convert
- (
- file
- =
- audio_file
- ,
- model_id
- =
- "scribe_v2"
- )
- except
- Exception
- as
- e
- :
- (
- f"Transcription failed:
- {
- e
- }
- "
- )
- Common errors:
- 401
-
- Invalid API key
- 422
-
- Invalid parameters
- 429
-
- Rate limit exceeded
- Tracking Costs
- Monitor usage via
- request-id
- response header:
- response
- =
- client
- .
- speech_to_text
- .
- convert
- .
- with_raw_response
- (
- file
- =
- audio_file
- ,
- model_id
- =
- "scribe_v2"
- )
- result
- =
- response
- .
- parse
- (
- )
- (
- f"Request ID:
- {
- response
- .
- headers
- .
- get
- (
- 'request-id'
- )
- }
- "
- )
- Real-Time Streaming
- For live transcription with ultra-low latency (~150ms), use the real-time API. The real-time API produces two types of transcripts:
- Partial transcripts
-
- Interim results that update frequently as audio is processed - use these for live feedback (e.g., showing text as the user speaks)
- Committed transcripts
- Final, stable results after you "commit" - use these as the source of truth for your application
A "commit" tells the model to finalize the current segment. You can commit manually (e.g., when the user pauses) or use Voice Activity Detection (VAD) to auto-commit on silence.
Python (Server-Side)
import
asyncio
from
elevenlabs
import
ElevenLabs
client
=
ElevenLabs
(
)
async
def
transcribe_realtime
(
)
:
async
with
client
.
speech_to_text
.
realtime
.
connect
(
model_id
=
"scribe_v2_realtime"
,
include_timestamps
=
True
,
)
as
connection
:
await
connection
.
stream_url
(
"https://example.com/audio.mp3"
)
async
for
event
in
connection
:
if
event
.
type
==
"partial_transcript"
:
print
(
f"Partial:
{
event
.
text
}
"
)
elif
event
.
type
==
"committed_transcript"
:
print
(
f"Final:
{
event
.
text
}
"
)
asyncio
.
run
(
transcribe_realtime
(
)
)
JavaScript (Client-Side with React)
import
{
useScribe
,
CommitStrategy
}
from
"@elevenlabs/react"
;
function
TranscriptionComponent
(
)
{
const
[
transcript
,
setTranscript
]
=
useState
(
""
)
;
const
scribe
=
useScribe
(
{
modelId
:
"scribe_v2_realtime"
,
commitStrategy
:
CommitStrategy
.
VAD
,
// Auto-commit on silence for mic input
onPartialTranscript
:
(
data
)
=>
console
.
log
(
"Partial:"
,
data
.
text
)
,
onCommittedTranscript
:
(
data
)
=>
setTranscript
(
(
prev
)
=>
prev
+
data
.
text
)
,
}
)
;
const
start
=
async
(
)
=>
{
// Get token from your backend (never expose API key to client)
const
{
token
}
=
await
fetch
(
"/scribe-token"
)
.
then
(
(
r
)
=>
r
.
json
(
)
)
;
await
scribe
.
connect
(
{
token
,
microphone
:
{
echoCancellation
:
true
,
noiseSuppression
:
true
}
,
}
)
;
}
;
return
<
button onClick
=
{
start
}
Start Recording < / button
; } Commit Strategies Strategy Description Manual You call commit() when ready - use for file processing or when you control the audio segments VAD Voice Activity Detection auto-commits when silence is detected - use for live microphone input // React: set commitStrategy on the hook (recommended for mic input) import { useScribe , CommitStrategy } from "@elevenlabs/react" ; const scribe = useScribe ( { modelId : "scribe_v2_realtime" , commitStrategy : CommitStrategy . VAD , // Optional VAD tuning: vadSilenceThresholdSecs : 1.5 , vadThreshold : 0.4 , } ) ; // JavaScript client: pass vad config on connect const connection = await client . speechToText . realtime . connect ( { modelId : "scribe_v2_realtime" , vad : { silenceThresholdSecs : 1.5 , threshold : 0.4 , } , } ) ; Event Types Event Description partial_transcript Live interim results committed_transcript Final results after commit committed_transcript_with_timestamps Final with word timing error Error occurred See real-time references for complete documentation. References Installation Guide Transcription Options Real-Time Client-Side Streaming Real-Time Server-Side Streaming Commit Strategies Real-Time Event Reference