speech-to-text

安装量: 1.7K
排名: #952

安装

npx skills add https://github.com/elevenlabs/skills --skill speech-to-text

ElevenLabs Speech-to-Text Transcribe audio to text with Scribe v2 - supports 90+ languages, speaker diarization, and word-level timestamps. Setup: See Installation Guide . For JavaScript, use @elevenlabs/* packages only. Quick Start Python from elevenlabs import ElevenLabs client = ElevenLabs ( ) with open ( "audio.mp3" , "rb" ) as audio_file : result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" ) print ( result . text ) JavaScript import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js" ; import { createReadStream } from "fs" ; const client = new ElevenLabsClient ( ) ; const result = await client . speechToText . convert ( { file : createReadStream ( "audio.mp3" ) , modelId : "scribe_v2" , } ) ; console . log ( result . text ) ; cURL curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \ -H "xi-api-key: $ELEVENLABS_API_KEY " -F "file=@audio.mp3" -F "model_id=scribe_v2" Models Model ID Description Best For scribe_v2 State-of-the-art accuracy, 90+ languages Batch transcription, subtitles, long-form audio scribe_v2_realtime Low latency (~150ms) Live transcription, voice agents Transcription with Timestamps Word-level timestamps include type classification and speaker identification: result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" , timestamps_granularity = "word" ) for word in result . words : print ( f" { word . text } : { word . start } s - { word . end } s (type: { word . type } )" ) Speaker Diarization Identify WHO said WHAT - the model labels each word with a speaker ID, useful for meetings, interviews, or any multi-speaker audio: result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" , diarize = True ) for word in result . words : print ( f"[ { word . speaker_id } ] { word . text } " ) Keyterm Prompting Help the model recognize specific words it might otherwise mishear - product names, technical jargon, or unusual spellings (up to 100 terms): result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" , keyterms = [ "ElevenLabs" , "Scribe" , "API" ] ) Language Detection Automatic detection with optional language hint: result = client . speech_to_text . convert ( file = audio_file , model_id = "scribe_v2" , language_code = "eng"

ISO 639-1 or ISO 639-3 code

)
print
(
f"Detected:
{
result
.
language_code
}
(
{
result
.
language_probability
:
.0%
}
)"
)
Supported Formats
Audio:
MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus
Video:
MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP
Limits:
Up to 3GB file size, 10 hours duration
Response Format
{
"text"
:
"The full transcription text"
,
"language_code"
:
"eng"
,
"language_probability"
:
0.98
,
"words"
:
[
{
"text"
:
"The"
,
"start"
:
0.0
,
"end"
:
0.15
,
"type"
:
"word"
,
"speaker_id"
:
"speaker_0"
}
,
{
"text"
:
" "
,
"start"
:
0.15
,
"end"
:
0.16
,
"type"
:
"spacing"
,
"speaker_id"
:
"speaker_0"
}
]
}
Word types:
word
- An actual spoken word
spacing
- Whitespace between words (useful for precise timing)
audio_event
- Non-speech sounds the model detected (laughter, applause, music, etc.)
Error Handling
try
:
result
=
client
.
speech_to_text
.
convert
(
file
=
audio_file
,
model_id
=
"scribe_v2"
)
except
Exception
as
e
:
print
(
f"Transcription failed:
{
e
}
"
)
Common errors:
401
Invalid API key
422
Invalid parameters
429
Rate limit exceeded
Tracking Costs
Monitor usage via
request-id
response header:
response
=
client
.
speech_to_text
.
convert
.
with_raw_response
(
file
=
audio_file
,
model_id
=
"scribe_v2"
)
result
=
response
.
parse
(
)
print
(
f"Request ID:
{
response
.
headers
.
get
(
'request-id'
)
}
"
)
Real-Time Streaming
For live transcription with ultra-low latency (~150ms), use the real-time API. The real-time API produces two types of transcripts:
Partial transcripts
Interim results that update frequently as audio is processed - use these for live feedback (e.g., showing text as the user speaks)
Committed transcripts
Final, stable results after you "commit" - use these as the source of truth for your application A "commit" tells the model to finalize the current segment. You can commit manually (e.g., when the user pauses) or use Voice Activity Detection (VAD) to auto-commit on silence. Python (Server-Side) import asyncio from elevenlabs import ElevenLabs client = ElevenLabs ( ) async def transcribe_realtime ( ) : async with client . speech_to_text . realtime . connect ( model_id = "scribe_v2_realtime" , include_timestamps = True , ) as connection : await connection . stream_url ( "https://example.com/audio.mp3" ) async for event in connection : if event . type == "partial_transcript" : print ( f"Partial: { event . text } " ) elif event . type == "committed_transcript" : print ( f"Final: { event . text } " ) asyncio . run ( transcribe_realtime ( ) ) JavaScript (Client-Side with React) import { useScribe , CommitStrategy } from "@elevenlabs/react" ; function TranscriptionComponent ( ) { const [ transcript , setTranscript ] = useState ( "" ) ; const scribe = useScribe ( { modelId : "scribe_v2_realtime" , commitStrategy : CommitStrategy . VAD , // Auto-commit on silence for mic input onPartialTranscript : ( data ) => console . log ( "Partial:" , data . text ) , onCommittedTranscript : ( data ) => setTranscript ( ( prev ) => prev + data . text ) , } ) ; const start = async ( ) => { // Get token from your backend (never expose API key to client) const { token } = await fetch ( "/scribe-token" ) . then ( ( r ) => r . json ( ) ) ; await scribe . connect ( { token , microphone : { echoCancellation : true , noiseSuppression : true } , } ) ; } ; return < button onClick = { start }

Start Recording < / button

; } Commit Strategies Strategy Description Manual You call commit() when ready - use for file processing or when you control the audio segments VAD Voice Activity Detection auto-commits when silence is detected - use for live microphone input // React: set commitStrategy on the hook (recommended for mic input) import { useScribe , CommitStrategy } from "@elevenlabs/react" ; const scribe = useScribe ( { modelId : "scribe_v2_realtime" , commitStrategy : CommitStrategy . VAD , // Optional VAD tuning: vadSilenceThresholdSecs : 1.5 , vadThreshold : 0.4 , } ) ; // JavaScript client: pass vad config on connect const connection = await client . speechToText . realtime . connect ( { modelId : "scribe_v2_realtime" , vad : { silenceThresholdSecs : 1.5 , threshold : 0.4 , } , } ) ; Event Types Event Description partial_transcript Live interim results committed_transcript Final results after commit committed_transcript_with_timestamps Final with word timing error Error occurred See real-time references for complete documentation. References Installation Guide Transcription Options Real-Time Client-Side Streaming Real-Time Server-Side Streaming Commit Strategies Real-Time Event Reference

返回排行榜