Azure Speech to Text REST API for Short Audio Simple REST API for speech-to-text transcription of short audio files (up to 60 seconds). No SDK required - just HTTP requests. Prerequisites Azure subscription - Create one free Speech resource - Create in Azure Portal Get credentials - After deployment, go to resource > Keys and Endpoint Environment Variables
Required
AZURE_SPEECH_KEY
< your-speech-resource-key
AZURE_SPEECH_REGION
< region
e.g., eastus, westus2, westeurope
Alternative: Use endpoint directly
AZURE_SPEECH_ENDPOINT
https:// < region
.stt.speech.microsoft.com Installation pip install requests Quick Start import os import requests def transcribe_audio ( audio_file_path : str , language : str = "en-US" ) -
dict : """Transcribe short audio file (max 60 seconds) using REST API.""" region = os . environ [ "AZURE_SPEECH_REGION" ] api_key = os . environ [ "AZURE_SPEECH_KEY" ] url = f"https:// { region } .stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1" headers = { "Ocp-Apim-Subscription-Key" : api_key , "Content-Type" : "audio/wav; codecs=audio/pcm; samplerate=16000" , "Accept" : "application/json" } params = { "language" : language , "format" : "detailed"
or "simple"
} with open ( audio_file_path , "rb" ) as audio_file : response = requests . post ( url , headers = headers , params = params , data = audio_file ) response . raise_for_status ( ) return response . json ( )
Usage
result
transcribe_audio ( "audio.wav" , "en-US" ) print ( result [ "DisplayText" ] ) Audio Requirements Format Codec Sample Rate Notes WAV PCM 16 kHz, mono Recommended OGG OPUS 16 kHz, mono Smaller file size Limitations: Maximum 60 seconds of audio For pronunciation assessment: maximum 30 seconds No partial/interim results (final only) Content-Type Headers
WAV PCM 16kHz
"Content-Type" : "audio/wav; codecs=audio/pcm; samplerate=16000"
OGG OPUS
"Content-Type" : "audio/ogg; codecs=opus" Response Formats Simple Format (default) params = { "language" : "en-US" , "format" : "simple" } { "RecognitionStatus" : "Success" , "DisplayText" : "Remind me to buy 5 pencils." , "Offset" : "1236645672289" , "Duration" : "1236645672289" } Detailed Format params = { "language" : "en-US" , "format" : "detailed" } { "RecognitionStatus" : "Success" , "Offset" : "1236645672289" , "Duration" : "1236645672289" , "NBest" : [ { "Confidence" : 0.9052885 , "Display" : "What's the weather like?" , "ITN" : "what's the weather like" , "Lexical" : "what's the weather like" , "MaskedITN" : "what's the weather like" } ] } Chunked Transfer (Recommended) For lower latency, stream audio in chunks: import os import requests def transcribe_chunked ( audio_file_path : str , language : str = "en-US" ) -
dict : """Stream audio in chunks for lower latency.""" region = os . environ [ "AZURE_SPEECH_REGION" ] api_key = os . environ [ "AZURE_SPEECH_KEY" ] url = f"https:// { region } .stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1" headers = { "Ocp-Apim-Subscription-Key" : api_key , "Content-Type" : "audio/wav; codecs=audio/pcm; samplerate=16000" , "Accept" : "application/json" , "Transfer-Encoding" : "chunked" , "Expect" : "100-continue" } params = { "language" : language , "format" : "detailed" } def generate_chunks ( file_path : str , chunk_size : int = 1024 ) : with open ( file_path , "rb" ) as f : while chunk := f . read ( chunk_size ) : yield chunk response = requests . post ( url , headers = headers , params = params , data = generate_chunks ( audio_file_path ) ) response . raise_for_status ( ) return response . json ( ) Authentication Options Option 1: Subscription Key (Simple) headers = { "Ocp-Apim-Subscription-Key" : os . environ [ "AZURE_SPEECH_KEY" ] } Option 2: Bearer Token import requests import os def get_access_token ( ) -
str : """Get access token from the token endpoint.""" region = os . environ [ "AZURE_SPEECH_REGION" ] api_key = os . environ [ "AZURE_SPEECH_KEY" ] token_url = f"https:// { region } .api.cognitive.microsoft.com/sts/v1.0/issueToken" response = requests . post ( token_url , headers = { "Ocp-Apim-Subscription-Key" : api_key , "Content-Type" : "application/x-www-form-urlencoded" , "Content-Length" : "0" } ) response . raise_for_status ( ) return response . text
Use token in requests (valid for 10 minutes)
token
get_access_token ( ) headers = { "Authorization" : f"Bearer { token } " , "Content-Type" : "audio/wav; codecs=audio/pcm; samplerate=16000" , "Accept" : "application/json" } Query Parameters Parameter Required Values Description language Yes en-US , de-DE , etc. Language of speech format No simple , detailed Result format (default: simple) profanity No masked , removed , raw Profanity handling (default: masked) Recognition Status Values Status Description Success Recognition succeeded NoMatch Speech detected but no words matched InitialSilenceTimeout Only silence detected BabbleTimeout Only noise detected Error Internal service error Profanity Handling
Mask profanity with asterisks (default)
params
{ "language" : "en-US" , "profanity" : "masked" }
Remove profanity entirely
params
{ "language" : "en-US" , "profanity" : "removed" }
Include profanity as-is
params
{ "language" : "en-US" , "profanity" : "raw" } Error Handling import requests def transcribe_with_error_handling ( audio_path : str , language : str = "en-US" ) -
dict | None : """Transcribe with proper error handling.""" region = os . environ [ "AZURE_SPEECH_REGION" ] api_key = os . environ [ "AZURE_SPEECH_KEY" ] url = f"https:// { region } .stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1" try : with open ( audio_path , "rb" ) as audio_file : response = requests . post ( url , headers = { "Ocp-Apim-Subscription-Key" : api_key , "Content-Type" : "audio/wav; codecs=audio/pcm; samplerate=16000" , "Accept" : "application/json" } , params = { "language" : language , "format" : "detailed" } , data = audio_file ) if response . status_code == 200 : result = response . json ( ) if result . get ( "RecognitionStatus" ) == "Success" : return result else : print ( f"Recognition failed: { result . get ( 'RecognitionStatus' ) } " ) return None elif response . status_code == 400 : print ( f"Bad request: Check language code or audio format" ) elif response . status_code == 401 : print ( f"Unauthorized: Check API key or token" ) elif response . status_code == 403 : print ( f"Forbidden: Missing authorization header" ) else : print ( f"Error { response . status_code } : { response . text } " ) return None except requests . exceptions . RequestException as e : print ( f"Request failed: { e } " ) return None Async Version import os import aiohttp import asyncio async def transcribe_async ( audio_file_path : str , language : str = "en-US" ) -
dict : """Async version using aiohttp.""" region = os . environ [ "AZURE_SPEECH_REGION" ] api_key = os . environ [ "AZURE_SPEECH_KEY" ] url = f"https:// { region } .stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1" headers = { "Ocp-Apim-Subscription-Key" : api_key , "Content-Type" : "audio/wav; codecs=audio/pcm; samplerate=16000" , "Accept" : "application/json" } params = { "language" : language , "format" : "detailed" } async with aiohttp . ClientSession ( ) as session : with open ( audio_file_path , "rb" ) as f : audio_data = f . read ( ) async with session . post ( url , headers = headers , params = params , data = audio_data ) as response : response . raise_for_status ( ) return await response . json ( )
Usage
result
asyncio . run ( transcribe_async ( "audio.wav" , "en-US" ) ) print ( result [ "DisplayText" ] ) Supported Languages Common language codes (see full list ): Code Language en-US English (US) en-GB English (UK) de-DE German fr-FR French es-ES Spanish (Spain) es-MX Spanish (Mexico) zh-CN Chinese (Mandarin) ja-JP Japanese ko-KR Korean pt-BR Portuguese (Brazil) Best Practices Use WAV PCM 16kHz mono for best compatibility Enable chunked transfer for lower latency Cache access tokens for 9 minutes (valid for 10) Specify the correct language for accurate recognition Use detailed format when you need confidence scores Handle all RecognitionStatus values in production code When NOT to Use This API Use the Speech SDK or Batch Transcription API instead when you need: Audio longer than 60 seconds Real-time streaming transcription Partial/interim results Speech translation Custom speech models Batch transcription of many files Reference Files File Contents references/pronunciation-assessment.md Pronunciation assessment parameters and scoring When to Use This skill is applicable to execute the workflow or actions described in the overview.