Azure OpenAI Service - 2025 Models and Features

Complete knowledge base for Azure OpenAI Service with latest 2025 models including GPT-5, GPT-4.1, reasoning models, and Azure AI Foundry integration.

Overview

Azure OpenAI Service provides REST API access to OpenAI's most powerful models with enterprise-grade security, compliance, and regional availability.

Latest Models (2025) GPT-5 Series (GA August 2025)

Registration Required Models:

gpt-5-pro: Highest capability, complex reasoning gpt-5: Balanced performance and cost gpt-5-codex: Optimized for code generation

No Registration Required:

gpt-5-mini: Faster, more affordable gpt-5-nano: Ultra-fast for simple tasks gpt-5-chat: Optimized for conversational use GPT-4.1 Series gpt-4.1: 1 million token context window gpt-4.1-mini: Efficient version with 1M context gpt-4.1-nano: Fastest variant

Key Improvements:

1,000,000 token context (vs 128K in GPT-4 Turbo) Better instruction following Reduced hallucinations Improved multilingual support Reasoning Models

o4-mini: Lightweight reasoning model

Faster inference Lower cost Suitable for structured reasoning tasks

o3: Advanced reasoning model

Complex problem solving Mathematical reasoning Scientific analysis

o1: Original reasoning model

General-purpose reasoning Step-by-step explanations

o1-mini: Efficient reasoning

Balanced cost and performance Image Generation

GPT-image-1 (2025-04-15)

DALL-E 3 successor Higher quality images Better prompt understanding Improved safety filters Video Generation

Sora (2025-05-02)

Text-to-video generation Realistic and imaginative scenes Up to 60 seconds of video Multiple camera angles and styles Audio Models

gpt-4o-transcribe: Speech-to-text powered by GPT-4o

High accuracy transcription Multiple languages Speaker diarization

gpt-4o-mini-transcribe: Faster, more affordable transcription

Good accuracy Lower latency Cost-effective Deploying Azure OpenAI Create Azure OpenAI Resource

Create OpenAI account

az cognitiveservices account create \ --name myopenai \ --resource-group MyRG \ --kind OpenAI \ --sku S0 \ --location eastus \ --custom-domain myopenai \ --public-network-access Disabled \ --identity-type SystemAssigned

Get endpoint and key

az cognitiveservices account show \ --name myopenai \ --resource-group MyRG \ --query "properties.endpoint" \ --output tsv

az cognitiveservices account keys list \ --name myopenai \ --resource-group MyRG \ --query "key1" \ --output tsv

Deploy GPT-5 Model

Deploy gpt-5

az cognitiveservices account deployment create \ --resource-group MyRG \ --name myopenai \ --deployment-name gpt-5 \ --model-name gpt-5 \ --model-version latest \ --model-format OpenAI \ --sku-name Standard \ --sku-capacity 100 \ --scale-type Standard

Deploy gpt-5-pro (requires registration)

az cognitiveservices account deployment create \ --resource-group MyRG \ --name myopenai \ --deployment-name gpt-5-pro \ --model-name gpt-5-pro \ --model-version latest \ --model-format OpenAI \ --sku-name Standard \ --sku-capacity 50

Deploy Reasoning Models

Deploy o3 reasoning model

az cognitiveservices account deployment create \ --resource-group MyRG \ --name myopenai \ --deployment-name o3-reasoning \ --model-name o3 \ --model-version latest \ --model-format OpenAI \ --sku-name Standard \ --sku-capacity 50

Deploy o4-mini

az cognitiveservices account deployment create \ --resource-group MyRG \ --name myopenai \ --deployment-name o4-mini \ --model-name o4-mini \ --model-version latest \ --model-format OpenAI \ --sku-name Standard \ --sku-capacity 100

Deploy GPT-4.1 with 1M Context az cognitiveservices account deployment create \ --resource-group MyRG \ --name myopenai \ --deployment-name gpt-4-1 \ --model-name gpt-4.1 \ --model-version latest \ --model-format OpenAI \ --sku-name Standard \ --sku-capacity 100

Deploy Image Generation Model az cognitiveservices account deployment create \ --resource-group MyRG \ --name myopenai \ --deployment-name image-gen \ --model-name gpt-image-1 \ --model-version 2025-04-15 \ --model-format OpenAI \ --sku-name Standard \ --sku-capacity 10

Deploy Sora Video Generation az cognitiveservices account deployment create \ --resource-group MyRG \ --name myopenai \ --deployment-name sora \ --model-name sora \ --model-version 2025-05-02 \ --model-format OpenAI \ --sku-name Standard \ --sku-capacity 5

Using Azure OpenAI Models Python SDK (GPT-5) from openai import AzureOpenAI import os

Initialize client

client = AzureOpenAI( api_key=os.getenv("AZURE_OPENAI_API_KEY"), api_version="2025-02-01-preview", azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT") )

GPT-5 completion

response = client.chat.completions.create( model="gpt-5", # deployment name messages=[ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Explain quantum computing in simple terms."} ], max_tokens=1000, temperature=0.7, top_p=0.95 )

print(response.choices[0].message.content)

Python SDK (o3 Reasoning Model)

o3 reasoning with chain-of-thought

response = client.chat.completions.create( model="o3-reasoning", messages=[ {"role": "system", "content": "You are an expert problem solver. Show your reasoning step-by-step."}, {"role": "user", "content": "If a train travels 120 km in 2 hours, then speeds up to travel 180 km in the next 2 hours, what is the average speed for the entire journey?"} ], max_tokens=2000, temperature=0.2 # Lower temperature for reasoning tasks )

print(response.choices[0].message.content)

Python SDK (GPT-4.1 with 1M Context)

Read a large document

with open('large_document.txt', 'r') as f: document = f.read()

GPT-4.1 can handle up to 1M tokens

response = client.chat.completions.create( model="gpt-4-1", messages=[ {"role": "system", "content": "You are a document analysis expert."}, {"role": "user", "content": f"Analyze this document and provide key insights:\n\n{document}"} ], max_tokens=4000 )

print(response.choices[0].message.content)

Image Generation (GPT-image-1)

Generate image with DALL-E 3 successor

response = client.images.generate( model="image-gen", prompt="A futuristic city with flying cars and vertical gardens, cyberpunk style, highly detailed, 4K", size="1024x1024", quality="hd", n=1 )

image_url = response.data[0].url print(f"Generated image: {image_url}")

Video Generation (Sora)

Generate video with Sora

response = client.videos.generate( model="sora", prompt="A serene lakeside at sunset with birds flying overhead and gentle waves on the shore", duration=10, # seconds resolution="1080p", fps=30 )

video_url = response.data[0].url print(f"Generated video: {video_url}")

Audio Transcription

Transcribe audio file

audio_file = open("meeting_recording.mp3", "rb")

response = client.audio.transcriptions.create( model="gpt-4o-transcribe", file=audio_file, language="en", response_format="verbose_json" )

print(f"Transcription: {response.text}") print(f"Duration: {response.duration}s")

Speaker diarization

for segment in response.segments: print(f"[{segment.start}s - {segment.end}s] {segment.text}")

Azure AI Foundry Integration Model Router (Automatic Model Selection) from azure.ai.foundry import ModelRouter

Initialize model router

router = ModelRouter( endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"), credential=os.getenv("AZURE_OPENAI_API_KEY") )

Automatically select optimal model

response = router.complete( prompt="Analyze this complex scientific paper...", optimization_goals=["quality", "cost"], available_models=["gpt-5", "gpt-5-mini", "gpt-4-1"] )

print(f"Selected model: {response.model_used}") print(f"Response: {response.content}") print(f"Cost: ${response.cost}")

Benefits:

Automatic model selection based on prompt complexity Balance quality vs cost Reduce costs by up to 40% while maintaining quality Agentic Retrieval (Azure AI Search Integration) from azure.search.documents import SearchClient from azure.core.credentials import AzureKeyCredential

Initialize search client

search_client = SearchClient( endpoint=os.getenv("SEARCH_ENDPOINT"), index_name="documents", credential=AzureKeyCredential(os.getenv("SEARCH_KEY")) )

Agentic retrieval with Azure OpenAI

response = client.chat.completions.create( model="gpt-5", messages=[ {"role": "system", "content": "You have access to a document search system."}, {"role": "user", "content": "What are the company's revenue projections for Q3?"} ], tools=[{ "type": "function", "function": { "name": "search_documents", "description": "Search company documents", "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"} }, "required": ["query"] } } }], tool_choice="auto" )

Process tool calls

if response.choices[0].message.tool_calls: for tool_call in response.choices[0].message.tool_calls: if tool_call.function.name == "search_documents": query = json.loads(tool_call.function.arguments)["query"] results = search_client.search(query) # Feed results back to model for final answer

Improvements:

40% better on complex, multi-part questions Automatic query decomposition Relevance ranking Citation generation Foundry Observability (Preview) from azure.ai.foundry import FoundryObservability

Enable observability

observability = FoundryObservability( workspace_id=os.getenv("AI_FOUNDRY_WORKSPACE_ID"), enable_tracing=True, enable_metrics=True )

Monitor agent execution

with observability.trace_agent("customer_support_agent") as trace: response = client.chat.completions.create( model="gpt-5", messages=messages )

trace.log_tool_call("search_kb", {"query": "refund policy"})
trace.log_reasoning_step("Retrieved refund policy document")
trace.log_token_usage(response.usage.total_tokens)

View in Azure AI Foundry portal:

- End-to-end trace logs

- Reasoning steps and tool calls

- Performance metrics

- Cost analysis

Capacity and Quota Management Check Quota

List deployments with usage

az cognitiveservices account deployment list \ --resource-group MyRG \ --name myopenai \ --output table

Check usage metrics

az monitor metrics list \ --resource $(az cognitiveservices account show -g MyRG -n myopenai --query id -o tsv) \ --metric "TokenTransaction" \ --start-time 2025-01-01T00:00:00Z \ --end-time 2025-01-31T23:59:59Z \ --interval PT1H \ --aggregation Total

Update Capacity

Scale up deployment capacity

az cognitiveservices account deployment update \ --resource-group MyRG \ --name myopenai \ --deployment-name gpt-5 \ --sku-capacity 200

Scale down during off-peak

az cognitiveservices account deployment update \ --resource-group MyRG \ --name myopenai \ --deployment-name gpt-5 \ --sku-capacity 50

Request Quota Increase Navigate to Azure Portal → Azure OpenAI resource Go to "Quotas" blade Select model and region Click "Request quota increase" Provide justification and target capacity Security and Networking Private Endpoint

Create private endpoint

az network private-endpoint create \ --name openai-private-endpoint \ --resource-group MyRG \ --vnet-name MyVNet \ --subnet PrivateEndpointSubnet \ --private-connection-resource-id $(az cognitiveservices account show -g MyRG -n myopenai --query id -o tsv) \ --group-id account \ --connection-name openai-connection

Create private DNS zone

az network private-dns zone create \ --resource-group MyRG \ --name privatelink.openai.azure.com

Link to VNet

az network private-dns link vnet create \ --resource-group MyRG \ --zone-name privatelink.openai.azure.com \ --name openai-dns-link \ --virtual-network MyVNet \ --registration-enabled false

Create DNS zone group

az network private-endpoint dns-zone-group create \ --resource-group MyRG \ --endpoint-name openai-private-endpoint \ --name default \ --private-dns-zone privatelink.openai.azure.com \ --zone-name privatelink.openai.azure.com

Managed Identity Access

Enable system-assigned identity

az cognitiveservices account identity assign \ --name myopenai \ --resource-group MyRG

Grant role to managed identity

PRINCIPAL_ID=$(az cognitiveservices account show -g MyRG -n myopenai --query identity.principalId -o tsv)

az role assignment create \ --assignee $PRINCIPAL_ID \ --role "Cognitive Services OpenAI User" \ --scope /subscriptions//resourceGroups/MyRG

Content Filtering

Configure content filtering

az cognitiveservices account update \ --name myopenai \ --resource-group MyRG \ --set properties.customContentFilter='{ "hate": {"severity": "medium", "enabled": true}, "violence": {"severity": "medium", "enabled": true}, "sexual": {"severity": "medium", "enabled": true}, "selfHarm": {"severity": "high", "enabled": true} }'

Cost Optimization Model Selection Strategy

Use GPT-5-mini or GPT-5-nano for:

Simple questions Classification tasks Content moderation Summarization

Use GPT-5 or GPT-4.1 for:

Complex reasoning Long-form content generation Document analysis Code generation

Use Reasoning Models (o3, o4-mini) for:

Mathematical problems Scientific analysis Step-by-step reasoning Logic puzzles Implement Caching

Use semantic cache to reduce duplicate requests

from azure.ai.cache import SemanticCache

cache = SemanticCache( similarity_threshold=0.95, ttl_seconds=3600 )

Check cache before API call

cached_response = cache.get(user_query) if cached_response: return cached_response

response = client.chat.completions.create( model="gpt-5", messages=messages )

cache.set(user_query, response)

Token Management import tiktoken

Count tokens before API call

encoding = tiktoken.get_encoding("cl100k_base") tokens = len(encoding.encode(prompt))

if tokens > 100000: print(f"Warning: Prompt has {tokens} tokens, this will be expensive!")

Use shorter max_tokens when appropriate

response = client.chat.completions.create( model="gpt-5", messages=messages, max_tokens=500 # Limit output tokens )

Monitoring and Alerts Set Up Cost Alerts

Create budget alert

az consumption budget create \ --budget-name openai-monthly-budget \ --resource-group MyRG \ --amount 1000 \ --category Cost \ --time-grain Monthly \ --start-date 2025-01-01 \ --end-date 2025-12-31 \ --notifications '{ "actual_GreaterThan_80_Percent": { "enabled": true, "operator": "GreaterThan", "threshold": 80, "contactEmails": ["billing@example.com"] } }'

Application Insights Integration from opencensus.ext.azure.log_exporter import AzureLogHandler import logging

Configure logging

logger = logging.getLogger(name) logger.addHandler(AzureLogHandler( connection_string=os.getenv("APPLICATIONINSIGHTS_CONNECTION_STRING") ))

Log API calls

logger.info("OpenAI API call", extra={ "custom_dimensions": { "model": "gpt-5", "tokens": response.usage.total_tokens, "cost": calculate_cost(response.usage.total_tokens), "latency_ms": response.response_ms } })

Best Practices

✓ Use Model Router for automatic cost optimization ✓ Implement caching to reduce duplicate requests ✓ Monitor token usage and set budgets ✓ Use private endpoints for production workloads ✓ Enable managed identity instead of API keys ✓ Configure content filtering for safety ✓ Right-size capacity based on actual demand ✓ Use Foundry Observability for monitoring ✓ Implement retry logic with exponential backoff ✓ Choose appropriate models for task complexity

References Azure OpenAI Documentation What's New in Azure OpenAI GPT-5 Announcement Azure AI Foundry Model Pricing

Azure OpenAI Service with GPT-5 and reasoning models brings enterprise-grade AI to your applications!

安装

Create OpenAI account

Get endpoint and key

Deploy gpt-5

Deploy gpt-5-pro (requires registration)

Deploy o3 reasoning model

Deploy o4-mini

Initialize client

GPT-5 completion

o3 reasoning with chain-of-thought

Read a large document

GPT-4.1 can handle up to 1M tokens

Generate image with DALL-E 3 successor

Generate video with Sora

Transcribe audio file

Speaker diarization

Initialize model router

Automatically select optimal model

Initialize search client

Agentic retrieval with Azure OpenAI

Process tool calls

Enable observability

Monitor agent execution

View in Azure AI Foundry portal:

- End-to-end trace logs

- Reasoning steps and tool calls

- Performance metrics

- Cost analysis

List deployments with usage

Check usage metrics

Scale up deployment capacity

Scale down during off-peak

Create private endpoint

Create private DNS zone

Link to VNet

Create DNS zone group

Enable system-assigned identity

Grant role to managed identity

Configure content filtering

Use semantic cache to reduce duplicate requests

Check cache before API call

Count tokens before API call

Use shorter max_tokens when appropriate

Create budget alert

Configure logging

Log API calls