groq-inference

安装量: 36
排名: #19507

安装

npx skills add https://github.com/scientiacapital/skills --skill groq-inference

Basic chat with GROQ: from groq import Groq client = Groq ( api_key = os . environ . get ( "GROQ_API_KEY" ) ) response = client . chat . completions . create ( model = "llama-3.3-70b-versatile" ,

Best all-around

messages

[ { "role" : "user" , "content" : prompt } ] , ) Model selection: Use Case Model General chat llama-3.3-70b-versatile Vision/OCR meta-llama/llama-4-scout-17b-16e-instruct STT whisper-large-v3 (GROQ-hosted, NOT OpenAI) TTS playai-tts GROQ integration is successful when: Correct model selected for use case (see model table) API key in environment variable ( GROQ_API_KEY ) Retry logic with tenacity for rate limits Streaming enabled for real-time applications Async patterns used for parallel queries NOT using OpenAI (constraint: NO OPENAI) Ultra-fast LLM inference for real-time applications. GROQ delivers 10-100x faster inference than standard providers. Quick Reference: Model Selection Use Case Model ID Context Notes General Chat llama-3.3-70b-versatile 128K Best all-around Fast Chat llama-3.1-8b-instant 128K Simple tasks, fastest Vision/OCR meta-llama/llama-4-scout-17b-16e-instruct 128K Up to 5 images STT whisper-large-v3 448 GROQ-hosted (NOT OpenAI API) TTS playai-tts - Fritz-PlayAI voice Reasoning meta-llama/llama-4-maverick-17b-128e-instruct 128K Thinking models Tool Use compound-beta - Built-in web search, code exec Core Patterns 1. Chat Completion (Basic + Streaming) import os from groq import Groq , AsyncGroq client = Groq ( api_key = os . environ . get ( "GROQ_API_KEY" ) ) def chat ( prompt : str , system : str = "You are helpful." ) -

str : response = client . chat . completions . create ( model = "llama-3.3-70b-versatile" , messages = [ { "role" : "system" , "content" : system } , { "role" : "user" , "content" : prompt } ] , temperature = 0.7 , max_completion_tokens = 1024 , ) return response . choices [ 0 ] . message . content

Streaming

def stream_chat ( prompt : str ) : stream = client . chat . completions . create ( model = "llama-3.3-70b-versatile" , messages = [ { "role" : "user" , "content" : prompt } ] , stream = True , ) for chunk in stream : if chunk . choices [ 0 ] . delta . content : yield chunk . choices [ 0 ] . delta . content 2. Vision / Multimodal import base64 def analyze_image ( image_path : str , prompt : str ) -

str : with open ( image_path , "rb" ) as f : image_b64 = base64 . standard_b64encode ( f . read ( ) ) . decode ( "utf-8" ) response = client . chat . completions . create ( model = "meta-llama/llama-4-scout-17b-16e-instruct" , messages = [ { "role" : "user" , "content" : [ { "type" : "text" , "text" : prompt } , { "type" : "image_url" , "image_url" : { "url" : f"data:image/jpeg;base64, { image_b64 } " } } ] } ] , ) return response . choices [ 0 ] . message . content

URL-based: just pass {"url": "https://..."} instead of base64

  1. Audio: Speech-to-Text (GROQ-Hosted Whisper) Note: Whisper on GROQ runs on GROQ hardware
  2. NOT calling OpenAI's API. Whisper is an open-source model that GROQ hosts for fast inference. def transcribe ( audio_path : str , language : str = "en" ) -

    str : with open ( audio_path , "rb" ) as f : result = client . audio . transcriptions . create ( file = f , model = "whisper-large-v3" ,

GROQ-hosted, not OpenAI API

language

language , response_format = "verbose_json" ,

Includes timestamps

) return result . text def translate_to_english ( audio_path : str ) -

str : with open ( audio_path , "rb" ) as f : result = client . audio . translations . create ( file = f , model = "whisper-large-v3" ) return result . text Alternative STT Providers (if you prefer non-Whisper options): Deepgram - Real-time streaming, lowest latency ( pip install deepgram-sdk ) AssemblyAI - High accuracy, speaker diarization ( pip install assemblyai ) See voice-ai-skill for Deepgram/AssemblyAI integration patterns 4. Audio: Text-to-Speech (PlayAI) def text_to_speech ( text : str , output_path : str = "output.wav" ) : response = client . audio . speech . create ( model = "playai-tts" , voice = "Fritz-PlayAI" ,

Also: Arista-PlayAI

input

text , response_format = "wav" , ) response . write_to_file ( output_path )

Streaming TTS

def stream_tts ( text : str ) : with client . audio . speech . with_streaming_response . create ( model = "playai-tts" , voice = "Fritz-PlayAI" , input = text , response_format = "wav" ) as response : for chunk in response . iter_bytes ( 1024 ) : yield chunk Alternative TTS Providers (beyond GROQ's PlayAI): Cartesia - Ultra-low latency, emotional control ( pip install cartesia ) ElevenLabs - Most natural voices, voice cloning ( pip install elevenlabs ) Deepgram - Fast, cost-effective ( pip install deepgram-sdk ) See voice-ai-skill for Cartesia/ElevenLabs/Deepgram TTS integration patterns 5. Tool Use / Function Calling import json tools = [ { "type" : "function" , "function" : { "name" : "get_weather" , "description" : "Get weather for a location" , "parameters" : { "type" : "object" , "properties" : { "location" : { "type" : "string" } } , "required" : [ "location" ] } } } ] def chat_with_tools ( prompt : str ) : messages = [ { "role" : "user" , "content" : prompt } ] response = client . chat . completions . create ( model = "llama-3.3-70b-versatile" , messages = messages , tools = tools , tool_choice = "auto" ) msg = response . choices [ 0 ] . message if msg . tool_calls : for tc in msg . tool_calls : result = execute_function ( tc . function . name , json . loads ( tc . function . arguments ) ) messages . extend ( [ msg , { "role" : "tool" , "tool_call_id" : tc . id , "content" : json . dumps ( result ) } ] ) return client . chat . completions . create ( model = "llama-3.3-70b-versatile" , messages = messages , tools = tools ) . choices [ 0 ] . message . content return msg . content 6. Compound Beta (Built-in Web Search + Code Exec) def compound_query ( prompt : str ) : """Built-in tools: web_search, code_execution.""" response = client . chat . completions . create ( model = "compound-beta" , messages = [ { "role" : "user" , "content" : prompt } ] , ) msg = response . choices [ 0 ] . message

Access msg.executed_tools for tool results

return
msg
.
content
7. Reasoning Models
def
reasoning_query
(
prompt
:
str
,
format
:
str
=
"parsed"
)
:
"""format: 'parsed' (structured), 'raw' (visible), 'hidden' (no thinking)"""
response
=
client
.
chat
.
completions
.
create
(
model
=
"meta-llama/llama-4-maverick-17b-128e-instruct"
,
messages
=
[
{
"role"
:
"user"
,
"content"
:
prompt
}
]
,
reasoning_format
=
format
,
)
msg
=
response
.
choices
[
0
]
.
message
if
format
==
"parsed"
and
hasattr
(
msg
,
'reasoning'
)
:
return
{
"thinking"
:
msg
.
reasoning
,
"answer"
:
msg
.
content
}
return
msg
.
content
8. Async Patterns
async_client
=
AsyncGroq
(
api_key
=
os
.
environ
.
get
(
"GROQ_API_KEY"
)
)
async
def
async_chat
(
prompt
:
str
)
-
>
str
:
response
=
await
async_client
.
chat
.
completions
.
create
(
model
=
"llama-3.3-70b-versatile"
,
messages
=
[
{
"role"
:
"user"
,
"content"
:
prompt
}
]
,
)
return
response
.
choices
[
0
]
.
message
.
content
async
def
parallel_queries
(
prompts
:
list
[
str
]
)
-
>
list
[
str
]
:
import
asyncio
return
await
asyncio
.
gather
(
*
[
async_chat
(
p
)
for
p
in
prompts
]
)
Rate Limits
Tier
Requests/min
Tokens/min
Tokens/day
Free
30
15,000
500,000
Paid
100+
100,000+
Unlimited
from
tenacity
import
retry
,
stop_after_attempt
,
wait_exponential
@retry
(
stop
=
stop_after_attempt
(
3
)
,
wait
=
wait_exponential
(
min
=
1
,
max
=
10
)
)
def
reliable_chat
(
prompt
:
str
)
-
>
str
:
return
chat
(
prompt
)
Integration Notes
Pairs with
voice-ai-skill (Whisper STT + PlayAI TTS), langgraph-agents-skill
Complements
trading-signals-skill (fast analysis), data-analysis-skill
Projects
VozLux (voice agents), FieldVault-AI (document processing)
Constraint
NO OPENAI - GROQ is the fast inference layer Environment Variables GROQ_API_KEY = gsk_ .. .

Required - get from console.groq.com

Optional multi-provider

ANTHROPIC_API_KEY

Claude for complex reasoning

GOOGLE_API_KEY

Gemini fallback

Reference Files reference/models-catalog.md - Complete model catalog with specs reference/audio-speech.md - Whisper STT and PlayAI TTS deep dive reference/vision-multimodal.md - Multimodal and image processing reference/tool-use-patterns.md - Function calling and Compound Beta reference/reasoning-models.md - Thinking models and reasoning_format reference/cost-optimization.md - Batch API, caching, provider routing

返回排行榜