Groq Provider¶
The Groq provider offers ultra-fast inference on open-source models through GroqCloud.
Setup¶
1. Get API Key¶
- Create account at console.groq.com
- Navigate to API Keys
- Click "Create API Key"
- Copy the key (starts with
gsk_...)
2. Configure¶
Usage¶
Basic Usage¶
Available Models¶
| Model | Description | Context Window | Speed |
|---|---|---|---|
llama-3.3-70b-versatile |
Llama 3.3 70B | 128K | Very Fast |
meta-llama/llama-4-maverick-17b-128e-instruct |
Llama 4 17B | 128K | Ultra Fast |
openai/gpt-oss-120b |
GPT OSS 120B | 128K | Fast |
moonshotai/kimi-k2-instruct-0905 |
KIMI K2 (default) | 128K | Fast |
Model Selection¶
# Use default model
client = LLMClient(api_choice="groq")
# Specify model
client = LLMClient(
api_choice="groq",
llm="llama-3.3-70b-versatile"
)
Features¶
Ultra-Fast Inference¶
import time
messages = [{"role": "user", "content": "Explain quantum computing"}]
start = time.time()
response = client.chat_completion(messages)
elapsed = time.time() - start
print(f"Response in {elapsed:.2f}s")
# Typical: 0.3-1.0 seconds
Streaming¶
messages = [{"role": "user", "content": "Count from 1 to 10"}]
for chunk in client.chat_completion_stream(messages):
print(chunk, end="", flush=True)
print()
Function Calling¶
tools = [{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}
}
}
}
}]
result = client.chat_completion_with_tools(messages, tools)
Configuration¶
Via Config File¶
Runtime Parameters¶
client = LLMClient(
api_choice="groq",
llm="llama-3.3-70b-versatile",
temperature=0.5,
max_tokens=1024
)
Best Practices¶
1. Leverage Speed¶
# Groq is ideal for high-throughput applications
client = LLMClient(api_choice="groq")
queries = ["Question 1", "Question 2", "Question 3"]
for query in queries:
response = client.chat_completion([
{"role": "user", "content": query}
])
print(f"Q: {query}\nA: {response}\n")
2. Cost-Effective for Volume¶
# Use Groq for cost-sensitive applications
client = LLMClient(
api_choice="groq",
llm="llama-3.3-70b-versatile"
)
# Process large batches efficiently
for item in large_dataset:
response = client.chat_completion(
[{"role": "user", "content": item}]
)
3. Fallback Strategy¶
from llm_client.exceptions import ChatCompletionError
# Try Groq first (fast, cheap), fallback to OpenAI if needed
client = LLMClient(api_choice="groq")
try:
response = client.chat_completion(messages)
except ChatCompletionError:
# Fallback to OpenAI
client.switch_provider("openai")
response = client.chat_completion(messages)
Async Support¶
import asyncio
async def process_queries():
client = LLMClient(api_choice="groq", use_async=True)
queries = ["Q1", "Q2", "Q3"]
tasks = [
client.achat_completion([{"role": "user", "content": q}])
for q in queries
]
responses = await asyncio.gather(*tasks)
return responses
# Process multiple queries concurrently
results = asyncio.run(process_queries())
Pricing¶
Groq offers very competitive pricing for open-source models. Check groq.com/pricing for current rates.
Free Tier: - Generous free tier for testing and development - Rate limits apply