Google Gemini Provider¶

The Gemini provider enables access to Google's Gemini models through the OpenAI compatibility API.

Setup¶

1. Get API Key¶

Visit Google AI Studio
Sign in with your Google account
Click "Get API Key" or "Create API Key"
Select or create a Google Cloud project
Copy the generated key (starts with AIzaSy...)

2. Configure¶

# Recommendation: Use generic API_KEY for automatic provider detection
API_KEY=AIzaSy-your-api-key-here

# OR use provider-specific key
# GEMINI_API_KEY=AIzaSy-your-api-key-here

Usage¶

Basic Usage¶

from llm_client import LLMClient

# Explicit selection
client = LLMClient(api_choice="gemini")

Available Models¶

Based on Google Gemini API documentation (June 2026):

Stable Production Models:

Model	Description	Context Window	Best For
`gemini-3.5-flash`	Intelligent performance	1M tokens	Agentic and coding tasks
`gemini-3.1-flash-lite`	High performance (default)	1M tokens	Low-cost high-throughput
`gemini-2.5-pro`	Highest reasoning	2M tokens	Complex reasoning, programming
`gemini-2.5-flash`	Optimal balance	1M tokens	General-purpose tasks

Experimental/Preview Models:

Model	Status	Features
`gemini-3.1-pro-preview`	Preview	Advanced intelligence, agentic capabilities
`gemini-3-flash-preview`	Preview	High performance at fraction of cost
`gemini-2.5-pro-preview-tts`	Preview	High-quality speech synthesis

Model Selection¶

# Use default model (gemini-3.1-flash-lite)
client = LLMClient(api_choice="gemini")

# Specify model
client = LLMClient(
    api_choice="gemini",
    llm="gemini-2.5-pro"
)

# With custom parameters
client = LLMClient(
    api_choice="gemini",
    llm="gemini-2.5-flash",
    temperature=0.8,
    max_tokens=2048
)

Features¶

Chat Completion¶

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain quantum entanglement."}
]

response = client.chat_completion(messages)
print(response)

Streaming¶

messages = [
    {"role": "user", "content": "Write a poem about artificial intelligence"}
]

print("Response: ", end="")
for chunk in client.chat_completion_stream(messages):
    print(chunk, end="", flush=True)
print()

Function Calling¶

Gemini supports OpenAI-compatible function calling:

tools = [{
    "type": "function",
    "function": {
        "name": "search_knowledge_base",
        "description": "Search internal knowledge base",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search query"
                },
                "category": {
                    "type": "string",
                    "enum": ["technical", "business", "research"]
                }
            },
            "required": ["query"]
        }
    }
}]

messages = [
    {"role": "user", "content": "Find technical docs about RAG"}
]

result = client.chat_completion_with_tools(messages, tools)

if result['tool_calls']:
    for call in result['tool_calls']:
        print(f"Calling: {call['function']['name']}")
        print(f"Arguments: {call['function']['arguments']}")

Long Context Processing¶

Gemini excels at processing very long documents:

# Load large document
with open("long_document.txt", "r") as f:
    document = f.read()

# Gemini can handle up to 2M tokens
client = LLMClient(
    api_choice="gemini",
    llm="gemini-2.5-pro",
    max_tokens=4096
)

messages = [
    {"role": "system", "content": "You are a document analyzer."},
    {"role": "user", "content": f"Summarize this document:\n\n{document}"}
]

summary = client.chat_completion(messages)
print(summary)

Configuration¶

Via Config File¶

# llm_config.yaml
providers:
  gemini:
    model: gemini-2.5-flash
    temperature: 0.8
    max_tokens: 2048

client = LLMClient.from_config("llm_config.yaml", provider="gemini")

Runtime Parameters¶

client = LLMClient(
    api_choice="gemini",
    llm="gemini-2.5-pro",
    temperature=0.7,      # 0.0 = focused, 2.0 = creative
    max_tokens=2048       # Maximum response length
)

Best Practices¶

1. Choose the Right Model¶

# For complex reasoning - use gemini-2.5-pro
client = LLMClient(api_choice="gemini", llm="gemini-2.5-pro")
complex_response = client.chat_completion([
    {"role": "user", "content": "Analyze the geopolitical implications..."}
])

# For general tasks - use gemini-2.5-flash (faster, cheaper)
client.switch_provider("gemini", llm="gemini-2.5-flash")
quick_response = client.chat_completion([
    {"role": "user", "content": "Translate this text..."}
])

# For high throughput - use gemini-2.5-flash-lite
client.switch_provider("gemini", llm="gemini-2.5-flash-lite")

2. Leverage Long Context¶

# Gemini handles very long contexts efficiently
client = LLMClient(
    api_choice="gemini",
    llm="gemini-2.5-pro"
)

# Count tokens before sending
from llm_client import TokenCounter

token_count = TokenCounter.count_tokens(messages)
print(f"Tokens: {token_count}")

# Gemini 2.5 Pro supports up to 2M tokens
if token_count < 2_000_000:
    response = client.chat_completion(messages)

3. Multimodal Capabilities¶

While not directly supported through the OpenAI compatibility API used by llm_client, Gemini natively supports image and video input through the Google AI SDK.

4. Streaming for Long Responses¶

# Use streaming for better UX with long outputs
messages = [
    {"role": "user", "content": "Write a detailed analysis of..."}
]

for chunk in client.chat_completion_stream(messages):
    print(chunk, end="", flush=True)

5. Temperature Control¶

# Low temperature for factual responses
factual_client = LLMClient(
    api_choice="gemini",
    llm="gemini-2.5-flash",
    temperature=0.2
)

# High temperature for creative content
creative_client = LLMClient(
    api_choice="gemini",
    llm="gemini-2.5-flash",
    temperature=1.5
)

Async Support¶

import asyncio
from llm_client import LLMClient

async def main():
    client = LLMClient(
        api_choice="gemini",
        use_async=True
    )

    messages = [{"role": "user", "content": "Hello"}]

    # Async completion
    response = await client.achat_completion(messages)
    print(response)

    # Async streaming
    async for chunk in client.achat_completion_stream(messages):
        print(chunk, end="", flush=True)

asyncio.run(main())

Error Handling¶

from llm_client.exceptions import (
    APIKeyNotFoundError,
    ChatCompletionError
)

try:
    client = LLMClient(api_choice="gemini")
    response = client.chat_completion(messages)
except APIKeyNotFoundError:
    print("Gemini API key not found!")
    print("Set GEMINI_API_KEY environment variable")
except ChatCompletionError as e:
    print(f"API call failed: {e}")
    print(f"Original error: {e.original_error}")

Pricing¶

Google Gemini offers competitive pricing with a generous free tier:

Free Tier:
- 15 requests per minute
- 1 million tokens per minute
- 1,500 requests per day

Paid Tier (Pay-as-you-go):

Model	Input (per 1M tokens)	Output (per 1M tokens)
Gemini 2.5 Pro	$1.25	$5.00
Gemini 2.5 Flash	$0.075	$0.30
Gemini 2.5 Flash Lite	$0.0375	$0.15
Gemini 2.0 Flash	$0.075	$0.30

Check Google AI Pricing for current rates.

Cost Estimation¶

from llm_client import TokenCounter

messages = [
    {"role": "user", "content": "Analyze this data..."}
]

token_count = TokenCounter.count_tokens(messages)
estimated_response = 500

# For gemini-2.5-flash
input_cost = (token_count / 1_000_000) * 0.075
output_cost = (estimated_response / 1_000_000) * 0.30
total = input_cost + output_cost

print(f"Estimated cost: ${total:.4f}")

Comparison with Other Providers¶

Advantages:
- ✅ Very long context windows (up to 2M tokens)
- ✅ Competitive pricing
- ✅ Strong multilingual capabilities
- ✅ Excellent at structured data extraction
- ✅ Native multimodal support (via Google AI SDK)

Considerations:
- ⚠️ Newer than GPT-4, ecosystem still developing
- ⚠️ Some features require Google AI SDK (not OpenAI compatibility)
- ⚠️ Regional availability may vary

Troubleshooting¶

API Key Issues¶

# Verify key is set
echo $GEMINI_API_KEY

# Or in Python
import os
print(os.getenv("GEMINI_API_KEY"))

Rate Limit Errors¶

import time
from llm_client.exceptions import ChatCompletionError

for attempt in range(3):
    try:
        response = client.chat_completion(messages)
        break
    except ChatCompletionError as e:
        if "rate_limit" in str(e).lower():
            wait_time = (attempt + 1) * 10
            print(f"Rate limit hit, waiting {wait_time}s...")
            time.sleep(wait_time)
        else:
            raise

Context Length Errors¶

from llm_client import TokenCounter

token_count = TokenCounter.count_tokens(messages)
model_limit = 1_000_000  # gemini-2.5-flash limit

if token_count > model_limit:
    print(f"Message too long: {token_count} > {model_limit}")
    # Consider using gemini-2.5-pro (2M limit)
    client.switch_provider("gemini", llm="gemini-2.5-pro")

Resources¶

Example: Complete Workflow¶

from llm_client import LLMClient
from llm_client.exceptions import ChatCompletionError

# Initialize client
client = LLMClient(
    api_choice="gemini",
    llm="gemini-2.5-flash",
    temperature=0.7,
    max_tokens=1024
)

# Multi-turn conversation
conversation = [
    {"role": "system", "content": "You are a helpful research assistant."},
    {"role": "user", "content": "What are the latest trends in AI?"}
]

try:
    # Get initial response
    response = client.chat_completion(conversation)
    print(f"Assistant: {response}\n")

    # Continue conversation
    conversation.append({"role": "assistant", "content": response})
    conversation.append({"role": "user", "content": "Can you elaborate on transformers?"})

    # Stream the follow-up response
    print("Assistant: ", end="")
    for chunk in client.chat_completion_stream(conversation):
        print(chunk, end="", flush=True)
    print("\n")

except ChatCompletionError as e:
    print(f"Error: {e}")