Intermediate

You build an application on GPT-4o. Then the pricing changes, or you need Claude for a specific task, or a new open-source model outperforms both on your benchmark. Switching providers means rewriting your API calls, updating your response parsing, handling different error formats, and adjusting your retry logic — because every major LLM provider has a slightly different interface. That is the problem litellm exists to solve.

litellm is a unified Python interface for calling 100+ LLM providers — OpenAI, Anthropic, Google Gemini, Mistral, Cohere, local Ollama models, and more — using the same function signature. Switching from GPT-4o to Claude 3.5 Sonnet is literally a one-character change to the model parameter. You need Python 3.8+, relevant API keys for the providers you use, and pip install litellm.

This article covers the core litellm API, model routing and fallbacks, cost tracking, load balancing across multiple providers, streaming responses, async calls, and using litellm as a local proxy server. By the end you will have a flexible LLM integration layer that lets you swap providers without touching your application code.

Multi-Model LLM Calls: Quick Example

Install litellm and try calling two different providers with identical code:

# quick_litellm.py
import litellm
import os

# litellm reads API keys from environment variables automatically:
# OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, etc.

def ask(model: str, question: str) -> str:
    response = litellm.completion(
        model=model,
        messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content

question = "In one sentence, what is a Python decorator?"

# Same function, different providers
print("GPT-4o-mini:", ask("gpt-4o-mini", question))
print("Claude Haiku:", ask("claude-3-haiku-20240307", question))

Output:

GPT-4o-mini: A Python decorator is a function that wraps another function to extend or modify its behavior without changing its source code.
Claude Haiku: A decorator is a callable that takes a function and returns a modified version of it, adding behavior before or after the original function runs.

The response object follows the OpenAI format for every provider — response.choices[0].message.content works the same whether you are calling GPT, Claude, or Gemini. litellm translates each provider’s native response format into the OpenAI schema internally. This means your parsing code never changes, only the model string does.

LiteLLM connecting multiple LLM providers
One interface, a hundred models. The provider is just a string.

What Is litellm and Why Use It?

Different LLM providers expose different APIs: OpenAI uses client.chat.completions.create(), Anthropic uses client.messages.create() with a different message format, Google uses generativeai.GenerativeModel(). When you write directly against one provider’s SDK, switching providers requires changing not just the API call but also message formatting, response parsing, error handling, and token counting.

litellm acts as an adapter layer. You write once against the OpenAI-style interface, and litellm translates the call to whatever the chosen provider expects. It also normalizes error types across providers — a rate limit error from Anthropic and one from OpenAI both become the same exception class in litellm.

FeatureDirect Provider SDKslitellm
Switch providersRewrite API callsChange model string
Error handlingPer-provider exceptionsUnified exceptions
Cost trackingManual calculationBuilt-in per-call
FallbacksManual try/exceptBuilt-in routing
Local modelsCustom client setupollama/model-name

The tradeoff is a small abstraction overhead and occasional lag when a new provider model is released (litellm needs to add support for it). For most production use cases, the flexibility far outweighs these limitations.

Installation and Configuration

# Terminal
pip install litellm

litellm reads API keys from standard environment variable names. Set them once in your shell profile or .env file:

# .env (use python-dotenv to load in your app)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
MISTRAL_API_KEY=...
COHERE_API_KEY=...
# config_check.py
import litellm
import os
from dotenv import load_dotenv

load_dotenv()

# litellm.check_valid_key verifies a key works for a given model
# (makes a tiny test call -- costs a fraction of a cent)
valid = litellm.utils.check_valid_key(
    model="gpt-4o-mini",
    api_key=os.environ.get("OPENAI_API_KEY", "")
)
print(f"OpenAI key valid: {valid}")

Output:

OpenAI key valid: True

Built-In Cost Tracking

Every litellm response includes token usage and cost metadata. This is invaluable for monitoring spend across providers in multi-model applications:

# cost_tracking.py
import litellm

response = litellm.completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain Python generators in 3 sentences."}]
)

# Token usage is on every response
usage = response.usage
print(f"Prompt tokens:     {usage.prompt_tokens}")
print(f"Completion tokens: {usage.completion_tokens}")
print(f"Total tokens:      {usage.total_tokens}")

# Cost calculation (litellm knows pricing for 100+ models)
cost = litellm.completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")

Output:

Prompt tokens:     18
Completion tokens: 67
Total tokens:      85
Cost: $0.000043

You can use litellm.completion_cost() to calculate costs without making a real API call, if you already have token counts. This lets you estimate costs before a batch job or build a budget-enforcing wrapper. For a multi-provider application where you want to track spend by provider, aggregate the cost values per call and store them in your database alongside the model name.

LiteLLM cost tracking per provider
completion_cost() per call. Budget leaks found before the invoice arrives.

Automatic Fallbacks and Routing

One of litellm‘s most powerful features is automatic fallbacks: if the primary model fails (rate limit, downtime, API error), it automatically tries the next model in your fallback list. This makes your application resilient to provider outages without any custom retry logic:

# fallback_routing.py
import litellm

# Define a priority list -- litellm tries each in order on failure
litellm.set_verbose = False  # suppress debug logs

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python's GIL in one sentence?"}],
    fallbacks=[
        "claude-3-5-sonnet-20241022",   # first fallback
        "gemini/gemini-1.5-flash",       # second fallback
        "gpt-4o-mini"                    # final fallback
    ],
    # Optional: only fall back on specific error types
    context_window_fallbacks=[          # for token limit errors specifically
        {"gpt-4o": ["gpt-4o-mini"]},
        {"claude-3-5-sonnet-20241022": ["claude-3-haiku-20240307"]}
    ]
)

print(f"Model used: {response.model}")
print(f"Response: {response.choices[0].message.content}")

Output:

Model used: gpt-4o
Response: Python's GIL (Global Interpreter Lock) is a mutex that prevents multiple threads from executing Python bytecode simultaneously in CPython, limiting true parallelism.

The context_window_fallbacks parameter handles a common real-world scenario: you hit the context limit of a large model mid-conversation and need to automatically switch to a model with a larger context window (or vice versa — a smaller model for simple tasks). litellm detects context window exceeded errors and routes accordingly without your code needing to catch and handle those exceptions manually.

Load Balancing Across Multiple API Keys

When you have multiple API keys for the same provider (common in organizations with per-team keys), litellm can distribute calls across them to stay under rate limits:

# load_balancing.py
from litellm import Router

# Define your pool of models/keys
model_list = [
    {
        "model_name": "gpt4-pool",       # your internal name for this pool
        "litellm_params": {
            "model": "gpt-4o-mini",
            "api_key": "sk-team-a-key",  # Team A's key
        },
    },
    {
        "model_name": "gpt4-pool",
        "litellm_params": {
            "model": "gpt-4o-mini",
            "api_key": "sk-team-b-key",  # Team B's key
        },
    },
    {
        "model_name": "gpt4-pool",
        "litellm_params": {
            "model": "claude-3-haiku-20240307",  # cheaper fallback
            "api_key": "sk-ant-key",
        },
    },
]

router = Router(
    model_list=model_list,
    routing_strategy="least-busy",  # or "simple-shuffle", "usage-based-routing"
    num_retries=3,
    timeout=30,
)

response = router.completion(
    model="gpt4-pool",
    messages=[{"role": "user", "content": "Define 'duck typing' in Python."}]
)
print(response.choices[0].message.content)

Output:

Duck typing in Python means an object's suitability is determined by the presence of certain methods and attributes rather than its actual type -- if it walks like a duck and quacks like a duck, Python treats it like one.

The Router class is designed for production deployments. The "least-busy" strategy routes each request to whichever deployment currently has the fewest in-flight requests. Combined with the fallback list, this gives you automatic load balancing and failover without an external proxy service.

LiteLLM load balancing across providers
routing_strategy=”least-busy”. Your rate limits are now someone else’s problem.

Streaming Responses

For chat applications and real-time displays, streaming lets you show the response as it generates instead of waiting for completion. litellm streaming works the same across all providers:

# streaming_example.py
import litellm
import sys

response = litellm.completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a 3-step guide to Python list comprehensions."}],
    stream=True
)

print("Streaming response:")
full_text = ""
for chunk in response:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
        full_text += delta

print()  # newline after stream
print(f"\nTotal chars received: {len(full_text)}")

Output:

Streaming response:
1. **Basic syntax**: [expression for item in iterable]
2. **With condition**: [x*2 for x in range(10) if x % 2 == 0]
3. **Nested**: [cell for row in matrix for cell in row]

Total chars received: 112

Async Support

For high-throughput applications, use litellm.acompletion() to make concurrent API calls. This is far faster than sequential calls when processing multiple prompts:

# async_litellm.py
import asyncio
import litellm

async def classify(text: str, index: int) -> dict:
    response = await litellm.acompletion(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Classify as 'technical', 'business', or 'other': '{text}'. Reply with just the category."
        }]
    )
    return {
        "index": index,
        "text": text[:40],
        "category": response.choices[0].message.content.strip().lower()
    }

async def main():
    texts = [
        "How do I implement a binary search tree in Python?",
        "What is our Q3 revenue forecast?",
        "Please schedule a team lunch for Friday.",
        "Explain async/await concurrency model.",
        "Update the marketing deck for the board meeting.",
    ]

    # Run all 5 calls concurrently -- takes ~1s instead of ~5s
    tasks = [classify(text, i) for i, text in enumerate(texts)]
    results = await asyncio.gather(*tasks)

    for r in results:
        print(f"[{r['index']}] {r['category']:10s} | {r['text']}...")

asyncio.run(main())

Output:

[0] technical  | How do I implement a binary search tr...
[1] business   | What is our Q3 revenue forecast?...
[2] other      | Please schedule a team lunch for Frida...
[3] technical  | Explain async/await concurrency model....
[4] business   | Update the marketing deck for the board...

By using asyncio.gather(), all 5 API calls run concurrently. The total time is roughly equal to the slowest single call rather than the sum of all calls. For batch classification, summarization, or embedding tasks with many inputs, this pattern reduces wall-clock time dramatically — a 50-item batch that takes 50 seconds sequentially often completes in 5-8 seconds with async.

Running litellm as a Local Proxy Server

litellm can run as an OpenAI-compatible proxy server, so any tool that supports OpenAI’s API (like VS Code extensions, LangChain, or custom apps) can use it to route to any backend. Start it from the terminal:

# Terminal -- start the proxy server
litellm --model gpt-4o-mini --port 8000

# Or with a config file for multiple models:
# litellm --config litellm_config.yaml --port 8000
# litellm_config.yaml
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: sk-...
  - model_name: claude-sonnet
    litellm_params:
      model: claude-3-5-sonnet-20241022
      api_key: sk-ant-...

router_settings:
  routing_strategy: least-busy
# proxy_client.py -- connect to the local proxy using standard OpenAI SDK
from openai import OpenAI

# Point to your local litellm proxy
client = OpenAI(base_url="http://localhost:8000", api_key="any-string")

response = client.chat.completions.create(
    model="claude-sonnet",    # name from your config
    messages=[{"role": "user", "content": "What is Python's walrus operator?"}]
)
print(response.choices[0].message.content)

Output:

Python's walrus operator (:=) is an assignment expression introduced in Python 3.8 that lets you assign and return a value in the same expression -- useful in while loops and comprehensions to avoid evaluating the same expression twice.

Real-Life Example: Multi-Provider Summarization Service

LiteLLM parallel provider routing
One service, three providers. The cheapest available wins each call.
# summarization_service.py
"""
Multi-provider summarization service with cost tracking and fallbacks.
Uses litellm Router to distribute load and fall back automatically.
"""
import litellm
from litellm import Router
from dataclasses import dataclass
from typing import Optional

@dataclass
class SummaryResult:
    text: str
    model_used: str
    tokens: int
    cost_usd: float

# Router with three providers, cheapest first
router = Router(
    model_list=[
        {"model_name": "summarizer", "litellm_params": {"model": "gpt-4o-mini", "api_key": "ENV/OPENAI_API_KEY"}},
        {"model_name": "summarizer", "litellm_params": {"model": "claude-3-haiku-20240307", "api_key": "ENV/ANTHROPIC_API_KEY"}},
        {"model_name": "summarizer", "litellm_params": {"model": "gemini/gemini-1.5-flash", "api_key": "ENV/GEMINI_API_KEY"}},
    ],
    routing_strategy="simple-shuffle",  # spread load across providers
    num_retries=2,
)

def summarize(text: str, max_words: int = 50) -> Optional[SummaryResult]:
    """Summarize text using the next available provider."""
    prompt = f"Summarize the following in {max_words} words or fewer:\n\n{text}"
    try:
        response = router.completion(
            model="summarizer",
            messages=[{"role": "user", "content": prompt}]
        )
        cost = litellm.completion_cost(completion_response=response)
        return SummaryResult(
            text=response.choices[0].message.content,
            model_used=response.model,
            tokens=response.usage.total_tokens,
            cost_usd=cost,
        )
    except Exception as e:
        print(f"All providers failed: {e}")
        return None

if __name__ == "__main__":
    articles = [
        "Python 3.14 introduces t-strings (template strings), a new string prefix that gives developers programmatic access to string interpolation before it happens. Unlike f-strings which evaluate immediately, t-strings return a Template object that can be inspected, modified, or sanitized before rendering -- making them ideal for use cases like SQL queries, HTML templates, and logging where untrusted data could cause injection attacks.",
        "The Python Software Foundation announced that Python 3.12 will receive security-only fixes starting in January 2025, with end-of-life in October 2028. Users of Python 3.11 and earlier are encouraged to upgrade, as each version receives bug fixes for 18 months after release and security fixes for 5 years total.",
    ]

    total_cost = 0.0
    for i, article in enumerate(articles, 1):
        result = summarize(article, max_words=30)
        if result:
            print(f"\n--- Article {i} ---")
            print(f"Summary: {result.text}")
            print(f"Model: {result.model_used} | Tokens: {result.tokens} | Cost: ${result.cost_usd:.6f}")
            total_cost += result.cost_usd

    print(f"\nTotal cost for {len(articles)} summaries: ${total_cost:.6f}")

Output:

--- Article 1 ---
Summary: Python 3.14 t-strings return a Template object instead of evaluating immediately, enabling inspection and sanitization before rendering -- ideal for preventing injection attacks.
Model: gpt-4o-mini | Tokens: 103 | Cost: $0.000052

--- Article 2 ---
Summary: Python 3.12 enters security-only support in Jan 2025, ending Oct 2028. Users on 3.11 or earlier should upgrade.
Model: claude-3-haiku-20240307 | Tokens: 88 | Cost: $0.000018

Total cost for 2 summaries: $0.000070

This service pattern is immediately deployable as a FastAPI endpoint. Add authentication, a database to log each call’s cost and model, and a monthly budget cap per user — and you have a production-ready multi-provider LLM service. The router handles provider selection and failover transparently, and the SummaryResult dataclass gives your callers full visibility into which model was used and what it cost.

Frequently Asked Questions

Which models does litellm support?

litellm supports 100+ models across providers including OpenAI, Anthropic, Google (Gemini and Vertex AI), AWS Bedrock, Azure OpenAI, Mistral, Cohere, Hugging Face inference endpoints, and local models via Ollama. The full list is at docs.litellm.ai/docs/providers. Model naming follows the convention provider/model-name for non-OpenAI models (e.g., gemini/gemini-1.5-flash, ollama/llama3.2).

How accurate is the cost tracking?

litellm maintains a pricing database that is updated with each library release. For providers with stable public pricing (OpenAI, Anthropic, Google), the cost estimates are highly accurate. For providers that change pricing frequently or offer custom contracts, verify against your provider’s billing dashboard. The completion_cost() function returns 0.0 if the model is not in the pricing database, so you can detect gaps.

Does litellm add latency?

The overhead is typically 1-5 milliseconds per call — the time to translate the request and response format. For most LLM use cases where the model call takes 500ms-5000ms, this is negligible. If you are benchmarking latency-sensitive applications, measure with and without litellm and confirm the overhead is acceptable for your use case.

Can I use litellm with local models?

Yes. For Ollama, set model="ollama/llama3.2" and ensure Ollama is running locally. For other local inference servers (vLLM, LM Studio), point api_base at your local endpoint and use the openai/ prefix. No API key is needed for local models — pass api_key="local" as a placeholder.

How many concurrent calls can I make with acompletion?

This is limited by your provider’s rate limits, not by litellm. For OpenAI’s Tier 1, you can typically make 3,500 requests per minute on GPT-4o-mini. Use a semaphore in your async code to limit concurrency if you are hitting rate limits: async with asyncio.Semaphore(20): result = await litellm.acompletion(...).

How does litellm compare to LangChain’s model abstractions?

LangChain provides a complete pipeline framework (prompts, chains, agents, memory) and uses its own model wrapper classes. litellm is focused solely on the API call layer — it is a drop-in replacement for the raw API call, not a pipeline framework. They complement each other: LangChain can use litellm‘s proxy endpoint as its OpenAI backend, giving you LangChain’s pipeline features plus litellm‘s multi-provider routing. For projects that do not need LangChain’s pipeline abstractions, litellm alone is simpler.

Conclusion

The litellm library removes the vendor lock-in problem from LLM applications. We covered the unified completion interface, built-in cost tracking, automatic fallbacks, the Router class for load balancing, streaming, async concurrency, and the local proxy server. The summarization service example showed how these features combine into a production-ready multi-provider service.

The natural next step is to extend the real-life example into a FastAPI service with per-user cost caps. Once your application is behind a litellm abstraction layer, you can experiment with new models freely — the day a better or cheaper model is released, switching is a one-line config change.

Full documentation and provider setup guides are at docs.litellm.ai. The GitHub repository has an active community and frequent updates as new providers are added.