Beginner

Introduction: Unlocking AI with Python

The OpenAI API brings powerful language models directly into your Python applications. Whether you’re building chatbots, automating content creation, analyzing text, or generating embeddings, the official OpenAI Python SDK makes integration straightforward and intuitive. In this guide, we’ll explore everything from basic chat completions to advanced features like function calling and vision capabilities, complete with production-ready examples you can deploy immediately.

The modern AI landscape has democratized access to sophisticated language models. What once required significant ML expertise now takes just a few lines of Python. The OpenAI API currently powers applications used by millions of developers worldwide, and with the latest Python SDK (v1.0+), the experience is more elegant and Pythonic than ever. You’ll gain the skills to harness models like GPT-4o, GPT-4o-mini, and GPT-3.5-turbo in your projects.

By the end of this tutorial, you’ll understand how to initialize the client, handle authentication, construct effective prompts, stream responses for real-time interaction, invoke external tools through function calling, process images, generate embeddings, and implement robust error handling. We’ll also examine a complete CLI chatbot implementation that demonstrates conversation history management.

Quick Example: Your First API Call

Let’s get straight to it. Here’s a minimal example that demonstrates the power of the OpenAI API. This script creates a single chat completion request and displays the model’s response. It assumes your OPENAI_API_KEY environment variable is set:

# quick_chat.py
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain recursion in 20 words."}]
)
print(response.choices[0].message.content)

Output:

Recursion is a function calling itself to solve smaller versions of the same problem until reaching a base case.

The OpenAI() client automatically reads your API key from the environment, constructs a message, sends it to the model, and returns a structured response. The choices array contains the model’s completions, and message.content is the actual text response.

What Is the OpenAI API?

The OpenAI API is a REST interface that gives you programmatic access to OpenAI’s language models. Rather than using the web interface, you call the API from your application. The official Python SDK wraps this REST API, handling authentication, request formatting, and response parsing automatically.

OpenAI offers multiple models optimized for different use cases:

ModelBest ForContext WindowRelative CostSpeed
gpt-4oComplex reasoning, multimodal, production128K tokensHigherModerate
gpt-4o-miniFast, cost-effective, high volume128K tokensLowFast
gpt-3.5-turboLegacy applications4K tokensVery LowFastest

For most new projects, we recommend gpt-4o-mini as your starting point. The API also supports embeddings, audio transcription, image generation, and fine-tuning.

Programmer relaying prompts to OpenAI and receiving completions
Prompt in. Completion out. Magic in the middle.

Installing the OpenAI Python SDK

The official OpenAI Python SDK is available on PyPI. We recommend installing within a virtual environment:

# install_openai.sh
$ python3 -m venv openai_env
$ source openai_env/bin/activate
$ pip install openai

Output:

Successfully installed openai-1.30.0

Verify the installation:

# verify_openai.py
import openai
print(f"OpenAI SDK version: {openai.__version__}")

Output:

OpenAI SDK version: 1.30.0

The SDK requires Python 3.7 or higher.

Setting Up Your API Key

Every request requires authentication via an API key. Create one at platform.openai.com/api-keys. Never hardcode your key in source code. Use environment variables instead:

# setup_env.sh
$ export OPENAI_API_KEY="sk-proj-your-actual-key-here"

The OpenAI() client automatically reads this environment variable:

# client_init.py
from openai import OpenAI

client = OpenAI()  # Reads OPENAI_API_KEY from environment
print("Client initialized successfully")

Output:

Client initialized successfully
Developer protecting API keys in a secure vault
Keep your API key closer than your passwords

Chat Completions: The Core API

Chat completions are the foundation of most OpenAI applications. You send a list of messages and the model generates a completion:

# chat_basic.py
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "What are three benefits of Python for data science?"}
    ],
    max_tokens=200,
    temperature=0.7
)
print(response.choices[0].message.content)

Output:

1. Rich Ecosystem: Libraries like pandas, NumPy, and scikit-learn provide comprehensive tools.
2. Ease of Learning: Python's readable syntax lets data scientists focus on algorithms.
3. Community and Integration: Strong community support and seamless production integration.

Key parameters: model specifies which model, messages is the conversation, max_tokens limits response length, and temperature controls randomness (0.7 is a good default).

System Messages and Conversation Roles

System messages set the assistant’s behavior and personality. Every conversation should begin with one:

# system_messages.py
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful Python tutor. Keep responses under 150 words."},
        {"role": "user", "content": "What is a list comprehension?"}
    ],
    temperature=0.5
)
print(response.choices[0].message.content)

Output:

A list comprehension is a concise way to create lists in Python:
squares = [x ** 2 for x in range(5)]  # [0, 1, 4, 9, 16]

Messages have three roles: system (instructions), user (human input), and assistant (model responses). Store messages in a list to maintain multi-turn conversation context.

Streaming Responses

Streaming sends tokens as they’re generated, creating a real-time effect:

# streaming_response.py
from openai import OpenAI

client = OpenAI()
stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a haiku about Python."}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Output:

Code flows like rivers
Functions call within themselves
Logic pure and clean

The stream=True parameter returns a generator that yields chunks as they arrive — perfect for web UIs.

Tokens streaming in real-time from an API response
Streaming: because waiting for the full response is so 2022

Function Calling and Tool Use

Function calling lets the model request your application invoke specific functions:

# function_calling.py
import json
from openai import OpenAI

client = OpenAI()
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["city"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather in New York?"}],
    tools=tools,
    tool_choice="auto"
)

if response.choices[0].message.tool_calls:
    call = response.choices[0].message.tool_calls[0]
    print(f"Function: {call.function.name}")
    print(f"Arguments: {call.function.arguments}")

Output:

Function: get_weather
Arguments: {"city": "New York", "unit": "fahrenheit"}

The model decides which function to invoke and structures arguments automatically. Your code executes the logic and sends results back.

Generating Embeddings

Embeddings are numerical representations of text for semantic search and similarity:

# embeddings_example.py
from openai import OpenAI

client = OpenAI()
texts = ["The cat sat on the mat.", "A feline rests on the rug.", "The dog ran through the park."]

response = client.embeddings.create(model="text-embedding-3-small", input=texts)
for i, item in enumerate(response.data):
    print(f"Text {i}: {len(item.embedding)} dimensions, first 3: {item.embedding[:3]}")

Output:

Text 0: 1536 dimensions, first 3: [-0.0234, 0.0891, -0.0123]
Text 1: 1536 dimensions, first 3: [-0.0245, 0.0885, -0.0115]
Text 2: 1536 dimensions, first 3: [0.0123, 0.0342, 0.0789]

Semantically similar texts produce similar embeddings. Store them in vector databases like ChromaDB for powerful search.

Error Handling and Rate Limits

Production applications must handle errors gracefully:

# error_handling.py
from openai import OpenAI, RateLimitError, APIError

client = OpenAI()
try:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=10
    )
    print(response.choices[0].message.content)
except RateLimitError:
    print("Rate limit exceeded. Wait before retrying.")
except APIError as e:
    print(f"API error: {e.status_code} - {e.message}")

Output:

Hi there! How can I help you today?

Implement exponential backoff for rate limits — wait progressively longer between retries.

Developer handling rate limit errors with grace and retry logic
Rate limits: the universe’s way of saying slow down

Real-Life Example: Interactive CLI Chatbot

Here’s a complete chatbot with conversation history:

# chatbot.py
from openai import OpenAI

class Chatbot:
    def __init__(self, system_prompt="You are a helpful assistant."):
        self.client = OpenAI()
        self.messages = [{"role": "system", "content": system_prompt}]

    def chat(self, user_input):
        self.messages.append({"role": "user", "content": user_input})
        try:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=self.messages,
                temperature=0.7,
                max_tokens=500
            )
            reply = response.choices[0].message.content
            self.messages.append({"role": "assistant", "content": reply})
            return reply
        except Exception as e:
            return f"Error: {e}"

    def save_history(self, filename="chat_history.txt"):
        with open(filename, "w") as f:
            for msg in self.messages:
                f.write(f"{msg['role'].upper()}:\n{msg['content']}\n\n")

    def run(self):
        print("Chatbot ready. Type 'quit' to exit, 'save' to save history.\n")
        while True:
            user_input = input("You: ").strip()
            if not user_input:
                continue
            if user_input.lower() == "quit":
                break
            if user_input.lower() == "save":
                self.save_history()
                print("History saved.")
                continue
            print(f"Assistant: {self.chat(user_input)}\n")

if __name__ == "__main__":
    Chatbot("You are a knowledgeable Python expert.").run()

Usage:

$ python chatbot.py
Chatbot ready. Type 'quit' to exit, 'save' to save history.

You: What's the difference between lists and tuples?
Assistant: Lists are mutable, tuples are immutable...

You: save
History saved.

This demonstrates conversation history management, error handling, persistent storage, and an interactive loop.

Frequently Asked Questions

How much does the OpenAI API cost?

OpenAI uses pay-per-token pricing. gpt-4o-mini costs roughly $0.15 per million input tokens. Set hard spending limits in your account settings.

What’s the difference between temperature and top_p?

temperature controls randomness directly (0 = deterministic, 2 = very random). top_p uses nucleus sampling. For most apps, adjust temperature and leave top_p at 1.0.

How long can a conversation be?

Limited by the context window: 128K tokens for gpt-4o/gpt-4o-mini. Monitor response.usage to track consumption.

Can I fine-tune the models?

Yes, OpenAI supports fine-tuning for specific models. Start with prompt engineering first — it’s usually sufficient and cheaper.

How do I handle sensitive data?

Never send PII (SSNs, credit cards) to the API. Use data scrubbing and anonymization. Review OpenAI’s privacy policy for compliance.

Conclusion

You now have a comprehensive foundation for building with the OpenAI API: chat completions, system messages, streaming, function calling, vision, embeddings, and error handling. The Python SDK makes integration elegant. Start with a simple chatbot and extend from there.

Visit the official documentation at platform.openai.com/docs for advanced features like fine-tuning and batch processing.

Setting Up the OpenAI Client

The official Python SDK is openai. Authentication via the OPENAI_API_KEY environment variable is the simplest and safest path:

# pip install openai

import os
from openai import OpenAI

# Reads from OPENAI_API_KEY env var
client = OpenAI()

# Or pass explicitly (never hardcode in production)
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Test
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Say hello in one sentence"}],
)
print(resp.choices[0].message.content)

For local development, put your key in a .env file and load it with python-dotenv — never commit keys to git. In production, use your platform’s secrets manager (AWS Secrets Manager, Vault, GCP Secret Manager).

Chat Completions: The Workhorse Endpoint

Chat completions handle 95% of real-world use cases. The model takes a list of messages with roles (system, user, assistant) and returns the next assistant message:

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a Python expert. Be concise."},
        {"role": "user", "content": "How do I read a CSV?"},
    ],
    temperature=0.2,           # 0.0 = deterministic, 1.0 = creative
    max_tokens=200,             # cap response length
)

print(resp.choices[0].message.content)
print(resp.usage.total_tokens)   # tokens you'll be billed for

The system message shapes the model’s behavior across the conversation. temperature is the most-impactful parameter — drop it to 0 for code generation or structured outputs, raise it for creative writing.

Streaming Responses

For chat UIs and long completions, streaming the response gives users immediate feedback. Iterate over chunks as they arrive:

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a haiku about Python"}],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
print()

Streaming reduces perceived latency from “5 seconds of waiting” to “instant first word”. Build it in from the start for any user-facing feature.

Function Calling / Tools

For agents that need to call code (lookup data, run calculations, fetch URLs), use the tools/function-calling feature. You describe the functions; the model decides when to call them and with what arguments:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

tool_call = resp.choices[0].message.tool_calls[0]
print(tool_call.function.name)        # 'get_weather'
print(tool_call.function.arguments)   # '{"city":"Tokyo","unit":"celsius"}'

# Your code calls the actual function, then sends the result back as a message
# with role="tool" and tool_call_id matching the call

Structured Outputs (JSON Mode)

When you need the model to return parseable JSON, use the response_format parameter or structured outputs:

from pydantic import BaseModel

class UserProfile(BaseModel):
    name: str
    age: int
    interests: list[str]

resp = client.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Give me a sample user profile"}],
    response_format=UserProfile,
)

profile = resp.choices[0].message.parsed
print(profile.name, profile.age, profile.interests)

This is the modern way to do structured extraction — the SDK validates the output against your pydantic schema, raising if the model returns anything malformed.

Embeddings for Search and Similarity

Embeddings turn text into vectors. Cosine similarity between vectors approximates semantic similarity — the foundation of semantic search, RAG, and clustering:

resp = client.embeddings.create(
    model="text-embedding-3-small",
    input=[
        "Python is a programming language",
        "JavaScript runs in browsers",
        "I love coding in Python every day",
    ],
)

for item in resp.data:
    print(len(item.embedding))    # 1536 dimensions

# Cosine similarity to find similar texts
import numpy as np
vecs = np.array([d.embedding for d in resp.data])
sim = vecs @ vecs.T
print(sim)   # how each pair compares

Common Pitfalls

  • Hardcoding keys. Hardcoded API keys end up on GitHub, get scraped within hours, and get revoked. Always use environment variables.
  • Not setting max_tokens. Unbounded responses can rack up costs fast. Set max_tokens on every call.
  • Building chat history forever. Sending the whole conversation every turn means quadratic token growth. Truncate or summarize old messages once the context approaches the model’s window.
  • Ignoring rate limits. OpenAI returns 429 errors when you hit RPM / TPM limits. Wrap calls with exponential backoff (tenacity library) or use the async client with concurrency limits.
  • Treating model output as code. Never exec() or eval() generated code. Treat outputs as untrusted user input — validate, sanitize, sandbox.

FAQ

Q: Which model should I use?
A: gpt-4o-mini for cost-effective everyday work — fast and cheap. gpt-4o for harder tasks. o1 / o3 for deep reasoning. Use the smallest model that solves your problem.

Q: How do I control cost?
A: Three levers: pick a smaller model, set max_tokens, truncate conversation history. Monitor with the OpenAI dashboard — set alerts at 50% and 80% of your monthly cap.

Q: How do I handle long documents?
A: Split into chunks (1500-2000 tokens each), embed each chunk, retrieve relevant chunks for each query (RAG), and only send those to the model. LangChain and LlamaIndex automate this pattern.

Q: Is there an async client?
A: Yes — from openai import AsyncOpenAI. Same API, all methods are coroutines. Use it inside FastAPI handlers and async scrapers.

Q: What about local LLMs?
A: Run open-weight models via Ollama, llama.cpp, or LM Studio. They expose an OpenAI-compatible API — change base_url in the client and the rest of your code keeps working.

Wrapping Up

The OpenAI Python SDK is one of those rare ones where the surface area maps cleanly onto real-world tasks: chat completions, streaming, tool/function calling, structured outputs, and embeddings cover almost everything. Pick the smallest model that does the job, set max_tokens, use environment variables for keys, and validate model output before acting on it. Those four habits prevent 90% of production incidents.