How To Build a Simple Chatbot with Python and Ollama

Last Updated: June 01, 2026

Table of Contents

What Is Ollama?
Installing Ollama and Pulling Models
The Ollama Python Library
Building Conversational Memory
Streaming Responses
Building a Web Interface with FastAPI
The OpenAI-Compatible API
Real-Life Example: A Python Coding Assistant
Frequently Asked Questions
Summary
Related Articles

Every chatbot tutorial eventually reaches the same uncomfortable sentence: “You’ll need an OpenAI API key and be comfortable with usage costs.” For development, experimentation, and production systems that process sensitive data, that sentence is a genuine problem. Ollama solves it. It’s a tool that runs large language models locally on your machine — no API keys, no cloud costs, no data leaving your computer, and no rate limits at 2am when you’re debugging.

Ollama supports dozens of models — Llama 3, Mistral, Phi-3, Gemma, Qwen, and more — with a dead-simple CLI for downloading and running them. Once a model is running, it exposes an OpenAI-compatible REST API on localhost:11434, which means any Python code that works with OpenAI can work with Ollama by changing one URL. You get local AI with basically zero friction.

In this tutorial you’ll build a complete conversational chatbot: streaming responses, conversation memory, a system prompt that defines personality, and a web interface using FastAPI. All running locally, all free, all private.

Quick Answer
Install Ollama, run ollama pull llama3.2 to download a model, then use the ollama Python library (pip install ollama) or hit http://localhost:11434/api/chat directly. For conversational memory, maintain a messages list and append each turn. For streaming responses, use ollama.chat(stream=True) and iterate over the chunks.

Programmer running a local LLM server with no cloud connection — Local LLM. No cloud required. Your data stays home.

Written by Pubs

Python developer and educator with 15+ years building production systems across data engineering, web APIs, and AI tooling. Founder of Python How To Program — 270+ in-depth tutorials covering the modern Python stack.

View all tutorials by Pubs →

What Is Ollama?

Part of the Modern Python AI Stack series. See the full tutorial hub for all 23 tutorials on LangGraph, MCP, Pydantic AI, Polars, FastAPI, Litestar, Typer, and more.

Ollama is an open-source tool that packages LLM serving into a simple desktop application and CLI. It handles model downloading, quantization management, GPU acceleration (NVIDIA and Apple Silicon), and running a local HTTP server that speaks the OpenAI API format. The mental model is: Docker, but for LLMs instead of containers.

The models available through Ollama are generally quantized versions of popular open-source models — quantization reduces precision from 32-bit or 16-bit floats to 4-bit or 8-bit integers, shrinking the model size by 4-8x with modest quality loss. A 7B parameter model that would need 14GB of VRAM at full precision runs in about 4GB quantized. This makes capable models accessible on consumer hardware.

Model	Size	RAM Required	Best For
llama3.2:1b	1.3GB	~4GB	Fast responses, simple tasks
llama3.2:3b	2.0GB	~6GB	Good balance of speed and quality
llama3.1:8b	4.7GB	~8GB	High quality, general purpose
mistral:7b	4.1GB	~8GB	Strong coding and reasoning
phi3:mini	2.3GB	~6GB	Microsoft’s efficient small model
gemma2:9b	5.5GB	~10GB	Google’s instruction-tuned model

Installing Ollama and Pulling Models

Ollama is a standalone application that runs large language models locally on your machine — no internet connection or API key required after initial setup. The installer handles everything including the background server process that your Python code will talk to. Once Ollama is running, you pull models the same way you’d pull a Docker image: they download once and live on disk.

# macOS or Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from https://ollama.com
# Or via winget:
winget install Ollama.Ollama

After installation, Ollama runs as a background service automatically. Pull your first model:

# Download Llama 3.2 3B (good starting point -- 2GB, fast, decent quality)
ollama pull llama3.2:3b

# Check what you have installed
ollama list

# Quick test from the command line
ollama run llama3.2:3b "What is the capital of Australia?"

The first pull takes a few minutes (downloading the model file). Subsequent runs use the cached model. Once pulled, the model is available for API use immediately — the Ollama service starts automatically on port 11434.

Developer watching a model download progress bar fill up — Downloading intelligence. Please wait.

The Ollama Python Library

The official ollama Python package is a thin wrapper around Ollama’s HTTP API. It gives you a clean ollama.chat() interface that mirrors the OpenAI SDK’s structure — making it easy to swap providers if you need to. Install it with a single pip command:

pip install ollama

The simplest possible chatbot — just text in, text out:

# chatbot.py
import ollama

# Single-turn completion
response = ollama.chat(
    model="llama3.2:3b",
    messages=[
        {"role": "user", "content": "Explain Python decorators in one paragraph."}
    ]
)

print(response["message"]["content"])

The message format uses the same roles as OpenAI: system (sets context/personality), user (human messages), and assistant (model responses). This intentional compatibility means code written for one API needs minimal changes for the other.

Building Conversational Memory

A chatbot that can’t remember the previous message is just a slightly fancier search engine. Conversation memory in Ollama (and LLMs generally) is simple: keep the entire conversation history as a list of messages and send it all with each new request. The model reads the full history to maintain context.

# chatbot.py
import ollama

class Chatbot:
    def __init__(self, model: str = "llama3.2:3b", system_prompt: str = None):
        self.model = model
        self.conversation_history = []

        if system_prompt:
            self.conversation_history.append({
                "role": "system",
                "content": system_prompt
            })

    def chat(self, user_message: str) -> str:
        """Send a message and get a response."""
        # Add user message to history
        self.conversation_history.append({
            "role": "user",
            "content": user_message
        })

        # Send full conversation history to model
        response = ollama.chat(
            model=self.model,
            messages=self.conversation_history
        )

        assistant_message = response["message"]["content"]

        # Add assistant response to history
        self.conversation_history.append({
            "role": "assistant",
            "content": assistant_message
        })

        return assistant_message

    def reset(self):
        """Clear conversation history (keep system prompt)."""
        system_messages = [m for m in self.conversation_history if m["role"] == "system"]
        self.conversation_history = system_messages

# Create a chatbot with a custom personality
bot = Chatbot(
    model="llama3.2:3b",
    system_prompt="""You are a friendly Python tutor who teaches with clear examples.
    When you show code, always explain what each part does.
    Keep answers concise — three paragraphs maximum unless the user asks for more detail."""
)

# Multi-turn conversation
questions = [
    "What's the difference between a list and a tuple?",
    "Can you show me an example that uses both?",
    "When would I actually use a tuple in real code?"
]

for question in questions:
    print(f"\nYou: {question}")
    response = bot.chat(question)
    print(f"Bot: {response}")

The key insight: self.conversation_history grows with each turn. The model sees the entire conversation on every request, which is why it can reference “the example you showed earlier” — it literally reads the earlier messages. Models like Llama 3.1 have 128K token context windows, so very long conversations rarely hit limits in practice.

Streaming Responses

Nothing makes a chatbot feel slower than waiting for the full response before showing anything. Streaming sends tokens as they’re generated, so the user sees the response building in real time — exactly like ChatGPT.

# chatbot.py
import ollama

class StreamingChatbot:
    def __init__(self, model: str = "llama3.2:3b", system_prompt: str = None):
        self.model = model
        self.history = []
        if system_prompt:
            self.history.append({"role": "system", "content": system_prompt})

    def chat(self, user_message: str) -> str:
        self.history.append({"role": "user", "content": user_message})

        print("Bot: ", end="", flush=True)
        full_response = ""

        stream = ollama.chat(model=self.model, messages=self.history, stream=True)
        for chunk in stream:
            token = chunk["message"]["content"]
            print(token, end="", flush=True)
            full_response += token

        print()  # Newline after response
        self.history.append({"role": "assistant", "content": full_response})
        return full_response


# Interactive streaming chat session
def run_interactive_chat():
    bot = StreamingChatbot(
        model="llama3.2:3b",
        system_prompt="You are a helpful assistant. Be concise and direct."
    )

    print("Chat started. Type 'quit' to exit, 'reset' to start over.\n")

    while True:
        try:
            user_input = input("You: ").strip()
        except (KeyboardInterrupt, EOFError):
            print("\nGoodbye!")
            break

        if not user_input:
            continue
        if user_input.lower() == "quit":
            break
        if user_input.lower() == "reset":
            bot.history = [m for m in bot.history if m["role"] == "system"]
            print("Conversation reset.\n")
            continue

        bot.chat(user_input)
        print()

if __name__ == "__main__":
    run_interactive_chat()

LLM response streaming in real-time locally — Streaming locally: fast, private, zero API bill

Building a Web Interface with FastAPI

A terminal chatbot is great for development. A web interface is what you actually deploy. Here’s a FastAPI backend with session management:

# project.py
from fastapi import FastAPI, HTTPException
from fastapi.responses import HTMLResponse
from pydantic import BaseModel
from typing import Optional
import ollama
import uuid

app = FastAPI(title="Ollama Chatbot API")

# In-memory session storage (use Redis in production)
sessions: dict[str, list] = {}

class ChatRequest(BaseModel):
    message: str
    session_id: Optional[str] = None
    model: str = "llama3.2:3b"

@app.post("/chat")
def chat(request: ChatRequest):
    """Send a message and get a complete response."""
    session_id = request.session_id or str(uuid.uuid4())
    if session_id not in sessions:
        sessions[session_id] = [
            {"role": "system", "content": "You are a helpful assistant."}
        ]

    history = sessions[session_id]
    history.append({"role": "user", "content": request.message})

    try:
        response = ollama.chat(model=request.model, messages=history)
        assistant_message = response["message"]["content"]
        history.append({"role": "assistant", "content": assistant_message})

        return {
            "session_id": session_id,
            "response": assistant_message,
            "message_count": len([m for m in history if m["role"] != "system"])
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Ollama error: {str(e)}")

@app.get("/sessions/{session_id}")
def get_session(session_id: str):
    """Get conversation history for a session."""
    if session_id not in sessions:
        raise HTTPException(status_code=404, detail="Session not found")
    history = [m for m in sessions[session_id] if m["role"] != "system"]
    return {"session_id": session_id, "messages": history}

@app.delete("/sessions/{session_id}")
def delete_session(session_id: str):
    """Delete a chat session."""
    sessions.pop(session_id, None)
    return {"status": "deleted"}

@app.get("/models")
def list_models():
    """List available Ollama models."""
    models = ollama.list()
    return {"models": [m["model"] for m in models.get("models", [])]}

@app.get("/")
def serve_ui():
    """Serve a minimal chat UI."""
    html = """<!DOCTYPE html>
<html>
<head><title>Local AI Chatbot</title>
<style>
  body { font-family: Arial, sans-serif; max-width: 800px; margin: 50px auto; padding: 20px; }
  #chat { border: 1px solid #ddd; height: 400px; overflow-y: auto; padding: 15px; margin-bottom: 10px; }
  .user { text-align: right; margin: 10px 0; }
  .bot { text-align: left; margin: 10px 0; }
  .user span { background: #007bff; color: white; padding: 8px 12px; border-radius: 12px; display: inline-block; }
  .bot span { background: #f0f0f0; padding: 8px 12px; border-radius: 12px; display: inline-block; }
  #input-area { display: flex; gap: 10px; }
  #message { flex: 1; padding: 10px; border: 1px solid #ddd; border-radius: 6px; }
  button { padding: 10px 20px; background: #007bff; color: white; border: none; border-radius: 6px; cursor: pointer; }
</style>
</head>
<body>
<h2>Local AI Chatbot (Powered by Ollama)</h2>
<div id="chat"></div>
<div id="input-area">
  <input id="message" type="text" placeholder="Type a message..." onkeypress="if(event.key==='Enter')sendMessage()">
  <button onclick="sendMessage()">Send</button>
</div>
</body>
</html>"""
    return HTMLResponse(html)

Run it with uvicorn chatbot_api:app --reload and visit http://localhost:8000 for the web interface. The session management keeps conversations separate — each user can have an independent conversation identified by their session ID.

The OpenAI-Compatible API

Ollama exposes an OpenAI-compatible API, which means any code using the openai Python library works with Ollama by changing the base URL:

# chatbot.py
from openai import OpenAI

# Point OpenAI client at local Ollama
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required by the client but ignored by Ollama
)

# This looks exactly like OpenAI code
response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the Zen of Python?"}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)

This compatibility is enormously useful. A codebase built for OpenAI can switch to local Ollama by changing two lines — the base URL and the model name. Teams can develop and test against a local model (free, fast, private) and deploy against OpenAI’s API (better quality, scalable) with zero code changes.

Real-Life Example: A Python Coding Assistant

Here’s a complete coding assistant specialized in Python with streaming and code review:

# real_life_project.py
import ollama

SYSTEM_PROMPT = """You are an expert Python programming assistant with 15 years of experience.

Your behavior:
- Provide working, tested code examples for every concept explained
- Always explain the "why" behind best practices, not just the "what"
- Point out potential pitfalls and edge cases proactively
- Use type hints in all code examples
- Keep explanations to 2-3 paragraphs unless the topic requires more"""

class PythonAssistant:
    def __init__(self):
        self.model = "llama3.2:3b"
        self.history = [{"role": "system", "content": SYSTEM_PROMPT}]
        self.turn_count = 0

    def ask(self, question: str) -> str:
        self.history.append({"role": "user", "content": question})
        self.turn_count += 1

        print(f"\n[Turn {self.turn_count}] Assistant: ", end="", flush=True)

        full_response = ""
        stream = ollama.chat(model=self.model, messages=self.history, stream=True)
        for chunk in stream:
            token = chunk["message"]["content"]
            print(token, end="", flush=True)
            full_response += token
        print()

        self.history.append({"role": "assistant", "content": full_response})
        return full_response

    def review_code(self, code: str) -> str:
        prompt = f"""Please review this Python code. Cover:
1. Correctness: any bugs or logical errors
2. Style: PEP 8 compliance and Pythonic patterns
3. Performance: any obvious inefficiencies
4. Safety: any potential exceptions or edge cases

Code to review:
```python
{code}
```"""
        return self.ask(prompt)

# Use the assistant
assistant = PythonAssistant()

print("Python Assistant ready. Type 'quit' to exit.\n")

while True:
    try:
        command = input("You: ").strip()
    except (KeyboardInterrupt, EOFError):
        break

    if not command or command.lower() == "quit":
        break

    assistant.ask(command)
    print()

Frequently Asked Questions

Does Ollama use my GPU?
Yes, automatically. If you have an NVIDIA GPU with CUDA, Ollama detects it and offloads layers to the GPU. Apple Silicon Macs use Metal for GPU acceleration. CPU-only inference works but is 5-10x slower. Check GPU usage with ollama ps while a model is running.

How is Ollama different from running Hugging Face models directly?
Ollama abstracts model management, quantization, and serving into a simple tool. Running a Hugging Face model directly requires more setup (transformers library, manual quantization, serving code). Ollama’s tradeoff: less flexibility, much less friction. For production custom fine-tuning, Hugging Face is more appropriate.

Can I use Ollama in production?
For personal tools and small teams, yes. For high-traffic production systems, you’d typically use a managed API (OpenAI, Anthropic) or self-hosted serving infrastructure (vLLM) that’s designed for horizontal scaling. Ollama is designed for local development and single-machine serving.

How do I make the chatbot remember things across sessions?
The conversation history in this tutorial lives in memory and is lost when the process restarts. For persistence, save the history to a database (SQLite, PostgreSQL) keyed by session ID. Load the history at session start and save it after each turn.

Can I run Ollama on a server and access it remotely?
Yes. By default Ollama only listens on localhost. Set OLLAMA_HOST=0.0.0.0:11434 to expose it on all interfaces, then access it from other machines. Add proper authentication (nginx with Basic Auth, or a VPN) before exposing to the internet.

Which model should I start with?
For most use cases: llama3.2:3b. It’s 2GB, responds quickly, and handles general conversation, Q&A, and simple coding well. If you have a machine with 8+ GB RAM and want better quality, try llama3.1:8b or mistral:7b.

Summary

You’ve built a complete local chatbot system: conversation memory, streaming responses, a FastAPI web backend, and a domain-specific Python coding assistant. All running on your machine, all free, all private. The OpenAI-compatible API means the same code works against hosted models when you need better quality or scale.

Local LLMs with Ollama are the right starting point for experimentation, internal tools, and privacy-sensitive applications. When you need more context (for a RAG system to query your documents), see How To Build a RAG System with LangChain. When you want to fine-tune a model on your specific domain, check out How To Fine-Tune a Hugging Face Model.

Frequently Asked Questions

What hardware do I need to run Ollama?

Ollama runs on CPU but is painfully slow without a GPU. A 7B model needs ~6GB of VRAM in Q4 quantisation; 13B needs ~10GB; 34B needs ~24GB. An M1 or later Mac with 16GB unified memory handles 7B models smoothly. Consumer NVIDIA cards (RTX 3060 12GB and up) work via CUDA. Cloud GPU rentals (RunPod, Vast.ai) are an option for occasional heavier work.

How does Ollama compare to running models with Transformers directly?

Ollama wraps llama.cpp (which uses GGUF quantised models) and adds a friendly REST API, model registry, and automatic memory management. Transformers gives you fine-grained control, supports fine-tuning, and runs unquantised. For chat/inference, Ollama is dramatically easier. For training or research, Transformers.

Which model should I start with?

For general chat, try llama3.2 (1B or 3B for laptops, 8B for desktops). For code, qwen2.5-coder. For tool use and function calling, llama3.1. Test on YOUR actual prompts — benchmarks tell you nothing about whether a model handles your specific use case. Pull, prompt, replace.

Can I expose Ollama to other machines on my network?

Yes — set OLLAMA_HOST=0.0.0.0 in the environment before starting Ollama and it binds to all interfaces instead of just localhost. Then point your client at http://:11434. There’s no built-in authentication, so put it behind a reverse proxy (Caddy, nginx) with basic auth if it’s exposed beyond a trusted LAN.

How do I make Ollama responses faster?

Three levers: smaller model (3B is 3-5x faster than 8B with often-acceptable quality), smaller quantisation (Q4 is faster than Q8 but slightly less accurate), and a lower num_predict (the maximum tokens to generate). For low-latency UX, stream tokens to the user as they arrive instead of waiting for the full response.

Continue Learning Python

Tutorials you might also find useful:

Post Views: 128

How To Build a Simple Chatbot with Python and Ollama

What Is Ollama?

Installing Ollama and Pulling Models

The Ollama Python Library

Building Conversational Memory

Streaming Responses

Building a Web Interface with FastAPI

The OpenAI-Compatible API

Real-Life Example: A Python Coding Assistant

Frequently Asked Questions

Summary

Related Articles

Frequently Asked Questions

What hardware do I need to run Ollama?

How does Ollama compare to running models with Transformers directly?

Which model should I start with?

Can I expose Ollama to other machines on my network?

How do I make Ollama responses faster?

Continue Learning Python

Submit a Comment Cancel reply