Intermediate
You have built a Python script that calls a third-party API, fetches data from a remote server, or writes to a cloud database. Everything works perfectly in testing — until it doesn’t. The network blips for half a second, the API returns a 503, or the database connection times out. Without retry logic, your script crashes and you lose the entire run. Writing retry loops by hand is tedious, error-prone, and looks different every time.
Python’s stamina library solves this cleanly. It wraps functions with production-grade retry logic using a single decorator, handling exponential backoff, jitter, and configurable limits out of the box. You get resilience without boilerplate. Install it with pip install stamina and you are ready to add retries to any function.
In this article we cover how to install stamina and apply the @stamina.retry decorator, how to configure attempt counts and wait times, how to retry only on specific exception types, how to monitor retry events with structured logging, and how to build a resilient API poller as a real-world project. By the end you will have a reusable pattern for making any Python function retry-safe.
Retrying a Flaky Function: Quick Example
Here is the fastest way to add retry logic to a function that might fail due to network issues. The @stamina.retry decorator handles everything automatically.
# quick_stamina.py
import stamina
import httpx
@stamina.retry(on=httpx.HTTPError, attempts=3)
def fetch_user(user_id: int) -> dict:
response = httpx.get(f"https://jsonplaceholder.typicode.com/users/{user_id}")
response.raise_for_status()
return response.json()
user = fetch_user(1)
print(user["name"])
print(user["email"])
Output:
Leanne Graham
Sinclair@april.biz
The @stamina.retry decorator wraps fetch_user so that if it raises an httpx.HTTPError, stamina automatically waits and retries up to 3 times before re-raising the exception. The first call that succeeds returns normally — the caller never sees the retries. If all 3 attempts fail, the original exception propagates as if the decorator was not there.
The sections below go deeper: configuring wait times, scoping retries to specific errors, and integrating stamina’s built-in instrumentation into your observability stack.
What Is stamina and Why Use It?
stamina is a Python library built on top of tenacity that provides a simple, opinionated retry decorator for production code. It was designed by Hynek Schlawack (author of attrs and structlog) to be safe by default and integrated with modern observability tools like structlog and Prometheus.
The core idea is that many failure modes in distributed systems are transient — a brief network partition, a rate limit that clears in a second, a momentary database overload. A function that retries with exponential backoff and jitter handles these automatically, making your code resilient without human intervention.
| Approach | Code Volume | Backoff | Jitter | Observability |
|---|---|---|---|---|
| Manual retry loop | 15-30 lines | Manual | Manual | None |
| tenacity | 5-10 lines | Built-in | Built-in | None |
| stamina | 1-2 lines | Built-in | Built-in | structlog + Prometheus |
stamina is the right tool when you want retry logic that is safe by default, readable at a glance, and observable in production — not just in development.
Installing stamina
stamina is available on PyPI. Install it with pip:
# Install stamina and httpx for the examples
pip install stamina httpx
stamina requires Python 3.10 or later and has no mandatory dependencies beyond tenacity. The structlog and prometheus-client integrations are optional — stamina detects them automatically if they are installed.
Basic Retry: Attempts and Wait
The simplest configuration is specifying how many times to try and which exceptions to retry on. stamina uses exponential backoff by default, starting at 0.1 seconds and doubling on each failure, with added jitter to prevent thundering herd problems when many callers retry simultaneously.
# basic_retry.py
import stamina
import httpx
@stamina.retry(
on=httpx.HTTPStatusError,
attempts=5,
wait_initial=0.1, # Start with 100ms wait
wait_max=10.0, # Cap at 10 seconds
wait_jitter=1.0, # Add up to 1s random jitter
)
def fetch_post(post_id: int) -> dict:
response = httpx.get(
f"https://jsonplaceholder.typicode.com/posts/{post_id}",
timeout=5.0
)
response.raise_for_status()
return response.json()
post = fetch_post(1)
print(post["title"])
Output:
sunt aut facere repellat provident occaecati excepturi optio reprehenderit
The wait_initial, wait_max, and wait_jitter parameters give you precise control over the retry schedule. The exponential formula means the wait sequence is roughly: 0.1s, 0.2s, 0.4s, 0.8s — each doubled from the last, plus a random jitter up to wait_jitter seconds. Setting wait_max prevents indefinite growth in long retry chains.
Timeout-Based Retry
Instead of counting attempts, you can tell stamina to keep retrying for a fixed duration using the timeout parameter. This is useful for background jobs where you want to retry for up to N seconds regardless of how many individual attempts it takes.
# timeout_retry.py
import stamina
import httpx
@stamina.retry(
on=httpx.HTTPError,
timeout=30.0, # Keep retrying for up to 30 seconds total
)
def fetch_comments(post_id: int) -> list:
response = httpx.get(
f"https://jsonplaceholder.typicode.com/comments?postId={post_id}"
)
response.raise_for_status()
return response.json()
comments = fetch_comments(1)
print(f"Fetched {len(comments)} comments")
Output:
Fetched 5 comments
You can combine both attempts and timeout — stamina stops retrying when either limit is reached first. This is the safest production configuration: cap both the number of attempts and the total wall-clock time.
Retrying on Multiple Exception Types
Real applications fail in multiple ways: connection refused, timeout, HTTP 5xx, rate limit 429. Pass a tuple of exception types to on to retry on any of them.
# multi_exception_retry.py
import stamina
import httpx
RETRYABLE = (
httpx.ConnectError,
httpx.TimeoutException,
httpx.HTTPStatusError,
)
@stamina.retry(on=RETRYABLE, attempts=4, timeout=20.0)
def resilient_get(url: str) -> dict:
response = httpx.get(url, timeout=5.0)
# Raise for 4xx/5xx so stamina can catch HTTPStatusError
response.raise_for_status()
return response.json()
data = resilient_get("https://jsonplaceholder.typicode.com/todos/1")
print(data["title"], "-- done:", data["completed"])
Output:
delectus aut autem -- done: False
Notice that we define the retryable exceptions as a named tuple at the top of the file. This is a good practice — it documents exactly which errors are considered transient and makes it easy to add new ones. Non-retryable errors (like ValueError or KeyError) are not listed and therefore propagate immediately without retries.
Monitoring Retries with Logging
stamina emits structured log events on every retry attempt when structlog is installed. These events include the function name, attempt number, wait time, and exception details — everything you need to spot patterns in a production log aggregator.
# monitored_retry.py
import logging
import stamina
import httpx
# Set up basic logging so we can see retry events
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s %(levelname)s %(name)s -- %(message)s"
)
@stamina.retry(on=httpx.HTTPError, attempts=3, wait_initial=0.5)
def fetch_album(album_id: int) -> dict:
response = httpx.get(
f"https://jsonplaceholder.typicode.com/albums/{album_id}"
)
response.raise_for_status()
return response.json()
album = fetch_album(1)
print(album["title"])
Output (no errors):
quidem molestiae enim
If the first call fails, stamina logs a stamina.retry event at WARNING level with the exception class, message, and wait time before the next attempt. On the final attempt, if it also fails, stamina re-raises the exception normally. This two-level signaling — warnings during retries, exception on exhaustion — integrates cleanly with any log monitoring system.
Disabling Retries in Tests
In unit tests you usually want functions to fail fast — not wait for exponential backoff. stamina provides stamina.set_testing() to globally disable all retries in test environments.
# test_with_stamina.py
import stamina
import httpx
import unittest
@stamina.retry(on=httpx.HTTPError, attempts=5, timeout=30.0)
def fetch_todo(todo_id: int) -> dict:
response = httpx.get(
f"https://jsonplaceholder.typicode.com/todos/{todo_id}"
)
response.raise_for_status()
return response.json()
class TestFetchTodo(unittest.TestCase):
def setUp(self):
# Disable retries so tests run fast
stamina.set_testing(True)
def tearDown(self):
stamina.set_testing(False)
def test_fetch_success(self):
todo = fetch_todo(1)
self.assertIn("title", todo)
self.assertIn("completed", todo)
if __name__ == "__main__":
unittest.main()
Output:
.
----------------------------------------------------------------------
Ran 1 test in 0.312s
OK
With stamina.set_testing(True) active, any decorated function that fails raises the exception immediately on the first attempt, bypassing all retry logic and wait times. This is exactly what you want in tests — deterministic, fast failures that don’t slow down your test suite.
Real-Life Example: Resilient API Poller
Here is a complete poller that fetches paginated data from a REST API with full retry logic. It retries on network errors, logs every failure, and stops cleanly if it exhausts all retries.
# resilient_poller.py
import time
import logging
import stamina
import httpx
logging.basicConfig(level=logging.INFO,
format="%(asctime)s %(levelname)s -- %(message)s")
RETRYABLE = (httpx.ConnectError, httpx.TimeoutException, httpx.HTTPStatusError)
@stamina.retry(on=RETRYABLE, attempts=5, wait_initial=0.5, wait_max=8.0, timeout=60.0)
def fetch_page(url: str, page: int, page_size: int = 10) -> list:
params = {"_page": page, "_limit": page_size}
response = httpx.get(url, params=params, timeout=10.0)
response.raise_for_status()
items = response.json()
return items
def poll_all_posts(base_url: str, page_size: int = 10) -> list:
all_posts = []
page = 1
while True:
logging.info(f"Fetching page {page}...")
items = fetch_page(base_url, page=page, page_size=page_size)
if not items:
logging.info("No more pages. Done.")
break
all_posts.extend(items)
logging.info(f" Got {len(items)} posts (total: {len(all_posts)})")
page += 1
if page > 3: # Limit to first 3 pages for demo
break
time.sleep(0.2) # Polite delay between pages
return all_posts
if __name__ == "__main__":
posts = poll_all_posts("https://jsonplaceholder.typicode.com/posts")
print(f"\nFetched {len(posts)} posts total")
print("First post title:", posts[0]["title"])
print("Last post title:", posts[-1]["title"])
Output:
2026-05-13 10:00:01 INFO -- Fetching page 1...
2026-05-13 10:00:01 INFO -- Got 10 posts (total: 10)
2026-05-13 10:00:01 INFO -- Fetching page 2...
2026-05-13 10:00:02 INFO -- Got 10 posts (total: 20)
2026-05-13 10:00:02 INFO -- Fetching page 3...
2026-05-13 10:00:02 INFO -- Got 10 posts (total: 30)
Fetched 30 posts total
First post title: sunt aut facere repellat provident occaecati...
Last post title: at nam consequatur ea labore ea harum
The poller retries any individual page fetch up to 5 times with exponential backoff, but the outer pagination loop continues normally after each success. This architecture means a brief network hiccup on page 7 does not abort the entire 100-page crawl — it just adds a few extra seconds of retry delay before resuming. You can extend this to write results to a database, send to a queue, or write to a file at each page boundary for incremental progress tracking.
Frequently Asked Questions
What is the difference between stamina and tenacity?
stamina is built on top of tenacity and exposes a simpler, more opinionated API. tenacity gives you full control over every retry strategy through a rich configuration DSL. stamina trades some of that flexibility for safer defaults — specifically, it adds jitter by default (tenacity does not), integrates with structlog and Prometheus automatically, and provides set_testing() for test isolation. If you are starting a new project, stamina is the easier choice. For complex custom retry strategies, tenacity may be more appropriate.
Does stamina work with async functions?
Yes. The @stamina.retry decorator works transparently with both sync and async functions. Simply apply the same decorator to an async def function and it will use asyncio.sleep internally instead of time.sleep, ensuring your async event loop is never blocked during retry waits.
How do I make some errors non-retryable?
Only list exceptions in the on parameter that represent transient failures. Any exception not listed propagates immediately. If you want to retry on a broad exception like Exception but exclude certain subtypes, you can raise a custom non-retryable wrapper exception inside the decorated function before calling response.raise_for_status() — for example, converting a 404 to a NotFoundError that is not in your retryable list.
How does the Prometheus integration work?
If prometheus-client is installed, stamina automatically increments a counter named stamina_retries_total on each retry, labelled by callable name and exception type. This gives you a dashboard metric showing which functions are retrying most and what errors they are seeing, without any extra instrumentation code on your part.
Should I always call set_testing in tests?
Yes, if your test suite runs decorated functions. Without stamina.set_testing(True), a test that intentionally triggers an error will wait for the full retry sequence — potentially several seconds of time.sleep calls — before failing. This is especially painful in CI environments with many tests. Use setUp/tearDown or a pytest fixture to enable and disable testing mode around each test that exercises retry-decorated code.
Conclusion
stamina brings production-grade retry logic to Python with a single decorator and zero boilerplate. We covered the core @stamina.retry decorator, configuring attempt counts and wait strategies, scoping retries to specific exception types, combining attempts and timeout for belt-and-suspenders limits, and using set_testing() to keep tests fast. The real-life poller example shows how retry logic at the page-fetch level makes an entire multi-page crawl resilient to transient failures.
Try extending the poller to write each page to a JSON file before moving to the next — that way a crash mid-crawl loses at most one page, not everything. Or add the Prometheus integration and build a Grafana panel showing which functions retry most in your production environment.
The official documentation is at stamina.hynek.me. The companion tenacity docs at tenacity.readthedocs.io are also worth reading if you need to understand what is happening under the hood.