Intermediate

NumPy handles arrays. Pandas handles tables. But when you need to solve a system of equations, find the minimum of a complex function, integrate a curve, or run a statistical test, you need SciPy. It is the Swiss Army knife of scientific computing in Python, built on top of NumPy and packed with algorithms that scientists, engineers, and data professionals use every day.

SciPy is organised into subpackages, each covering a different domain: optimization, linear algebra, statistics, signal processing, interpolation, and more. You rarely import all of SciPy at once — instead, you pull in just the subpackage you need, keeping your code clean and your imports explicit.

In this tutorial, you will learn how to install SciPy, solve optimization problems, perform statistical tests, work with linear algebra, integrate functions numerically, interpolate data points, and process signals. By the end, you will have a practical toolkit for tackling real scientific and engineering problems in Python.

SciPy for Scientific Computing: Quick Example

Let us start with a common task: finding the minimum of a mathematical function. This comes up everywhere from machine learning (gradient descent) to engineering (minimising material cost).

# quick_scipy.py
from scipy.optimize import minimize
import numpy as np

# Define a function: the Rosenbrock function (a classic test problem)
def rosenbrock(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

# Find the minimum starting from an initial guess
result = minimize(rosenbrock, x0=[0, 0], method='Nelder-Mead')

print(f"Minimum found at: x={result.x[0]:.4f}, y={result.x[1]:.4f}")
print(f"Minimum value: {result.fun:.6f}")
print(f"Converged: {result.success}")
print(f"Iterations: {result.nit}")

Output:

Minimum found at: x=1.0000, y=1.0000
Minimum value: 0.000000
Converged: True
Iterations: 141

What this does step by step: The Rosenbrock function is a famous test problem with a global minimum at (1, 1). We pass it to scipy.optimize.minimize with an initial guess of (0, 0) and the Nelder-Mead algorithm (a gradient-free method). SciPy iterates 141 times and finds the exact minimum. This same function can minimise any callable Python function, making it invaluable for fitting models, calibrating parameters, and solving engineering problems.

Installing SciPy

SciPy depends on NumPy and ships with precompiled binaries for all major platforms, so installation is straightforward.

# Install SciPy
pip install scipy

# Verify installation and check version
python -c "import scipy; print(scipy.__version__)"

# SciPy subpackages are imported individually:
# from scipy import optimize, stats, linalg, integrate, interpolate, signal

Unlike some scientific libraries, SciPy installs cleanly on Windows, macOS, and Linux without needing a C compiler. The pip package includes optimised BLAS and LAPACK routines, so you get near-Fortran performance out of the box. If you are using Anaconda, SciPy comes preinstalled.

Optimization: Finding Minimums and Solving Equations

The scipy.optimize module is one of the most-used parts of SciPy. It handles curve fitting, root finding, and general optimization.

Curve Fitting with curve_fit

When you have experimental data and want to find the best parameters for a model, curve_fit is your go-to tool.

import numpy as np
from scipy.optimize import curve_fit
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# Generate noisy exponential decay data
np.random.seed(42)
x_data = np.linspace(0, 10, 50)
y_data = 3.5 * np.exp(-0.7 * x_data) + np.random.normal(0, 0.2, 50)

# Define the model function
def exponential_decay(x, amplitude, decay_rate):
    return amplitude * np.exp(-decay_rate * x)

# Fit the model to data
params, covariance = curve_fit(exponential_decay, x_data, y_data)
amplitude, decay_rate = params

print(f"Fitted amplitude: {amplitude:.3f} (true: 3.500)")
print(f"Fitted decay rate: {decay_rate:.3f} (true: 0.700)")

# Calculate uncertainties from covariance matrix
uncertainties = np.sqrt(np.diag(covariance))
print(f"Amplitude uncertainty: +/- {uncertainties[0]:.3f}")
print(f"Decay rate uncertainty: +/- {uncertainties[1]:.3f}")

curve_fit returns two things: the optimal parameters and a covariance matrix. The diagonal of the covariance matrix gives you the variance of each parameter, and taking the square root gives you the standard error. This tells you not just the best fit, but how confident you should be in each parameter.

Root Finding

Finding where a function equals zero is fundamental to solving equations. SciPy offers several root-finding algorithms.

from scipy.optimize import brentq, fsolve
import numpy as np

# Find where x^3 - 2x - 5 = 0 using Brent's method (bracketed)
def cubic(x):
    return x**3 - 2*x - 5

root = brentq(cubic, 1, 3)  # Root must be between 1 and 3
print(f"Root of x^3 - 2x - 5: {root:.6f}")
print(f"Verification: f({root:.4f}) = {cubic(root):.2e}")

# Solve a system of nonlinear equations with fsolve
def system(variables):
    x, y = variables
    eq1 = x**2 + y**2 - 4    # Circle of radius 2
    eq2 = x - y**2 + 1        # Parabola
    return [eq1, eq2]

solution = fsolve(system, x0=[1, 1])
print(f"\nSystem solution: x={solution[0]:.4f}, y={solution[1]:.4f}")
print(f"Check circle: {solution[0]**2 + solution[1]**2:.4f} (should be 4)")

brentq is the fastest bracketed root finder — you give it an interval where the function changes sign, and it guarantees finding the root. fsolve handles systems of nonlinear equations, taking an initial guess and using Newton-type iterations to converge on the solution.

Statistical Analysis with scipy.stats

The scipy.stats module contains over 100 probability distributions plus a comprehensive collection of statistical tests.

from scipy import stats
import numpy as np

# Generate two samples
np.random.seed(42)
group_a = np.random.normal(loc=75, scale=10, size=100)  # Mean 75, std 10
group_b = np.random.normal(loc=78, scale=10, size=100)  # Mean 78, std 10

# Descriptive statistics
desc_a = stats.describe(group_a)
print(f"Group A: mean={desc_a.mean:.2f}, variance={desc_a.variance:.2f}")
print(f"  skewness={desc_a.skewness:.3f}, kurtosis={desc_a.kurtosis:.3f}")

# Independent samples t-test
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"\nT-test: t={t_stat:.3f}, p-value={p_value:.4f}")
print(f"Significant at 0.05? {'Yes' if p_value < 0.05 else 'No'}")

# Mann-Whitney U test (non-parametric alternative)
u_stat, p_mann = stats.mannwhitneyu(group_a, group_b, alternative='two-sided')
print(f"\nMann-Whitney U: U={u_stat:.1f}, p-value={p_mann:.4f}")

# Shapiro-Wilk normality test
w_stat, p_normal = stats.shapiro(group_a)
print(f"\nShapiro-Wilk (Group A): W={w_stat:.4f}, p-value={p_normal:.4f}")
print(f"Normal distribution? {'Yes' if p_normal > 0.05 else 'No'}")

# Pearson correlation
correlation, p_corr = stats.pearsonr(group_a[:50], group_b[:50])
print(f"\nPearson correlation: r={correlation:.3f}, p-value={p_corr:.4f}")

The t-test tells you whether two groups have significantly different means. The Mann-Whitney U test does the same thing but does not assume normal distributions. The Shapiro-Wilk test checks whether your data is normally distributed (important for choosing the right test). Always check your assumptions before picking a statistical test.

Working with Probability Distributions

from scipy import stats
import numpy as np

# Normal distribution
normal = stats.norm(loc=100, scale=15)  # IQ distribution: mean=100, std=15
print(f"P(IQ > 130): {1 - normal.cdf(130):.4f}")
print(f"P(85 < IQ < 115): {normal.cdf(115) - normal.cdf(85):.4f}")
print(f"IQ at 95th percentile: {normal.ppf(0.95):.1f}")

# Generate random samples
samples = normal.rvs(size=1000, random_state=42)
print(f"\nSample mean: {np.mean(samples):.1f}, std: {np.std(samples):.1f}")

# Fit a distribution to data
params = stats.norm.fit(samples)
print(f"Fitted params: mean={params[0]:.1f}, std={params[1]:.1f}")

# Chi-squared goodness of fit
observed = [18, 22, 20, 25, 15]
expected = [20, 20, 20, 20, 20]
chi2, p_chi = stats.chisquare(observed, expected)
print(f"\nChi-squared test: chi2={chi2:.2f}, p-value={p_chi:.4f}")

Every distribution in SciPy has the same interface: .pdf() for probability density, .cdf() for cumulative probability, .ppf() for percentiles (inverse CDF), .rvs() for random samples, and .fit() for parameter estimation. Once you learn one distribution, you know them all.

Linear Algebra with scipy.linalg

While NumPy has basic linear algebra, scipy.linalg adds decompositions, matrix functions, and specialised solvers that go well beyond the basics.

from scipy import linalg
import numpy as np

# Solve a system of linear equations: Ax = b
A = np.array([[3, 1, -1],
              [1, 4, 2],
              [2, 1, 3]])
b = np.array([4, 17, 13])

x = linalg.solve(A, b)
print(f"Solution: {x}")
print(f"Verification Ax: {A @ x}")

# LU decomposition
P, L, U = linalg.lu(A)
print(f"\nLU decomposition:")
print(f"L (lower triangular):\n{np.round(L, 3)}")
print(f"U (upper triangular):\n{np.round(U, 3)}")

# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = linalg.eig(A)
print(f"\nEigenvalues: {eigenvalues.real.round(3)}")

# Matrix determinant and inverse
det = linalg.det(A)
inv = linalg.inv(A)
print(f"\nDeterminant: {det:.1f}")
print(f"A @ inv(A) (should be identity):\n{np.round(A @ inv, 1)}")

# Singular Value Decomposition
U_svd, s, Vt = linalg.svd(A)
print(f"\nSingular values: {s.round(3)}")

linalg.solve is faster and more numerically stable than computing the inverse and multiplying. LU decomposition breaks a matrix into lower and upper triangular parts, which is useful when you need to solve the same system with different right-hand sides. SVD is the workhorse behind dimensionality reduction, recommendation systems, and image compression.

Numerical Integration

When you cannot find an analytical solution to an integral, numerical integration gives you an accurate approximation.

from scipy import integrate
import numpy as np

# Integrate a simple function: integral of sin(x) from 0 to pi
result, error = integrate.quad(lambda x: np.sin(x), 0, np.pi)
print(f"Integral of sin(x) from 0 to pi: {result:.6f} (exact: 2.0)")
print(f"Estimated error: {error:.2e}")

# Integrate a more complex function: Gaussian integral
result, error = integrate.quad(
    lambda x: np.exp(-x**2),
    -np.inf, np.inf  # Infinite limits are supported!
)
print(f"\nGaussian integral: {result:.6f} (exact: sqrt(pi) = {np.sqrt(np.pi):.6f})")

# Double integral: integral of x*y over the unit circle
def integrand(y, x):  # Note: y first, then x
    return x * y

def y_lower(x):
    return -np.sqrt(1 - x**2)

def y_upper(x):
    return np.sqrt(1 - x**2)

result, error = integrate.dblquad(integrand, -1, 1, y_lower, y_upper)
print(f"\nDouble integral of xy over unit circle: {result:.6f} (exact: 0)")

# Solve an ODE: dy/dt = -2y, y(0) = 1 (exponential decay)
def decay(t, y):
    return -2 * y

t_span = (0, 5)
t_eval = np.linspace(0, 5, 100)
solution = integrate.solve_ivp(decay, t_span, [1.0], t_eval=t_eval)

print(f"\ny(0) = {solution.y[0][0]:.4f}")
print(f"y(5) = {solution.y[0][-1]:.6f} (exact: {np.exp(-10):.6f})")

quad handles single integrals and even supports infinite limits. dblquad handles double integrals. solve_ivp solves initial value problems for ordinary differential equations, which is essential for modelling anything that changes over time: population growth, chemical reactions, mechanical systems, or circuit dynamics.

Interpolation: Filling in the Gaps

When you have data at discrete points and need values between them, interpolation creates a smooth function that passes through your data.

from scipy.interpolate import interp1d, CubicSpline
import numpy as np

# Known data points (e.g., temperature readings every 3 hours)
hours = np.array([0, 3, 6, 9, 12, 15, 18, 21, 24])
temps = np.array([15, 13, 14, 18, 24, 26, 23, 19, 16])

# Linear interpolation
linear_interp = interp1d(hours, temps, kind='linear')

# Cubic spline interpolation (smoother)
cubic_interp = CubicSpline(hours, temps)

# Evaluate at every half hour
fine_hours = np.linspace(0, 24, 49)
linear_temps = linear_interp(fine_hours)
cubic_temps = cubic_interp(fine_hours)

# Compare at hour 7.5
print(f"Temperature at 7:30 AM:")
print(f"  Linear interpolation: {linear_interp(7.5):.1f} C")
print(f"  Cubic spline: {cubic_interp(7.5):.1f} C")

# Cubic spline also gives derivatives
print(f"\nRate of change at noon: {cubic_interp(12, 1):.2f} C/hour")
print(f"Rate of change at 6 PM: {cubic_interp(18, 1):.2f} C/hour")

Linear interpolation draws straight lines between points -- simple but produces sharp corners. Cubic splines create smooth curves that pass through every point and have continuous first and second derivatives. The CubicSpline object can also compute derivatives at any point, which is useful for finding rates of change.

Signal Processing Basics

The scipy.signal module handles filtering, spectral analysis, and signal manipulation -- essential for audio processing, sensor data, and time series analysis.

from scipy import signal
import numpy as np

# Create a noisy signal: clean sine wave + noise
np.random.seed(42)
fs = 1000  # Sampling frequency (Hz)
t = np.linspace(0, 1, fs, endpoint=False)
clean_signal = np.sin(2 * np.pi * 5 * t) + 0.5 * np.sin(2 * np.pi * 50 * t)
noisy_signal = clean_signal + 0.5 * np.random.randn(len(t))

# Design a low-pass Butterworth filter (remove frequencies above 20 Hz)
nyquist = fs / 2
cutoff = 20 / nyquist  # Normalize to Nyquist frequency
b, a = signal.butter(N=4, Wn=cutoff, btype='low')

# Apply the filter
filtered = signal.filtfilt(b, a, noisy_signal)

print(f"Original signal RMS: {np.sqrt(np.mean(noisy_signal**2)):.3f}")
print(f"Filtered signal RMS: {np.sqrt(np.mean(filtered**2)):.3f}")
print(f"Noise reduced by: {(1 - np.sqrt(np.mean((filtered - clean_signal[:len(filtered)])**2)) / np.sqrt(np.mean((noisy_signal - clean_signal)**2))) * 100:.1f}%")

# Find peaks in the filtered signal
peaks, properties = signal.find_peaks(filtered, height=0.5, distance=50)
print(f"\nPeaks found: {len(peaks)}")
print(f"Peak heights: {properties['peak_heights'][:5].round(3)}")

signal.butter designs a Butterworth filter with a smooth frequency response. signal.filtfilt applies it forward and backward to eliminate phase distortion. find_peaks locates local maxima in a signal, with parameters to control minimum height and distance between peaks. This pipeline -- design filter, apply filter, find features -- is the backbone of most signal processing workflows.

Real-World Example: Analysing Experimental Data

Let us combine multiple SciPy tools to analyse a realistic dataset: fitting a model to noisy measurements, testing hypotheses, and quantifying uncertainty.

import numpy as np
from scipy import stats, optimize, integrate

# Simulate experimental data: drug dosage vs response
np.random.seed(42)
doses = np.array([0, 0.5, 1, 2, 5, 10, 20, 50, 100])
true_response = 100 * doses / (10 + doses)  # Hill equation (pharmacology)
measured_response = true_response + np.random.normal(0, 5, len(doses))
measured_response = np.clip(measured_response, 0, 100)

# Fit the Hill equation to data
def hill_equation(dose, v_max, k_half):
    return v_max * dose / (k_half + dose)

params, cov = optimize.curve_fit(
    hill_equation, doses, measured_response,
    p0=[100, 10],  # Initial guesses
    bounds=([0, 0], [200, 100])  # Parameter bounds
)
v_max, k_half = params
errors = np.sqrt(np.diag(cov))

print("Hill Equation Fit Results:")
print(f"  V_max = {v_max:.1f} +/- {errors[0]:.1f} (true: 100)")
print(f"  K_half = {k_half:.1f} +/- {errors[1]:.1f} (true: 10)")

# Calculate R-squared
predicted = hill_equation(doses, *params)
ss_res = np.sum((measured_response - predicted)**2)
ss_tot = np.sum((measured_response - np.mean(measured_response))**2)
r_squared = 1 - ss_res / ss_tot
print(f"  R-squared = {r_squared:.4f}")

# Calculate Area Under the Curve (AUC) using integration
auc, auc_error = integrate.quad(
    lambda d: hill_equation(d, *params), 0, 100
)
print(f"\nArea Under Curve (0-100): {auc:.1f} +/- {auc_error:.2e}")

# Statistical test: is the response at dose=50 significantly different from dose=5?
np.random.seed(42)
samples_dose5 = hill_equation(5, *params) + np.random.normal(0, 5, 30)
samples_dose50 = hill_equation(50, *params) + np.random.normal(0, 5, 30)

t_stat, p_value = stats.ttest_ind(samples_dose5, samples_dose50)
print(f"\nDose 5 vs Dose 50 comparison:")
print(f"  Mean at dose 5: {np.mean(samples_dose5):.1f}")
print(f"  Mean at dose 50: {np.mean(samples_dose50):.1f}")
print(f"  t-statistic: {t_stat:.3f}")
print(f"  p-value: {p_value:.2e}")
print(f"  Significant difference: {'Yes' if p_value < 0.05 else 'No'}")

This example mirrors a real pharmacology workflow: fit a dose-response curve, quantify parameter uncertainty, calculate the area under the curve (a standard efficacy metric), and test whether two dosage levels produce significantly different responses. The same pattern applies to any field where you need to model data and draw statistical conclusions.

Frequently Asked Questions

What is the difference between NumPy and SciPy?

NumPy provides the fundamental array data structure and basic operations like element-wise math, reshaping, and basic linear algebra. SciPy builds on NumPy and adds higher-level scientific algorithms: optimization, statistics, signal processing, integration, interpolation, and advanced linear algebra. Think of NumPy as the foundation and SciPy as the specialised toolkit.

Should I use scipy.linalg or numpy.linalg?

scipy.linalg is a superset of numpy.linalg with additional decompositions and solvers. It also always uses BLAS and LAPACK, which can be faster. For basic operations like dot or norm, either works fine. For decompositions (LU, Cholesky, SVD) or specialised solvers, prefer scipy.linalg.

How do I choose between different optimization methods?

If your function is smooth and differentiable, use L-BFGS-B or BFGS (fast, gradient-based). If you cannot compute gradients, use Nelder-Mead or Powell. If your function has many local minima, consider differential_evolution (global optimizer). For bounded problems, L-BFGS-B handles box constraints natively.

Can SciPy handle large datasets?

SciPy works with NumPy arrays, so it handles arrays that fit in memory efficiently. For very large sparse matrices, use scipy.sparse, which stores only non-zero elements. For datasets larger than memory, consider chunked processing or libraries like Dask that parallelize SciPy operations.

How do I choose the right statistical test?

For comparing two group means with normal data, use the t-test (ttest_ind). For non-normal data, use Mann-Whitney U (mannwhitneyu). For more than two groups, use one-way ANOVA (f_oneway) or Kruskal-Wallis (kruskal). Always check normality with shapiro first and check equal variances with levene before choosing a parametric test.

Wrapping Up

SciPy gives you access to decades of scientific computing algorithms through a clean, consistent Python interface. You have learned how to optimise functions and fit models with scipy.optimize, run statistical tests and work with distributions using scipy.stats, solve linear algebra problems with scipy.linalg, integrate functions numerically with scipy.integrate, interpolate data with scipy.interpolate, and process signals with scipy.signal. The key to using SciPy effectively is knowing which subpackage to reach for and understanding the assumptions behind each algorithm. Start with the examples in this tutorial, adapt them to your own data, and you will find that SciPy handles the mathematical heavy lifting while you focus on the science.

Related Articles