Markov Chains: Theory, Equations, and Applications in Stochastic Modeling

Markov chains are one of the most widely useful mathematical models for random systems that evolve step-by-step with no memory except the present state. They appear in probability theory, statistics, physics, computer science, genetics, finance, queueing theory, machine learning (HMMs, MCMC), and many other fields. This guide covers theory, equations, classifications, convergence, algorithms, worked examples, continuous-time variants, applications, and pointers for further study.

What is a Markov chain?

A (discrete-time) Markov chain is a stochastic process $X_0, X_1, X_2, \dots$ on a state space $S$ (finite or countable, sometimes continuous) that satisfies the Markov property:

$\Pr(X_{n+1}=j \mid X_n=i, \\ X_{n-1}=i_{n-1} \dots,X_0=i_0) \\ = \Pr(X_{n+1}=j \mid X_n=i)$

The future depends only on the present, not the full past.

We usually describe a Markov chain by its one-step transition probabilities. For discrete state space $S=\{1,2,…\},$ define the transition matrix $P$ with entries

$P_{ij} = \Pr(X_{n+1}=j \mid X_n=i)$ .

By construction, every row of $P$ sums to 1:

$\sum_{j\in S} P_{ij} = 1$ for all ${i\in S}$ .

If $S$ is finite with size $N, P$ is an ${$N\times N$}$ row-stochastic matrix.

Multi-step transitions and Chapman–Kolmogorov

The $n$ -step transition probabilities are entries of the matrix power ${P_n}$ :

$P_{ij}^{(n)} = \Pr(X_{m+n}=j \mid X_m=i) \\ (time-homogeneous case)$

They obey the Chapman–Kolmogorov equations: $P^{(n+m)} = P^{(n)} P^{(m)}$ ,

or in entries

$P_{ij}^{(n+m)} = \sum_{k\in S} P_{ik}^{(n)} P_{kj}^{(m)}$ .

The $n$ -step probabilities are just matrix powers: $P^{(n)} = P^{n}$ .

Examples (simple and illuminating)

1. Two-state chain (worked example)

State space $S = {1, 2}$ . Let $P = \begin{pmatrix}0.9 & 0.4 \\0.1 & 0.6\end{pmatrix}$ .

Stationary distribution $π$ satisfies $\pi = \pi P$ and $\pi_1 + \pi_2 = 1$ . Write ${\pi=(\pi_1,π\pi_2)}$ .

From $\pi = \pi P$ we get (component equations)

${ \pi = 0.9\pi_1+ 0.4\pi_2 }$ .

Rearrange: ${\pi_1 - 0.9\pi_2 =0.4\pi_2}$ so ${0.1\pi_1 =0.4\pi_2}$ . Divide both sides by $0.1$ (digit-by-digit): ${0.4/0.1=4.0}$ , therefore

${\pi_1 =4.0\pi_2}$ .

Using normalization ${\pi_1 +\pi_2 =1}$ gives ${4\pi_2+\pi_2 =5\pi_2=1}$ so ${\pi_2 =1/5=0.2}$ . Then ${\pi_1=0.8}$ .

So the stationary distribution is ${\pi=(0.8,0.2)}$ .

(You can check: $\pi_P=(0.8,0.2),$ e.g. first component $0.8 \times 0.9+0.2 \times 0.4 \\ =0.72+0.08=0.80)$

2. Simple random walk on a finite cycle

On states ${0,1,…,$n - 1$}$ with ${P_{i,i+1 (mod\,n)}=p$ and $P_{i,i-1 (mod\,n)}=1-p$ . If $p=1/2$ the stationary distribution is uniform: ${\pi_i=1/n}$ .

Classification of states

For a Markov chain on countable $S$ , states are classified by accessibility and recurrence.

Accessible: $i \to j$ if $P_{ij}^{(n)} > 0$ for some $n$ .
Communicate: $i \leftrightarrow j$ if both $i \to j$ and $j \to i$ . Communication partitions $S$ into classes.

For a state $i$ :

Transient: with probability < 1 you ever return to $i$ .
Recurrent (persistent): with probability 1 you eventually return to .
- Positive recurrent: expected return time $\mathbb{E} [\tau_i]<$\infty$$ .
- Null recurrent: expected return time infinite.
Periodic: the period $d(i) = \gcd \{ n >= 1: P_{ii}^{(n)}>0 \} = 1$ .If $d(i)=1$ the state is aperiodic.

Important facts:

Communication classes are either all transient or all recurrent.
In a finite state irreducible chain, all states are positive recurrent; there exists a unique stationary distribution.

Stationary distributions and invariant measures

A probability vector $\pi$ (row vector) is stationary if $\pi = \pi P, \quad \sum_{i \in S } \pi_i = 1, \quad \pi_i \ge 0$ .

If the chain starts in $\pi$ then it is stationary (the marginal distribution at every time is $\pi$ ).

For irreducible, positive recurrent chains, a unique stationary distribution exists. For finite irreducible chains it is guaranteed.

Detailed balance and reversibility

A stronger condition is detailed balance: $\pi_i P_{ij} = \pi_j P_{ji}$ for all ${i,j}$ .

If detailed balance holds, the chain is reversible (time-reversal has the same law). Many constructions (e.g., Metropolis–Hastings) enforce detailed balance to guarantee $\pi$ is stationary.

Convergence, ergodicity, and mixing

Ergodicity

An irreducible, aperiodic, positive recurrent Markov chain is ergodic: for any initial distribution ${\mu}$ ,

$\lim_{n\to\infty} \mu P^n = \pi$ ,

i.e., the chain converges to the stationary distribution.

Total variation distance

Define total variation distance between two distributions μ,ν on S: $||\mu - \nu||_{\text{TV}} = \frac{1}{2} \sum_{i \in S} \left| \mu_i - \nu_i \right|$ .

The mixing time $t_{\mathrm{mix}}(\varepsilon)$ is the smallest $n$ such that $\max_{x} || P^n(x, \cdot) - \pi |_{\text{TV}} \le \varepsilon$ .

Spectral gap and relaxation time (finite-state reversible chains)

For a reversible finite chain, the transition matrix $P$ has real eigenvalues $1 = \lambda_1 > \lambda_2 \geq \lambda_3 \geq \cdots \geq \lambda_N \geq -1$ . Roughly,

The time to approach stationarity scales like $O((1/{1-\lambda_2})ln(1/\varepsilon))$ .
Larger spectral gap → faster mixing.

(There are precise inequalities; the spectral approach is fundamental.)

Hitting times, commute times, and potential theory

Let $T_A$ time to hit set $A$ be the hitting time of set $A$ . For expected hitting times $h(i) = \mathbb{E}_i[T_A]$ you can solve linear equations: $\begin{cases}h(i) = 0, & \text{if } i \in A \\h(i) = 1 + \sum_j P_{ij} h(j), & \text{if } i \notin A\end{cases}$ .

These linear systems are effective in computing mean times to absorption, cover times, etc. In reversible chains there are intimate connections between hitting times, electrical networks, and effective resistance.

Continuous-time Markov chains (CTMC)

Discrete-time Markov chains jump at integer times. In continuous time we have a Markov process with generator matrix $Q = (q_{ij})$ satisfying, for $i \neq j$ , $q_{ij} \ge 0$ , and

For a CTMC the transition function $q_{ii} = -\sum_{j\neq i} q_{ij}$

and Kolmogorov forward/backward equations hold:

Forward (Kolmogorov): $P(t) = e^{tQ}$ .
Backward: $\frac{d}{dt}P(t) = P(t)Q$ .

Poisson process and birth–death processes are prototypical CTMCs. For birth–death with birth rates ${\lambda_i}$ and death rates ${\mu_i}$ , the stationary distribution (if it exists) has product form:

$\pi_n \propto \prod_{k=1}^n \frac{\lambda_{k-1}}{\mu_k}$ .

Examples of important chains

Random walk on graphs: $P_{ij} = \frac{1}{\text{deg}(i)} \quad \text{if } (i,j)$ edge. Stationary $\pi_i \propto \text{deg}(i)$ .
Birth–death chains: 1D nearest-neighbour transitions with closed-form stationary formulas.
Glauber dynamics (Ising model): Markov chain on spin configurations used in statistical physics and MCMC.
PageRank: random surfer with teleportation; stationary vector solves ${\pi = \pi G}$ for Google matrix $G$ .
Markov chain Monte Carlo (MCMC): design $P$ with target stationary ${\pi}$ (Metropolis–Hastings, Gibbs).

Markov Chain Monte Carlo (MCMC)

Goal: sample from a complicated target distribution $\pi (x)$ on large state space. Strategy: construct an ergodic chain with stationary distribution ${\pi}$ .

Metropolis–Hastings

Given proposal kernel $q(x \to y)$ :

Acceptance probability $\alpha(x,y) = \min\left(1, \frac{\pi(y) q(y \to x)}{\pi(x) q(x \to y)}\right)$ .

Algorithm:

At state $x$ , propose ${y \sim q(x,\cdot)}$ .
With probability ${\alpha(x,y)}$ move to $y$ ; otherwise stay at $x$ .

This enforces detailed balance and hence stationarity.

Gibbs sampling

A special case where the proposal is the conditional distribution of one coordinate given others; always accepted.

MCMC performance is measured by mixing time and autocorrelation; diagnostics include effective sample size, trace plots, and Gelman–Rubin statistics.

Limits & limit theorems

Ergodic theorem for Markov chains: For ergodic chain and function $f$ with ${\mathbb{E}_\pi[|f|] < \infty}$ ,

$\frac{1}{n}\sum_{t=0}^{n-1} f(X_t) \xrightarrow{a.s.} \mathbb{E}_\pi[f]$ ,

i.e. time averages converge to ensemble averages.

Central limit theorem (CLT): Under mixing conditions, $\sqrt{n} (\overline{f_n} - \mathbb{E}_{\pi}[f])$ converges in distribution to a normal with asymptotic variance expressible via the Green–Kubo formula (autocovariance sum).

Tools for bounding mixing times

Coupling: Construct two copies of the chain started from different initial states; if they couple (meet) quickly, that yields bounds on mixing.
Conductance (Cheeger-type inequality): Define for distribution $\pi$ ,

$\Phi := \min_{S : 0 < \pi(S) \leq \frac{1}{2}} \sum_{i \in S, j \notin S} \frac{\pi_i P_{ij}}{\pi(S)}$ .

A small conductance implies slow mixing. Cheeger inequalities relate $\phi$ to the spectral gap.

Canonical paths / comparison methods for complex chains.

Hidden Markov Models (HMMs)

An HMM combines a Markov chain on hidden states with an observation model. Important algorithms:

Forward algorithm: computes likelihood efficiently.
Viterbi algorithm: finds most probable hidden state path.
Baum–Welch (EM): learns HMM parameters from observed sequences.

HMMs are used in speech recognition, bioinformatics (gene prediction), and time-series modeling.

Practical computations & linear algebraic viewpoint

Stationary distribution ππ solves linear system $\pi(I-P)=0$ with normalization $\sum{\pi_i}$ =1.
For large sparse $P$ , compute ${\pi}$ by power iteration: repeatedly multiply an initial vector by $P$ until convergence (this is the approach used by PageRank with damping).
For reversible chains, solving weighted eigen problems is numerically better.

Common pitfalls & intuition checks

Not every stochastic matrix converges to a unique stationary distribution. Need irreducibility and aperiodicity (or consider periodic limiting behavior).
Infinite state spaces can be subtle: e.g., simple symmetric random walk on ${\mathbb{Z}}$ is recurrent in 1D and 2D (returns w.p. 1) but null recurrent in 1D/2D (no finite stationary distribution); in 3D it’s transient.
Ergodicity vs. speed: Existence of ${\pi}$ does not imply rapid mixing; chains can be ergodic but mix extremely slowly (metastability).

Applications (selective)

Search & ranking: PageRank.
Statistical physics: Monte Carlo sampling, Glauber dynamics, Ising/Potts models.
Machine learning: MCMC for Bayesian inference, HMMs.
Genetics & population models: Wright–Fisher and Moran models (Markov chains on counts).
Queueing theory: Birth–death processes, M/M/1 queues modeled by CTMCs.
Finance: Regime-switching models, credit rating transitions.
Robotics & control: Markov decision processes (MDPs) extend Markov chains with rewards and control.

Conceptual diagrams (you can draw these)

State graph: nodes = states; directed edges $i \to j$ labeled by ${P_ij}$ .
Transition matrix heatmap: show $P$ colors; power-iteration evolution of a distribution vector.
Mixing illustration: plot total-variation distance $|| P_n(x, \cdot) - \pi ||_{\text{TV}}$ vs $n$ .
Coupling picture: two walkers from different starts that merge then move together.

Quick reference of key formulas (summary)

Chapman–Kolmogorov: $P^{(n+m)} = P^{(n)} P^{(m)}$ .
Stationary distribution: $\pi = \pi P, \quad \sum_i \pi_i = 1$ .
Detailed balance (reversible): $\pi_i P_{ij} = \pi_j P_{ji}$ .
Expected hitting time system:

$h(i)=\begin{cases}0, & i\in A\\1+\sum_j P_{ij} h(j), & i\notin A\end{cases}$

CTMC generator relation: $P(t) = e^{tQ}$ , $\frac{d}{dt} P(t) = P(t) Q$ .

Final thoughts

Markov chains are deceptively simple to define yet enormously rich. The central tension is between local simplicity (memoryless one-step dynamics) and global complexity (long-term behavior, hitting times, mixing). Whether you need to analyze a queue, design a sampler, or reason about random walks on networks, Markov chain theory supplies powerful tools — algebraic (eigenvalues), probabilistic (hitting/return times), and algorithmic (coupling, MCMC).