Markov Chain Routing Models for Synthetic Trajectories

A Markov-chain routing model turns a network and a set of transition probabilities into fresh, statistically grounded movement traces — without replaying a single real journey. Part of Trajectory & Movement Simulation, this page covers the specific sub-problem of synthesizing sequential, network-constrained paths (vehicle routes, transit legs, pedestrian walks) that reproduce the route-choice structure of a source distribution while remaining reproducible, topologically valid, and privacy-auditable. Where the parent area frames the fidelity-versus-privacy trade-off across every movement primitive, here we make it concrete for routing: the row-stochastic transition matrix $\mathbf{P}$ is the contract, and every downstream gate tests how faithfully a sampled walk honours it.

Problem Framing

The failure that Markov routing exists to prevent is the plausible-but-wrong synthetic journey: a path that looks reasonable on a basemap but silently misrepresents how traffic actually distributes across a network. Two engineering failures dominate.

The first is deterministic monoculture from shortest-path generation. Routing every synthetic agent along the single least-cost path collapses behavioral variance: real populations spread across parallel corridors, rat-runs, and habit-driven detours. A model trained on shortest-path-only synthetic data learns a route prior that does not exist, and it underestimates edge-load variance the moment it meets real demand. Sampling next-hops from a calibrated transition distribution restores that spread, and the macroscopic origin-destination flow is validated against the same statistics that describe the source.

The second is silent non-reproducibility and absorbing-state collapse. A Markov walk is stochastic by construction, so without a propagated seed, a pinned toolchain, and a fixed node ordering, two runs of the “same” pipeline produce different traces and different flow statistics. Worse, an un-smoothed matrix built from sparse telemetry contains rows that sum to zero — dead-end states that trap the walk and silently truncate every trajectory passing through them. Treating the seed, the node index, and the transition matrix as versioned, first-class artifacts is the discipline that closes both gaps, governed by the same CRS contract enforcement that fixes the spatial envelope before any walk is drawn.

Prerequisites & Toolchain

Markov routing sits on the standard geospatial Python stack plus a graph library for topology and a sparse-matrix backend for the transition operator. Pin to the major versions used across this site so that node ordering, RNG streams, and sparse arithmetic behave identically across environments.

python
# requirements.txt — pin majors; let patch releases float
numpy==1.26.*
scipy==1.11.*           # CSR/CSC sparse matrices, sparse linear algebra
networkx==3.2.*         # directed graph construction + SCC analysis
shapely==2.0.*          # GEOS-backed geometry for node snapping
pyproj==3.6.*           # CRS transforms (PROJ 9.x under the hood)
geopandas==0.14.*       # GeoDataFrame I/O for the road / network layer

Two environment variables must be explicit, not inherited, or you will hit non-deterministic behaviour that masquerades as a calibration bug:

bash
# Resolve PROJ data deterministically; never rely on the ambient default
export PROJ_LIB="$(python -c 'import pyproj; print(pyproj.datadir.get_data_dir())')"
export PYTHONHASHSEED=0   # stabilize any set/dict iteration that feeds node order

All topology construction happens in a projected, metric CRS, never in geographic degrees. Edge lengths, turn penalties, and node-snapping tolerances are distance-sensitive, so deriving them in EPSG:4326 distorts every weight by latitude. Reproject the network at ingestion to a local UTM zone or an equal-area CRS, build and sample there, and reproject the assembled trace back to EPSG:4326 only at serialization. When kinematic realism matters, this topological layer is paired with physics-based path generation, which constrains velocity, turning radius, and acceleration so that a stochastically selected hop sequence stays physically traversable.

Core Concept: The Transition Operator and Ergodicity

A Markov-chain routing model is a directed graph $G = (V, E)$ plus a transition operator $\mathbf{P}$ whose entry $P_{ij}$ is the probability of stepping from node $i$ to node $j$ in one move. Nodes are discrete spatial entities — intersections, transit stops, grid centroids, POI clusters — and edges encode feasible transitions. The model rests on the first-order Markov property: the next state depends only on the current state, not the path taken to reach it. That memorylessness is what makes the matrix compact and the sampling cheap, and it is also the assumption you validate against later.

For $\mathbf{P}$ to be a valid one-step operator it must be row-stochastic — every row is a probability distribution over destinations:

\sum_{j} P_{ij} = 1 \quad \forall i \in V.

Two structural properties decide whether the chain produces usable trajectories. Irreducibility means every state is reachable from every other state, so the walk cannot become marooned in a disconnected sub-network; you verify it by confirming the graph has a single strongly connected component. Aperiodicity rules out forced cycles that would make flow oscillate deterministically. Together they guarantee a unique stationary distribution $\boldsymbol{\pi}$ satisfying $\boldsymbol{\pi}\mathbf{P} = \boldsymbol{\pi}$ — the long-run fraction of time the chain spends at each node, which must match the source’s node-occupancy frequencies for the synthetic flow to be faithful.

The practical threat is the absorbing state: a row that sums to zero because no telemetry was ever observed leaving that node. Once entered, the walk cannot leave, truncating the trace. The fix is calibration smoothing — adding a small Laplace (or Dirichlet) pseudo-count to every feasible transition so no reachable edge has exactly zero probability while the empirical signal still dominates. The spectral gap $1 - |\lambda_2|$ , the distance between the leading eigenvalue (always 1) and the second-largest, governs mixing speed: a wide gap means short walks already sample close to $\boldsymbol{\pi}$ , while a narrow gap warns that trajectories must run long before their statistics stabilize.

Step-by-Step Implementation

The sequence below builds a calibrated, row-stochastic operator and samples reproducible walks from it, in a projected CRS under a fixed seed. Each block is copy-pasteable and builds on the previous one.

Step 1 — Build a spatially masked directed graph

Construct the network as a networkx.DiGraph with nodes indexed deterministically and infeasible edges removed before any probability is assigned. One-way restrictions, physical barriers, and disconnected fragments are masked here, not patched later.

python
import networkx as nx
import geopandas as gpd

def build_network(edges: gpd.GeoDataFrame, metric_crs: str = "EPSG:32633") -> nx.DiGraph:
    """Build a directed routing graph in a projected CRS, dropping infeasible edges."""
    edges = edges.to_crs(metric_crs)            # distances must be metric, not degrees
    g = nx.DiGraph()
    for row in edges.itertuples():
        if row.access == "forbidden" or row.length_m <= 0:
            continue                            # spatial mask: never enters the matrix
        g.add_edge(row.u, row.v, flow_weight=float(row.flow_weight))
        if not row.oneway:
            g.add_edge(row.v, row.u, flow_weight=float(row.flow_weight))
    # Keep only the largest strongly connected component => irreducibility.
    scc = max(nx.strongly_connected_components(g), key=len)
    return g.subgraph(scc).copy()

Step 2 — Construct a row-stochastic transition matrix

Map empirical edge flows into a sparse CSR matrix, apply Laplace smoothing to eliminate absorbing states, then normalize each row to sum to one. Sorting the node list is what makes indexing deterministic across hosts.

python
import numpy as np
from scipy.sparse import csr_matrix, diags

def build_transition_matrix(graph: nx.DiGraph, smoothing_alpha: float = 0.01) -> csr_matrix:
    """Row-stochastic transition matrix with spatial masking and Laplace smoothing."""
    nodes = sorted(graph.nodes())               # sorted => deterministic indexing
    node_idx = {n: i for i, n in enumerate(nodes)}
    n = len(nodes)

    rows, cols, data = [], [], []
    for u, v, d in graph.edges(data=True):
        weight = d.get("flow_weight", 1.0)
        if weight > 0:
            rows.append(node_idx[u]); cols.append(node_idx[v]); data.append(weight)

    adj = csr_matrix((data, (rows, cols)), shape=(n, n))
    smoothed = adj + diags([smoothing_alpha] * n, 0)   # no zero-probability rows

    row_sums = np.asarray(smoothed.sum(axis=1)).flatten()
    inv = 1.0 / np.where(row_sums > 0, row_sums, 1.0)
    return smoothed.multiply(inv[:, np.newaxis]).tocsr()

Step 3 — Sample reproducible trajectories

At each timestep, draw the next state from the current row’s categorical distribution using a single seeded generator. The one np.random.default_rng(seed) instance is the only entropy source, so identical seeds yield byte-identical walks.

python
def sample_trajectory(
    P: csr_matrix, start_idx: int, max_steps: int, seed: int,
) -> list[int]:
    """Sample one walk of node indices from a row-stochastic matrix under a fixed seed."""
    rng = np.random.default_rng(seed)
    path, state = [start_idx], start_idx
    for _ in range(max_steps):
        row = P.getrow(state)
        dests, probs = row.indices, row.data
        if dests.size == 0:                      # should not happen post-smoothing
            break
        state = int(rng.choice(dests, p=probs / probs.sum()))
        path.append(state)
    return path

Step 4 — Assemble traces and hand off to noise modelling

Concatenate sampled states into geometry until a terminal condition (max length, destination state, or dwell threshold) is met, reproject to EPSG:4326, and assign per-edge dwell and velocity. Sensor realism is added downstream by noise injection and stochastic drift, which decouples topological routing from GNSS error so each can be validated independently, while temporal synchronization for moving objects assigns the realistic timestamps that downstream streaming analytics expect.

python
import geopandas as gpd
from shapely.geometry import LineString

def assemble_trace(path_idx, nodes, coords, metric_crs="EPSG:32633"):
    """Turn a node-index walk into a WGS84 LineString trace."""
    pts = [coords[nodes[i]] for i in path_idx]
    line = gpd.GeoSeries([LineString(pts)], crs=metric_crs)
    return line.to_crs("EPSG:4326")              # serialize only at the end

Validation & Testing

A synthetic trajectory set is only usable once it provably matches the source’s routing statistics. Wire these checks into CI so a drifting operator fails the build instead of shipping silently. The same fidelity philosophy underpins the broader realism metrics evaluation used across the architecture.

Row-sum conformance is the cheapest gate: every row of $\mathbf{P}$ must sum to one within $|1 - \sum_j P_{ij}| < 10^{-9}$ , or the operator is not a valid probability kernel.
Irreducibility is confirmed by a single strongly connected component over the node set; more than one component means some origins can never reach some destinations.
Origin-destination parity compares the synthetic OD matrix and per-edge flow distribution against the source using KL divergence for transition alignment and Earth Mover’s Distance for spatial flow matching.
Stationary-distribution match checks that the chain’s $\boldsymbol{\pi}$ tracks observed node-occupancy frequencies, and Dynamic Time Warping scores individual trajectory shape similarity.
Deterministic seeding is itself a test: identical seed and inputs must yield identical traces on every host.

python
import numpy as np
from scipy.sparse import csr_matrix

def assert_row_stochastic(P: csr_matrix, tol: float = 1e-9) -> None:
    """CI gate: P must be a valid row-stochastic operator."""
    sums = np.asarray(P.sum(axis=1)).flatten()
    assert np.all(np.abs(sums - 1.0) < tol), "Non-stochastic rows: smoothing or normalization bug"

def assert_irreducible(graph) -> None:
    import networkx as nx
    assert nx.number_strongly_connected_components(graph) == 1, \
        "Reducible chain: unreachable states will truncate trajectories"

def assert_reproducible(make_path) -> None:
    assert make_path(seed=7) == make_path(seed=7), \
        "Non-deterministic output: unseeded RNG or set-order leak"

Expected shapes are part of the contract too: trajectory lengths should follow the same distribution as the source rather than collapsing to a single modal length, which is the tell-tale signature of an over-smoothed or near-deterministic matrix.

Performance & Scale Considerations

City- and continental-scale networks break the naive single-process model in two predictable ways, both solvable with the right data structures and seed discipline.

Memory from dense matrices. A dense $|V| \times |V|$ matrix is quadratic in node count and infeasible past a few thousand nodes; a real road network is overwhelmingly sparse. Store $\mathbf{P}$ in CSR/CSC format so memory scales with edge count, not node count squared, and matrix-vector products during stationary-distribution checks stay fast.
Throughput from $O(1)$ sampling and batching. Per-step rng.choice over a row is fine for thousands of agents but dominates runtime at millions. Precompute Walker alias tables per row to make each next-hop draw $O(1)$ , and generate agents in vectorized batches rather than a Python-level loop. For independent agent fleets, derive each worker’s seed deterministically from a base seed (seed = base_seed * 1_000_003 + agent_id) so the whole population stays reproducible while agents run in parallel.

python
def agent_seed(base_seed: int, agent_id: int) -> int:
    """Stable per-agent seed: independent agents, globally reproducible fleet."""
    return (base_seed * 1_000_003 + agent_id) & 0x7FFF_FFFF

Checkpoint completed agent batches and keep worker functions idempotent so a partial failure re-runs only the affected agents, never the whole fleet. Cache the alias tables alongside the matrix hash so a re-run reuses them instead of rebuilding the sampling structures from scratch.

Failure Modes & Troubleshooting

Absorbing-state truncation

Symptom: Trajectories end abruptly and cluster at a handful of dead-end nodes; length distribution spikes at short values. Root cause: rows that sum to zero because no telemetry was observed leaving those nodes, trapping every walk that enters them. Fix: apply Laplace or Dirichlet smoothing before normalization (Step 2) so every feasible edge carries a small floor probability, and assert irreducibility in CI.

python
from scipy.sparse import diags
smoothed = adj + diags([0.01] * adj.shape[0], 0)   # floor every reachable transition

Topology built in geographic degrees

Symptom: Snapping tolerances and edge weights behave inconsistently across latitudes, and turn penalties skew with location. Root cause: constructing the graph or computing distances in EPSG:4326, where a degree is not a constant ground distance. Fix: reproject the network to a local UTM zone or equal-area CRS at ingestion, build and sample there, and convert traces back to EPSG:4326 only at export.

Deterministic monoculture from over-weighting

Symptom: Nearly every agent follows the same corridor and synthetic edge-load variance is far below the source. Root cause: flow weights so peaked that one destination dominates each row, collapsing the categorical draw toward a single shortest-path. Fix: calibrate against observed route-choice entropy, temper extreme weights, and validate the synthetic OD spread with EMD rather than eyeballing a basemap.

Non-reproducible output across hosts

Symptom: Identical configs yield different traces on different machines. Root cause: an unseeded RNG, set/dict iteration order feeding node indexing, or an un-sorted node list. Fix: sort nodes before indexing, propagate one seed into every stochastic call, pin the toolchain, set PYTHONHASHSEED=0, and record a hash of the matrix plus node index in a seed registry.

Privacy leakage through over-faithful transitions

Symptom: Adversarial linkage tests reconstruct rare real journeys from synthetic traces. Root cause: transition probabilities estimated so tightly from sparse telemetry that a single real trip dominates a row, re-exposing an identifiable path. Fix: apply differential privacy mechanisms — add calibrated Laplace or Gaussian noise to the empirical transition counts before normalization, track the composed $(\varepsilon, \delta)$ budget, and enforce spatial $k$ -anonymity so each synthetic path shares at least $k-1$ indistinguishable alternatives. Confirm utility survives with an EMD check on the OD matrix.

Frequently Asked Questions

When is a first-order Markov chain too simple for realistic routing?

When route choice depends on where the agent came from — through-traffic continuing straight versus local traffic turning — a memoryless first-order chain cannot reproduce it, and you will see it as turn-ratio statistics that miss the source. Move to a second-order (state = current plus previous node) or variable-order chain, which encodes that history at the cost of a larger, sparser state space. The state-granularity trade-offs for human-scale movement are worked through in simulating pedestrian movement with first-order Markov models.

How much Laplace smoothing should I add to the transition counts?

Enough to remove absorbing states without drowning the empirical signal. Start with a small additive constant (around 0.01 of a transition) so well-observed rows are barely perturbed while zero-count rows gain a non-absorbing floor. Validate by confirming the stationary distribution still tracks observed node occupancy; if smoothing flattens $\boldsymbol{\pi}$ noticeably, it is too large.

Why store the transition matrix in CSR format instead of a dense array?

A real routing graph is sparse — each intersection connects to a handful of neighbours, not to every node. A dense matrix costs $O(|V|^2)$ memory and is infeasible past a few thousand nodes, while CSR scales with the number of edges and keeps the per-step row lookup and stationary-distribution products fast.

How do I keep a fleet of synthetic agents reproducible across workers?

Derive each agent’s seed deterministically from one base seed, draw from a single seeded generator per agent, sort nodes before indexing, pin library versions, and set PYTHONHASHSEED=0. Independent agents then run in parallel while the merged fleet stays byte-identical for a given base seed.

Physics-Based Path Generation — constrain stochastically routed hops with velocity, acceleration, and turning-radius limits so paths stay physically traversable.
Noise Injection & Stochastic Drift — add realistic GNSS error to clean Markov traces, decoupled from the routing layer.
Temporal Synchronization for Moving Objects — assign dwell times, timestamps, and replay alignment to sampled edge traversals.
Simulating Pedestrian Movement with First-Order Markov Models — state granularity and transition-prior tuning for human-scale mobility.
Privacy-Preserving Generation Frameworks — the noise budgets and $k$ -anonymity gates that bound leakage from generated trajectories.