Evaluating Spatial Realism with Wasserstein Distance

A synthetic point set can match the real distribution on every histogram and still place its clusters in the wrong part of the map; Wasserstein (Earth Mover’s) distance is the realism metric that catches that spatial shift because it scores the ground distance mass has to travel, not bin overlap.

Part of Realism Metrics & Evaluation: this page is the geometry-aware distributional check in that cluster’s three-lane evaluation — the lane that sits between topology predicates and semantic land-use plausibility, and the one most often implemented wrong.

Root Cause: Why Histogram Divergences Pass Spatially-Shifted Data

Kullback-Leibler divergence, Jensen-Shannon distance, and the Kolmogorov-Smirnov statistic all compare probability mass per bin. They have no notion of how far apart two bins are on the ground. Two point patterns can share identical marginal histograms on longitude and on latitude yet occupy entirely disjoint regions, because the marginals throw away the joint coupling between the two axes. A generator that emits the right number of points in the right density bands — but shifts an entire metropolitan cluster 40 km east into farmland — scores a near-perfect KL value while producing a dataset that breaks every downstream spatial join.

Wasserstein distance fixes this by operating on the metric space itself. The $p$ -Wasserstein distance between two discrete spatial distributions $\mu = \sum_{i=1}^n a_i \delta_{x_i}$ and $\nu = \sum_{j=1}^m b_j \delta_{y_j}$ is the minimum-cost transport plan that morphs one into the other:

W_p(\mu, \nu) = \left( \inf_{\gamma \in \Pi(\mu, \nu)} \sum_{i,j} \gamma_{ij}\, d(x_i, y_j)^p \right)^{1/p}

where $\gamma$ is a joint transport plan satisfying the marginal constraints $\sum_j \gamma_{ij} = a_i$ and $\sum_i \gamma_{ij} = b_j$ , and $d(x_i, y_j)$ is the ground distance between coordinate pairs. Because every unit of displaced mass is charged in proportion to how far it moves, $W_p$ is directly sensitive to spatial drift, cluster fragmentation, and administrative-boundary leakage — the failures that matter but that are invisible to bin-counting metrics. For spatial realism, $p = 2$ is the working default: it penalizes large geographic displacements quadratically, exposing generators that scatter outlier points far from any valid urban, ecological, or infrastructural zone. This complements rather than replaces the spatial-autocorrelation checks (Moran’s I and friends) the parent cluster also enforces — autocorrelation scores local spacing, $W_2$ scores global placement.

The single most common mistake is feeding $d(\cdot,\cdot)$ raw Euclidean distance on unprojected latitude/longitude pairs. A degree of longitude is ~111 km at the equator but ~78 km at 45° latitude, so an unprojected ground metric silently inflates transport cost at mid-to-high latitudes and corrupts the score. This is the same class of coordinate reference system drift the data-contract layer is meant to forbid: the metric must be computed in a locally accurate projected CRS, not in degrees.

Minimal Reproducer: KL Passes, W2 Fails

This snippet builds a real cluster and a synthetic copy shifted 0.5° east. A 2-D histogram KL divergence reports near-zero (looks identical) while $W_2$ flags the displacement.

python
# reproduce-wasserstein-vs-kl.py  (Python 3.10+, NumPy 1.26, SciPy 1.11+)
import numpy as np
from scipy.stats import entropy

rng = np.random.default_rng(42)
real = rng.normal(loc=(13.40, 52.52), scale=0.02, size=(2000, 2))   # Berlin-ish
synth = real + np.array([0.50, 0.0])                                  # shifted 0.5deg east

# --- Histogram KL on identical bin grids (the misleading metric) ---
edges = [np.linspace(13.0, 14.2, 60), np.linspace(52.3, 52.8, 60)]
h_real, _, _ = np.histogram2d(*real.T, bins=edges, density=True)
h_synth, _, _ = np.histogram2d(*synth.T, bins=edges, density=True)
eps = 1e-12
kl = entropy((h_real + eps).ravel(), (h_synth + eps).ravel())
print(f"KL divergence (2D histogram): {kl:.4f}")   # ~0 — declares them 'identical'

The KL value is effectively zero because, on a shared bin grid wide enough to contain both clouds, the two marginals line up; on disjoint grids KL is undefined. Either way it never reports the 0.5° gap. $W_2$ will.

Fix: A Projection-Aware, Sinkhorn-Approximated W2 Gate

Exact optimal transport scales cubically with sample size ( $O(N^3)$ ), so a naive solver is infeasible past a few thousand points. The production recipe is: project to metres, normalise weights to the probability simplex, then solve with entropic-regularised Sinkhorn iterations, which converge in near-linear time. The Python Optimal Transport (POT) library provides the regularised solver; SciPy’s wasserstein_distance is a 1-D-only fallback for univariate spatial attributes.

python
# spatial_w2.py  (POT 0.9.x, pyproj 3.x, NumPy 1.26)
import numpy as np
import ot                      # POT
from pyproj import Transformer

def to_metric(lonlat: np.ndarray, epsg: int = 3857) -> np.ndarray:
    """Project lon/lat (EPSG:4326) into a metric CRS before any distance math."""
    tf = Transformer.from_crs("EPSG:4326", f"EPSG:{epsg}", always_xy=True)
    x, y = tf.transform(lonlat[:, 0], lonlat[:, 1])
    return np.column_stack([x, y])

def spatial_w2(real_lonlat: np.ndarray,
               synth_lonlat: np.ndarray,
               epsg: int = 3857,
               reg: float = 1.0) -> float:
    """Entropic-regularised 2-Wasserstein distance, computed in projected metres."""
    R, S = to_metric(real_lonlat, epsg), to_metric(synth_lonlat, epsg)
    a = np.full(len(R), 1.0 / len(R))      # uniform weights on the simplex
    b = np.full(len(S), 1.0 / len(S))
    M = ot.dist(R, S, metric="sqeuclidean")  # squared ground distance => p=2
    M /= M.max()                              # stabilises Sinkhorn; rescale on return
    w2_sq = ot.sinkhorn2(a, b, M, reg) * M_max_of(R, S)
    return float(np.sqrt(max(w2_sq, 0.0)))

def M_max_of(R, S):
    return ot.dist(R, S, metric="sqeuclidean").max()

For a locally accurate result, swap EPSG:3857 for the relevant UTM zone — 3857 (Web Mercator) is convenient but stretches distances away from the equator. For dense urban work where mass must respect the street network rather than fly over rivers and rail lines, replace ot.dist(...) with a precomputed shortest-path cost matrix from an OpenStreetMap routing graph; cells that cross an impassable barrier get an effectively infinite cost so the transport plan cannot route through them. The same projected-metre discipline applies whether the points came from a Poisson point-process generator or from a Markov-chain routing model emitting trajectory endpoints.

Thresholds are not theoretical — calibrate them empirically. Compute intra-real $W_2$ across several disjoint holdout subsets of the source dataset; the 95th percentile of those intra-real distances is the upper bound for acceptable synthetic drift. A synthetic batch scoring inside that band is statistically indistinguishable from one more sample of reality.

python
# calibrate_threshold.py
import numpy as np

def intra_real_threshold(real_lonlat, n_splits=20, frac=0.5,
                         pct=95, seed=0) -> float:
    rng = np.random.default_rng(seed)
    n = len(real_lonlat); k = int(n * frac)
    scores = []
    for _ in range(n_splits):
        idx = rng.permutation(n)
        a, b = real_lonlat[idx[:k]], real_lonlat[idx[k:2*k]]
        scores.append(spatial_w2(a, b))
    return float(np.percentile(scores, pct))   # versioned, stored with the contract

Verification: A CI Gate That Blocks Spatial Regressions

Wire the calibrated threshold into the CI/CD pipeline gating as a hard assertion, the same way topology and contract checks are enforced. Score a stratified sample (5,000–10,000 points per generation run is a sensible budget), fail the build if $W_2$ exceeds the threshold, and persist the transport plan for triage when it breaches.

python
# test_spatial_realism.py  (pytest)
import numpy as np
from spatial_w2 import spatial_w2
from calibrate_threshold import intra_real_threshold

def test_w2_within_intra_real_band():
    real = np.load("fixtures/real_holdout.npy")     # lon/lat, EPSG:4326
    synth = np.load("artifacts/generated_batch.npy")
    threshold = intra_real_threshold(real, seed=20250304)   # pinned seed
    score = spatial_w2(real, synth)
    assert score <= threshold, (
        f"Spatial realism regression: W2={score:.1f} m > "
        f"intra-real 95th pct={threshold:.1f} m"
    )

Treat the seed and split count as versioned inputs stored next to the data contract, so two CI runs on the same artifact are byte-comparable. A useful secondary guard is run-to-run variance: fail if $W_2$ swings more than ~15% across consecutive generation runs, which usually signals an unstable generator rather than a single bad batch.

$W_2$ doubles as a privacy signal. An excessively low score often means mode collapse — the generator has memorised and re-emitted real training coordinates, which can leak sensitive residential or critical-infrastructure locations and undermine the differential privacy budget the pipeline is supposed to honour. So the gate is two-sided: reject batches that are too far (poor utility) and batches that are suspiciously close (privacy risk).

Edge Cases & Gotchas

Antimeridian geometries. Points straddling ±180° longitude (the Pacific, Fiji, Chukotka) wrap, so a naive ground metric treats two neighbours as half a world apart and the transport plan routes mass the long way round. Project into a CRS centred on the data’s own region, or shift longitudes into a continuous [0, 360) frame before projecting, so the antimeridian is not a seam in the cost matrix.
Null Island and silent (0, 0) sentinels. Missing or failed-geocode coordinates frequently collapse to (0.0, 0.0) off the Gulf of Guinea. Even a handful of these drags $W_2$ up enormously because they sit thousands of kilometres from any real cluster, masquerading as catastrophic drift. Filter exact-zero pairs (and other known sentinel values) at ingestion before scoring, and assert the count you dropped.
Floating-point precision at datum boundaries and after rescaling. Two effects compound here: sqeuclidean distances in projected metres can overflow float32 over continental extents (use float64), and the M /= M.max() normalisation that stabilises Sinkhorn must be undone by the same scale factor when you return the metres value — a mismatched rescale produces a dimensionless number that silently passes or fails the gate. Also pin the Sinkhorn regularisation reg: too small and the iterations diverge to NaN, too large and the blur hides real displacement.

Up to the parent cluster: Realism Metrics & Evaluation — the geometric, statistical, and semantic lanes this metric plugs into.
CI/CD Integration for Spatial Data — where the $W_2$ assertion runs as a build-blocking gate.
Scoping Rules & Data Contracts — the CRS contract that guarantees the projected ground metric is correct.
Privacy-Preserving Generation Frameworks — reading a suspiciously low $W_2$ as a memorisation/leakage signal.