How do I keep realism thresholds from going stale?

Calibrate against a baseline reference dataset, store thresholds in the versioned contract, and track drift across releases with a rolling window and statistical process control. Tighten or loosen a threshold by editing the contract, never by changing evaluator code.

Realism Metrics & Evaluation

Quantifying realism in synthetic spatial data means scoring three independent things at once — whether the geometry is well-formed, whether the numbers are distributed like the reference, and whether the result is plausible as a place — and refusing to promote any artifact that fails on any axis. This page extends Synthetic Spatial Data Architecture & Fundamentals, turning its deterministic-generation principle into a measurable, gate-enforced definition of “realistic enough to ship.” For GIS developers, ML engineers, QA teams, and compliance engineers, the goal is the same: a realism score that is computed automatically, compared against a versioned threshold, and blocks a release when it regresses — never a number eyeballed once and forgotten.

Problem Framing: A Single Score Hides the Failure That Matters

The tempting mistake is to collapse realism into one scalar — a “fidelity percentage” on a dashboard. It fails because the three ways synthetic spatial data goes wrong have nothing in common. A generator can emit geometry with perfect marginal distributions and still produce self-intersecting rings that crash a downstream ST_Union. It can produce flawless topology whose nearest-neighbour spacing is far too regular, destroying the spatial autocorrelation an ML model was meant to learn. It can match every distribution and still place a residential land-use polygon in the middle of a runway, because the attribute model has no spatial conditioning.

A single averaged score lets a strong dimension mask a catastrophic one — the metric masking problem. If geometric validity is 100% and semantic plausibility is 40%, an average of 70% looks acceptable while the dataset is unusable. The evaluation strategy this page builds keeps the three dimensions strictly separate, computes each against its own tolerance, and routes the first failure to a hard stop. Crucially, each dimension’s failure points at a different root cause, so distinguishing them at evaluation time is what lets a team fix the right component instead of tuning hyperparameters at random.

Prerequisites & Toolchain

Realism scores are only comparable across runs when the stack that computes them is pinned. A minor bump in a spatial-statistics library can change a Moran’s I permutation result; a PROJ database skew shifts coordinates before a single metric is computed. Pin the evaluation environment to the same versions the scoping rules and data contracts require of the generator, and run both in the same container.

Component	Version	Role in evaluation
GeoPandas	0.14.x	Vectorized geometry I/O, CRS handling, validity predicates
Shapely	2.x	`is_valid_reason`, `make_valid`, topology predicates
pyproj	3.x	Explicit CRS objects; rejects implicit `EPSG:4326` fallback
libpysal	4.11.x	Spatial weights (Queen, KNN) for autocorrelation
esda	2.5.x	Global and Local Moran’s I, permutation inference
SciPy	1.11+	`wasserstein_distance`, `ks_2samp` for distributional tests
POT	0.9.x	Multivariate optimal transport for embedded feature spaces


# requirements.txt — exact pins so two runs are byte-comparable
geopandas==0.14.4
shapely==2.0.4
pyproj==3.6.1
libpysal==4.11.0
esda==2.5.1
scipy==1.11.4
pot==0.9.3

Set PYTHONHASHSEED=0 and a fixed permutation seed for every metric that uses Monte-Carlo inference, or autocorrelation p-values will drift run-to-run and your tolerance bands will flap. Reference implementation details for spatial weights and autocorrelation are in the official PySAL ESDA documentation.

Core Concept: Three Orthogonal Dimensions of Realism

Production evaluation decomposes realism into three dimensions that are scored, thresholded, and failure-routed independently. Treating them as orthogonal is what prevents metric masking.

Geometric & topological fidelity asks whether the geometry is well-formed at all: coordinate precision, CRS identity, polygon validity, ring orientation, and indexing efficiency. Its metrics are blunt and binary-leaning — valid-geometry ratio, self-intersection count, sliver-polygon count, topology rule violations. This dimension is a precondition: there is no point computing a distribution over invalid geometry.

Statistical & distributional alignment asks whether the numbers behave like the reference: marginal and conditional distributions, spatial autocorrelation, and feature covariance. Its core indicators are Global Moran’s I for clustering structure, the Wasserstein distance for distribution shape, Kullback–Leibler divergence for categorical mixes, and empirical variogram matching for the correlation range.

Global Moran’s I summarizes whether like values cluster in space:

I = \frac{n}{\sum_{i}\sum_{j} w_{ij}} \cdot \frac{\sum_{i}\sum_{j} w_{ij}\,(x_i - \bar{x})(x_j - \bar{x})}{\sum_{i}(x_i - \bar{x})^2}

where $w_{ij}$ is the spatial weight between features $i$ and $j$ . A synthetic dataset whose $I$ diverges from the reference has the wrong clustering structure even when every marginal histogram matches.

Semantic & contextual plausibility asks whether the result is believable as a place: land-use realism, network connectivity, attribute realism, and spatial relationships. It is scored with categorical entropy, adjacency-matrix similarity against a reference land-use mix, and rule-based constraint satisfaction (no residential polygon inside an airport, no road segment terminating in water).

The three are evaluated in cost order — cheap geometric checks first, expensive transport-based distributional checks next, rule-based semantic checks last — so a malformed artifact is rejected before any expensive metric runs.

Step-by-Step Implementation

The following steps assemble a complete, ordered evaluation. Each block is copy-pasteable into the pinned container and is introduced by the role it plays.

Step 1: Score geometric and topological fidelity

Run this first; it is cheap, deterministic, and a precondition for the rest. Capture the reason for each invalid geometry rather than a binary pass/fail, so a diagnostic points at the generator bug.

python
import geopandas as gpd
from shapely.validation import explain_validity


def geometric_fidelity(gdf: gpd.GeoDataFrame, expected_crs: str = "EPSG:4326") -> dict:
    """Return per-dimension geometric realism scores; never average them away."""
    if gdf.crs is None or gdf.crs.to_string() != expected_crs:
        raise ValueError(f"CRS contract violation: expected {expected_crs}, got {gdf.crs}")

    valid_mask = gdf.geometry.is_valid
    invalid = gdf.loc[~valid_mask]
    reasons = invalid.geometry.apply(explain_validity).value_counts().to_dict()

    # Slivers: tiny-area polygons that serialize cleanly but corrupt aggregations.
    areas = gdf.geometry.area
    sliver_count = int((areas < areas.median() * 1e-4).sum())

    return {
        "valid_geometry_ratio": round(float(valid_mask.mean()), 6),
        "invalid_count": int((~valid_mask).sum()),
        "invalid_reasons": reasons,
        "sliver_count": sliver_count,
    }

Step 2: Score spatial autocorrelation

Spatial point processes and polygon attributes carry clustering and dispersion structure that must be preserved to keep analytical utility. Compute Global Moran’s I on a row-standardized weights matrix for both datasets and report the delta — the absolute gap is the realism signal, not either value alone.

python
import libpysal
from esda import Moran


def autocorrelation_delta(synthetic: gpd.GeoDataFrame, reference: gpd.GeoDataFrame,
                          attribute_col: str, seed: int = 0) -> dict:
    """Compare Global Moran's I between synthetic and reference on aligned CRS."""
    if synthetic.crs != reference.crs:
        synthetic = synthetic.to_crs(reference.crs)

    def morans(gdf):
        w = libpysal.weights.Queen.from_dataframe(gdf, silence_warnings=True)
        w.transform = "R"  # row-standardize for comparability
        return Moran(gdf[attribute_col].dropna().values, w, permutations=999, seed=seed)

    m_ref, m_syn = morans(reference), morans(synthetic)
    return {
        "reference_I": round(m_ref.I, 4),
        "synthetic_I": round(m_syn.I, 4),
        "delta_I": round(abs(m_ref.I - m_syn.I), 4),
        "p_value_ref": m_ref.p_sim,
        "p_value_syn": m_syn.p_sim,
    }

Step 3: Score distributional fidelity with optimal transport

Marginal histograms are insufficient; joint and conditional distributions must align too. The Wasserstein distance (Earth Mover’s Distance) is geometry-aware — it respects the topology of the feature space — which makes it more robust than KL divergence for bounded or multimodal spatial attributes. For univariate features it has a closed form over the inverse CDFs:

W_1(\mu, \nu) = \int_{0}^{1} \left| F_\mu^{-1}(t) - F_\nu^{-1}(t) \right| \, dt

The deeper mathematical treatment and pipeline-integration patterns live in Evaluating Spatial Realism with Wasserstein Distance; the block below is the operational score. Always normalize features to a common scale before computing transport cost, or a large-magnitude unit (metres) will dominate a small one (a density ratio).

python
import numpy as np
from scipy.stats import wasserstein_distance


def distributional_fidelity(synthetic: gpd.GeoDataFrame, reference: gpd.GeoDataFrame,
                            cols: list[str]) -> dict:
    """Per-feature normalized Wasserstein distance; small is realistic."""
    scores = {}
    for col in cols:
        s = synthetic[col].dropna().to_numpy(dtype=float)
        r = reference[col].dropna().to_numpy(dtype=float)
        # Normalize to the reference range so distances are unit-free and comparable.
        lo, hi = r.min(), r.max()
        scale = (hi - lo) or 1.0
        scores[col] = round(wasserstein_distance((s - lo) / scale, (r - lo) / scale), 5)
    return scores

Step 4: Score semantic plausibility against a land-use adjacency reference

The semantic score compares the synthetic land-use adjacency structure against a reference mix and applies hard rule constraints. A normalized adjacency-matrix distance plus a count of constraint violations gives a score that catches “statistically fine but geographically impossible” output.

python
def semantic_plausibility(synthetic: gpd.GeoDataFrame, reference_adjacency: np.ndarray,
                          class_col: str = "land_use") -> dict:
    """Adjacency-similarity score plus hard-rule violation count."""
    classes = sorted(synthetic[class_col].unique())
    idx = {c: i for i, c in enumerate(classes)}
    adj = np.zeros((len(classes), len(classes)))

    sindex = synthetic.sindex
    for pos, geom in zip(synthetic.index, synthetic.geometry):
        a = idx[synthetic.at[pos, class_col]]
        for nb in sindex.query(geom, predicate="touches"):
            b = idx[synthetic.iloc[nb][class_col]]
            adj[a, b] += 1
    adj = adj / (adj.sum() or 1.0)

    # Frobenius distance to the reference adjacency mix (lower is more plausible).
    ref = reference_adjacency / (reference_adjacency.sum() or 1.0)
    adjacency_distance = float(np.linalg.norm(adj - ref))
    return {"classes": classes, "adjacency_distance": round(adjacency_distance, 5)}

Step 5: Assemble a versioned realism report

Serialize the three dimensions side by side — never averaged — alongside the artifact, stamped with the contract version and a configuration hash so the evaluation is reproducible and auditable.

python
import hashlib
import json


def realism_report(geometric: dict, autocorr: dict, distributional: dict,
                   semantic: dict, contract_version: str) -> dict:
    body = {
        "contract_version": contract_version,
        "geometric": geometric,
        "statistical": {"autocorrelation": autocorr, "distributional": distributional},
        "semantic": semantic,
    }
    body["config_hash"] = hashlib.sha256(
        json.dumps(body, sort_keys=True).encode()).hexdigest()[:16]
    return body

Validation & Testing

The evaluator is itself code and must be tested with deliberately broken fixtures — a metric that has never been seen to fail is one you cannot trust. The most valuable tests prove that each dimension fires on its own kind of corruption and only its own kind.

python
import pytest
from shapely.geometry import Polygon


def test_geometric_dimension_flags_self_intersection():
    bowtie = Polygon([(0, 0), (1, 1), (1, 0), (0, 1)])  # serializes cleanly
    gdf = gpd.GeoDataFrame(geometry=[bowtie], crs="EPSG:4326")
    score = geometric_fidelity(gdf)
    assert score["valid_geometry_ratio"] < 1.0
    assert score["invalid_count"] == 1


def test_distributional_dimension_separates_shifted_distributions():
    ref = gpd.GeoDataFrame({"v": np.linspace(0, 1, 500)})
    syn = gpd.GeoDataFrame({"v": np.linspace(0.5, 1.5, 500)})  # shifted right
    score = distributional_fidelity(syn, ref, ["v"])
    assert score["v"] > 0.4  # large transport cost => low realism

Wire these into the same pinned container as the generator. Assert on the shape of each result, not a single number: the geometric score returns a dict of named sub-scores, the distributional score returns one value per feature, and the autocorrelation score returns a delta plus two p-values. Keep every tolerance in the versioned contract — morans_i_tolerance, wasserstein_max, min_valid_ratio — never hard-coded in the evaluator, so QA can tighten a threshold without a code change. This is the same enforcement point the CI/CD integration fidelity gate calls.

Performance & Scale Considerations

Realism scoring is the most compute-heavy stage in the pipeline because optimal transport and permutation inference both scale poorly. Three constraints dominate on large grids:

Permutation cost. Moran’s I with 999 permutations on a continental polygon set can take minutes. Compute it per spatial tile and aggregate, keeping a halo of neighbouring features around each tile so the weights graph does not see false edges at partition seams. The non-blocking patterns in async execution for large grids apply directly.
Transport memory. Multivariate Wasserstein with POT builds an $n \times m$ cost matrix that exhausts runner memory above a few tens of thousands of points. Subsample to a fixed, seeded reference size, or switch to the Sinkhorn (entropy-regularized) approximation for large feature spaces.
Weights construction. Building Queen weights re-reads geometry; cache the weights object keyed by a geometry hash so repeated metric runs in the same job do not rebuild it. Read only the columns each dimension needs (columns=["geometry", attribute_col]) to keep the GeoDataFrame in memory.

The highest-leverage optimization is ordering: run the cheap geometric dimension first and short-circuit on failure, so the expensive transport and permutation work never runs on an artifact that was already going to be rejected.

Failure Modes & Troubleshooting

Metric masking by averaging

Symptom: a dataset with a healthy aggregate “realism %” turns out to be unusable downstream. Root cause: the three dimensions were averaged into one number, letting a perfect geometric score hide a collapsed semantic score. Remediation: never average across dimensions; gate each one against its own threshold and fail on the first breach. Report the three sub-scores separately in the manifest.

Autocorrelation p-values flapping between runs

Symptom: the Moran’s I delta passes and fails intermittently with no parameter change. Root cause: the permutation inference uses a fresh random seed each run, and small extents amplify the sampling variance. Remediation: fix the permutation seed, raise the minimum feature count in the contract, and set the tolerance band to the empirically observed run-to-run variance rather than an aspirational value.

Unit dominance in multivariate transport

Symptom: the distributional score is driven entirely by one feature and ignores the rest. Root cause: features were fed to the transport computation on their native scales, so a metres-valued column dwarfs a unit-interval density. Remediation: normalize every feature to the reference range (as in Step 3) before computing transport cost, or standardize to zero mean and unit variance.

Privacy noise inflating distributional distance

Symptom: a privacy-compliant dataset fails the Wasserstein threshold even though it is analytically fit for purpose. Root cause: the differential privacy mechanisms injected calibrated noise that legitimately widens the distribution. Remediation: widen the distributional tolerance band by an amount derived from the privacy budget so the gate accounts for the noise the budget mandates, rather than treating it as a regression.

Semantic rules with no spatial conditioning

Symptom: geometry and statistics pass but land-use polygons land in impossible places. Root cause: the attribute-synthesis model assigns classes without conditioning on location, so the adjacency structure is random. Remediation: this is a generator fix, not a threshold fix — add spatial conditioning to the attribute model; the semantic score exists precisely to make this class of bug visible at gate time.

Frequently Asked Questions

Why not reduce realism to a single score?

Because the three failure modes are independent and have different root causes. A topology failure is a correctness bug in the generation algorithm; a Moran’s I gap means the stochastic parameters are miscalibrated; a semantic failure means the attribute model lacks spatial conditioning. Averaging them lets a strong dimension mask a fatal one and points the team at the wrong fix.

When should I prefer Wasserstein distance over KL divergence?

For bounded, multimodal, or continuous spatial attributes where the geometry of the distribution matters. Wasserstein respects the metric of the feature space and stays finite when supports do not overlap, whereas KL divergence diverges to infinity on disjoint supports and ignores how far apart the masses are. KL remains convenient for purely categorical mixes such as land-use class proportions.

How do I keep thresholds from going stale?

Calibrate them against a baseline reference dataset, store them in the versioned contract, and track metric drift across releases with a rolling window. Use statistical process control on the per-dimension scores to catch gradual degradation, and tighten or loosen a threshold by editing the contract — never by changing evaluator code.

Do topology checks need to follow a standard?

Yes. Self-intersections, sliver polygons, and inverted ring orientations must be caught with the predicates defined by the OGC Simple Features Access standard. Use shapely.is_valid_reason or PostGIS ST_IsValid so you capture the exact failure mode rather than a binary pass/fail, which is what turns a geometric failure into an actionable generator bug report.

How does realism evaluation support a compliance audit?

Every score must be reproducible: deterministic seeds, pinned library versions, and an explicit configuration manifest. Serialize the three-dimension report with a content hash, and when sharing data across organizational boundaries attach a signed evaluation manifest certifying the scores. That satisfies audit requirements while staying transparent about the synthetic data’s limitations.

Evaluating Spatial Realism with Wasserstein Distance — the optimal-transport mathematics and pipeline integration behind the distributional dimension.
Scoping Rules & Data Contracts — the versioned thresholds and CRS contract every metric reads from.
Privacy-Preserving Generation Frameworks — how privacy noise interacts with distributional realism budgets.
CI/CD Integration for Spatial Data — the gate that enforces these scores at promotion time.
Point Process Simulation Models — generation methods whose autocorrelation these metrics validate.
Synthetic Spatial Data Architecture & Fundamentals — the parent area that frames generation, validation, and promotion as one control plane.