CI/CD Integration for Spatial Data

Automating the generation, validation, and promotion of synthetic spatial datasets demands a disciplined continuous-integration and continuous-delivery strategy that treats geometry as a first-class, machine-enforceable contract. This page extends Synthetic Spatial Data Architecture & Fundamentals, applying its deterministic-generation and spatial-contract principles to the pipeline layer where commits become published artifacts. Unlike tabular pipelines, spatial workflows must enforce coordinate reference system consistency, topological integrity, and statistical fidelity at every commit — and they must do so in containers that pin the exact geospatial stack, because a single unpinned library can silently rewrite every coordinate that flows through it.

Problem Framing: Failures That Pass Schema But Corrupt Geometry

The defining hazard of spatial CI/CD is the silent failure. A tabular pipeline that breaks usually throws: a type mismatch, a null where a value was required, a row count that collapses to zero. Spatial pipelines fail differently. A subtle coordinate reference system drift — a runner that falls back to an implicit EPSG:4326 because PROJ_LIB was never set — produces a dataset that is perfectly well-formed JSON, passes every column-level schema assertion, and is geometrically wrong by hundreds of metres. The error only surfaces three stages downstream, when a spatial join returns zero matches or an ML feature store ingests misaligned rasters.

The second class of silent failure is topological: generators emit self-intersecting rings, inverted winding order, or sliver polygons a few microns wide. These geometries serialize cleanly and may even render, but ST_Area returns negative values, buffering explodes the vertex count, and tessellation aggregations leak across boundaries. A spatial CI gate exists to convert these silent corruptions into loud, blocking failures before an artifact is promoted. The three failures this page is built to stop are CRS drift, topology violation, and statistical degradation — each caught by its own gate, each a hard stop.

Prerequisites & Toolchain

Spatial CI is reproducible only when the geospatial stack is pinned to exact versions and executed inside a container image that is itself version-tagged. The “works on my machine” failure mode is not a developer inconvenience here — it is a correctness bug, because GDAL and PROJ carry the projection database that defines what a coordinate means.

Pin the following in the runner image and in requirements.txt:

Component	Version	Why it must be pinned
GDAL	3.x	Carries the driver and the PROJ binding; a minor bump can change WKT output
PROJ	9.x	Ships the projection database; version skew shifts transformed coordinates
GeoPandas	0.14.x	Vectorized geometry I/O and CRS handling used by every gate
Shapely	2.x	Topology predicates and `make_valid` auto-repair
pyproj	3.x	Explicit CRS objects and transformer construction

Set PROJ_LIB and GDAL_DATA explicitly in the workflow environment rather than trusting the base image. A minimal pinned environment looks like this:

dockerfile
# Dockerfile.spatial-ci — the single source of geospatial truth for every runner
FROM python:3.11-slim

ENV PROJ_LIB=/usr/share/proj \
    GDAL_DATA=/usr/share/gdal \
    PYTHONHASHSEED=0

RUN apt-get update && apt-get install -y --no-install-recommends \
        gdal-bin=3.* libgdal-dev=3.* proj-bin=9.* && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt


# requirements.txt — exact pins, no ranges
geopandas==0.14.4
shapely==2.0.4
pyproj==3.6.1
pyarrow==16.1.0
esda==2.5.1
libpysal==4.11.0

Setting PYTHONHASHSEED=0 and pinning every transitive dependency is what makes two runs on two runners byte-identical — the same determinism guarantee that the scoping rules and data contracts require of the generation stage itself.

Core Concept: A Three-Gate Promotion Lifecycle

Spatial CI/CD separates the lifecycle into three discrete stages — generation, validation, promotion — and inserts three sequential, independently-failing gates between validation and promotion. The ordering is deliberate and reflects the cost and dependency structure of each check:

Topology & CRS gate runs first because it is cheap, deterministic, and a precondition for every later metric. There is no point computing spatial autocorrelation on geometries that are invalid or in the wrong projection.
Statistical fidelity gate runs second. It is more expensive and only meaningful on geometrically valid input, so it consumes the output of gate one.
Privacy & compliance gate runs last because disclosure risk depends on the final, validated spatial distribution, and because a privacy failure must block promotion even when geometry and statistics are flawless.

Each gate is a hard stop: the first violation halts the run, quarantines the artifact, and surfaces a structured diagnostic rather than a binary pass/fail. Gates never auto-promote on warning. This is the same separation-of-concerns discipline that governs the whole architecture — generation proposes, the gates dispose, and only validated artifacts reach a consumer.

For the concrete GitHub Actions wiring of this lifecycle — matrix builds across multiple CRS targets and parallelized validation runners — see Syncing Synthetic Data Generation with GitHub Actions, which implements the workflow file this page describes conceptually.

Step-by-Step Implementation

The following steps assemble a complete gate sequence. Every block is copy-pasteable into the pinned container above and is introduced by the role it plays in the pipeline.

Step 1: Declare the spatial data contract

The contract is the versioned specification every gate reads from. It is committed alongside the generation manifest so that a parameter change and its acceptance criteria move together. Keep it in YAML for reviewability.

yaml
# spatial_contract.yaml — versioned acceptance criteria, reviewed in PRs
schema_version: "2.1"
crs: "EPSG:4326"
extent_bbox: [-122.5200, 37.7000, -122.3500, 37.8200]
geometry:
  allowed_types: ["Polygon", "MultiPolygon"]
  min_feature_count: 500
  max_feature_count: 50000
  min_vertex_spacing_m: 1.0
statistics:
  morans_i_tolerance: 0.05      # max |I_synthetic - I_reference|
  nn_distance_ks_pvalue_min: 0.01
privacy:
  epsilon_max: 1.0
  k_anonymity_min: 5
  restricted_zones: "data/restricted_zones.parquet"
retention:
  format: "geoparquet"
  keep_versions: 12

Step 2: Run the topology & CRS gate

This gate enforces geometry validity, CRS identity, and bound compliance before any downstream consumption. It attempts a bounded auto-repair with Shapely’s make_valid and fails hard if repair leaves residual invalid geometries — never promoting silently-broken output.

python
import logging
import geopandas as gpd
from shapely.validation import make_valid

logging.basicConfig(level=logging.INFO)


def topology_crs_gate(artifact_path: str, contract: dict) -> gpd.GeoDataFrame:
    gdf = gpd.read_parquet(artifact_path)
    expected_crs = contract["crs"]

    # CRS identity — reject implicit projections outright.
    if gdf.crs is None or gdf.crs.to_string() != expected_crs:
        raise ValueError(f"CRS contract violation: expected {expected_crs}, got {gdf.crs}")

    # Topology — attempt bounded auto-repair, then re-assert.
    invalid = ~gdf.geometry.is_valid
    if invalid.any():
        logging.warning("Repairing %d invalid geometries in %s", int(invalid.sum()), artifact_path)
        gdf.loc[invalid, "geometry"] = gdf.loc[invalid, "geometry"].apply(make_valid)
        if (~gdf.geometry.is_valid).any():
            raise ValueError("Topology violations persist after make_valid — gate failed")

    # Extent — coordinates must stay inside the declared envelope.
    minx, miny, maxx, maxy = gdf.total_bounds
    ext = contract["extent_bbox"]
    if minx < ext[0] or miny < ext[1] or maxx > ext[2] or maxy > ext[3]:
        raise ValueError(f"Extent violation: {gdf.total_bounds.tolist()} exceeds {ext}")

    logging.info("Topology & CRS gate passed for %s (%d features)", artifact_path, len(gdf))
    return gdf

Consult the GeoPandas documentation for the vectorized validity predicates and CRS transformation semantics this gate relies on.

Step 3: Run the statistical fidelity gate

Generated output must preserve the spatial autocorrelation, density gradients, and feature distributions of the reference domain. This gate computes Global Moran’s I and a nearest-neighbour distance comparison against the baseline, failing when either exceeds the contract tolerance. The detailed methodology and threshold calibration live in Realism Metrics & Evaluation; the gate below is the operational enforcement point.

python
import numpy as np
from scipy.stats import ks_2samp
from libpysal.weights import KNN
from esda.moran import Moran


def fidelity_gate(synthetic: gpd.GeoDataFrame, reference: gpd.GeoDataFrame,
                  value_col: str, contract: dict) -> None:
    # Global Moran's I on a fixed k-nearest-neighbour graph for stable comparison.
    def morans_i(gdf):
        w = KNN.from_dataframe(gdf, k=8)
        w.transform = "r"
        return Moran(gdf[value_col].to_numpy(), w, permutations=999).I

    delta_i = abs(morans_i(synthetic) - morans_i(reference))
    if delta_i > contract["statistics"]["morans_i_tolerance"]:
        raise ValueError(f"Autocorrelation drift: |delta I| = {delta_i:.4f} over tolerance")

    # Nearest-neighbour distance distribution — KS test against the reference.
    def nn_dist(gdf):
        c = np.c_[gdf.geometry.x, gdf.geometry.y] if gdf.geom_type.iloc[0] == "Point" \
            else np.c_[gdf.geometry.centroid.x, gdf.geometry.centroid.y]
        w = KNN.from_array(c, k=1)
        return np.array([min(d.values()) for d in w.weights.values()])

    p = ks_2samp(nn_dist(synthetic), nn_dist(reference)).pvalue
    if p < contract["statistics"]["nn_distance_ks_pvalue_min"]:
        raise ValueError(f"Point-pattern divergence: KS p={p:.4f} below floor")

Step 4: Run the privacy & compliance gate

Even with direct identifiers removed, coordinate precision and clustering can enable re-identification by linkage or proximity inference. This gate verifies the differential privacy budget, a k-anonymity floor, and that no geometry intersects a restricted zone. It must stay synchronized with the generation-side privacy parameters so gate and generator never disagree on the budget.

The budget itself is the standard $(\varepsilon, \delta)$ guarantee: a mechanism $M$ satisfies it when, for all adjacent datasets $D, D'$ and outputs $S$ ,

\Pr[M(D) \in S] \le e^{\varepsilon}\,\Pr[M(D') \in S] + \delta.

python
def privacy_gate(gdf: gpd.GeoDataFrame, contract: dict, epsilon_spent: float) -> None:
    p = contract["privacy"]

    # Budget — the generator's spent epsilon must not exceed the declared ceiling.
    if epsilon_spent > p["epsilon_max"]:
        raise ValueError(f"Privacy budget exceeded: epsilon={epsilon_spent} > {p['epsilon_max']}")

    # Restricted zones — no synthetic geometry may intersect a protected area.
    restricted = gpd.read_parquet(p["restricted_zones"]).to_crs(gdf.crs)
    hits = gpd.sjoin(gdf, restricted, predicate="intersects", how="inner")
    if not hits.empty:
        raise ValueError(f"{len(hits)} geometries intersect restricted zones")

    # k-anonymity on a quantized grid cell — every occupied cell needs >= k features.
    cell = (gdf.geometry.centroid.x.round(3).astype(str) + "_" +
            gdf.geometry.centroid.y.round(3).astype(str))
    if cell.value_counts().min() < p["k_anonymity_min"]:
        raise ValueError("k-anonymity floor violated in at least one grid cell")

Step 5: Promote to versioned storage with provenance

Only after all three gates pass does the artifact reach versioned storage. Promotion writes GeoParquet — which carries spatial metadata and supports predicate pushdown for downstream queries — stamped with a content hash and the contract version, giving every release a verifiable chain of custody.

python
import hashlib
import json
from pathlib import Path


def promote(gdf: gpd.GeoDataFrame, contract: dict, epsilon_spent: float, out_dir: str) -> str:
    payload = gdf.to_parquet(index=False)
    digest = hashlib.sha256(payload).hexdigest()[:16]
    dest = Path(out_dir) / f"synthetic_{contract['schema_version']}_{digest}.parquet"
    dest.write_bytes(payload)

    manifest = {
        "artifact": dest.name, "sha256_16": digest,
        "schema_version": contract["schema_version"], "crs": contract["crs"],
        "feature_count": len(gdf), "epsilon_spent": epsilon_spent,
    }
    dest.with_suffix(".manifest.json").write_text(json.dumps(manifest, indent=2))
    return str(dest)

Validation & Testing

Gates are themselves code and must be tested. The most valuable tests are negative: deliberately malformed fixtures that prove each gate blocks what it is supposed to block. A gate that has never been seen to fail is a gate you cannot trust.

python
import pytest
from shapely.geometry import Polygon


def test_topology_gate_rejects_self_intersection(tmp_path):
    # Bowtie polygon — classic self-intersection that serializes cleanly.
    bowtie = Polygon([(0, 0), (1, 1), (1, 0), (0, 1)])
    gdf = gpd.GeoDataFrame(geometry=[bowtie], crs="EPSG:4326")
    path = tmp_path / "bad.parquet"
    gdf.to_parquet(path)
    # make_valid splits the bowtie into a valid MultiPolygon, so the gate should pass
    # — assert the repair actually produced valid output rather than crashing.
    out = topology_crs_gate(str(path), {"crs": "EPSG:4326",
                                        "extent_bbox": [-180, -90, 180, 90]})
    assert out.geometry.is_valid.all()


def test_crs_gate_rejects_wrong_projection(tmp_path):
    gdf = gpd.GeoDataFrame(geometry=[Polygon([(0, 0), (0, 1), (1, 1)])], crs="EPSG:3857")
    path = tmp_path / "wrong_crs.parquet"
    gdf.to_parquet(path)
    with pytest.raises(ValueError, match="CRS contract violation"):
        topology_crs_gate(str(path), {"crs": "EPSG:4326",
                                      "extent_bbox": [-180, -90, 180, 90]})

Wire these into the pipeline so the test suite runs in the same pinned container as the gates. Expected output shapes matter: the fidelity gate returns None on success and raises on failure, so CI asserts on absence of exception plus a green log line, not on a return value. Keep tolerance thresholds in the contract, never hard-coded in the gate, so QA can tighten them without a code change.

Performance & Scale Considerations

Spatial artifacts are large and expensive to regenerate, so the pipeline must avoid recomputing what it can stream. Three constraints dominate at scale:

Memory. read_parquet loads the full GeoDataFrame; for continental grids this exceeds runner memory. Read only the columns each gate needs (columns=["geometry", value_col]), and use GeoParquet row-group filtering to process one spatial partition at a time rather than the whole frame.
Parallelism. The three gates are sequential by dependency, but within the fidelity gate the per-partition Moran’s I computations are embarrassingly parallel. Shard by a spatial tiling key and reduce, mirroring the non-blocking patterns described in async execution for large grids. Keep a halo of neighbouring features around each tile so the k-nearest-neighbour graph does not see false edges at partition seams.
Build matrix cost. Testing every CRS target on every commit is wasteful. Restrict the full matrix to changes touching the generation manifest or contract; for unrelated commits, run a single canonical CRS and defer the matrix to a nightly scheduled run.

The single highest-leverage optimization remains the pinned container: it eliminates the entire class of environment-dependent reruns before the first gate executes.

Failure Modes & Troubleshooting

Silent CRS fallback on the runner

Symptom: geometries are valid and inside bounds locally but shifted hundreds of metres in CI. Root cause: PROJ_LIB is unset on the runner, so PROJ falls back to a partial database and transforms resolve incorrectly. Remediation: set PROJ_LIB and GDAL_DATA explicitly in the workflow env (as in the Dockerfile above) and add a smoke assertion that pyproj.datadir.get_data_dir() resolves to the pinned path before generation runs.

Auto-repair masking a generator bug

Symptom: the topology gate passes but the repaired feature count creeps upward every release. Root cause: make_valid is silently fixing systematically broken output from the generator instead of the generator producing valid geometry. Remediation: emit a metric for the repaired-geometry ratio and fail the gate when it exceeds a small threshold (e.g. 1%) — repair is a safety net, not a substitute for a correct generator.

Statistical gate flapping on small extents

Symptom: the fidelity gate passes and fails intermittently with no parameter change. Root cause: Moran’s I and KS tests are noisy on small samples, so a tight tolerance trips on sampling variance. Remediation: fix the permutation seed, raise the minimum feature count in the contract, and widen the tolerance band to the empirically observed run-to-run variance rather than an aspirational value.

Privacy budget desynchronization

Symptom: the privacy gate rejects an artifact whose generator reported a compliant budget. Root cause: the generator and the gate read different epsilon values because the budget was duplicated rather than sourced from one place. Remediation: make the contract the single source of epsilon_max, have the generator write epsilon_spent into the manifest, and have the gate read that manifest — never re-derive the budget independently.

Quarantine without notification

Symptom: artifacts stop reaching consumers but no one is alerted. Root cause: the quarantine path swallows the failure without routing it. Remediation: on any gate exception, write the structured diagnostic to the quarantine location and fail the CI job with a non-zero exit so the platform’s alerting fires; never except: pass around a gate.

Frequently Asked Questions

Why run topology validation before statistical checks?

Because statistical metrics are only meaningful on geometrically valid input and they are far more expensive to compute. An invalid ring or wrong CRS makes Moran’s I and nearest-neighbour distances meaningless, so spending compute on them before the cheap, deterministic topology gate has passed is wasted work. Order the gates cheapest-and-most-fundamental first.

Should make_valid auto-repair run in CI or should the gate just fail?

Run a bounded auto-repair and then re-assert validity, failing hard if anything remains invalid. Repair absorbs incidental, one-off artifacts without a manual loop, but you must also track the repaired-geometry ratio — a rising ratio means the generator has a real bug that repair is hiding, and that should itself fail the gate.

How do I stop CRS drift between local development and cloud runners?

Pin GDAL, PROJ, GeoPandas, Shapely, and pyproj to exact versions inside a version-tagged container, and set PROJ_LIB and GDAL_DATA explicitly in the workflow environment. Never trust the base image’s defaults. Add a pre-generation smoke check that the PROJ data directory resolves to the pinned path, so a misconfigured runner fails immediately rather than emitting shifted coordinates.

Where in the pipeline does the privacy budget get enforced?

In a dedicated gate that runs after topology and statistics, reading epsilon_max from the shared contract and epsilon_spent from the generator’s manifest. Enforcing it last ensures disclosure risk is measured against the final validated distribution, and sourcing both values from one place prevents the generator and gate from disagreeing on the budget.

What format should promoted artifacts use?

GeoParquet. It carries CRS and geometry metadata, supports predicate pushdown and row-group filtering for memory-bounded downstream reads, and compresses well for large feature counts. Stamp each promoted file with a content hash and the contract version in a sidecar manifest to preserve a verifiable chain of custody.

Scoping Rules & Data Contracts — the deterministic boundaries and CRS contracts the gates enforce.
Realism Metrics & Evaluation — the autocorrelation and distributional methods behind the fidelity gate.
Privacy-Preserving Generation Frameworks — the $(\varepsilon, \delta)$ budgeting the privacy gate verifies.
Syncing Synthetic Data Generation with GitHub Actions — the concrete workflow file that wires these gates into matrix builds.
Async Execution for Large Grids — partitioned, non-blocking patterns for validating continental-scale output.
Synthetic Spatial Data Architecture & Fundamentals — the parent area that frames generation, validation, and promotion as one control plane.