Why are coordinates subtly wrong but nothing throws an error?

Silent CRS drift: two stages disagree on the projection and treat coordinates as bare numbers. Make the authoritative CRS a required scope field, assert it at every ingestion point, refuse to infer a CRS, and ship PROJ datum grids in-image with PROJ_NETWORK off.

Scoping Rules & Data Contracts

Scoping rules and data contracts are the part of a synthetic spatial pipeline that turns “the data should look like San Francisco at 10 m resolution and stay inside the city limits” into an executable, machine-enforceable specification that every downstream stage can trust. This page extends the Synthetic Spatial Data Architecture & Fundamentals reference and focuses on one sub-problem: how to declare the spatial, temporal, and semantic boundaries of generated data, and how to formalise the structural, statistical, and topological agreement passed between generation, validation, and promotion so that a violation fails loudly at the source instead of silently corrupting a downstream consumer.

A scoping rule fixes what may be generated; a data contract fixes what shape the output must take at every interface. Together they are the control plane that lets four disciplines work against one shared truth: GIS developers configure spatial indexes and tiling from the declared extent, ML engineers pre-allocate coordinate tensors from the resolution and bounds, QA teams assert geometric and statistical validity against versioned thresholds, and privacy and compliance engineers bind disclosure limits to the same manifest that drives generation.

Problem Framing: Implicit Assumptions Are Silent Failures

The failure this layer prevents is the implicit assumption that survives until a consumer trips over it hours later. Generative models do not respect intent; they respect inputs. Without an explicit envelope, a generator produces unbounded coordinate drift, mismatched projections, and feature densities that break spatial joins or tensor batching downstream. Three concrete, repeatable engineering failures recur.

First, silent coordinate reference system drift: a stage emits geometries in EPSG:3857 while a consumer assumes EPSG:4326, and because both are “just numbers” nothing throws — the data is simply wrong by hundreds of metres, and the error surfaces as a mysterious offset in a downstream spatial join. Second, envelope escape: an unbounded sampler places features outside the declared study area, inflating tile counts and poisoning area-normalised statistics. Third, schema drift between stages: a generation module renames a field or changes a geometry type from Polygon to MultiPolygon, and the simulation engine that consumes it either crashes or, worse, coerces the type and continues with corrupted topology.

Each of these is invisible at the point of origin and expensive at the point of discovery. A scoping rule turns the first two into a hard boundary check; a data contract turns the third into a schema assertion that fails the run before promotion. The contract is also the document that the privacy-preserving generation frameworks layer inherits its CRS and correlation limits from, and the document the realism metrics and evaluation layer reads its statistical tolerance bands from. Getting it right here is what makes those layers tractable.

Prerequisites & Toolchain

This layer targets Python 3.10+ and pins the geospatial stack to the versions used across this site:

GeoPandas 0.14.x and Shapely 2.x for geometry handling, validity checks, and bounds computation.
pyproj 3.x (PROJ 9.x) for deterministic CRS resolution and transformation; coordinate transforms should be managed through PROJ so conversions are reproducible to millimetre precision.
GDAL 3.x for format I/O (GeoJSON, FlatGeobuf, GeoParquet).
A schema validator — jsonschema for JSON Schema payloads, or protobuf where a binary contract is preferred.
NumPy with an explicitly seeded Generator, never the legacy global RNG, so deterministic thinning and sampling replay byte-identically under audit.

Pin these in a lockfile and run generation inside a container so CRS transforms and topology checks yield identical results across local and CI runners. Set PROJ_NETWORK=OFF and ship the PROJ datum grids in-image so a missing grid never triggers an implicit, version-dependent transformation fallback that quietly shifts coordinates. Treat the contract files themselves as code: version them, review them, and tag each release so a generated artifact can always be traced back to the exact scope that produced it.

Core Concept: Scope as Boundary, Contract as Interface

The mechanism has two halves that are easy to conflate but must stay separate.

A scoping rule is a declarative boundary on the generation domain — the spatial extent, the authoritative CRS, the target resolution, the temporal window, and the density envelope. It is consumed before and during generation to constrain the sampler. A data contract is a structural and statistical agreement on the artifact that crosses an interface — field names and types, geometry-type whitelists, nullability, distributional tolerances, and topology rules. It is consumed after generation to validate the candidate before it is allowed to move downstream. Scope shapes the output; the contract certifies it.

The spatial scope envelope

Every generation job declares a bounding envelope, a target resolution, an authoritative CRS, and a temporal window. Implicit coordinate assumptions are rejected outright; the scope is the only source of projection truth.

yaml
# spatial_scope.yaml — declarative generation boundary
extent:
  crs: "EPSG:4326"
  bbox: [-122.4194, 37.7749, -122.3500, 37.8200]
  clip_to_boundary: true
resolution:
  grid_size_meters: 10.0
  min_feature_spacing_meters: 15.0
temporal_window:
  start: "2025-01-01T00:00:00Z"
  end: "2025-12-31T23:59:59Z"
  sampling_frequency: "P1D"
density:
  max_features_per_km2: 1200
  process: "poisson"

GIS developers read this to configure spatial indexes (R-tree, Quadtree) and tiling strategies; ML engineers read the extent and resolution to normalise coordinate tensors and pre-allocate batch buffers; QA teams assert that no generated geometry escapes the envelope and that temporal sampling respects the declared frequency, preventing simulation time-step desynchronisation.

The contract layers four guarantees on top of the scope:

Schema and type guarantees — mandatory fields, data types, nullability, and enum restrictions, expressed in JSON Schema or Protocol Buffers. For spatial payloads this includes strict typing of the geometry field against a whitelist of OGC geometry types, so an ML engineer can build fixed-shape tensors without runtime coercion.
Statistical distribution bounds — expected ranges, quantile thresholds, and divergence tolerances (Jensen–Shannon divergence, Kolmogorov–Smirnov p-values) that keep synthetic marginals and joints aligned with a reference baseline without leaking ground-truth correlations.
Topological integrity — OGC Simple Features compliance: ring closure, no self-intersection, consistent outer-ring orientation, and multipolygon disjointness.
Privacy and compliance boundaries — differential-privacy budgets, k-anonymity thresholds, or spatial suppression rules for regulated domains, enforced before serialisation.

Step-by-Step Implementation

Step 1 — Resolve and assert the CRS before anything is generated

Refuse to guess a projection. Load the scope, resolve the declared CRS, and fail fast if an input layer disagrees, so coordinate drift is caught at ingestion rather than discovered downstream.

python
import geopandas as gpd
from pyproj import CRS

def enforce_crs(gdf: gpd.GeoDataFrame, scope: dict) -> gpd.GeoDataFrame:
    """Bind the layer to the scope's authoritative CRS; never infer silently."""
    declared = CRS.from_user_input(scope["extent"]["crs"])
    if gdf.crs is None:
        raise ValueError("Input geometry has no CRS; refuse to assume one.")
    if CRS.from_user_input(gdf.crs) != declared:
        # Explicit, logged transformation — not an implicit fallback.
        gdf = gdf.to_crs(declared)
    return gdf

Step 2 — Clip generated features to the declared envelope

After generation, clip to the bounding box so no feature escapes the study area. Envelope escape is one of the most common sources of poisoned area-normalised statistics.

python
from shapely.geometry import box

def clip_to_extent(gdf: gpd.GeoDataFrame, scope: dict) -> gpd.GeoDataFrame:
    """Drop or clip any geometry outside the declared bbox."""
    minx, miny, maxx, maxy = scope["extent"]["bbox"]
    envelope = box(minx, miny, maxx, maxy)
    if scope["extent"].get("clip_to_boundary", True):
        return gpd.clip(gdf, envelope)
    inside = gdf.geometry.within(envelope)
    return gdf[inside]

Step 3 — Cap feature density deterministically

Scoping rules cap entity counts per unit area to prevent unrealistic clustering or computational exhaustion. Thinning is deterministic so repeated runs with the same seed produce identical spatial distributions — a hard requirement for regression testing and model benchmarking.

python
import geopandas as gpd

def apply_density_scoping(features: gpd.GeoDataFrame, scope: dict) -> gpd.GeoDataFrame:
    """Enforce max_features_per_km2 with reproducible, hash-based thinning."""
    max_density = scope["density"]["max_features_per_km2"]
    # Approximate bbox area in km^2 (degrees -> km via per-axis scale).
    # Constants are km per degree near the equator; adequate for a scoping cap.
    minx, miny, maxx, maxy = features.total_bounds
    width_km = (maxx - minx) * 111.32
    height_km = (maxy - miny) * 110.574
    area_km2 = abs(width_km * height_km)
    allowed_count = int(max_density * area_km2)

    if len(features) > allowed_count:
        # Deterministic thinning on a spatial hash preserves reproducibility.
        features = features.copy()
        features["spatial_hash"] = features.geometry.apply(
            lambda g: f"{g.centroid.x:.4f}_{g.centroid.y:.4f}"
        )
        features = features.drop_duplicates(subset="spatial_hash", keep="first")

    return features.iloc[:allowed_count]

Step 4 — Validate the artifact against the schema contract

Before an artifact crosses an interface, validate its non-spatial structure against a JSON Schema and its geometry types against an OGC whitelist. This is the assertion that turns schema drift into a hard stop.

python
from jsonschema import Draft202012Validator

ALLOWED_GEOMS = {"Point", "Polygon", "MultiPolygon", "LineString"}

def validate_contract(gdf: gpd.GeoDataFrame, schema: dict) -> None:
    """Assert attribute schema and geometry-type whitelist; raise on the first breach."""
    validator = Draft202012Validator(schema)
    for record in gdf.drop(columns="geometry").to_dict(orient="records"):
        errors = sorted(validator.iter_errors(record), key=lambda e: e.path)
        if errors:
            raise ValueError(f"schema breach: {errors[0].message}")
    bad = set(gdf.geometry.geom_type.unique()) - ALLOWED_GEOMS
    if bad:
        raise ValueError(f"geometry types outside contract whitelist: {bad}")

Step 5 — Bind statistical and privacy tolerances to the manifest

The contract also carries the divergence thresholds the realism layer enforces and the privacy budget the generation layer spends. Keeping them in one versioned manifest means a single document drives every gate.

python
from dataclasses import dataclass

@dataclass(frozen=True)
class ArtifactContract:
    schema_id: str
    crs: str                  # authoritative projection
    js_divergence_max: float  # statistical fidelity tolerance
    topology_error_rate_max: float = 1e-4
    epsilon_budget: float | None = None  # None outside regulated domains

    def manifest_hash(self) -> str:
        import hashlib, json
        payload = json.dumps(self.__dict__, sort_keys=True).encode()
        return hashlib.sha256(payload).hexdigest()

Validation & Testing

Validation is a gate, not a report. Each transition asserts, in a fixed order, that the artifact is geometrically valid, inside scope, schema-conformant, and within statistical tolerance — halting on the first failure so a topology bug never masquerades as a statistics problem.

python
def gate(gdf, scope, schema, js_divergence, contract: "ArtifactContract"):
    # 1. Geometry first — a correctness bug, not a tuning issue.
    assert gdf.geometry.is_valid.all(), "invalid geometry reached the gate"
    # 2. Scope: every feature inside the declared envelope.
    minx, miny, maxx, maxy = scope["extent"]["bbox"]
    assert gdf.cx[minx:maxx, miny:maxy].shape[0] == gdf.shape[0], "feature escaped extent"
    # 3. Schema and geometry-type whitelist.
    validate_contract(gdf, schema)
    # 4. Statistical tolerance against the reference baseline.
    assert js_divergence <= contract.js_divergence_max, (
        f"JS divergence {js_divergence:.3f} over {contract.js_divergence_max}"
    )

Embed this gate into the CI/CD integration for spatial data workflow so contract validation runs on every pipeline transition rather than on demand. Pre-commit hooks should run topology and schema checks before a generation script is even committed; CI gates should fail fast when coordinate bounds exceed the declared extent, statistical divergence surpasses the configured threshold, the topology error rate breaches roughly 0.01%, or a privacy budget is exhausted. Expected output shapes — column set, geometry dtype, CRS — should be asserted explicitly so a downstream consumer can rely on them without defensive coercion.

Performance & Scale Considerations

Contract validation is cheap relative to generation, but two operations dominate at scale and reward care. Per-record JSON Schema validation in a Python loop is the obvious bottleneck on multi-million-feature layers; compile the validator once (as in Step 4) and, where the schema is simple, prefer a vectorised check on the GeoDataFrame columns over per-row iteration. The iter_errors path should be reserved for the records that fail a fast bulk predicate, not run over every row.

Bounds and density checks should operate on the NumPy block, not per-Point Python objects — total_bounds and cx indexing are vectorised and stay in C. For very large grids, partition the artifact by a coarse spatial grid or R-tree and validate tiles in parallel, but compute area-normalised density on the global extent rather than per tile, or a sparse tile will spuriously pass a cap that the whole layer violates. Serialise validated intermediate artifacts to GeoParquet with the contract manifest hash in the file metadata, so a failed run resumes from the last gate-passing stage and every artifact carries a verifiable link to the scope that produced it.

Failure Modes & Troubleshooting

Coordinates are subtly wrong but nothing throws an error

This is silent CRS drift: two stages disagree on the projection and both treat coordinates as bare numbers. Make the authoritative CRS a required field in the scope, assert it at every ingestion point (Step 1), and refuse to infer a CRS from data with none. Ship PROJ datum grids in-image and set PROJ_NETWORK=OFF so a missing grid cannot trigger an implicit transformation fallback.

Area-normalised statistics look inflated or features appear outside the study area

An unbounded sampler let features escape the envelope. Clip to the declared bbox after generation (Step 2) and assert containment in the gate. Compute density against the global extent, never per tile, so a sparse partition cannot mask a layer-wide cap violation.

A downstream consumer crashes or coerces a changed field type

This is schema drift across an interface — a renamed field or a geometry type that flipped from Polygon to MultiPolygon. Validate every artifact against a versioned JSON Schema and an explicit geometry-type whitelist (Step 4) before promotion, and fail the run on the first breach instead of letting the consumer coerce silently.

Repeated runs with the same seed produce different spatial distributions

A stochastic stage is using the global RNG or an unstable iteration order. Seed every NumPy Generator from the config registry, use hash-based deterministic thinning (Step 3) rather than random sampling for density caps, and fix iteration order before any drop so audits replay byte-identically.

A generation gap aborts the whole pipeline instead of degrading gracefully

Missing attributes, CRS mismatches, or empty geometries should route to a deterministic fallback, not fail silently or crash. Define acceptable missingness and degradation boundaries in the contract and wire them to automated fallbacks for missing spatial attributes, which preserve continuity while keeping a logged, audit-ready trail of every degraded record.

Setting Up Automated Fallbacks for Missing Spatial Attributes — deterministic recovery when a contract field is missing, governed by the boundaries this page defines.
Privacy-Preserving Generation Frameworks — the layer that inherits its CRS, extent, and correlation limits from this contract.
Realism Metrics & Evaluation — the statistical tolerance bands this contract declares are enforced here.
CI/CD Integration for Spatial Data — wiring contract validation into automated promotion and gating.
Synthetic Spatial Data Architecture & Fundamentals — the reference architecture this page sits within.