Why does utility collapse after perturbation in dense urban regions?

A single global clip radius over-suppresses high-density areas. Scale the clip radius to local feature spacing so dense cores and sparse rural networks are not held to the same metre budget.

Why does topology repair change between runs?

Non-deterministic iteration order and unseeded RNGs produce different repairs. Seed every generator from the config registry, pin Shapely and GEOS versions, and fix iteration order so audits replay byte-identically.

Privacy-Preserving Generation Frameworks

A privacy-preserving generation framework is the part of a synthetic spatial pipeline that turns “we should protect locations” into a provable, composable guarantee enforced at generation time. This page extends the Synthetic Spatial Data Architecture & Fundamentals reference and focuses on one sub-problem: how to inject a calibrated differential privacy mechanism into a generator without destroying planar topology, drifting outside valid coordinate reference system (CRS) bounds, or silently exhausting the privacy budget across pipeline stages.

The framework replaces heuristic masking and naive coordinate jittering with bounded, accountable mechanisms. By formalizing privacy budgets and embedding them directly into the generation directed acyclic graph (DAG), teams ensure coordinate perturbation, attribute synthesis, and topological transformation do not leak ground-truth information. That requires tight coordination: GIS developers maintain spatial integrity and CRS compliance, ML engineers optimize downstream feature distributions, QA teams enforce statistical and geometric constraints, and privacy engineers track cumulative compliance budgets against a single ledger.

Problem Framing: Why Spatial Data Breaks Scalar Privacy

The core difficulty is that differential privacy mechanisms are designed for independent scalar values, while spatial data is inherently correlated. Adjacent polygons share boundaries, points cluster by generating process, and attributes co-vary with location. A framework that ignores this produces three concrete, repeatable engineering failures.

First, metric distortion: applying noise directly in geographic degrees (EPSG:4326) treats a degree of longitude as a fixed distance, when it shrinks from roughly 111 km at the equator to zero at the poles. Sensitivity calibrated in degrees therefore violates the formal guarantee at every latitude except the one it was tuned for. Second, topological collapse: independent noise realizations on shared vertices open gaps, overlaps, and self-intersections that invalidate spatial indexes and break routing. Third, silent budget exhaustion: when each stage spends ε without a shared ledger, the composed guarantee degrades far below what any single stage advertises, and the release ships with a privacy claim that no longer holds.

The framework on this page addresses all three by perturbing in a projected metric CRS, repairing topology after perturbation, and tracking composition across every stage. Each decision has a correctness implication; none is optional. The detailed mechanism calibration lives in Implementing Differential Privacy for Coordinate Generation, which derives the sensitivity bounds this page assumes.

Prerequisites & Toolchain

The framework targets Python 3.10+ and pins the geospatial stack to the versions used across this site:

GeoPandas 0.14.x and Shapely 2.x for geometry handling, validity checks, and topology repair.
pyproj 3.x (PROJ 9.x) for deterministic CRS transformation into a distance-preserving projection.
GDAL 3.x for format I/O (GeoJSON, GeoParquet, ESRI Shapefile).
A differential-privacy accountant — Opacus (Rényi DP) or google-differential-privacy — for composition tracking.
NumPy with an explicitly seeded Generator, never the legacy global RNG, so audits can replay byte-identically.

Pin these in a lockfile and execute generation inside a container so coordinate transformations and topology operations yield identical results across local and CI runners. Set PROJ_NETWORK=OFF and ship the PROJ datum grids in-image to avoid implicit, version-dependent transformation fallbacks. The CRS contract — extent, target projection, resolution — should be inherited from the scoping rules and data contracts layer rather than hardcoded in the generator.

Core Concept: The (ε, δ) Budget and Spatial Sensitivity

A randomized mechanism $\mathcal{M}$ satisfies $(\varepsilon, \delta)$ -differential privacy if for all adjacent datasets $D, D'$ differing in one record and all output sets $S$ ,

\Pr[\mathcal{M}(D) \in S] \le e^{\varepsilon}\,\Pr[\mathcal{M}(D') \in S] + \delta.

The smaller the budget $\varepsilon$ , the more noise is injected and the tighter disclosure is bounded — at a cost in utility. For a coordinate-valued query, the noise scale is set by the $\ell_2$ sensitivity $\Delta_2 f$ , the maximum change in the output (in projected metres) caused by adding or removing one record. The Gaussian mechanism then draws noise with standard deviation

\sigma = \frac{\Delta_2 f \,\sqrt{2 \ln(1.25/\delta)}}{\varepsilon},

which is why projecting to metres before calibration is mandatory: $\Delta_2 f$ must be expressed in the same units the noise will be added in. The framework allocates a slice of the global budget to each stage and composes them. Naive sequential composition sums the $\varepsilon_i$ , but that is wasteful; Rényi differential privacy (RDP) tracks the moments of the privacy loss and converts to a tight $(\varepsilon, \delta)$ at the end, recovering substantial budget over many stages.

The canonical execution sequence is a transformation DAG where each stage applies a bounded mechanism, charges the shared ledger, and emits an intermediate artifact for validation:

Ingestion & Schema Resolution — parse source formats, resolve the CRS to a distance-preserving projection, validate geometry types, enforce typing contracts, and strip direct identifiers before any privacy mechanism runs.
Privacy Budget Allocation — split $\varepsilon$ and $\delta$ across coordinate perturbation, attribute synthesis, and topology preservation, using RDP or zero-concentrated DP accounting to compose across stages.
Spatial Perturbation — add calibrated noise in the projected CRS while enforcing boundary clipping, with sensitivity that accounts for spatial autocorrelation and cluster density.
Attribute Synthesis — regenerate non-spatial features with a DP generator conditioned on the perturbed geometries, preserving spatial-attribute joint distributions.
Topology Repair & Validation — resolve self-intersections, sliver polygons, and broken connectivity introduced by noise, restoring OGC Simple Features compliance.
Utility & Privacy Scoring — compute distributional fidelity, spatial realism, and the composed formal guarantee before promotion.

Step-by-Step Implementation

Step 1 — Resolve the CRS to a metric projection

Reproject to a distance-preserving CRS before any noise is added, so sensitivity and noise share units. Use a local UTM or equal-area projection rather than working in degrees.

python
import geopandas as gpd
from pyproj import CRS

def to_metric_crs(gdf: gpd.GeoDataFrame, target: str = "EPSG:32610") -> gpd.GeoDataFrame:
    """Project to a metre-based CRS so DP sensitivity is calibrated in metres."""
    if gdf.crs is None:
        raise ValueError("Input geometry has no CRS; refuse to guess.")
    metric = CRS.from_user_input(target)
    if not metric.is_projected:
        raise ValueError(f"{target} is not a projected CRS; noise units would be degrees.")
    return gdf.to_crs(metric)

Step 2 — Allocate and account for the privacy budget

Split the global budget across stages and track composition with an RDP accountant. The allocation, not the mechanism, is what most teams get wrong.

python
from dataclasses import dataclass

@dataclass(frozen=True)
class BudgetPlan:
    coord_eps: float       # spatial perturbation
    attr_eps: float        # attribute synthesis
    delta: float           # shared failure probability

    @property
    def total_eps(self) -> float:
        # Sequential upper bound; the RDP accountant returns a tighter value.
        return self.coord_eps + self.attr_eps

def assert_within_global(plan: BudgetPlan, global_eps: float) -> None:
    if plan.total_eps > global_eps:
        raise ValueError(
            f"Allocated ε={plan.total_eps:.3f} exceeds global budget {global_eps:.3f}"
        )

Step 3 — Perturb coordinates with bounded sensitivity

Clip per-record influence to a fixed radius (this is the sensitivity), then add Gaussian noise scaled to the allocated coord_eps. Clip the result back inside the declared extent.

python
import numpy as np
from shapely.geometry import Point

def perturb_points(
    gdf: gpd.GeoDataFrame, eps: float, delta: float, clip_radius_m: float, seed: int
) -> gpd.GeoDataFrame:
    """Gaussian mechanism on projected coordinates; clip_radius_m is the l2 sensitivity."""
    rng = np.random.default_rng(seed)  # seeded for audit replay
    sigma = clip_radius_m * np.sqrt(2 * np.log(1.25 / delta)) / eps
    out = gdf.copy()
    coords = np.array([(p.x, p.y) for p in gdf.geometry])
    noise = rng.normal(scale=sigma, size=coords.shape)
    out["geometry"] = [Point(x, y) for x, y in coords + noise]
    return out.clip(gdf.total_bounds)  # enforce extent

Step 4 — Synthesize attributes conditioned on geometry

Regenerate non-spatial features with a DP mechanism conditioned on the perturbed layer, so joint distributions (e.g. land use given proximity to transit) survive. Histogram-based DP preserves marginals; private covariance or Bayesian-network sampling preserves the joints. Define which transformations are allowed in the scoping rules and data contracts so the synthesizer cannot invent correlations the contract forbids.

python
def dp_histogram(values: np.ndarray, bins: int, eps: float, seed: int) -> np.ndarray:
    """Differentially private marginal via the Laplace mechanism (sensitivity = 1)."""
    rng = np.random.default_rng(seed)
    counts, edges = np.histogram(values, bins=bins)
    noisy = counts + rng.laplace(loc=0.0, scale=1.0 / eps, size=counts.shape)
    return np.clip(noisy, 0, None), edges  # negative counts are non-physical

Step 5 — Repair topology after perturbation

Independent noise breaks shared boundaries. Snap to a tolerance grid, fix ring orientation, and validate against OGC Simple Features before scoring.

python
from shapely import set_precision, make_valid

def repair(gdf: gpd.GeoDataFrame, grid_size_m: float = 0.01) -> gpd.GeoDataFrame:
    """Snap to a precision grid then make geometries OGC-valid."""
    out = gdf.copy()
    out["geometry"] = out.geometry.apply(
        lambda g: make_valid(set_precision(g, grid_size_m))
    )
    invalid = (~out.geometry.is_valid).sum()
    if invalid:
        raise ValueError(f"{invalid} geometries still invalid after repair")
    return out

Validation & Testing

Validation is a gate, not a report. Every run asserts, in order, that the geometry is OGC-valid, that the composed budget is within bound, and that utility clears its floor — halting on the first failure so a topology bug never masquerades as a statistics problem.

python
def gate(gdf, accountant, global_eps, delta, ks_threshold, ks_stat):
    # 1. Geometry first — a correctness bug, not a tuning issue.
    assert gdf.geometry.is_valid.all(), "invalid geometry reached the gate"
    # 2. Composed privacy budget (RDP -> (eps, delta)).
    eps = accountant.get_epsilon(delta=delta)
    assert eps <= global_eps, f"composed ε={eps:.3f} exceeds {global_eps}"
    # 3. Utility floor — distributional fidelity vs. the reference.
    assert ks_stat <= ks_threshold, f"KS={ks_stat:.3f} over {ks_threshold}"

For attribute synthesis, flag any feature whose Kolmogorov–Smirnov or Wasserstein distance from the source exceeds a versioned threshold — but widen the band by the noise the privacy mechanism is expected to add, or you will reject correctly-private output. The full scoring methodology, including spatial-autocorrelation and connectivity checks, is owned by the realism metrics and evaluation layer, and these gates wire into the broader CI/CD integration for spatial data workflow so a budget miscalculation or a geometry failure produces a hard stop before promotion.

Performance & Scale Considerations

Privacy accounting and topology repair are the two cost centres at scale. The accountant must be thread-safe: when perturbation runs across a distributed cluster, every worker charges the same ledger, so the composition step has to serialize budget updates or accumulate RDP moments locally and reduce once at the end. The reduce pattern is preferable — it avoids a lock on the hot path and is associative, so partial sums combine correctly regardless of worker order.

Topology repair dominates wall-clock time on dense polygon layers because make_valid is superlinear in vertex count. Partition by a spatial index (R-tree or a coarse grid) and repair tiles independently, but overlap tile boundaries by at least the clip radius so shared edges are reconciled rather than re-broken at the seam. Serialize intermediate noise-injected states to GeoParquet with predicate pushdown so a failed run resumes from the last valid stage instead of regenerating from ingestion. Pre-allocate coordinate arrays and operate on the NumPy block rather than per-Point Python objects; the vectorized noise draw in Step 3 is the difference between minutes and hours on a multi-million-feature grid.

Failure Modes & Troubleshooting

Utility collapses after perturbation in dense urban regions

A single clip radius applied globally over-suppresses high-density areas where features are tightly packed. The fix is density-aware sensitivity: scale the clip radius to local feature spacing (e.g. derive it from the contract's minimum-separation distance) so urban cores and sparse rural networks are not held to the same metre budget.

Geometries land outside valid projection bounds or wrap the antimeridian

Unbounded Gaussian noise pushes points past the CRS extent. Clip every perturbed coordinate back inside the declared envelope after the noise draw, and run perturbation in a local projected CRS — never in EPSG:4326 — so the extent check is a simple metric bounding box rather than a wraparound special case.

The composed ε is far larger than any single stage's budget

This is sequential composition charging you the sum of per-stage budgets. Switch to a Rényi or zero-concentrated DP accountant that tracks privacy-loss moments and converts to a tight (ε, δ) once at the end. Verify that every stochastic stage actually reports to the shared accountant — an un-instrumented stage spends budget invisibly.

Topology repair changes between runs, breaking reproducibility

Non-deterministic iteration order in geometry operations and unseeded RNGs produce different repairs each run. Seed every NumPy Generator from the config registry, pin Shapely/GEOS versions, and fix iteration order before repair so audits replay byte-identically.

Attribute validation rejects output that is actually correct

Static KS/Wasserstein thresholds do not account for the distortion the privacy mechanism is designed to add. Calibrate tolerance bands as a function of the allocated attribute budget, so the gate distinguishes a miscalibrated generator from the expected, bounded effect of differential privacy.

Implementing Differential Privacy for Coordinate Generation — mechanism calibration, sensitivity derivation, and projection-aware noise.
Scoping Rules & Data Contracts — the CRS, extent, and correlation contract this framework inherits.
Realism Metrics & Evaluation — the utility scoring that gates a privacy release.
CI/CD Integration for Spatial Data — wiring budget and topology gates into automated promotion.
Synthetic Spatial Data Architecture & Fundamentals — the reference architecture this page sits within.