Do synthetic urban point patterns still need privacy filtering?

Yes. If the intensity surface is bound too tightly to real data, individual records dominate cells and become recoverable by nearest-neighbour attack. Inject a calibrated noise budget before realization, bound leakage with a declared epsilon-delta budget, and apply about plus or minus 15 m of spatial jitter inside high-sensitivity zones.

Generating Urban Point Patterns Using Poisson Processes

Synthetic urban points generated by clipping a uniform draw to a city polygon produce edge-depressed density and complete spatial randomness — this page shows how to replace that with a calibrated, edge-corrected inhomogeneous Poisson process that survives second-order validation.

Part of Point Process Simulation Models: where the parent area covers the full family of point processes — Poisson, Cox, Gibbs — this page resolves the specific task of synthesizing urban event patterns (store locations, incident reports, sensor hits) across a non-convex municipal boundary, and the two failure modes that ruin them: an un-calibrated intensity surface and uncorrected boundary artifacts.

Root Cause: Why Naive Sampling Fails on Urban Extents

Urban environments are non-stationary. A homogeneous Poisson process, which scatters points at a constant rate, cannot capture the density gradients that follow zoning boundaries, transit corridors, and commercial cores. Two distinct failures follow from ignoring that.

The first is distributional mismatch. A constant-rate process produces complete spatial randomness; a model trained on it learns a flat spatial prior that does not exist and degrades on real queries. The fix is to sample from an inhomogeneous Poisson process whose intensity function $\lambda(x, y)$ varies continuously across the domain $W$ :

\lambda(x, y) = \exp\big(\beta_0 + \beta_1 f_1(x, y) + \beta_2 f_2(x, y)\big)

The log-link keeps $\lambda$ non-negative under arbitrary covariate scaling, and the surface is normalized so that $\iint_W \lambda(x, y)\, dx\, dy = \mu$ , the target point count. The covariates $f_i$ are derived from census, points-of-interest, or mobility telemetry via kernel density estimation, which is the same machinery that the sibling density mapping and heat generation workflow uses to turn raw points into continuous surfaces.

The second failure is edge bias from box-then-clip. Generating candidates inside an axis-aligned bounding box and clipping to the municipal polygon systematically depresses density near the boundary and distorts the nearest-neighbour distribution, because points just outside the polygon that should have neighbours inside it are never generated. Urban administrative boundaries are rarely convex, so this artifact is severe along every concave inlet of the city edge.

A third, quieter failure underpins both: silent non-reproducibility. A Poisson process is stochastic, so without a propagated seed, a pinned toolchain, and a fixed coordinate reference system, two runs diverge — breaking regression tests and making privacy attestations unverifiable. The CRS contract enforcement that governs the rest of the pipeline must be honoured here before a single point is drawn: validate in a projected, equal-area metric CRS and reproject to EPSG:4326 only for storage.

Prerequisite Check: Detecting the Edge Artifact

Before fixing anything, confirm the boundary bias exists in the naive approach. Pin the toolchain to the versions used across this site so RNG streams and geometry predicates behave identically.

python
# requirements.txt — pin majors, let patch releases float
numpy==1.26.*
scipy==1.11.*
shapely==2.0.*          # GEOS-backed predicates
pyproj==3.6.*           # PROJ 9.x transforms
geopandas==0.14.*       # clip, spatial joins, GeoDataFrame I/O
pointpats==2.4.*        # Ripley's K/L, G-function, MC envelopes

python
import numpy as np
import geopandas as gpd
from shapely.geometry import Point

# A non-convex municipal polygon in a projected, equal-area metric CRS.
city = gpd.read_file("city_boundary.gpkg").to_crs("EPSG:3035")
poly = city.union_all()
rng = np.random.default_rng(20260226)

# NAIVE: draw uniformly inside the bounding box, then clip.
minx, miny, maxx, maxy = poly.bounds
n = 5000
xs = rng.uniform(minx, maxx, n)
ys = rng.uniform(miny, maxy, n)
pts = gpd.GeoSeries([Point(x, y) for x, y in zip(xs, ys)], crs=city.crs)
inside = pts[pts.within(poly)]

# Symptom: mean distance from retained points to the boundary is biased high,
# because the concave edge band is under-populated.
edge_gap = inside.distance(poly.boundary).mean()
print(f"retained={len(inside)}  mean dist-to-edge={edge_gap:.1f} m")

If you compare edge_gap against the same statistic computed on real points clipped to the same polygon, the naive draw reports a noticeably larger mean distance-to-edge: the boundary band is starved.

Fix: Calibrated Intensity Plus Buffer-and-Clip Realization

The complete solution builds a normalized intensity surface, realizes it by rejection-thinning a homogeneous process, and corrects the edge by generating across a buffered domain before clipping strictly to the polygon.

python
import numpy as np
import geopandas as gpd
from shapely.geometry import Point
from sklearn.neighbors import KernelDensity


def build_intensity(source_xy: np.ndarray, bbox, bandwidth: float, mu: float):
    """Return a callable lambda(x, y) normalized so its integral over bbox == mu.

    source_xy : real coordinates used only to shape the surface, never copied out.
    bandwidth : KDE bandwidth in CRS units; tune so KL(empirical || model) < 0.15.
    """
    kde = KernelDensity(kernel="gaussian", bandwidth=bandwidth).fit(source_xy)
    minx, miny, maxx, maxy = bbox

    # Estimate the integral on a fixed-origin grid (10-25 m for dense cores).
    step = 20.0
    gx, gy = np.meshgrid(
        np.arange(minx, maxx, step), np.arange(miny, maxy, step)
    )
    grid = np.column_stack([gx.ravel(), gy.ravel()])
    dens = np.exp(kde.score_samples(grid))
    integral = dens.sum() * step * step
    scale = mu / integral  # rescale so the surface integrates to the target count

    def lam(xy: np.ndarray) -> np.ndarray:
        return np.exp(kde.score_samples(xy)) * scale

    lam_max = float(dens.max() * scale)
    return lam, lam_max


def generate_urban_points(poly, source_xy, mu, bandwidth, seed=20260226,
                          nn_radius=400.0):
    """Inhomogeneous Poisson realization with buffer-and-clip edge correction."""
    rng = np.random.default_rng(seed)

    # 1. Extend the domain by the max expected nearest-neighbour distance so the
    #    boundary band is fully populated before clipping (no edge-depressed density).
    buffered = poly.buffer(nn_radius)
    minx, miny, maxx, maxy = buffered.bounds
    area = (maxx - minx) * (maxy - miny)

    lam, lam_max = build_intensity(source_xy, (minx, miny, maxx, maxy), bandwidth, mu)

    # 2. Homogeneous candidates at lam_max over the buffered box (Poisson count).
    n_cand = rng.poisson(lam_max * area)
    cand = np.column_stack([
        rng.uniform(minx, maxx, n_cand),
        rng.uniform(miny, maxy, n_cand),
    ])

    # 3. Rejection-thinning: keep each candidate with prob lambda(x,y) / lam_max.
    keep = rng.uniform(size=n_cand) < (lam(cand) / lam_max)
    thinned = cand[keep]

    # 4. Clip STRICTLY to the target polygon (not the buffer) for clean edges.
    gs = gpd.GeoSeries([Point(x, y) for x, y in thinned], crs="EPSG:3035")
    return gs[gs.within(poly)].reset_index(drop=True)

The buffer radius nn_radius should equal the maximum expected nearest-neighbour distance (typically 200–500 m for dense urban cores). Clipping to poly rather than buffered discards the helper band but keeps the neighbours it contributed, preserving local density right up to the boundary. Topologically robust clipping for heterogeneous geometries can also be delegated to GeoPandas clip; bandwidth selection follows scikit-learn’s Kernel Density Estimation.

Verification: Count Tolerance and an L-Function Gate

Two assertions belong in CI. First, the realized count is itself Poisson with mean and variance $\mu$ , so demand only that it falls within roughly $3\sqrt{\mu}$ of the target — an exact-count assertion is statistically wrong. Second, the pattern must clear a second-order check: Ripley’s $L$ -function must stay inside a Monte Carlo envelope, confirming no unintended clustering or regularity was introduced.

python
import numpy as np
from pointpats import PointPattern, k_test


def test_urban_point_pattern(poly, source_xy):
    mu = 5000.0
    pts = generate_urban_points(poly, source_xy, mu=mu, bandwidth=150.0, seed=20260226)

    # 1. First-order: count within ~3 sigma of the target (sigma = sqrt(mu)).
    n = len(pts)
    assert abs(n - mu) <= 3.0 * np.sqrt(mu), f"count {n} drifted from mu={mu:.0f}"

    # 2. Reproducibility: identical seed -> byte-identical coordinates.
    again = generate_urban_points(poly, source_xy, mu=mu, bandwidth=150.0, seed=20260226)
    assert np.array_equal(pts.get_coordinates().to_numpy(),
                          again.get_coordinates().to_numpy())

    # 3. Second-order: observed L stays inside a 999-sim CSR envelope for r > 500 m.
    coords = pts.get_coordinates().to_numpy()
    result = k_test(coords, keep_simulations=True, n_simulations=999)
    breach = (result.statistic > result.simulations.max(axis=0))
    assert not breach[result.support > 500].any(), "L-function escaped the envelope"

Log every hyperparameter — bandwidth, buffer radius, grid resolution, seed — to a versioned manifest so a failing run is reproducible. The same Monte Carlo envelope discipline and the Wasserstein-distance realism check used elsewhere confirm utility is preserved without preserving identity.

Edge Cases and Gotchas

Non-convex boundaries with narrow inlets. When a city polygon has a slot narrower than nn_radius, the buffer of opposite walls overlaps and the helper band double-counts density into the slot. Detect it by checking whether poly.buffer(nn_radius) self-intersects across the inlet, and reduce the radius locally or split the inlet into its own tile before generating.

Reprojection to EPSG:4326 for storage. Intensity, Ripley’s $K$ , and the $G$ -function are all distance- and area-sensitive. Computing them in geographic degrees makes a degree near the poles represent a far shorter ground distance than near the equator, corrupting every statistic. Always calibrate and validate in a projected equal-area CRS such as EPSG:3035, and reproject to EPSG:4326 only at serialization.

Memory at $\mu > 10^6$ . Do not materialize the full intensity raster in RAM. Partition the extent with an H3 or quadtree index, seed each tile deterministically from one base seed, and stream tile-level intensity with memory-mapped arrays — the orchestration pattern covered in async execution for large grids. Independent tiles then run in parallel while the merged field stays byte-identical for a given base seed.

Frequently Asked Questions

Why generate across a buffered domain instead of just clipping a bounding-box draw?

Because box-then-clip starves the boundary band: points outside the polygon that would have been a clipped point’s nearest neighbours are never generated, so local density and the nearest-neighbour distribution are biased near every edge. Extending the domain by the maximum nearest-neighbour distance, generating there, then clipping strictly to the polygon restores uniform density right up to a non-convex boundary.

How do I size the buffer radius?

Set it to the maximum expected nearest-neighbour distance — typically 200–500 m for dense urban cores. Too small and the edge correction is incomplete; too large and you waste computation generating candidates that are always clipped. If you have real reference points, take the 99th percentile of their nearest-neighbour distances as the radius.

Why not assert an exact synthetic point count?

A Poisson process count is random with mean and variance both equal to the target $\mu$ , so an exact-count assertion fails on correct output. Assert that the realized $n$ lies within roughly $3\sqrt{\mu}$ of $\mu$ ; a consistent breach signals an un-normalized intensity surface or a clipping bias rather than ordinary stochastic variation.

Do urban point patterns still need privacy filtering?

Yes. If the intensity surface is bound too tightly to the real data, individual source records dominate cells and become recoverable by nearest-neighbour attack. Inject a calibrated noise budget into the surface before realization and bound leakage with a declared $(\varepsilon, \delta)$ budget — the differential privacy mechanisms applied across the rest of the pipeline — plus spatial jitter of about $\pm 15$ m inside high-sensitivity zones.

Point Process Simulation Models — the parent area: Poisson, Cox, and Gibbs processes, intensity contracts, and statistical validation.
Optimizing Voronoi Tessellation for Synthetic Zoning Maps — turn generated urban points into gap-free, sliver-free zones.
Scaling Density-Based Spatial Generation with Dask — distribute KDE and intensity rasterization across continental extents without task-graph blowup.
Evaluating Spatial Realism with Wasserstein Distance — confirm utility is preserved without preserving identity.