Point Process Simulation Models for Synthetic Spatial Data

Point process simulation models are the mechanism that turns a target intensity into a fresh, statistically grounded set of coordinates — without copying a single real location. Part of Spatial Distribution & Pattern Generation, this page covers the specific sub-problem of synthesizing discrete event patterns (store locations, incident reports, sensor hits) that reproduce the first- and second-order structure of a source distribution while remaining reproducible, topologically clean, and privacy-auditable. Where the parent area frames the fidelity-versus-privacy trade-off across all spatial primitives, here we make it concrete for points: the intensity surface $\lambda(x, y)$ is the contract, and every downstream gate tests how faithfully a realization honours it.

Problem Framing

The failure that point process models exist to prevent is the plausible-but-wrong synthetic dataset: a point cloud that looks reasonable on a basemap but silently violates the spatial statistics a downstream model depends on. Two engineering failures dominate.

The first is distributional drift from naive sampling. Drawing points uniformly inside a bounding box and clipping to a polygon produces complete spatial randomness, but real phenomena cluster around corridors, cores, and catchments. A model trained on uniform synthetic points learns a spatial prior that does not exist, and its accuracy collapses on real queries. The fix is to sample from a process whose intensity varies across space, validated against the same summary statistics — Ripley’s $K$ , the nearest-neighbour distribution, the pair-correlation function — that describe the source.

The second is silent non-reproducibility. A point process is stochastic by definition, so without a propagated seed, a pinned toolchain, and a fixed coordinate reference system, two runs of the “same” pipeline produce different coordinates and different statistics. That breaks regression testing, makes privacy attestations unverifiable, and turns every audit into a forensic exercise. Treating the seed and the intensity raster as versioned, first-class artifacts is the discipline that closes this gap — the same CRS contract enforcement that governs the rest of the pipeline applies here before a single point is drawn.

Prerequisites & Toolchain

Point process work sits on the standard geospatial Python stack plus a spatial-statistics library for the validation step. Pin to the major versions used across this site so that thinning probabilities, RNG streams, and geometry predicates behave identically across environments.

python
# requirements.txt — pin majors; let patch releases float
numpy==1.26.*
scipy==1.11.*
shapely==2.0.*          # GEOS-backed predicates, vectorized ops
pyproj==3.6.*           # CRS transforms (PROJ 9.x under the hood)
geopandas==0.14.*       # GeoDataFrame I/O, clip, spatial joins
pointpats==2.4.*        # Ripley's K/L, G-function, MC envelopes
gdal==3.8.*             # raster I/O for intensity surfaces

Two environment variables must be explicit, not inherited, or you will hit non-deterministic projection behaviour that masquerades as a statistics bug:

bash
# Resolve PROJ data deterministically; never rely on the ambient default
export PROJ_LIB="$(python -c 'import pyproj; print(pyproj.datadir.get_data_dir())')"
export PYTHONHASHSEED=0   # stabilize any set/dict iteration that feeds geometry order

All sampling happens in a projected, metric CRS, never in geographic degrees. Intensities, distances, and Ripley’s $K$ are area- and distance-sensitive, so computing them in EPSG:4326 distorts every result by latitude. Reproject at ingestion to an equal-area CRS such as EPSG:6933 (or a local UTM zone) and reproject back to EPSG:4326 only at serialization. This mirrors the projection discipline used in density mapping and heat generation, where the same areal distortion corrupts kernel bandwidths.

Core Concept: Process Families and Rejection-Thinning

A point process is a probabilistic rule for where events land and how they relate. Three families cover almost every production need, ordered by how much spatial interaction they encode.

Homogeneous Poisson process (HPP). A constant intensity $\lambda$ across the domain $\Omega$ ; events are independent and the count in any region of area $A$ is Poisson-distributed with mean $\lambda A$ . HPP is the null model — useful as a baseline and as the proposal distribution for thinning, but it rarely matches reality.

Inhomogeneous Poisson process (IPP). Intensity varies in space as $\lambda(x, y)$ , so density follows covariates — road centrality, zoning, transit access — while events remain mutually independent. This is the workhorse for most synthetic point generation. The standard realization technique is rejection-thinning (Lewis-Shedler):

Compute $\lambda_{\max} = \max_{(x,y)\in\Omega} \lambda(x,y)$ over the discretized domain.
Generate a candidate set via HPP at intensity $\lambda_{\max}$ .
Retain each candidate $(x_i, y_i)$ independently with probability $p_i = \lambda(x_i, y_i) / \lambda_{\max}$ .
The retained points are an exact draw from the IPP with intensity $\lambda(x, y)$ .

Thinning cost scales linearly with $\lambda_{\max}$ , so a few sharp intensity spikes inflate the candidate count for the whole domain. Capping or smoothing the surface before generation is a throughput prerequisite, not a cosmetic step.

Clustered and inhibition processes. When events are not independent, two families apply. Cox processes (doubly stochastic Poisson, including Thomas and Neyman-Scott cluster models) modulate $\lambda$ through a latent random field, producing realistic clustering around parent locations. Gibbs processes use an energy function to reward or penalize proximity, modelling inhibition — minimum spacing between cell towers, for example. Both typically require Markov Chain Monte Carlo or Metropolis-Hastings sampling, which adds burn-in and convergence-diagnostic obligations that a plain Poisson draw does not have. Cluster processes are also what makes synthetic seeds realistic when they feed polygon tessellation algorithms, since uniform seeds produce artificially regular zones.

Step-by-Step Implementation

The sequence below realizes an IPP by thinning, in a projected CRS, under a fixed seed. Each block is copy-pasteable and builds on the previous one.

Step 1 — Build a normalized intensity surface

Construct $\lambda(x, y)$ as a raster aligned to a fixed origin and resolution, then rescale it so its integral equals the target point count. Anchoring the grid guarantees tile reproducibility across distributed workers.

python
import numpy as np

def build_intensity(
    width_m: float,
    height_m: float,
    res_m: float,
    target_count: int,
    covariate_fn,
) -> tuple[np.ndarray, float]:
    """Return a (rows, cols) intensity raster whose integral == target_count."""
    cols = int(np.ceil(width_m / res_m))
    rows = int(np.ceil(height_m / res_m))
    # Cell-center coordinates in the projected (metric) CRS.
    xs = (np.arange(cols) + 0.5) * res_m
    ys = (np.arange(rows) + 0.5) * res_m
    gx, gy = np.meshgrid(xs, ys)
    raw = np.clip(covariate_fn(gx, gy), 0.0, None)  # intensities are non-negative

    cell_area = res_m ** 2
    mass = raw.sum() * cell_area
    if mass <= 0:
        raise ValueError("Intensity surface integrates to zero; check covariate_fn")
    lam = raw * (target_count / mass)               # E[count] == target_count
    return lam, res_m

Step 2 — Realize the IPP by rejection-thinning

Generate an HPP at $\lambda_{\max}$ , then keep each candidate with probability $\lambda/\lambda_{\max}$ . The single np.random.default_rng(seed) instance is the only entropy source — pass the same seed and the output is byte-identical.

python
import numpy as np

def thin_ipp(
    lam: np.ndarray,
    res_m: float,
    seed: int,
) -> np.ndarray:
    """Draw an inhomogeneous Poisson pattern. Returns (n, 2) metric coordinates."""
    rng = np.random.default_rng(seed)
    rows, cols = lam.shape
    width_m, height_m = cols * res_m, rows * res_m

    lam_max = float(lam.max())
    area = width_m * height_m
    n_candidates = rng.poisson(lam_max * area)          # HPP at lambda_max

    cx = rng.uniform(0.0, width_m, n_candidates)
    cy = rng.uniform(0.0, height_m, n_candidates)

    # Look up local intensity at each candidate, then thin.
    col = np.clip((cx / res_m).astype(int), 0, cols - 1)
    row = np.clip((cy / res_m).astype(int), 0, rows - 1)
    keep_prob = lam[row, col] / lam_max
    keep = rng.uniform(0.0, 1.0, n_candidates) < keep_prob
    return np.column_stack([cx[keep], cy[keep]])

Step 3 — Clip to the real boundary with an edge buffer

Generating in a rectangle and clipping to an irregular polygon depresses density near edges. Extend the domain by the maximum expected nearest-neighbour distance, generate, then clip — this preserves local statistics instead of carving artificial voids. The detailed treatment for municipal extents lives in generating urban point patterns using Poisson processes.

python
import geopandas as gpd
from shapely.geometry import MultiPoint

def to_geodataframe(
    coords: np.ndarray,
    boundary: gpd.GeoSeries,        # single polygon in the SAME metric CRS
    metric_crs: str = "EPSG:6933",
) -> gpd.GeoDataFrame:
    pts = gpd.GeoSeries(list(MultiPoint(coords).geoms), crs=metric_crs)
    clipped = pts[pts.within(boundary.union_all())]
    gdf = gpd.GeoDataFrame(geometry=clipped.reset_index(drop=True), crs=metric_crs)
    return gdf

Step 4 — Add clustering with a Thomas process (optional)

When the source pattern clusters, replace the plain IPP with a Thomas cluster process: draw parent locations from an HPP, then scatter Gaussian-distributed offspring around each parent.

python
def thomas_process(
    boundary_bounds: tuple[float, float, float, float],
    parent_intensity: float,   # parents per m^2
    mean_children: float,      # expected offspring per parent
    sigma_m: float,            # offspring dispersal std-dev, in meters
    seed: int,
) -> np.ndarray:
    rng = np.random.default_rng(seed)
    minx, miny, maxx, maxy = boundary_bounds
    area = (maxx - minx) * (maxy - miny)
    n_parents = rng.poisson(parent_intensity * area)
    px = rng.uniform(minx, maxx, n_parents)
    py = rng.uniform(miny, maxy, n_parents)

    out = []
    for x, y in zip(px, py):
        k = rng.poisson(mean_children)
        out.append(rng.normal([x, y], sigma_m, size=(k, 2)))
    return np.vstack(out) if out else np.empty((0, 2))

Validation & Testing

A synthetic point pattern is only usable once it provably matches the source’s spatial statistics. Wire these checks into CI so a drifting generator fails the build instead of shipping silently. The same fidelity philosophy underpins the broader realism metrics evaluation used across the architecture.

Ripley’s $K$ and the $L$ -transform quantify clustering or dispersion across distance scales. For an HPP the expectation is $K(r) = \pi r^2$ ; the variance-stabilized $L(r) = \sqrt{K(r)/\pi}$ should track $r$ under randomness, rise above it under clustering, and fall below under inhibition.
The nearest-neighbour ( $G$ ) function validates local spacing — critical for infrastructure or retail footprints where minimum separation matters.
Monte Carlo envelope testing draws 99–999 simulations under the null to bound statistical significance; the observed function must stay inside the envelope to pass.
Deterministic seeding is itself a test: identical seed and inputs must yield identical output on every host.

python
import numpy as np
from pointpats import k_test

def assert_clustering_within_envelope(coords: np.ndarray, n_sims: int = 199) -> None:
    """CI gate: observed K must fall inside the Monte Carlo envelope at the
    target scales, and generation must be reproducible under a fixed seed."""
    result = k_test(coords, keep_simulations=True, n_simulations=n_sims, seed=42)
    inside = (result.statistic >= result.simulations.min(axis=0)) & \
             (result.statistic <= result.simulations.max(axis=0))
    assert inside.mean() >= 0.95, "K-function escapes the MC envelope: intensity drift"

def assert_reproducible(make_coords) -> None:
    a = make_coords(seed=7)
    b = make_coords(seed=7)
    assert np.array_equal(a, b), "Non-deterministic output: unseeded RNG or set-order leak"

Expected shapes are part of the contract too: the realized count should land within a Poisson tolerance of the target. For a target $\mu$ , assert that the observed $n$ satisfies $|n - \mu| \le 3\sqrt{\mu}$ rather than demanding exact equality, since the count is itself a random variable.

Performance & Scale Considerations

Continental-extent generation breaks the single-process model in two predictable ways, both solvable with spatial chunking.

Memory from oversized candidate sets. Thinning at a high $\lambda_{\max}$ allocates a candidate array proportional to $\lambda_{\max} \cdot \text{area}$ . Smooth or cap the intensity surface, and for very large domains stream candidates through memory-mapped arrays rather than holding the full set on the heap.
Compute through tiled, asynchronous execution. Partition the domain into tiles with overlap buffers (halos) sized to the maximum interaction radius, generate tiles in parallel, and merge with the halo de-duplicated. Each tile must derive its seed deterministically from a base seed (e.g. seed = base_seed * 1_000_003 + tile_id) so the whole field stays reproducible while tiles run independently. The non-blocking orchestration patterns for this live in async execution for large grids.

python
def tile_seed(base_seed: int, tile_id: int) -> int:
    """Stable per-tile seed: independent tiles, globally reproducible field."""
    return (base_seed * 1_000_003 + tile_id) & 0x7FFF_FFFF

Route compute toward high-variance regions instead of spreading workers uniformly: precompute the intensity distribution so that dense tiles, which dominate thinning cost, get proportionally more workers. Checkpoint completed tiles and keep worker functions idempotent so a partial failure re-runs only the affected tiles.

Failure Modes & Troubleshooting

Generation in geographic degrees

Symptom: Ripley’s $K$ and nearest-neighbour distances are nonsensical, and density skews with latitude. Root cause: sampling or computing statistics in EPSG:4326, where a degree is not a constant distance. Fix: reproject to an equal-area metric CRS at ingestion, do all generation and validation there, and convert back only at export.

python
gdf_metric = gdf_wgs84.to_crs("EPSG:6933")   # generate + validate here
gdf_export = gdf_metric.to_crs("EPSG:4326")  # serialize only at the end

Edge-depressed density after clipping

Symptom: Point density falls off near the boundary and nearest-neighbour distributions distort at the margin. Root cause: generating inside a bounding box and clipping discards the neighbours that would have sat just outside the polygon. Fix: buffer the generation domain by the maximum interaction radius before thinning, then clip strictly to the target polygon (Step 3).

Thinning throughput collapse

Symptom: Generation time explodes and memory spikes on an otherwise small target count. Root cause: a sharp spike in $\lambda(x, y)$ forces a huge $\lambda_{\max}$ , so the HPP proposal generates orders of magnitude more candidates than needed. Fix: smooth or clip the top percentile of the intensity surface, or switch to a piecewise-constant per-tile $\lambda_{\max}$ so a local spike no longer penalizes the whole domain.

Non-reproducible output across hosts

Symptom: Identical configs yield different coordinates on different machines. Root cause: an unseeded RNG, set/dict iteration order feeding geometry construction, or a late/mismatched reprojection. Fix: propagate one seed into every stochastic call, pin the toolchain, set PYTHONHASHSEED=0, and record a hash of the config plus lockfile in a seed registry.

Privacy leakage through over-faithful intensity

Symptom: Adversarial nearest-neighbour tests show synthetic points sitting implausibly close to real source locations. Root cause: the intensity surface is bound so tightly to the real data that each source record dominates a cell. Fix: inject calibrated noise into the surface before realization and track the budget. Spatial $k$ -anonymity and a declared $(\varepsilon, \delta)$ budget — the same differential privacy mechanisms used elsewhere — bound how much any single record can influence the output, and a Wasserstein-distance check confirms utility is preserved without preserving identity.

Frequently Asked Questions

When should I use a Cox or Gibbs process instead of an inhomogeneous Poisson process?

Use an IPP when events are mutually independent and only their density varies in space — the common case for most synthetic point work. Switch to a Cox or Thomas cluster process when the source pattern shows attraction (events bunch around parents) and to a Gibbs process when it shows inhibition (a minimum spacing). Those interactions are second-order structure that a Poisson process cannot reproduce, and you will see it as an $L$ -function that escapes the Monte Carlo envelope.

How do I keep a thinned IPP reproducible across distributed workers?

Derive each tile’s seed deterministically from one base seed, draw from a single np.random.default_rng(tile_seed) per tile, pin the library versions, and set PYTHONHASHSEED=0. Independent tiles then run in parallel while the merged field remains byte-identical for a given base seed.

What tolerance should I allow on the realized point count?

The count from a Poisson process is itself random with mean and variance equal to the target $\mu$ , so demanding an exact count is wrong. Assert that the observed $n$ falls within roughly $3\sqrt{\mu}$ of $\mu$ ; a count that is consistently off by more than that signals an un-normalized intensity surface or a clipping bias.

Why validate in a projected CRS rather than the storage CRS?

Ripley’s $K$ , the $G$ -function, and intensity are all distance- and area-sensitive. Computing them in EPSG:4326 makes a degree near the poles represent a far shorter ground distance than near the equator, corrupting every statistic. Validate in an equal-area metric CRS and reproject to EPSG:4326 only for storage and exchange.

Polygon Tessellation Algorithms — turn generated points into gap-free, sliver-free zones using cluster-seeded Voronoi and Delaunay partitions.
Density Mapping & Heat Generation — convert point patterns into continuous intensity surfaces with deterministic KDE and binning.
Async Execution for Large Grids — non-blocking, checkpointed orchestration for continental-scale tiled generation.
Generating Urban Point Patterns Using Poisson Processes — intensity calibration and edge correction for municipal extents.
Privacy-Preserving Generation Frameworks — the noise budgets and $k$ -anonymity gates that bound leakage from realized points.