Density Mapping & Heat Generation in Synthetic Spatial Pipelines

Density mapping and heat generation turn discrete synthetic events — points, event centroids, trajectory samples — into continuous intensity surfaces that downstream models, dashboards, and compliance attestations depend on. This page is part of Spatial Distribution & Pattern Generation: where point process simulation models produce the discrete events, density mapping is the stage that aggregates them into the smoothed surfaces an ML feature store or risk map actually consumes. Done as exploratory visualization, density estimation is a throwaway plot; done inside a generation pipeline, it has to be deterministic, area-true, and reproducible byte-for-byte across runs.

Problem Framing: When a Heatmap Stops Being Reproducible

The engineering failure this page solves is silent intensity drift — two runs of the same pipeline, on the same input batch and the same seed, producing visibly different heat surfaces. The drift almost never comes from the events themselves; it comes from four under-specified decisions in the aggregation path:

Areal distortion from the projection. Computing density in geographic degrees (EPSG:4326) inflates cell areas toward the poles, so identical point counts map to different intensities depending on latitude. A surface generated in degrees is not comparable across regions and cannot be normalized consistently.
Data-adaptive bandwidth. Letting scipy.stats.gaussian_kde pick its own bandwidth means the smoothing radius is a function of the sample, not the configuration. Re-run with one extra point and every cell value shifts.
Unanchored grids. A grid whose origin is derived from the data extent re-tiles itself whenever the extent changes by a pixel, so tile boundaries — and therefore halo merges — move between runs.
Edge discontinuities at tile seams. Density is non-local: a point near a tile edge contributes to neighbours across the seam. Tile the grid without overlap buffers and you get visible ridges or troughs exactly on the seam lines.

Each of these is individually small and individually invisible in a one-off notebook. Composed inside a continuous pipeline they produce a surface that “looks about right” but fails bitwise-identity checks, breaks statistical-parity gates, and quietly corrupts any feature derived from cell intensities. The fix is to make every one of those four decisions explicit, seeded, and version-pinned — the same separation-of-concerns discipline that governs CRS contract enforcement elsewhere in the stack.

Prerequisites & Toolchain

Density mapping sits on the standard synthetic-spatial toolchain. Pin majors explicitly so bandwidth heuristics and rasterization geotransforms do not shift under you between environments:


python >= 3.10
numpy == 1.26.*
scipy == 1.11.*
geopandas == 0.14.*
shapely == 2.*
pyproj == 3.*
rasterio == 1.3.*      # GDAL 3.x bindings
h3 == 3.7.*            # hexagonal binning (optional but recommended)
dask[array] == 2024.*  # only for distributed grids

Two environment variables decide whether reprojection is even reproducible:

bash
export PROJ_NETWORK=OFF          # never fetch datum grids over the network mid-run
export PROJ_LIB=/opt/conda/share/proj   # pin the PROJ data dir explicitly

PROJ_NETWORK=ON lets pyproj silently download a different datum-shift grid on one machine and not another, which changes coordinates by sub-metre amounts — enough to move a point across a cell boundary. Pin it off and ship the grids in the image.

Core Concept: Why Determinism Lives in the Bandwidth and the Grid Origin

A heat surface is the convolution of a point pattern with a kernel, sampled onto a grid. Two parameters dominate both fidelity and reproducibility: the kernel bandwidth $h$ and the grid anchor $(x_0, y_0, r)$ .

Kernel density estimation evaluates, at every grid cell centre $\mathbf{s}$ ,

\hat{\lambda}(\mathbf{s}) = \frac{1}{n h^2} \sum_{i=1}^{n} K\!\left(\frac{\lVert \mathbf{s} - \mathbf{x}_i \rVert}{h}\right)

where $K$ is the kernel (Gaussian by default) and $h$ the bandwidth in projected metres. The reproducibility trap is that Silverman’s and Scott’s rules both make $h$ a function of the sample covariance and $n$ :

h_{\text{Scott}} = n^{-1/(d+4)} \, \hat{\sigma}, \qquad d = 2

so any change to the input set perturbs $h$ , and a perturbed $h$ perturbs every cell. The production move is to compute the heuristic once, against a frozen reference batch, record the resulting metre value in config, and then pass that fixed number on every run — never bw_method="scott" live. The same logic applies to inhibition: if you need to suppress sub-metre spikes for privacy, you enforce a floor on $h$ rather than letting the data choose.

The grid anchor matters for the same reason tile seams do. If the grid origin, resolution r, and extent are derived from the data, they move when the data moves, and the cell that a point falls into changes. Anchoring the grid to a fixed origin and a fixed resolution makes cell membership a pure function of coordinates — the precondition for bitwise-identical tiles across distributed workers.

For irregular domains, hexagonal binning (via h3) or quadtree cells replace rectangular grids: hexagons have uniform adjacency (six equidistant neighbours) and so introduce no directional bias along rows and columns, which matters when the surface feeds a graph neural network whose edges follow cell adjacency.

Step-by-Step Implementation

Step 1 — Reproject to an equal-area CRS and index

Density must be computed where equal counts mean equal area. Reproject geographic input to an equal-area CRS (EPSG:6933 globally, or a local UTM zone) before anything else, then build a deterministic spatial index for neighbourhood queries and clipping.

python
import geopandas as gpd
from shapely import STRtree

EQUAL_AREA = "EPSG:6933"  # WGS 84 / NSIDC EASE-Grid 2.0 Global

def index_events(events: gpd.GeoDataFrame) -> tuple[gpd.GeoDataFrame, STRtree]:
    if events.crs is None:
        raise ValueError("events have no CRS; refuse to guess")
    proj = events.to_crs(EQUAL_AREA).reset_index(drop=True)  # stable row order
    tree = STRtree(proj.geometry.values)  # deterministic given fixed input order
    return proj, tree

Resetting the index gives a stable row order, which keeps the STRtree topology and any downstream iteration deterministic.

Step 2 — Estimate intensity with a fixed bandwidth onto an anchored grid

Compute the bandwidth heuristic offline once, store the metre value, and feed it as a constant. Anchor the grid to a fixed origin and resolution.

python
import numpy as np
from scipy.stats import gaussian_kde

FIXED_BANDWIDTH_M = 750.0   # derived once from a frozen reference batch, then frozen
ORIGIN = (-17_367_530.0, 7_314_540.0)  # fixed EASE-Grid anchor, metres
RES_M = 1_000.0

def density_surface(xy: np.ndarray, n_cols: int, n_rows: int) -> np.ndarray:
    # bw_method as a scalar => bandwidth is configuration, not a function of the sample
    kde = gaussian_kde(xy.T, bw_method=FIXED_BANDWIDTH_M / xy.std(ddof=1))
    xs = ORIGIN[0] + (np.arange(n_cols) + 0.5) * RES_M
    ys = ORIGIN[1] - (np.arange(n_rows) + 0.5) * RES_M
    gx, gy = np.meshgrid(xs, ys)
    grid = np.vstack([gx.ravel(), gy.ravel()])
    return kde(grid).reshape(n_rows, n_cols)

For large extents, prefer histogram or hex binning over KDE — it is bounded-memory and trivially deterministic. Couple the validation back to the generating process: when events come from a Poisson, Thomas, or Neyman–Scott model, the surface should reproduce that parent intensity. See point process simulation models for the matching generators, and align cell boundaries with polygon tessellation algorithms so density cells snap to administrative or land-use zones without edge fragmentation.

Step 3 — Apply the privacy budget before normalization

High-intensity zones can pinpoint rare event locations, so perturb the density matrix under a declared budget before it is scaled. This is the spatial application of the differential privacy mechanisms used across the architecture layer.

python
def laplace_privatize(density: np.ndarray, sensitivity: float,
                      epsilon: float, rng: np.random.Generator) -> np.ndarray:
    scale = sensitivity / epsilon          # Laplace scale b = Δf / ε
    noise = rng.laplace(0.0, scale, size=density.shape)
    return np.clip(density + noise, 0.0, None)  # density cannot go negative

The privacy budget $\varepsilon$ must be tracked and logged per generation run; lower $\varepsilon$ means more noise and stronger protection. A separate floor on the bandwidth prevents any single synthetic point from dominating a cell, and a percentile threshold zeroes sparse cells that would otherwise expose outlier trajectories.

Step 4 — Rasterize, normalize, and stamp provenance

Scale to a fixed bit-depth and write the geotransform, then embed the parameters that make the surface auditable.

python
import rasterio
from rasterio.transform import from_origin

def write_heat_raster(density: np.ndarray, path: str, *,
                      bandwidth_m: float, epsilon: float,
                      pipeline_hash: str) -> None:
    p99 = np.percentile(density, 99)               # clamp to upper percentile, not max
    scaled = np.clip(density / p99, 0.0, 1.0)
    raster = (scaled * 65535).astype("uint16")     # 16-bit, deterministic clamp
    transform = from_origin(ORIGIN[0], ORIGIN[1], RES_M, RES_M)
    with rasterio.open(
        path, "w", driver="GTiff", height=raster.shape[0], width=raster.shape[1],
        count=1, dtype="uint16", crs=EQUAL_AREA, transform=transform,
        compress="ZSTD", predictor=2,
    ) as dst:
        dst.write(raster, 1)
        dst.update_tags(crs=EQUAL_AREA, bandwidth_m=str(bandwidth_m),
                        resolution_m=str(RES_M), epsilon=str(epsilon),
                        normalization="p99-linear", pipeline_hash=pipeline_hash)

Normalizing to the 99th percentile rather than the raw maximum keeps a single injected outlier from collapsing the whole dynamic range, and the embedded tags let an auditor reconstruct exactly how the surface was produced.

Validation & Testing

Validate every stage; serialize the results as pipeline artifacts so the chain from raw coordinates to final raster is traceable.

python
import numpy as np

def assert_reproducible(run_a: np.ndarray, run_b: np.ndarray) -> None:
    # identical batch + fixed seed => bitwise-identical raster
    assert np.array_equal(run_a, run_b), "density surface drifted between runs"

def assert_mass_conserved(density: np.ndarray, n_events: int,
                          cell_area_m2: float, tol: float = 1e-6) -> None:
    integrated = density.sum() * cell_area_m2
    assert abs(integrated - n_events) / n_events < tol, "density mass not conserved"

The gate set that should run in CI:

Deterministic seed verification — identical inputs and a fixed seed produce bitwise-identical rasters across runs (np.array_equal).
Distribution fidelity — compare the surface against the theoretical intensity with a Kolmogorov–Smirnov test or, preferably, Wasserstein distance; the realism metrics evaluation layer defines the tolerance thresholds.
Boundary integrity — tessellation-aligned grids conserve area; no density mass is lost during clipping or halo merging.
Spatial autocorrelation — the surface passes a Moran’s $I$ check, confirming it is smooth rather than salt-and-pepper noise after privatization.
Bit-depth audit — the normalization curve introduces no quantization banding that would degrade downstream training.

Wire these into CI/CD integration for spatial data so a drifted surface fails the build instead of shipping.

Performance & Scale Considerations

Naive KDE is $O(n \cdot m)$ over $n$ points and $m$ cells, which becomes memory-bound on continental grids. Three levers keep it tractable:

Chunked, halo-buffered tiling. Process spatial tiles in parallel, but pad each tile with an overlap buffer of at least ceil(3h / RES_M) cells (three bandwidths) so cross-seam contributions are captured, then trim the halo before merging. Without the buffer you get ridges on the seams; with too small a buffer you get a faint discontinuity.
Cache-aligned chunks and memory-mapped intermediates. Size chunks to L3-cache boundaries and back intermediate density matrices with memory-mapped arrays so a large grid spills to disk instead of OOM-killing the worker.
Distributed lazy evaluation. For grids that exceed a single host, scaling density-based spatial generation with Dask shows the spatially-aware partitioning, automatic spilling, and fault-tolerant retries that stop the task graph from exploding. The non-blocking export patterns in async execution for large grids remove the I/O stall when writing tiles to cloud object storage.

The recurring rule: parallelize the compute, but keep boundary handling deterministic — halo width, merge order, and grid anchor must be identical regardless of how many workers ran.

Failure Modes & Troubleshooting

Latitude-dependent intensity (areal distortion). Symptom: the same point density reads brighter near the poles. Root cause: KDE or binning computed in EPSG:4326 degrees. Remediation: reproject to EPSG:6933 or local UTM in Step 1 and never compute bandwidth in degrees.

Surface changes when one point is added (adaptive bandwidth). Symptom: cell values shift across an entire region after a one-event change. Root cause: bw_method="scott"/"silverman" recomputed live. Remediation: freeze the bandwidth to a metre value in config and pass it as a scalar, as in Step 2.

Visible ridges or troughs on tile seams. Symptom: straight-line artifacts aligned with tile boundaries. Root cause: tiling without overlap buffers, so cross-seam kernel contributions are dropped. Remediation: pad tiles by ≥ 3 bandwidths, compute on the padded tile, then trim before merge.

Non-deterministic tiling (unanchored grid). Symptom: tiles fail bitwise-identity even with a fixed seed. Root cause: grid origin/resolution derived from the data extent. Remediation: pin ORIGIN and RES_M to fixed constants so cell membership is a pure function of coordinates.

Privacy noise erases real structure. Symptom: Moran’s $I$ collapses and the surface looks like static after privatization. Root cause: $\varepsilon$ set too low for the chosen sensitivity, or sensitivity over-estimated. Remediation: calibrate sensitivity to the per-cell contribution cap, raise $\varepsilon$ within the declared budget, and gate on the autocorrelation check; the coordinate-level differential privacy walkthrough covers sensitivity calibration in depth.

Frequently Asked Questions

Should I use KDE or grid binning for a production density surface?

Use binning (histogram or hexagonal) whenever the grid is large or the surface must be strictly deterministic and bounded-memory — it has no bandwidth to drift and scales linearly. Reserve KDE for smaller extents where a smooth surface matters and you can afford to freeze the bandwidth and evaluate the kernel against every cell.

Which CRS should density mapping default to?

A projected, equal-area CRS such as EPSG:6933 globally, or a local UTM zone for regional work. Equal-area is non-negotiable: density means events per unit area, so computing it in EPSG:4326 degrees produces latitude-dependent intensities that cannot be normalized or compared across regions.

How do I keep two runs of the same pipeline byte-identical?

Freeze four things: the equal-area projection, the bandwidth in metres, the grid origin and resolution, and the RNG seed feeding privacy noise. With all four pinned and a stable input row order, np.array_equal on the two output rasters must pass; if it fails, one of the four is still a function of the data.

How wide should the halo buffer around each tile be?

At least three bandwidths — ceil(3h / RES_M) cells — because a Gaussian kernel’s contribution is negligible beyond roughly three standard deviations. Narrower buffers leave a faint seam discontinuity; wider buffers only cost compute. Trim the halo before merging adjacent tiles.

Where does the privacy budget get applied in the pipeline?

To the density matrix, after estimation and before normalization, using Laplace or Gaussian noise calibrated to the per-cell sensitivity and the declared $\varepsilon$ . Applying it before normalization keeps the budget interpretable in count space; applying it after would couple the noise scale to the normalization curve and make $\varepsilon$ accounting unreliable.

Point Process Simulation Models — the Poisson, Thomas, and Neyman–Scott generators whose intensity these surfaces must reproduce.
Polygon Tessellation Algorithms — align density cells to gap-free, sliver-free zones so aggregation respects boundaries.
Async Execution for Large Grids — non-blocking export and distributed synthesis for continental-scale surfaces.
Scaling Density-Based Spatial Generation with Dask — spatially-aware partitioning that prevents task-graph explosion.
Privacy-Preserving Generation Frameworks — the $(\varepsilon, \delta)$ budgeting that the noise-injection step draws on.
Spatial Distribution & Pattern Generation — the parent area that ties indexing, synthesis, tessellation, and density together.