Scaling Density-Based Spatial Generation with Dask

A density-based synthetic generation pipeline that runs fine on a city tile detonates the moment it spans a continent: the Dask scheduler emits a multi-million-node task graph, workers cross their memory threshold, and the run dies with MemoryError or a serialization deadlock instead of producing a heat surface.

Part of Density Mapping & Heat Generation within Spatial Distribution & Pattern Generation: the parent page makes a single density surface deterministic and area-true; this page resolves the specific failure of making that same estimation scale across millions of points without the task graph blowing up. It is the distributed companion to Async Execution for Large Grids, which handles non-blocking export of the tiles this page produces.

Root Cause: Task Graph Explosion from Spatially-Agnostic Chunking

Density-based algorithms are inherently non-local. A single point’s contribution to a rasterized intensity surface — or its DBSCAN cluster assignment — depends on every neighbor within a fixed bandwidth or epsilon radius. The estimation step that Density Mapping & Heat Generation describes as a fixed-bandwidth kernel density estimation (KDE) is therefore not embarrassingly parallel: correctness requires that each point can see its neighborhood, and that neighborhood does not respect array boundaries.

Default Dask partitioning splits arrays by row index or fixed byte size, ignoring geographic locality. When a kernel radius or clustering epsilon spans two partitions, Dask’s lazy execution engine cannot evaluate the partition in isolation. It must materialize cross-partition joins, duplicate boundary geometries, and construct intermediate Cartesian products to resolve the overlapping neighborhoods. At continental scale with millions of points, this transforms an O(N log N) spatial operation into an O(N²) memory-bound one. The scheduler tries to resolve every overlap before computing, so the task graph grows with the square of the point count and forces premature materialization of dense intermediate grids. Workers then breach their memory limit, spill to disk, and either thrash or get OOM-killed — which also breaks the bitwise reproducibility the pipeline owes to QA and to any differential privacy audit downstream.

The failure announces itself with three deterministic signals in the scheduler logs:

distributed.scheduler.Scheduler: Task graph exceeds 10^6 nodes during dask.array or dask_geopandas initialization.
Worker memory limit exceeded (95% threshold) with repeated spill-to-disk operations that collapse I/O throughput.
SerializationError: cannot serialize numpy.ndarray when Dask tries to ship a dense distance matrix or adjacency list between workers.

All three trace back to the same cause: the chunking scheme knows nothing about where points are in space.

Minimal Reproducer: Confirming the Chunking Is the Problem

Before changing anything, confirm that the graph size scales with point count rather than partition count — the signature of spatially-agnostic chunking. This snippet builds the graph lazily and inspects it without triggering the expensive compute():

python
import numpy as np
import pandas as pd
import dask_geopandas as dgpd
import geopandas as gpd
from shapely.geometry import Point

rng = np.random.default_rng(42)

def make_points(n: int) -> gpd.GeoDataFrame:
    lon = rng.uniform(-124.0, -66.0, n)   # CONUS longitude span
    lat = rng.uniform(24.0, 49.0, n)      # CONUS latitude span
    return gpd.GeoDataFrame(
        {"id": np.arange(n)},
        geometry=[Point(x, y) for x, y in zip(lon, lat)],
        crs="EPSG:4326",
    )

for n in (50_000, 200_000, 800_000):
    gdf = make_points(n)
    # Default, index-based partitioning — no spatial awareness
    ddf = dgpd.from_geopandas(gdf, npartitions=16)
    # A neighborhood-style self-join is what KDE/DBSCAN implicitly trigger
    joined = ddf.sjoin(ddf, predicate="dwithin", distance=0.05)
    print(f"n={n:>7}  graph_nodes={len(joined.__dask_graph__()):>9}")

If the printed graph_nodes count climbs super-linearly with n — roughly quadrupling when n quadruples rather than staying flat with the partition count — the pipeline is reconstructing spatial relationships at query time. That is the condition the fix below eliminates.

Fix: Lazy Chunking with Spatially-Aware Partitioning

The remedy is to replace index-based partitioning with spatially contiguous tiling, so that each partition already contains a geographically coherent neighborhood and the kernel never has to reach across a seam. Align partition boundaries to a hierarchical spatial index — an H3 cell, an S2 cell, or an R-tree bound — and defer neighborhood resolution until partition-local execution. Boundary overlaps are handled with deterministic buffer zones that are stripped after computation to prevent double-counting.

python
import dask_geopandas as dgpd
import dask.array as da
import geopandas as gpd
import numpy as np
import h3
from sklearn.neighbors import KernelDensity
import warnings

warnings.filterwarnings("ignore", category=UserWarning)

def _h3_partition(gdf: gpd.GeoDataFrame, resolution: int = 7) -> "dgpd.GeoDataFrame":
    """Assign H3 indices and group into spatially contiguous partitions."""
    gdf = gdf.copy()
    gdf["h3_index"] = gdf.geometry.apply(
        lambda geom: h3.latlng_to_cell(geom.y, geom.x, resolution)
    )
    n_cells = gdf["h3_index"].nunique()
    return dgpd.from_geopandas(gdf, npartitions=n_cells)

def _compute_partition_kde(partition_gdf, bandwidth=0.005, grid_res=0.01):
    """Execute lazy KDE within a single spatial partition — no cross-seam reads."""
    coords = np.vstack([partition_gdf.geometry.x, partition_gdf.geometry.y]).T
    if len(coords) < 2:
        return None  # empty / singleton cells contribute nothing

    kde = KernelDensity(bandwidth=bandwidth, metric="euclidean", kernel="gaussian")
    kde.fit(coords)

    # Local, anchored evaluation grid for this cell only
    x_min, y_min, x_max, y_max = partition_gdf.total_bounds
    x = np.arange(x_min, x_max, grid_res)
    y = np.arange(y_min, y_max, grid_res)
    xx, yy = np.meshgrid(x, y)
    grid_points = np.vstack([xx.ravel(), yy.ravel()]).T

    log_density = kde.score_samples(grid_points)
    density = np.exp(log_density).reshape(xx.shape)

    return da.from_array(density, chunks=256)

def generate_synthetic_density(points_gdf, bandwidth=0.005, h3_res=7, grid_res=0.01):
    """
    Scale density-based spatial generation by partitioning on geography, not index.
    Returns a Dask-backed array of rasterized density values.
    """
    # 1. Spatially partition with H3 so every partition is a coherent neighborhood
    partitioned = _h3_partition(points_gdf, resolution=h3_res)

    # 2. Map the lazy KDE across partitions — the graph stays shallow
    density_chunks = partitioned.map_partitions(
        _compute_partition_kde,
        bandwidth=bandwidth,
        grid_res=grid_res,
        meta=object,
    ).compute()

    # 3. Reassemble, dropping empty partitions
    valid_chunks = [c for c in density_chunks if c is not None]
    if not valid_chunks:
        raise ValueError("No valid spatial partitions generated.")

    return da.concatenate([c.flatten() for c in valid_chunks])

Because the kernel is evaluated inside map_partitions, the scheduler keeps a graph whose depth scales with partition count rather than point count, and no dense distance matrix is ever serialized between workers.

Task Graph and Memory Guardrails

Spatial partitioning removes the quadratic blow-up, but Dask still needs explicit configuration to stop graph bloat and worker thrashing during Spatial Distribution & Pattern Generation workflows:

Chunk-size alignment. Match array chunk dimensions to L3/L4 cache lines (typically 256–512 elements per axis). Oversized chunks trigger worker OOM; undersized chunks inflate scheduler overhead.
Graph fusion. Enable dask.config.set({"array.slicing.split_large_chunks": True}) and call da.optimize() before compute() to merge adjacent operations and drop redundant intermediates. Pass configuration as a dict, not keyword arguments, to avoid deprecation warnings in Dask ≥ 2023.
Serialization control. Stage intermediate state through zarr or parquet rather than the default pickle serializer, which fails on dense adjacency matrices above 2 GB. Set distributed.worker.memory.spill to 0.75 and distributed.comm.timeouts.connect to 30s to keep high-concurrency spatial joins out of deadlock.
Deterministic execution. Run QA validation under dask.config.set(scheduler="synchronous") so partition ordering and seed propagation are reproducible; switch to the distributed scheduler only after memory profiles stabilize.

See Dask Best Practices for Large Datasets for scheduler tuning and the H3 Hexagonal Hierarchical Spatial Index docs for resolution-to-area mapping tables.

Verification Step

The fix has two acceptance criteria: the task graph must scale with partition count, and the output must be bitwise-reproducible across runs. Gate both in CI:

python
import numpy as np

def test_graph_scales_with_partitions():
    """Graph size tracks H3 cell count, not point count."""
    small = _h3_partition(make_points(50_000), resolution=7)
    large = _h3_partition(make_points(800_000), resolution=7)
    small_nodes = len(small.__dask_graph__())
    large_nodes = len(large.__dask_graph__())
    # 16x more points must NOT mean ~16x more graph nodes
    assert large_nodes < 4 * small_nodes, "task graph still scaling with N"

def test_density_is_reproducible():
    """Same input + same seed -> byte-identical density."""
    pts = make_points(100_000)
    a = generate_synthetic_density(pts, bandwidth=0.005, h3_res=7).compute()
    b = generate_synthetic_density(pts, bandwidth=0.005, h3_res=7).compute()
    assert np.array_equal(a, b), "non-deterministic density output"

If test_graph_scales_with_partitions fails, a partition is still being formed by row index somewhere upstream. If test_density_is_reproducible fails, the bandwidth, grid origin, or RNG seed feeding the privacy noise is still a function of the data rather than the configuration — the same determinism contract that CRS contract enforcement imposes elsewhere in the stack.

Edge Cases & Gotchas

Antimeridian-spanning H3 cells. Near ±180° longitude, a single hexagon’s vertices straddle the date line, and total_bounds reports a width of nearly 360°, producing an enormous evaluation grid that re-creates the OOM you just fixed. Detect cells whose bound width exceeds 180°, split them into east/west sub-extents, and evaluate each separately before merging.
Null Island and unprojected coordinates. Points that defaulted to (0, 0) after a failed geocode all collapse into one equatorial H3 cell, creating a single oversized partition that starves every other worker of work. Filter Point(0, 0) before partitioning, and assert the input CRS is EPSG:4326 so latlng_to_cell receives degrees and not projected metres.
Seam artifacts from missing halo buffers. Density is non-local, so cropping strictly to the H3 boundary leaves a faint trough exactly on each cell seam. Buffer every partition by at least three bandwidths, compute on the buffered extent, then crop back to the strict boundary with rasterio.mask before merging — the same halo discipline the parent Density Mapping & Heat Generation page applies to single-machine tiling.
Single-resolution mismatch under skew. A fixed h3_res that is correct for a dense urban core leaves rural cells nearly empty (wasted workers) and metro cells overloaded. If spill-to-disk exceeds ~15% of run time, drop the H3 resolution by one tier or fall back to scipy.stats.gaussian_kde with a sparse covariance approximation.

Apply privacy noise after estimation and before normalization, seeding np.random.Generator with a cryptographic hash of each partition’s H3 index so the differential privacy budget scales with local partition density rather than global point count. Validate the assembled surface against ground-truth Kolmogorov–Smirnov and Ripley’s K statistics to confirm spatial autocorrelation survived partitioning; consult scikit-learn Kernel Density Estimation for bandwidth-selection heuristics.

Frequently Asked Questions

Why not just raise the worker memory limit instead of repartitioning?

More RAM only delays the failure. The task graph itself grows with the square of the point count under index-based chunking, so the scheduler exhausts memory while building the graph, before any worker runs. Spatial partitioning changes the graph’s shape, not just its footprint, which is the only fix that scales.

What H3 resolution should I start with?

Resolution 7 (average cell edge ≈ 1.2 km) is a reasonable default for regional density at sub-kilometre grids. Choose the coarsest resolution whose cells still hold enough points for stable estimation; finer resolutions multiply partitions and graph depth, coarser ones reintroduce cross-cell neighborhoods that the kernel must reach across.

Can I use S2 or a quadtree instead of H3?

Yes — the requirement is geographic contiguity per partition, not H3 specifically. S2 cells, an R-tree leaf bound, or a fixed quadkey grid all satisfy it. H3 is convenient because its hexagons have uniform neighbor distances, which makes the three-bandwidth halo width identical in every direction.

Why does the output have to be byte-identical between runs?

Because synthetic density surfaces feed ML feature stores and privacy attestations that are audited. If two runs of the same seed produce different rasters, you cannot prove the noise injection respected the declared epsilon, and statistical-parity gates downstream become meaningless. Determinism is what makes the privacy budget accountable.

How do I keep the privacy budget correct after partitioning?

Calibrate the noise to per-cell sensitivity inside each partition and apply it before normalization, seeding the generator from the partition’s H3 index. This keeps epsilon interpretable in count space and prevents one dense cell from quietly consuming budget that a sparse cell never spent.

Density Mapping & Heat Generation — the parent page that defines the deterministic, area-true density surface this page scales out.
Async Execution for Large Grids — non-blocking export and back-pressure for the tiles this pipeline emits.
Point Process Simulation Models — the generators whose intensity these density surfaces must reproduce.
Privacy-Preserving Generation Frameworks — the $(\varepsilon, \delta)$ budgeting the per-partition noise step draws on.