Geospatial ML for New Store Site Selection & Sales Forecasting

Databricks · Spark · XGBoost · Geospatial Intelligence · MLflow · Unity Catalog

Client: $10B+ convenience & prepared foods retailer (2,500+ locations)
Role: Lead Architect

Executive Summary

Metric	Value
Client scale	$10B+ retailer · 2,500+ locations across the US
CAPEX allocation decisions informed (monthly)	$50M+
Site-selection approval cycle reduction	70% (3 weeks → 5 days)
Query latency improvement	22× (45s → <2s)
Geospatial features engineered	200M+
Accuracy vs 3rd-party benchmarking tool	+15% lift
Data volume processed	>1TB

As the lead architect, I worked directly with the client’s Real Estate and Finance leadership to diagnose why existing decisions were failing at scale - then designed and delivered a system that replaced intuition with a defensible, explainable ML engine. The system generates 3-year category-wise sales forecasts from a latitude/longitude input, reducing evaluation time from weeks to seconds and directly informing $50M+ in monthly capital allocation decisions.

1. Business Problem

The Core Issue

Real estate expansion decisions at this client were:

Decentralised across regional teams with inconsistent criteria
Qualitative and subjective - dependent on individual real estate managers’ instincts
High-risk due to long-term CAPEX commitments and irreversible lease obligations

The organisation lacked:

Standardised evaluation criteria across geographies
Quantitative risk assessment for new site viability
Scalability for evaluating multiple sites simultaneously

Objective

Build a data-driven site selection engine to:

Eliminate regional bias and standardise decision-making nationally
Quantify upside and risk for any candidate site from geographic coordinates
Scale across the entire US footprint
Produce explainable forecasts that Real Estate and Finance leadership could act on and defend to their CFO

2. Why Machine Learning (Not Rules or BI)

Why BI Failed

Retrospective only - no inferential capability for unseen geographies
Manual heuristics didn’t scale across 2,500+ existing locations and new candidates
Reinforced existing human bias in site selection

Why ML Was Required

>1,700 heterogeneous features: demographics, mobility, infrastructure, competitor data
Strong non-linear interactions between spatial drivers (e.g., highway access, school proximity)
High-dimensional spatial relationships not capturable by rules or regression
Cold-start problem: new-to-industry (NTI) sites have no historical sales to anchor on

Impact

10× throughput for site evaluation
Standardised investment scoring across all geographies
Seconds-level inference latency from lat/lon to 3-year forecast
Bias reduction in multi-million dollar CAPEX decisions

3. Platform & Architectural Choice: Why Databricks

Managed Spark removed infrastructure overhead for large-scale geospatial joins
Native support for distributed feature engineering across 200M+ spatial features
MLflow-based governance for model lineage, versioning, and audit readiness
Migration from schema-on-read Hive to star-schema Delta + Unity Catalog delivered 22× query latency improvement (45s → <2s)
Faster time-to-market than raw AWS primitives given team’s existing proficiency

4. Data Landscape

Domain	Source	Purpose
Human Mobility	Placer.ai	Footfall & traffic behaviour
Demographics	US Census	Socio-economic context per trade area
Infrastructure	AADT + OpenStreetMap	Traffic volumes & road accessibility
Store History	PostgreSQL	Sales ground truth for training

Data Hygiene Decisions

Excluded stores with <1.5 years of history (grand-opening bias distorts early-period sales)
Excluded stores with >5 years of history in some categories (legacy market conditions diverge from current expansion context)
Final training dataset: 10 years of curated store history

5. Spatial Feature Engineering

Key Design Decision: Dual Trade-Area Framework

Rather than a single buffer radius, each candidate site was evaluated through two spatial lenses:

1. Drive-Time Isochrones (3 / 5 / 10 minutes)

Captures real accessibility based on road network topology
Models friction from one-way roads, dividers, and traffic patterns

2. Ring Radii (3 / 5 / 10 miles)

Captures broader trade dynamics and highway-corridor behaviour
Critical for diesel and long-haul truck categories

This dual framework standardised feature extraction consistently across all data sources and was a key differentiator over the 3rd-party tool’s single-ring approach.

6. Feature Engineering at Scale (1,700+ Features)

Positive Demand Drivers

Schools, stadiums, and high-footfall destinations (Placer.ai)
Highway access nodes and major road intersections (AADT)
Gas pumps, kitchen layout variables, store format features

Negative / Risk Drivers

Competitor proximity (distance set to 9,999 when absent - treated as signal, not null)
Turn complexity and wrong-side access friction
Road network constraints modelled via graph traversal

Handling High Dimensionality

Store Archetype Clustering (Rural / Suburban / Urban) to reduce within-group variance
SHAP-based feature pruning - preserved explainability and reduced noise without black-box selection

7. Spatial & Data Integrity Challenges

Graph-Based Proximity Logic

Euclidean distance was insufficient for real-world accessibility modelling. Road-network graphs were built to calculate true drive-time distances, capturing one-way roads, dividers, and access friction.

Shapefile Processing

GeoPandas + Spark for point-in-polygon joins against US Census blocks - enabling accurate socio-economic attribution per trade area for each candidate site.

Extreme Value Imputation

Rural isolation handled explicitly: missing competitor distances set to 9,999 (signal of low competitive density)
Missing mobility data imputed using archetype-cluster medians, not global statistics

8. Model Architecture

Multi-Vertical Forecasting

Separate XGBoost regressors trained for each revenue vertical:

Diesel fuel
Gasoline
Prepared food
Grocery

Each model was trained on vertical-specific spatial and demographic drivers - fuel models prioritised highway features; prepared food models prioritised footfall and school proximity.

Cold-Start Strategy for NTI Sites (Dual Clustering)

New-to-industry sites have no historical sales. The cold-start problem was solved through a statistical twin methodology:

Cluster existing stores by historical performance profile
Cluster US geography by NTI-available spatial features
Map candidate NTI site to its nearest statistical twin cluster
Use twin cluster median as contextual grounding before regression

9. Training Strategy

Data Splitting

Stratified split (not time-based) - stratified by store tier and geography to maintain representation across archetypes
70 / 15 / 15 with a locked holdout test set

Key Design Decision: Stratified Over Time-Based Split

Time-based splits are appropriate for forecasting models where temporal leakage is a risk. For this cross-sectional spatial model - where the objective is predicting performance at a new location, not a future time - stratified splitting was more appropriate. It ensured coverage of rural, suburban, and urban archetypes across train/val/test.

Validation

5-fold cross-validation within training set
GridSearchCV for: tree depth, learning rate, subsampling ratios

10. Compute Optimisation

Hybrid Compute Strategy

Spark clusters → distributed feature engineering over 200M geospatial features
Single-node multi-GPU → XGBoost model training (5× faster grid search vs distributed CPU)
Serverless model serving → scale-to-zero for inference cost efficiency

Feature Engineering Cost Controls

Sliding-window spatial caching to avoid recomputing trade-area overlaps for nearby candidates
Haversine pre-filtering reduced candidate pairs by ~60% before exact polygon joins

11. Evaluation Metrics

Primary Metric: WMAPE (Weighted Mean Absolute Percentage Error)

Selected to reflect business cost asymmetry - penalises errors on high-volume categories more heavily
Aligns evaluation to capital allocation risk, not statistical convenience

Benchmarking Results

Benchmark	Accuracy
3rd-party industry tool (baseline)	~50%
Acceptance threshold (client requirement)	65%
Achieved	65%+ consistently

+15% accuracy lift over the industry-standard 3rd-party tool the client was previously paying for.

12. Explainability & Stakeholder Trust

SHAP-Based Transparency

SHAP values were computed at two levels:

Global drivers - overall feature importance for leadership-level understanding
Local reason codes - site-specific drivers explaining individual forecasts

Example output delivered to real estate leadership:

Highway traffic (+20%), school proximity (+15%) outweigh competitor proximity (–5%) at this site. Forecast reflects strong fuel and grocery upside.

Bias & Audit Readiness

Full MLflow lineage: every forecast traceable to the model version, feature snapshot, and training data vintage
Feature-level bias inspection available for regulatory and compliance review
Reproducible forecasts: same inputs always produce same output (no stochastic inference)

13. Deployment Architecture

Feature Consistency

Delta Lake Silver layer shared between training pipeline and serving pipeline
Eliminates training-serving skew - the exact same feature logic used in training applies at inference

Serving

Databricks Model Serving (serverless) for low-latency inference
Notebook-based UAT with widgets for real estate team self-service validation
Millisecond inference latency from lat/lon input to full 3-year category forecast

14. Monitoring & Drift Detection

±8% WMAPE guardrails: automated alerts when model accuracy degrades beyond threshold
Spatial trade-area drift detection: flags when the geographic distribution of scored sites shifts materially from the training distribution
Annual census-driven retraining cadence - incorporates updated demographic and mobility baselines
Human-in-the-loop override logging: real estate managers can flag model recommendations for supervisory review; overrides captured for future training signal

15. Rollout Strategy

Shadow mode backtesting - model ran in parallel with existing process for 8 weeks; predictions compared against actual outcomes without affecting decisions
Regional pilots - Des Moines and Little Rock markets selected as representative urban and rural test cases
Champion/Challenger - model predictions formally compared against 3rd-party tool scores; model outperformed on WMAPE across all verticals
Full self-service production rollout - real estate teams given direct access via Databricks-served endpoint

16. Technical Alternatives Evaluated and Rejected

Alternative	Why Rejected
Time-Series Models (Prophet, SARIMA)	Failed on spatial shocks; high-dimensionality collapse at 1,700+ features
Bayesian MCMC	No convergence at scale; prohibitively expensive for feature volume
Neural Networks	Insufficient training data per vertical for deep models; XGBoost generalised better on tabular spatial data
Single-buffer trade area	Over-simplified real-world accessibility; dual framework significantly improved WMAPE

17. Business Impact

$50M+ monthly CAPEX allocation decisions informed by the model
70% reduction in site approval cycle (3 weeks → 5 days)
15% accuracy lift over industry-standard 3rd-party tool - which the client subsequently decommissioned following a formal CFO and Real Estate leadership review
Standardised national expansion strategy - consistent, defensible criteria applied across all geographies
Millions in avoided CAPEX risk from sites that would have received approval without quantitative scoring

18. Lessons Learned

Dual trade-area frameworks are worth the engineering complexity: The added geospatial computation cost was significant, but the WMAPE improvement justified it - particularly for highway-corridor diesel sites where radial buffers significantly outperformed isochrone-only approaches.
Cold-start via clustering requires careful archetype definition: Statistical twins only work if the cluster structure reflects real business distinctions (rural/suburban/urban). Initial clustering attempts using pure geographic features produced economically meaningless archetypes. Adding performance-based clustering as the primary layer improved twin assignment quality substantially.
Human-in-the-loop overrides are data, not exceptions: Real estate managers’ override decisions captured valuable local knowledge (planned road changes, new competitor openings) that the model couldn’t see. Logging and incorporating overrides into future training cycles is a high-value, low-effort improvement.
What I’d approach differently today: A Huff gravity model for continuous market influence (rather than discrete trade-area boundaries), and real-time mobility ingestion to capture footfall pattern shifts without waiting for annual Placer.ai refreshes.

High-Level Architecture

Architecture Diagram