Geospatial ML for New Store Site Selection & Sales Forecasting

Databricks | Spark | XGBoost | Geospatial Intelligence | MLflow

Executive Summary

This project delivers a production-grade geospatial ML system for New-to-Industry (NTI) retail site selection at a $10B+ convenience & prepared foods retailer (2,500+ locations).

The system replaces intuition-driven real estate decisions with a low-latency ML inference engine that generates 3-year category-wise sales forecasts from a simple latitude/longitude input, reducing evaluation time from weeks to seconds and delivering a 15% accuracy lift over industry-standard 3rd-party tools.

1. Business Problem

The Core Issue

Real estate expansion decisions were:

Decentralized
Qualitative
Inconsistent across regions
High risk due to long-term CAPEX and lease commitments

The organization lacked:

Standardized evaluation criteria
Quantitative risk assessment
Scalability across geographies

Objective

Build a data-driven site selection engine that:

Eliminates regional bias
Quantifies upside and risk
Scales across the entire US footprint
Produces explainable forecasts for capital allocation decisions

2. Why Machine Learning (and Not Rules or BI)

Why BI Failed

Retrospective only
No inferential capability for unseen geographies
Manual heuristics didn’t scale
Reinforced human bias

Why ML Was Required

1,700 heterogeneous features (demographics, mobility, infrastructure)
Strong non-linear interactions
High-dimensional spatial relationships
Cold-start problem for NTI locations

Impact

10× throughput for site evaluation
Standardized investment scoring
Seconds-level inference latency
Bias reduction in multi-million dollar decisions

3. Platform & Architectural Choices

Why Databricks

Managed Spark removed infrastructure overhead
Native support for:
- Large-scale geospatial joins
- Distributed feature engineering
- MLflow-based governance
Faster time-to-market than raw AWS primitives

4. Data Landscape

Data Sources

Domain	Source	Purpose
Human Mobility	Placer.ai	Footfall & traffic behavior
Demographics	US Census	Socio-economic context
Infrastructure	AADT + OSM	Traffic & accessibility
Store History	PostgreSQL	Sales ground truth

Data Hygiene Decisions

Excluded stores:
- < 1.5 years (grand opening bias)
- > 5 years (legacy market conditions)
Final training set: 10 years curated history

5. Spatial Feature Engineering

Trade Area Definition (Critical Design Decision)

Instead of a single radius, each site is evaluated using dual spatial lenses:

1. Drive-Time Isochrones (3 / 5 / 10 mins)

Captures real accessibility
Models friction from road topology

2. Ring Radii (3 / 5 / 10 miles)

Captures broader trade dynamics
Important for diesel & highway-driven behavior

This framework standardized feature extraction across all data sources.

6. Feature Engineering at Scale (1,700+ Features)

Positive Drivers

Schools, stadiums, highways
Placer.ai footfall density
Gas pumps, kitchen layout, store design

Negative Drivers

Competitor proximity
Turn complexity (wrong-side access)
Road network constraints

Handling High Dimensionality

Store Archetype Clustering
- Rural / Suburban / Urban
SHAP-based Feature Pruning
- Preserved explainability
- Reduced noise

7. Spatial & Data Integrity Challenges

Graph-Based Proximity Logic

Euclidean distance was insufficient
Built road-network graphs
Calculated true drive-time distances
Captured one-way roads, dividers, access friction

Shapefile Processing

GeoPandas + Spark
Point-in-Polygon joins against census blocks
Accurate socio-economic attribution per trade area

Extreme Value Imputation

Rural isolation handled explicitly
Missing competitors → distance set to 9999
Treated as a signal, not a null

8. Model Architecture

Multi-Vertical Forecasting

Separate XGBoost regressors for:

Diesel
Gasoline
Prepared Food
Grocery

Each model prioritized different spatial and demographic drivers.

Cold-Start Strategy (Dual Clustering)

Cluster historical stores by performance
Cluster US geography by NTI-available features
Map NTI site → statistical twin

This provided contextual grounding before regression.

9. Training Strategy

Data Splitting

Stratified (not time-based)
Stratified by:
- Store tier
- Geography
70 / 15 / 15 with locked test set

Validation

5-fold cross-validation
GridSearchCV for:
- Depth
- Learning rate
- Subsampling

Imbalance Handling

scale_pos_weight
Synthetic weighting for top 5% performers

10. Compute Optimization

Hybrid Compute

Spark clusters → feature engineering
Single-node multi-GPU → XGBoost training
5× faster grid search vs distributed CPU

Cost Controls

Sliding-window spatial caching
Haversine pre-filtering (60% pruning)
Serverless model serving (scale-to-zero)

11. Evaluation Metrics

Primary Metric: WMAPE

Reflects business cost asymmetry
Penalizes high-volume errors more heavily

Benchmarking

3rd-party baseline: 50% accuracy
Acceptance threshold: 65%
Achieved: 65%+ consistently

12. Explainability & Trust

SHAP-Based Transparency

Global drivers for leadership
Local reason codes per site

Example:

Highway traffic (+20%), schools (+15%) outweigh competitor proximity (-5%)

Bias & Audit Readiness

Full MLflow lineage
Feature-level bias inspection
Reproducible forecasts

13. Deployment Architecture

Feature Consistency

Delta Lake Silver → shared by training & serving
Eliminated training-serving skew

Serving

Databricks Model Serving
Notebook-based UAT with widgets
Millisecond inference latency

14. Monitoring & Drift Detection

±8% WMAPE guardrails
Spatial trade-area drift detection
Annual census-driven retraining
Human-in-the-loop overrides logged for supervision

15. Rollout Strategy

Shadow mode backtesting
Regional pilots (Des Moines, Little Rock)
Champion / Challenger vs 3rd-party
Full self-service production rollout

16. Technical Alternatives Evaluated (and Rejected)

Time-Series Models

Prophet / SARIMA failed on spatial shocks
High dimensionality collapse

Bayesian MCMC

No convergence at scale
Prohibitively expensive

Why XGBoost Won

Non-linear feature handling
Efficient at scale
Strong spatial generalization

17. Business Impact

15% accuracy lift
Millions in avoided CAPEX risk
Decommissioned expensive 3rd-party tooling
Standardized national expansion strategy

18. What I’d Do Differently Today

Huff gravity model for continuous trade influence
Graph Neural Networks for topology-aware learning
Real-time mobility ingestion for near-live adaptation

Key Skills Demonstrated

Databricks & Spark at scale
Advanced geospatial feature engineering
Production ML governance (MLflow, Delta)
Cost-aware cloud architecture
Explainable ML for executive decisions

High-Level Architecture

Architecture Diagram