Geospatial ML for New Store Site Selection & Sales Forecasting

Databricks · Spark · XGBoost · Geospatial Intelligence · MLflow · Unity Catalog

Client: $10B+ convenience & prepared foods retailer (2,500+ locations)
Role: Lead Architect


Executive Summary

Metric Value
Client scale $10B+ retailer · 2,500+ locations across the US
CAPEX allocation decisions informed (monthly) $50M+
Site-selection approval cycle reduction 70% (3 weeks → 5 days)
Query latency improvement 22× (45s → <2s)
Geospatial features engineered 200M+
Accuracy vs 3rd-party benchmarking tool +15% lift
Data volume processed >1TB

As the lead architect, I worked directly with the client’s Real Estate and Finance leadership to diagnose why existing decisions were failing at scale - then designed and delivered a system that replaced intuition with a defensible, explainable ML engine. The system generates 3-year category-wise sales forecasts from a latitude/longitude input, reducing evaluation time from weeks to seconds and directly informing $50M+ in monthly capital allocation decisions.


1. Business Problem

The Core Issue

Real estate expansion decisions at this client were:

The organisation lacked:

Objective

Build a data-driven site selection engine to:


2. Why Machine Learning (Not Rules or BI)

Why BI Failed

Why ML Was Required

Impact


3. Platform & Architectural Choice: Why Databricks


4. Data Landscape

Domain Source Purpose
Human Mobility Placer.ai Footfall & traffic behaviour
Demographics US Census Socio-economic context per trade area
Infrastructure AADT + OpenStreetMap Traffic volumes & road accessibility
Store History PostgreSQL Sales ground truth for training

Data Hygiene Decisions


5. Spatial Feature Engineering

Key Design Decision: Dual Trade-Area Framework

Rather than a single buffer radius, each candidate site was evaluated through two spatial lenses:

1. Drive-Time Isochrones (3 / 5 / 10 minutes)

2. Ring Radii (3 / 5 / 10 miles)

This dual framework standardised feature extraction consistently across all data sources and was a key differentiator over the 3rd-party tool’s single-ring approach.


6. Feature Engineering at Scale (1,700+ Features)

Positive Demand Drivers

Negative / Risk Drivers

Handling High Dimensionality


7. Spatial & Data Integrity Challenges

Graph-Based Proximity Logic

Euclidean distance was insufficient for real-world accessibility modelling. Road-network graphs were built to calculate true drive-time distances, capturing one-way roads, dividers, and access friction.

Shapefile Processing

GeoPandas + Spark for point-in-polygon joins against US Census blocks - enabling accurate socio-economic attribution per trade area for each candidate site.

Extreme Value Imputation


8. Model Architecture

Multi-Vertical Forecasting

Separate XGBoost regressors trained for each revenue vertical:

Each model was trained on vertical-specific spatial and demographic drivers - fuel models prioritised highway features; prepared food models prioritised footfall and school proximity.

Cold-Start Strategy for NTI Sites (Dual Clustering)

New-to-industry sites have no historical sales. The cold-start problem was solved through a statistical twin methodology:

  1. Cluster existing stores by historical performance profile
  2. Cluster US geography by NTI-available spatial features
  3. Map candidate NTI site to its nearest statistical twin cluster
  4. Use twin cluster median as contextual grounding before regression

9. Training Strategy

Data Splitting

Key Design Decision: Stratified Over Time-Based Split

Time-based splits are appropriate for forecasting models where temporal leakage is a risk. For this cross-sectional spatial model - where the objective is predicting performance at a new location, not a future time - stratified splitting was more appropriate. It ensured coverage of rural, suburban, and urban archetypes across train/val/test.

Validation


10. Compute Optimisation

Hybrid Compute Strategy

Feature Engineering Cost Controls


11. Evaluation Metrics

Primary Metric: WMAPE (Weighted Mean Absolute Percentage Error)

Benchmarking Results

Benchmark Accuracy
3rd-party industry tool (baseline) ~50%
Acceptance threshold (client requirement) 65%
Achieved 65%+ consistently

+15% accuracy lift over the industry-standard 3rd-party tool the client was previously paying for.


12. Explainability & Stakeholder Trust

SHAP-Based Transparency

SHAP values were computed at two levels:

Example output delivered to real estate leadership:

Highway traffic (+20%), school proximity (+15%) outweigh competitor proximity (–5%) at this site. Forecast reflects strong fuel and grocery upside.

Bias & Audit Readiness


13. Deployment Architecture

Feature Consistency

Serving


14. Monitoring & Drift Detection


15. Rollout Strategy

  1. Shadow mode backtesting - model ran in parallel with existing process for 8 weeks; predictions compared against actual outcomes without affecting decisions
  2. Regional pilots - Des Moines and Little Rock markets selected as representative urban and rural test cases
  3. Champion/Challenger - model predictions formally compared against 3rd-party tool scores; model outperformed on WMAPE across all verticals
  4. Full self-service production rollout - real estate teams given direct access via Databricks-served endpoint

16. Technical Alternatives Evaluated and Rejected

Alternative Why Rejected
Time-Series Models (Prophet, SARIMA) Failed on spatial shocks; high-dimensionality collapse at 1,700+ features
Bayesian MCMC No convergence at scale; prohibitively expensive for feature volume
Neural Networks Insufficient training data per vertical for deep models; XGBoost generalised better on tabular spatial data
Single-buffer trade area Over-simplified real-world accessibility; dual framework significantly improved WMAPE

17. Business Impact


18. Lessons Learned


High-Level Architecture

Architecture Diagram