Skip to the content.

Geospatial ML for New Store Site Selection & Sales Forecasting

Databricks | Spark | XGBoost | Geospatial Intelligence | MLflow


Executive Summary

This project delivers a production-grade geospatial ML system for New-to-Industry (NTI) retail site selection at a $10B+ convenience & prepared foods retailer (2,500+ locations).

The system replaces intuition-driven real estate decisions with a low-latency ML inference engine that generates 3-year category-wise sales forecasts from a simple latitude/longitude input, reducing evaluation time from weeks to seconds and delivering a 15% accuracy lift over industry-standard 3rd-party tools.


1. Business Problem

The Core Issue

Real estate expansion decisions were:

The organization lacked:

Objective

Build a data-driven site selection engine that:


2. Why Machine Learning (and Not Rules or BI)

Why BI Failed

Why ML Was Required

Impact


3. Platform & Architectural Choices

Why Databricks


4. Data Landscape

Data Sources

Domain Source Purpose
Human Mobility Placer.ai Footfall & traffic behavior
Demographics US Census Socio-economic context
Infrastructure AADT + OSM Traffic & accessibility
Store History PostgreSQL Sales ground truth

Data Hygiene Decisions


5. Spatial Feature Engineering

Trade Area Definition (Critical Design Decision)

Instead of a single radius, each site is evaluated using dual spatial lenses:

1. Drive-Time Isochrones (3 / 5 / 10 mins)

2. Ring Radii (3 / 5 / 10 miles)

This framework standardized feature extraction across all data sources.


6. Feature Engineering at Scale (1,700+ Features)

Positive Drivers

Negative Drivers

Handling High Dimensionality


7. Spatial & Data Integrity Challenges

Graph-Based Proximity Logic

Shapefile Processing

Extreme Value Imputation


8. Model Architecture

Multi-Vertical Forecasting

Separate XGBoost regressors for:

Each model prioritized different spatial and demographic drivers.

Cold-Start Strategy (Dual Clustering)

  1. Cluster historical stores by performance
  2. Cluster US geography by NTI-available features
  3. Map NTI site → statistical twin

This provided contextual grounding before regression.


9. Training Strategy

Data Splitting

Validation

Imbalance Handling


10. Compute Optimization

Hybrid Compute

Cost Controls


11. Evaluation Metrics

Primary Metric: WMAPE

Benchmarking


12. Explainability & Trust

SHAP-Based Transparency

Example:

Highway traffic (+20%), schools (+15%) outweigh competitor proximity (-5%)

Bias & Audit Readiness


13. Deployment Architecture

Feature Consistency

Serving


14. Monitoring & Drift Detection


15. Rollout Strategy

  1. Shadow mode backtesting
  2. Regional pilots (Des Moines, Little Rock)
  3. Champion / Challenger vs 3rd-party
  4. Full self-service production rollout

16. Technical Alternatives Evaluated (and Rejected)

Time-Series Models

Bayesian MCMC

Why XGBoost Won


17. Business Impact


18. What I’d Do Differently Today


Key Skills Demonstrated


High-Level Architecture

Architecture Diagram