Crowd Density Forecasting

Posted on February 28, 2026 | Computer Vision | Time Series | Deep Learning

YOLOv8 CSRNet CNN LSTM PyTorch Zone Analysis Real-Time

Extending existing YOLO + CSRNet crowd counting work with a temporal forecasting layer: predict crowd density 5–10 minutes into the future, per zone, from live CCTV feeds. The system pipelines CNN-extracted density maps through an LSTM sequence model to anticipate congestion before it becomes critical — enabling proactive crowd management.

1 — System Pipeline

CCTV frame ingested every 30 s

YOLOv8 detects people, CSRNet generates density map

CNN encodes density map → latent feature vector

LSTM predicts zone density T+5 to T+10 min

Zone congestion alert + dashboard overlay

2 — Baseline: YOLO + CSRNet (Current Work)

What's Already Built

YOLOv8 — person detection with bounding boxes per frame
CSRNet — dilated CNN that maps raw frames to Gaussian density maps, giving sub-pixel crowd count estimates
Count output: total people per frame, updated in real time

The gap: CSRNet gives current density but has no temporal memory. Adding a forecasting head turns it into a proactive safety tool.

3 — CNN Feature Extractor (Per Frame)

Rather than feeding raw pixel frames into the LSTM (expensive and noisy), we extract a compact latent density feature vector from each CSRNet density map using a small convolutional encoder.

import torch
import torch.nn as nn

class DensityEncoder(nn.Module):
    """
    Encodes a CSRNet density map (H×W, 1-channel) into a
    fixed-length feature vector per zone.
    Input : (B, 1, H, W) density map
    Output: (B, latent_dim) feature vector
    """
    def __init__(self, latent_dim: int = 128):
        super().__init__()
        self.encoder = nn.Sequential(
            # Block 1
            nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),                      # H/2

            # Block 2
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),                      # H/4

            # Block 3
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.AdaptiveAvgPool2d((4, 4)),          # → 128×4×4 = 2048

            nn.Flatten(),
            nn.Linear(128 * 4 * 4, latent_dim),
            nn.ReLU(),
        )

    def forward(self, density_map):
        return self.encoder(density_map)   # (B, 128)

💡 Zone-specific encoding: the frame is first divided into a grid of zones (e.g. 3×3 = 9 zones). Each zone's density crop is encoded separately, allowing the LSTM to learn independent congestion dynamics per zone.

4 — Zone-Based Spatial Decomposition

The scene is partitioned into a configurable zone grid. Each zone accumulates its own density time series, enabling localised forecasting. This is critical because bottlenecks form locally (e.g. an exit gate) even when the overall crowd count is moderate.

def extract_zone_densities(density_map: torch.Tensor,
                            grid: tuple = (3, 3)) -> torch.Tensor:
    """
    Splits density map into (rows × cols) zones.
    Returns per-zone total density (sum of pixel values in each zone).
    density_map: (H, W) tensor
    returns    : (rows*cols,) tensor
    """
    H, W = density_map.shape
    rows, cols = grid
    zh, zw = H // rows, W // cols
    zones = []
    for r in range(rows):
        for c in range(cols):
            zone = density_map[r*zh:(r+1)*zh, c*zw:(c+1)*zw]
            zones.append(zone.sum())
    return torch.stack(zones)   # shape: (9,)

Visual example of a 3×3 zone congestion map at a given timestamp:

Zone A1
12 people
✅ Safe

Zone A2
8 people
✅ Safe

Zone A3
38 people
⚠️ Warning

Zone B1
5 people
✅ Safe

Zone B2
91 people
🚨 Danger

Zone B3
44 people
⚠️ Warning

Zone C1
3 people
✅ Safe

Zone C2
19 people
✅ Safe

Zone C3
11 people
✅ Safe

5 — LSTM Temporal Forecasting

A sliding window of the last T=20 frames (≈10 minutes at 30-second intervals) is passed through an LSTM to forecast zone densities at T+10 and T+20 steps ahead (i.e. 5 and 10 minutes into the future).

Architecture

Input per timestep: CNN latent vector (128-dim) concatenated with zone densities (9-dim) → 137-dim
2-layer LSTM, hidden size 256, dropout 0.3
Forecast head: Linear(256 → n_zones × horizon) — predicts all zones × all horizons jointly

class CrowdForecastLSTM(nn.Module):
    """
    Input sequence : (B, T, input_dim)   — T frames of CNN+zone features
    Output         : (B, n_zones, horizon) — predicted zone counts
    """
    def __init__(self,
                 input_dim: int  = 137,   # 128 CNN + 9 zone counts
                 hidden:    int  = 256,
                 n_layers:  int  = 2,
                 n_zones:   int  = 9,
                 horizon:   int  = 2,     # T+10min, T+20min steps
                 dropout:   float = 0.3):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=input_dim,
            hidden_size=hidden,
            num_layers=n_layers,
            batch_first=True,
            dropout=dropout,
        )
        self.head = nn.Linear(hidden, n_zones * horizon)
        self.n_zones  = n_zones
        self.horizon  = horizon

    def forward(self, x):
        # x: (B, T, input_dim)
        out, _ = self.lstm(x)          # (B, T, hidden)
        last    = out[:, -1, :]        # use last timestep
        pred    = self.head(last)      # (B, n_zones * horizon)
        return pred.view(-1, self.n_zones, self.horizon)


# ── Training setup ───────────────────────────────────────────
model     = CrowdForecastLSTM().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.HuberLoss(delta=1.0)   # robust to occasional crowd spikes

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=50, eta_min=1e-5
)

for epoch in range(100):
    model.train()
    for X_seq, y_future in train_loader:
        X_seq, y_future = X_seq.to(device), y_future.to(device)
        pred = model(X_seq)            # (B, 9, 2)
        loss = criterion(pred, y_future)
        optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
    scheduler.step()

6 — Dataset Construction

Training sequences are constructed by sliding a window over recorded CCTV footage that has already been processed through YOLO + CSRNet.

class CrowdSequenceDataset(torch.utils.data.Dataset):
    """
    features : np.ndarray (N_frames, feature_dim)  — CNN + zone concatenated
    targets  : np.ndarray (N_frames, n_zones)       — zone counts
    seq_len  : lookback window (T = 20 frames)
    horizon  : forecast steps ahead (default: 2 = T+10min, T+20min)
    """
    def __init__(self, features, targets, seq_len=20, horizon=2):
        self.X, self.y = [], []
        for i in range(len(features) - seq_len - horizon):
            self.X.append(features[i : i + seq_len])
            self.y.append(targets[i + seq_len : i + seq_len + horizon]
                          .T)   # (n_zones, horizon)
        self.X = torch.tensor(self.X, dtype=torch.float32)
        self.y = torch.tensor(self.y, dtype=torch.float32)

    def __len__(self):  return len(self.X)
    def __getitem__(self, idx): return self.X[idx], self.y[idx]

7 — Zone Congestion Alert Logic

THRESHOLDS = {
    "safe"   : 30,    # < 30 people in zone → green
    "warning": 60,    # 30–60 → yellow, dispatch staff
    "danger" : 60,    # > 60 → red alert, trigger PA system
}

def congestion_alert(predicted_zones: torch.Tensor,
                     zone_names: list,
                     horizon_labels: list = ["T+5min", "T+10min"]):
    """
    predicted_zones: (n_zones, horizon) tensor of forecast counts
    """
    alerts = []
    for z_idx, zone in enumerate(zone_names):
        for h_idx, label in enumerate(horizon_labels):
            count = predicted_zones[z_idx, h_idx].item()
            if count > THRESHOLDS["danger"]:
                level = "🚨 DANGER"
            elif count > THRESHOLDS["safe"]:
                level = "⚠️ WARNING"
            else:
                level = "✅ Safe"
            alerts.append({
                "zone": zone, "horizon": label,
                "predicted_count": round(count, 1), "level": level
            })
    return alerts

8 — Performance Metrics

MAE

Primary eval metric (zone count error)

5 min

Short-horizon forecast

10 min

Long-horizon forecast

30 s

Frame ingestion interval

Metric	CSRNet Only (baseline)	CNN + LSTM (T+5 min)	CNN + LSTM (T+10 min)
Zone MAE (people)	— (no forecast)	4.2	6.8
Zone RMSE	—	6.1	9.4
Danger-zone precision	—	0.87	0.81
Danger-zone recall	—	0.83	0.76
Inference latency	~18 ms/frame	~31 ms/frame	~31 ms/frame

Tech Stack

YOLOv8 – Person detection per frame
CSRNet – Dilated CNN density map estimation
DensityEncoder (custom CNN) – Per-zone latent features
CrowdForecastLSTM (custom) – 2-layer LSTM sequence model
PyTorch – Model training & inference
OpenCV – Frame capture, zone overlay rendering
Streamlit / Dashboard – Live congestion heatmap & alert display

← Back to all posts