Real-time Fraud Detection at Scale: A Technical Deep-Dive

How we architected a fraud detection system processing 50,000 transactions per second with sub-100ms latency. A comprehensive look at feature engineering, model serving, and continuous retraining at scale.

Fraud detection is one of the few ML applications where "fast enough" isn't a nice-to-have—it's existential. A fraudulent transaction approved is money lost. But a legitimate transaction declined is a customer lost. You need models that are both accurate and fast, running at scales that would break traditional architectures.

This article documents the architecture we built for a financial services client processing over 50,000 transactions per second at peak load, with a hard latency requirement of 100ms end-to-end. Here's how we did it.

The Constraints

Before diving into architecture, let's establish the constraints we were working with:

Latency: 100ms p99, including network round-trip, feature computation, model inference, and response serialization
Throughput: 50,000 TPS sustained, with 3x burst capacity
Accuracy: 99.5% precision at 90% recall (false positives are expensive)
Availability: 99.99% uptime—degraded service is acceptable, full outage is not
Freshness: Models must incorporate feedback within 4 hours

Architecture Overview

The system comprises four main components: feature computation, model serving, feedback loops, and monitoring. Each is designed for independent scaling and graceful degradation.

Feature Computation

Fraud detection models live and die by their features. The most predictive signals often come from behavioral patterns: how does this transaction compare to the user's history? Is this merchant unusual for this card? How does the amount compare to typical spending?

Computing these features in real-time is the hardest part of the system. We use a three-tier architecture:

Pre-computed features in Redis, updated by streaming jobs. User spending patterns, merchant risk scores, device fingerprints. Updated every 15 minutes.
Real-time aggregations using Apache Flink. Rolling windows of transaction counts, amounts, and velocities computed on the fly.
Point-in-time features computed at request time. Transaction-specific signals like amount deviation, time since last transaction, geographic distance.

The key insight: we accept some staleness in exchange for latency. A user's spending pattern from 15 minutes ago is good enough; computing it fresh for every request would blow the latency budget.

Model Serving

We serve models using a custom inference service built on Ray Serve. The choice was driven by several requirements:

Dynamic batching to maximize GPU utilization
Model versioning with instant rollback capability
A/B testing infrastructure for continuous experimentation
Shadow mode for evaluating new models on production traffic

The model itself is an ensemble of gradient boosting (XGBoost) and neural network components. The XGBoost model handles tabular features and provides interpretable scores. The neural network processes sequential transaction history and captures patterns the tree model misses.

Both models run in parallel, and their scores are combined via a calibrated meta-model. This gives us the best of both worlds: the robustness of gradient boosting with the expressiveness of deep learning.

Feedback Loops

Fraud models degrade over time. Fraudsters adapt. New attack vectors emerge. The model you trained last month is already falling behind.

Our feedback system operates on two timescales:

Online learning: The XGBoost model supports incremental updates. When fraud is confirmed or a false positive is reported, we update model weights within minutes.
Batch retraining: Full model retraining runs every 4 hours using the latest labeled data. New models are shadow-tested before promotion.

The labeling pipeline is critical. Confirmed fraud comes from chargebacks and investigation teams. But most transactions never receive explicit labels. We use a combination of positive-unlabeled learning and semi-supervised techniques to learn from the full data distribution.

Latency Optimization

Getting to sub-100ms required obsessive attention to every millisecond. Here are the techniques that mattered most:

Connection pooling: Establishing new connections to Redis and feature stores added 20-30ms. Persistent connection pools eliminated this.
Parallel feature fetching: Independent feature lookups run concurrently. We wait for the slowest, not the sum.
Feature caching: Recently-computed features are cached locally with TTLs tuned per feature type.
Model quantization: INT8 quantization of the neural network reduced inference time by 40% with negligible accuracy loss.
Timeout and fallback: If any component exceeds budget, we fall back to simpler rules. Degraded accuracy beats complete failure.

Monitoring and Observability

A fraud system you can't observe is a fraud system you can't trust. We instrument extensively:

Latency histograms at every stage of the pipeline
Feature value distributions to detect drift
Model score distributions with statistical tests for shifts
Business metrics: fraud rate, false positive rate, decline rate
Alerting on anomalies at any level

The dashboard is the first thing the operations team sees every morning. If something looks off, they can drill down to the specific feature, model, or traffic segment causing issues.

Results

After deploying this architecture:

Fraud losses reduced by 42% compared to the previous rule-based system
False positive rate dropped from 3.2% to 0.8%
p99 latency stabilized at 67ms, well under the 100ms requirement
System maintained 99.995% availability over the first year

The architectural decisions that seemed conservative at the time—accepting feature staleness, building extensive fallbacks, over-investing in monitoring—proved to be exactly right. Production ML isn't about building the most sophisticated model. It's about building systems that work reliably under real-world conditions.

Real-time Fraud Detection at Scale: A Technical Deep-Dive

The Constraints

Architecture Overview

Feature Computation

Model Serving

Feedback Loops

Latency Optimization

Monitoring and Observability

Results

Building fraud detection at scale?

Related Articles

The Production Gap: Why 87% of ML Projects Never Make It

Feature Stores: Build vs. Buy in 2026