ja4-platform/docs/services/bot-detector.md

# Bot Detector

The bot-detector is a Python service that performs machine-learning anomaly detection on aggregated HTTP/TLS traffic features stored in ClickHouse. It runs on a continuous cycle (default: every 5 minutes), using Isolation Forest to identify suspicious traffic patterns, enriched with SHAP explainability, DBSCAN clustering, and Anubis bot-rule enrichment.

## ML Algorithm

### Isolation Forest (Semi-Supervised)

The core algorithm is **Isolation Forest** (Liu, Ting & Zhou, 2008) — an unsupervised anomaly detection algorithm that isolates anomalies by randomly partitioning feature space. Anomalies require fewer partitions to isolate than normal points.

The approach is **semi-supervised** because:
1. **Known bots** are identified a priori via reputation dictionaries (IP, JA4, ASN)
2. **Human baseline** is identified via ASN reputation labels (`asn_label = 'human'`)
3. The model trains **only on human-baseline traffic** (minimum 500 sessions required)
4. Unknown traffic is scored by deviation from the human profile

### Two-Model Architecture

| Model | Condition | Features | Data |
|-------|-----------|----------|------|
| **Complet** | `correlated = 1` | 35 | HTTP + TCP + TLS (full pipeline data) |
| **Applicatif** | `correlated = 0` | 31 | HTTP only (no TLS correlation available) |

### Threat Levels

| Score Range | Level | Interpretation |
|------------|-------|----------------|
| `< -0.30` | **CRITICAL** | Extremely anomalous behavior |
| `< -0.15` | **HIGH** | Strong anomaly signal |
| `< -0.05` | **MEDIUM** | Moderate anomaly |
| `≥ -0.05` | **LOW** | Slightly unusual |

## Feature List

### Common Features (31 — Applicatif model)

#### HTTP Behavior

| Feature | Description |
|---------|-------------|
| `hits` | Request count in the window |
| `hit_velocity` | Requests per second |
| `fuzzing_index` | Path/parameter diversity anomaly score |
| `post_ratio` | Fraction of POST requests |
| `port_exhaustion_ratio` | Fraction of distinct source ports / total |
| `orphan_ratio` | Requests without TLS correlation |
| `head_ratio` | Fraction of HEAD requests |
| `http10_ratio` | Fraction of HTTP/1.0 requests |
| `generic_accept_ratio` | Fraction of short Accept headers |
| `sec_fetch_absence_rate` | Fraction missing Sec-Fetch-Site |
| `missing_accept_enc_ratio` | Fraction missing Accept-Encoding |
| `http_scheme_ratio` | Fraction using HTTP (not HTTPS) |

#### Connection Management

| Feature | Description |
|---------|-------------|
| `max_keepalives` | Max requests on a single Keep-Alive connection |
| `tcp_shared_count` | TCP connections shared between sessions |
| `multiplexing_efficiency` | HTTP/2 multiplexing efficiency |

#### Browser Fingerprint

| Feature | Description |
|---------|-------------|
| `header_count` | HTTP headers sent |
| `has_accept_language` | Accept-Language header presence |
| `has_cookie` | Cookie header presence |
| `has_referer` | Referer header presence |
| `modern_browser_score` | Composite browser compliance score (0–100) |
| `ua_ch_mismatch` | User-Agent vs Client Hints inconsistency |
| `ip_id_zero_ratio` | IP packets with ID=0 (headless/minimal stack) |
| `header_order_shared_count` | IPs sharing same header order |
| `header_order_confidence` | Normalized entropy of header order |
| `distinct_header_orders` | Distinct header orderings per IP |
| `is_fake_navigation` | Sec-Fetch-Mode=navigate with non-document dest |

#### Navigation Patterns

| Feature | Description |
|---------|-------------|
| `request_size_variance` | Variance of request sizes |
| `mss_mobile_mismatch` | TCP MSS vs mobile profile inconsistency |
| `asset_ratio` | Static asset request fraction |
| `direct_access_ratio` | Direct accesses (no referer) |
| `is_ua_rotating` | User-Agent rotation detected (flag) |
| `distinct_ja4_count` | Distinct JA4 fingerprints per IP |
| `anomalous_payload_ratio` | Anomalous payload size fraction |

#### Concentration & Rarity

| Feature | Description |
|---------|-------------|
| `src_port_density` | Source port entropy |
| `ja4_asn_concentration` | JA4 concentration within ASN |
| `ja4_country_concentration` | JA4 concentration per country |
| `is_rare_ja4` | Rare JA4 fingerprint (< 100 total hits) |

#### Temporal & Diversity

| Feature | Description |
|---------|-------------|
| `temporal_entropy` | Temporal distribution entropy |
| `path_diversity_ratio` | URL path diversity |
| `url_depth_variance` | URL depth variance |
| `ja3_diversity_ratio` | JA3 diversity ratio per IP |

### Additional TCP/TLS Features (Complet model only — 4 extra)

| Feature | Description |
|---------|-------------|
| `tcp_jitter_variance` | TCP inter-packet jitter variance |
| `alpn_http_mismatch` | ALPN vs actual HTTP protocol mismatch |
| `is_alpn_missing` | ALPN absent in ClientHello |
| `sni_host_mismatch` | TLS SNI vs HTTP Host mismatch |

### L4 Fingerprint Features (Complet model)

| Feature | Description |
|---------|-------------|
| `avg_ttl` | Average IP TTL (OS fingerprint) |
| `ttl_std` | TTL standard deviation |
| `no_window_scale_ratio` | Fraction without TCP window scale |
| `syn_timing_cv` | SYN timing coefficient of variation |
| `tls12_ratio` | Fraction of TLS 1.2 connections |
| `ip_df_variance` | IP Don't-Fragment flag variance |

## Detection Pipeline

```
1. Read view_ai_features_1h (last 24h) → DataFrame
2. Read view_ip_recurrence → recurrence map
3. Clean columns (fillna, astype)
4. Split by correlated=1 / correlated=0
5. For each model (Complet, Applicatif):
   a. A7: Validate features (exclude missing/constant)
   b. Separate known bots → log as KNOWN_BOT
   c. Filter human baseline (asn_label='human', min 500 sessions)
   d. Load or train Isolation Forest model
   e. A1: Check concept drift (KS test on features)
   f. Score unknown traffic
   g. A10: Normalize scores to [-1, 0]
   h. A2: Compute adaptive threshold = min(percentile_5, ANOMALY_THRESHOLD)
   i. A6: Apply recurrence weighting
   j. Filter scores below threshold
   k. A4: SHAP explainability (top 5 features)
   l. A8: DBSCAN clustering (campaign detection)
6. Concatenate results, deduplicate by src_ip (keep lowest score)
7. A5: Deduplication with TTL (skip recently reported IPs)
8. Insert into ml_detected_anomalies + ml_all_scores
```

## Concept Drift Detection (A1)

Uses the **Kolmogorov-Smirnov test** to compare feature distributions between the current data and the training data. If the fraction of drifted features exceeds `DRIFT_THRESHOLD` (default: 0.30), the model is retrained.

## SHAP Explainability (A4)

When enabled (`ENABLE_SHAP=true`), computes SHAP values for each detected anomaly using `shap.TreeExplainer`. The top 5 contributing features are stored in the `reason` field.

## DBSCAN Clustering (A8)

When enabled (`ENABLE_CLUSTERING=true`), applies DBSCAN on anomaly feature vectors to group related anomalies into campaigns. Each anomaly gets a `campaign_id` (-1 = no cluster).

## Anubis Bot-Rule Enrichment

The `view_ai_features_1h` view enriches each IP with Anubis bot detection using a priority cascade:
1. **UA + IP combined** (same `rule_id`) — highest confidence
2. **UA only** (no IP requirement)
3. **IP only** (no UA requirement)
4. **ASN match**
5. **Country match**

## Environment Variables

| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `CLICKHOUSE_HOST` | string | `clickhouse` | ClickHouse server hostname |
| `CLICKHOUSE_PORT` | int | `8123` | ClickHouse HTTP port |
| `CLICKHOUSE_DB` | string | `mabase_prod` | Database name |
| `CLICKHOUSE_USER` | string | `admin` | ClickHouse username |
| `CLICKHOUSE_PASSWORD` | string | `""` | ClickHouse password |
| `ISOLATION_CONTAMINATION` | float | `0.02` | Contamination parameter for Isolation Forest |
| `ANOMALY_THRESHOLD` | float | `-0.03` | Score threshold for anomaly detection |
| `ANOMALY_PERCENTILE` | int | `5` | Percentile for adaptive threshold (A2) |
| `CYCLE_INTERVAL_SEC` | int | `300` | Seconds between detection cycles |
| `MAX_CONSECUTIVE_FAILURES` | int | `3` | Max consecutive failures before exit |
| `BOT_DETECTOR_LOG` | string | `/var/log/bot_detector/decisions.jsonl` | Decision log file path |
| `LOG_BACKUP_COUNT` | int | `7` | Number of rotated log backups |
| `MODEL_DIR` | string | `/var/lib/bot_detector` | Model persistence directory |
| `RETRAIN_INTERVAL_HOURS` | int | `24` | Hours between model retraining |
| `MODEL_HISTORY_COUNT` | int | `10` | Number of model versions to keep |
| `DRIFT_THRESHOLD` | float | `0.30` | KS-test drift threshold (A1) |
| `ENABLE_MULTIWINDOW` | bool | `false` | Enable 24h multi-window analysis (A3) |
| `MULTIWINDOW_VIEW` | string | `view_ai_features_24h` | View for multi-window mode |
| `ENABLE_SHAP` | bool | `true` | Enable SHAP explainability (A4) |
| `DEDUP_TTL_MIN` | int | `60` | Deduplication TTL in minutes (A5) |
| `RECURRENCE_WEIGHT` | float | `0.005` | Recurrence score weighting factor (A6) |
| `MIN_VALID_FEATURE_RATIO` | float | `0.50` | Min valid feature ratio (A7) |
| `ENABLE_CLUSTERING` | bool | `true` | Enable DBSCAN clustering (A8) |
| `CLUSTERING_MIN_SAMPLES` | int | `3` | DBSCAN min samples per cluster |
| `HEALTH_PORT` | int | `8080` | Health check HTTP server port |

## Output Tables

### ml_detected_anomalies

Anomaly detections above the threat threshold. Engine: `ReplacingMergeTree(detected_at)`, ORDER BY `(src_ip)`, TTL 30 days.

Key columns: `detected_at`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `recurrence`, `campaign_id`, `reason`, `anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`, plus all ML features.

### ml_all_scores

All classifications (no threshold filter) for observability. Engine: `ReplacingMergeTree(detected_at)`, ORDER BY `(window_start, src_ip, ja4, host, model_name)`, TTL 3 days.

## Decision Log Format

The `decisions.jsonl` file contains structured JSONL entries:

```json
{"event": "CYCLE_START", "cycle_id": "20260309T143000", "total": 5000, "human": 1500, "known_bot": 200, "correlated": 3000}
{"event": "ANOMALY", "src_ip": "203.0.113.42", "score": -0.25, "threat_level": "HIGH", "reason": "hit_velocity=45.2, fuzzing_index=0.8, ...", "campaign_id": 3}
{"event": "KNOWN_BOT", "src_ip": "198.51.100.10", "bot_name": "AhrefsBot"}
{"event": "CYCLE_END", "cycle_id": "20260309T143000", "anomalies": 15, "known_bots": 200, "duration_sec": 12.5}
```

Log rotation: 50 MB max size × `LOG_BACKUP_COUNT` backups (default 7).

## Health Check Endpoint

- **URL**: `GET http://localhost:8080/`
- **Response**: `200 OK` with status JSON
- Runs in a separate thread

## Model Persistence

| File | Description |
|------|-------------|
| `model_<name>_<version>.joblib` | Serialized Isolation Forest (joblib) |
| `model_<name>_<version>.meta.json` | Model metadata (features, thresholds, training stats) |
| `model_<name>.current` | Pointer to active model version |
| `training_history.jsonl` | Training history log |

Models are rotated: only the last `MODEL_HISTORY_COUNT` versions (default 10) are kept.

## Docker Deployment

```bash
# Build
make build-bot-detector

# Run with docker-compose
cd services/bot-detector
docker-compose up -d
```

### Volumes

| Host Path | Container Path | Description |
|-----------|---------------|-------------|
| `./bot_detector_logs` | `/var/log/bot_detector` | Decision logs (JSONL) |
| `./bot_detector_models` | `/var/lib/bot_detector` | Persisted ML models |
| `./reputation/data/user_files/bot_ip.csv` | `/data/bot_ip.csv` (ro) | Known bot IP list |
| `./reputation/data/user_files/bot_ja4.csv` | `/data/bot_ja4.csv` (ro) | Known bot JA4 list |
| `./reputation/data/user_files/asn_reputation.csv` | `/data/asn_reputation.csv` (ro) | ASN reputation labels |