Files
ja4-platform/docs/services/bot-detector.md
toto d469e39da7 feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 16:42:59 +02:00

266 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Bot Detector
The bot-detector is a Python service that performs machine-learning anomaly detection on aggregated HTTP/TLS traffic features stored in ClickHouse. It runs on a continuous cycle (default: every 5 minutes), using Isolation Forest to identify suspicious traffic patterns, enriched with SHAP explainability, DBSCAN clustering, and Anubis bot-rule enrichment.
## ML Algorithm
### Isolation Forest (Semi-Supervised)
The core algorithm is **Isolation Forest** (Liu, Ting & Zhou, 2008) — an unsupervised anomaly detection algorithm that isolates anomalies by randomly partitioning feature space. Anomalies require fewer partitions to isolate than normal points.
The approach is **semi-supervised** because:
1. **Known bots** are identified a priori via reputation dictionaries (IP, JA4, ASN)
2. **Human baseline** is identified via ASN reputation labels (`asn_label = 'human'`)
3. The model trains **only on human-baseline traffic** (minimum 500 sessions required)
4. Unknown traffic is scored by deviation from the human profile
### Two-Model Architecture
| Model | Condition | Features | Data |
|-------|-----------|----------|------|
| **Complet** | `correlated = 1` | 35 | HTTP + TCP + TLS (full pipeline data) |
| **Applicatif** | `correlated = 0` | 31 | HTTP only (no TLS correlation available) |
### Threat Levels
| Score Range | Level | Interpretation |
|------------|-------|----------------|
| `< -0.30` | **CRITICAL** | Extremely anomalous behavior |
| `< -0.15` | **HIGH** | Strong anomaly signal |
| `< -0.05` | **MEDIUM** | Moderate anomaly |
| `≥ -0.05` | **LOW** | Slightly unusual |
## Feature List
### Common Features (31 — Applicatif model)
#### HTTP Behavior
| Feature | Description |
|---------|-------------|
| `hits` | Request count in the window |
| `hit_velocity` | Requests per second |
| `fuzzing_index` | Path/parameter diversity anomaly score |
| `post_ratio` | Fraction of POST requests |
| `port_exhaustion_ratio` | Fraction of distinct source ports / total |
| `orphan_ratio` | Requests without TLS correlation |
| `head_ratio` | Fraction of HEAD requests |
| `http10_ratio` | Fraction of HTTP/1.0 requests |
| `generic_accept_ratio` | Fraction of short Accept headers |
| `sec_fetch_absence_rate` | Fraction missing Sec-Fetch-Site |
| `missing_accept_enc_ratio` | Fraction missing Accept-Encoding |
| `http_scheme_ratio` | Fraction using HTTP (not HTTPS) |
#### Connection Management
| Feature | Description |
|---------|-------------|
| `max_keepalives` | Max requests on a single Keep-Alive connection |
| `tcp_shared_count` | TCP connections shared between sessions |
| `multiplexing_efficiency` | HTTP/2 multiplexing efficiency |
#### Browser Fingerprint
| Feature | Description |
|---------|-------------|
| `header_count` | HTTP headers sent |
| `has_accept_language` | Accept-Language header presence |
| `has_cookie` | Cookie header presence |
| `has_referer` | Referer header presence |
| `modern_browser_score` | Composite browser compliance score (0100) |
| `ua_ch_mismatch` | User-Agent vs Client Hints inconsistency |
| `ip_id_zero_ratio` | IP packets with ID=0 (headless/minimal stack) |
| `header_order_shared_count` | IPs sharing same header order |
| `header_order_confidence` | Normalized entropy of header order |
| `distinct_header_orders` | Distinct header orderings per IP |
| `is_fake_navigation` | Sec-Fetch-Mode=navigate with non-document dest |
#### Navigation Patterns
| Feature | Description |
|---------|-------------|
| `request_size_variance` | Variance of request sizes |
| `mss_mobile_mismatch` | TCP MSS vs mobile profile inconsistency |
| `asset_ratio` | Static asset request fraction |
| `direct_access_ratio` | Direct accesses (no referer) |
| `is_ua_rotating` | User-Agent rotation detected (flag) |
| `distinct_ja4_count` | Distinct JA4 fingerprints per IP |
| `anomalous_payload_ratio` | Anomalous payload size fraction |
#### Concentration & Rarity
| Feature | Description |
|---------|-------------|
| `src_port_density` | Source port entropy |
| `ja4_asn_concentration` | JA4 concentration within ASN |
| `ja4_country_concentration` | JA4 concentration per country |
| `is_rare_ja4` | Rare JA4 fingerprint (< 100 total hits) |
#### Temporal & Diversity
| Feature | Description |
|---------|-------------|
| `temporal_entropy` | Temporal distribution entropy |
| `path_diversity_ratio` | URL path diversity |
| `url_depth_variance` | URL depth variance |
| `ja3_diversity_ratio` | JA3 diversity ratio per IP |
### Additional TCP/TLS Features (Complet model only — 4 extra)
| Feature | Description |
|---------|-------------|
| `tcp_jitter_variance` | TCP inter-packet jitter variance |
| `alpn_http_mismatch` | ALPN vs actual HTTP protocol mismatch |
| `is_alpn_missing` | ALPN absent in ClientHello |
| `sni_host_mismatch` | TLS SNI vs HTTP Host mismatch |
### L4 Fingerprint Features (Complet model)
| Feature | Description |
|---------|-------------|
| `avg_ttl` | Average IP TTL (OS fingerprint) |
| `ttl_std` | TTL standard deviation |
| `no_window_scale_ratio` | Fraction without TCP window scale |
| `syn_timing_cv` | SYN timing coefficient of variation |
| `tls12_ratio` | Fraction of TLS 1.2 connections |
| `ip_df_variance` | IP Don't-Fragment flag variance |
## Detection Pipeline
```
1. Read view_ai_features_1h (last 24h) → DataFrame
2. Read view_ip_recurrence → recurrence map
3. Clean columns (fillna, astype)
4. Split by correlated=1 / correlated=0
5. For each model (Complet, Applicatif):
a. A7: Validate features (exclude missing/constant)
b. Separate known bots → log as KNOWN_BOT
c. Filter human baseline (asn_label='human', min 500 sessions)
d. Load or train Isolation Forest model
e. A1: Check concept drift (KS test on features)
f. Score unknown traffic
g. A10: Normalize scores to [-1, 0]
h. A2: Compute adaptive threshold = min(percentile_5, ANOMALY_THRESHOLD)
i. A6: Apply recurrence weighting
j. Filter scores below threshold
k. A4: SHAP explainability (top 5 features)
l. A8: DBSCAN clustering (campaign detection)
6. Concatenate results, deduplicate by src_ip (keep lowest score)
7. A5: Deduplication with TTL (skip recently reported IPs)
8. Insert into ml_detected_anomalies + ml_all_scores
```
## Concept Drift Detection (A1)
Uses the **Kolmogorov-Smirnov test** to compare feature distributions between the current data and the training data. If the fraction of drifted features exceeds `DRIFT_THRESHOLD` (default: 0.30), the model is retrained.
## SHAP Explainability (A4)
When enabled (`ENABLE_SHAP=true`), computes SHAP values for each detected anomaly using `shap.TreeExplainer`. The top 5 contributing features are stored in the `reason` field.
## DBSCAN Clustering (A8)
When enabled (`ENABLE_CLUSTERING=true`), applies DBSCAN on anomaly feature vectors to group related anomalies into campaigns. Each anomaly gets a `campaign_id` (-1 = no cluster).
## Anubis Bot-Rule Enrichment
The `view_ai_features_1h` view enriches each IP with Anubis bot detection using a priority cascade:
1. **UA + IP combined** (same `rule_id`) highest confidence
2. **UA only** (no IP requirement)
3. **IP only** (no UA requirement)
4. **ASN match**
5. **Country match**
## Environment Variables
| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `CLICKHOUSE_HOST` | string | `clickhouse` | ClickHouse server hostname |
| `CLICKHOUSE_PORT` | int | `8123` | ClickHouse HTTP port |
| `CLICKHOUSE_DB` | string | `mabase_prod` | Database name |
| `CLICKHOUSE_USER` | string | `admin` | ClickHouse username |
| `CLICKHOUSE_PASSWORD` | string | `""` | ClickHouse password |
| `ISOLATION_CONTAMINATION` | float | `0.02` | Contamination parameter for Isolation Forest |
| `ANOMALY_THRESHOLD` | float | `-0.03` | Score threshold for anomaly detection |
| `ANOMALY_PERCENTILE` | int | `5` | Percentile for adaptive threshold (A2) |
| `CYCLE_INTERVAL_SEC` | int | `300` | Seconds between detection cycles |
| `MAX_CONSECUTIVE_FAILURES` | int | `3` | Max consecutive failures before exit |
| `BOT_DETECTOR_LOG` | string | `/var/log/bot_detector/decisions.jsonl` | Decision log file path |
| `LOG_BACKUP_COUNT` | int | `7` | Number of rotated log backups |
| `MODEL_DIR` | string | `/var/lib/bot_detector` | Model persistence directory |
| `RETRAIN_INTERVAL_HOURS` | int | `24` | Hours between model retraining |
| `MODEL_HISTORY_COUNT` | int | `10` | Number of model versions to keep |
| `DRIFT_THRESHOLD` | float | `0.30` | KS-test drift threshold (A1) |
| `ENABLE_MULTIWINDOW` | bool | `false` | Enable 24h multi-window analysis (A3) |
| `MULTIWINDOW_VIEW` | string | `view_ai_features_24h` | View for multi-window mode |
| `ENABLE_SHAP` | bool | `true` | Enable SHAP explainability (A4) |
| `DEDUP_TTL_MIN` | int | `60` | Deduplication TTL in minutes (A5) |
| `RECURRENCE_WEIGHT` | float | `0.005` | Recurrence score weighting factor (A6) |
| `MIN_VALID_FEATURE_RATIO` | float | `0.50` | Min valid feature ratio (A7) |
| `ENABLE_CLUSTERING` | bool | `true` | Enable DBSCAN clustering (A8) |
| `CLUSTERING_MIN_SAMPLES` | int | `3` | DBSCAN min samples per cluster |
| `HEALTH_PORT` | int | `8080` | Health check HTTP server port |
## Output Tables
### ml_detected_anomalies
Anomaly detections above the threat threshold. Engine: `ReplacingMergeTree(detected_at)`, ORDER BY `(src_ip)`, TTL 30 days.
Key columns: `detected_at`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `recurrence`, `campaign_id`, `reason`, `anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`, plus all ML features.
### ml_all_scores
All classifications (no threshold filter) for observability. Engine: `ReplacingMergeTree(detected_at)`, ORDER BY `(window_start, src_ip, ja4, host, model_name)`, TTL 3 days.
## Decision Log Format
The `decisions.jsonl` file contains structured JSONL entries:
```json
{"event": "CYCLE_START", "cycle_id": "20260309T143000", "total": 5000, "human": 1500, "known_bot": 200, "correlated": 3000}
{"event": "ANOMALY", "src_ip": "203.0.113.42", "score": -0.25, "threat_level": "HIGH", "reason": "hit_velocity=45.2, fuzzing_index=0.8, ...", "campaign_id": 3}
{"event": "KNOWN_BOT", "src_ip": "198.51.100.10", "bot_name": "AhrefsBot"}
{"event": "CYCLE_END", "cycle_id": "20260309T143000", "anomalies": 15, "known_bots": 200, "duration_sec": 12.5}
```
Log rotation: 50 MB max size × `LOG_BACKUP_COUNT` backups (default 7).
## Health Check Endpoint
- **URL**: `GET http://localhost:8080/`
- **Response**: `200 OK` with status JSON
- Runs in a separate thread
## Model Persistence
| File | Description |
|------|-------------|
| `model_<name>_<version>.joblib` | Serialized Isolation Forest (joblib) |
| `model_<name>_<version>.meta.json` | Model metadata (features, thresholds, training stats) |
| `model_<name>.current` | Pointer to active model version |
| `training_history.jsonl` | Training history log |
Models are rotated: only the last `MODEL_HISTORY_COUNT` versions (default 10) are kept.
## Docker Deployment
```bash
# Build
make build-bot-detector
# Run with docker-compose
cd services/bot-detector
docker-compose up -d
```
### Volumes
| Host Path | Container Path | Description |
|-----------|---------------|-------------|
| `./bot_detector_logs` | `/var/log/bot_detector` | Decision logs (JSONL) |
| `./bot_detector_models` | `/var/lib/bot_detector` | Persisted ML models |
| `./reputation/data/user_files/bot_ip.csv` | `/data/bot_ip.csv` (ro) | Known bot IP list |
| `./reputation/data/user_files/bot_ja4.csv` | `/data/bot_ja4.csv` (ro) | Known bot JA4 list |
| `./reputation/data/user_files/asn_reputation.csv` | `/data/asn_reputation.csv` (ro) | ASN reputation labels |