# Bot Detector The bot-detector is a Python service that performs machine-learning anomaly detection on aggregated HTTP/TLS traffic features stored in ClickHouse. It runs on a continuous cycle (default: every 5 minutes), using Isolation Forest to identify suspicious traffic patterns, enriched with SHAP explainability, DBSCAN clustering, and Anubis bot-rule enrichment. ## ML Algorithm ### Isolation Forest (Semi-Supervised) The core algorithm is **Isolation Forest** (Liu, Ting & Zhou, 2008) — an unsupervised anomaly detection algorithm that isolates anomalies by randomly partitioning feature space. Anomalies require fewer partitions to isolate than normal points. The approach is **semi-supervised** because: 1. **Known bots** are identified a priori via reputation dictionaries (IP, JA4, ASN) 2. **Human baseline** is identified via ASN reputation labels (`asn_label = 'human'`) 3. The model trains **only on human-baseline traffic** (minimum 500 sessions required) 4. Unknown traffic is scored by deviation from the human profile ### Two-Model Architecture | Model | Condition | Features | Data | |-------|-----------|----------|------| | **Complet** | `correlated = 1` | 35 | HTTP + TCP + TLS (full pipeline data) | | **Applicatif** | `correlated = 0` | 31 | HTTP only (no TLS correlation available) | ### Threat Levels | Score Range | Level | Interpretation | |------------|-------|----------------| | `< -0.30` | **CRITICAL** | Extremely anomalous behavior | | `< -0.15` | **HIGH** | Strong anomaly signal | | `< -0.05` | **MEDIUM** | Moderate anomaly | | `≥ -0.05` | **LOW** | Slightly unusual | ## Feature List ### Common Features (31 — Applicatif model) #### HTTP Behavior | Feature | Description | |---------|-------------| | `hits` | Request count in the window | | `hit_velocity` | Requests per second | | `fuzzing_index` | Path/parameter diversity anomaly score | | `post_ratio` | Fraction of POST requests | | `port_exhaustion_ratio` | Fraction of distinct source ports / total | | `orphan_ratio` | Requests without TLS correlation | | `head_ratio` | Fraction of HEAD requests | | `http10_ratio` | Fraction of HTTP/1.0 requests | | `generic_accept_ratio` | Fraction of short Accept headers | | `sec_fetch_absence_rate` | Fraction missing Sec-Fetch-Site | | `missing_accept_enc_ratio` | Fraction missing Accept-Encoding | | `http_scheme_ratio` | Fraction using HTTP (not HTTPS) | #### Connection Management | Feature | Description | |---------|-------------| | `max_keepalives` | Max requests on a single Keep-Alive connection | | `tcp_shared_count` | TCP connections shared between sessions | | `multiplexing_efficiency` | HTTP/2 multiplexing efficiency | #### Browser Fingerprint | Feature | Description | |---------|-------------| | `header_count` | HTTP headers sent | | `has_accept_language` | Accept-Language header presence | | `has_cookie` | Cookie header presence | | `has_referer` | Referer header presence | | `modern_browser_score` | Composite browser compliance score (0–100) | | `ua_ch_mismatch` | User-Agent vs Client Hints inconsistency | | `ip_id_zero_ratio` | IP packets with ID=0 (headless/minimal stack) | | `header_order_shared_count` | IPs sharing same header order | | `header_order_confidence` | Normalized entropy of header order | | `distinct_header_orders` | Distinct header orderings per IP | | `is_fake_navigation` | Sec-Fetch-Mode=navigate with non-document dest | #### Navigation Patterns | Feature | Description | |---------|-------------| | `request_size_variance` | Variance of request sizes | | `mss_mobile_mismatch` | TCP MSS vs mobile profile inconsistency | | `asset_ratio` | Static asset request fraction | | `direct_access_ratio` | Direct accesses (no referer) | | `is_ua_rotating` | User-Agent rotation detected (flag) | | `distinct_ja4_count` | Distinct JA4 fingerprints per IP | | `anomalous_payload_ratio` | Anomalous payload size fraction | #### Concentration & Rarity | Feature | Description | |---------|-------------| | `src_port_density` | Source port entropy | | `ja4_asn_concentration` | JA4 concentration within ASN | | `ja4_country_concentration` | JA4 concentration per country | | `is_rare_ja4` | Rare JA4 fingerprint (< 100 total hits) | #### Temporal & Diversity | Feature | Description | |---------|-------------| | `temporal_entropy` | Temporal distribution entropy | | `path_diversity_ratio` | URL path diversity | | `url_depth_variance` | URL depth variance | | `ja3_diversity_ratio` | JA3 diversity ratio per IP | ### Additional TCP/TLS Features (Complet model only — 4 extra) | Feature | Description | |---------|-------------| | `tcp_jitter_variance` | TCP inter-packet jitter variance | | `alpn_http_mismatch` | ALPN vs actual HTTP protocol mismatch | | `is_alpn_missing` | ALPN absent in ClientHello | | `sni_host_mismatch` | TLS SNI vs HTTP Host mismatch | ### L4 Fingerprint Features (Complet model) | Feature | Description | |---------|-------------| | `avg_ttl` | Average IP TTL (OS fingerprint) | | `ttl_std` | TTL standard deviation | | `no_window_scale_ratio` | Fraction without TCP window scale | | `syn_timing_cv` | SYN timing coefficient of variation | | `tls12_ratio` | Fraction of TLS 1.2 connections | | `ip_df_variance` | IP Don't-Fragment flag variance | ## Detection Pipeline ``` 1. Read view_ai_features_1h (last 24h) → DataFrame 2. Read view_ip_recurrence → recurrence map 3. Clean columns (fillna, astype) 4. Split by correlated=1 / correlated=0 5. For each model (Complet, Applicatif): a. A7: Validate features (exclude missing/constant) b. Separate known bots → log as KNOWN_BOT c. Filter human baseline (asn_label='human', min 500 sessions) d. Load or train Isolation Forest model e. A1: Check concept drift (KS test on features) f. Score unknown traffic g. A10: Normalize scores to [-1, 0] h. A2: Compute adaptive threshold = min(percentile_5, ANOMALY_THRESHOLD) i. A6: Apply recurrence weighting j. Filter scores below threshold k. A4: SHAP explainability (top 5 features) l. A8: DBSCAN clustering (campaign detection) 6. Concatenate results, deduplicate by src_ip (keep lowest score) 7. A5: Deduplication with TTL (skip recently reported IPs) 8. Insert into ml_detected_anomalies + ml_all_scores ``` ## Concept Drift Detection (A1) Uses the **Kolmogorov-Smirnov test** to compare feature distributions between the current data and the training data. If the fraction of drifted features exceeds `DRIFT_THRESHOLD` (default: 0.30), the model is retrained. ## SHAP Explainability (A4) When enabled (`ENABLE_SHAP=true`), computes SHAP values for each detected anomaly using `shap.TreeExplainer`. The top 5 contributing features are stored in the `reason` field. ## DBSCAN Clustering (A8) When enabled (`ENABLE_CLUSTERING=true`), applies DBSCAN on anomaly feature vectors to group related anomalies into campaigns. Each anomaly gets a `campaign_id` (-1 = no cluster). ## Anubis Bot-Rule Enrichment The `view_ai_features_1h` view enriches each IP with Anubis bot detection using a priority cascade: 1. **UA + IP combined** (same `rule_id`) — highest confidence 2. **UA only** (no IP requirement) 3. **IP only** (no UA requirement) 4. **ASN match** 5. **Country match** ## Environment Variables | Variable | Type | Default | Description | |----------|------|---------|-------------| | `CLICKHOUSE_HOST` | string | `clickhouse` | ClickHouse server hostname | | `CLICKHOUSE_PORT` | int | `8123` | ClickHouse HTTP port | | `CLICKHOUSE_DB` | string | `ja4_processing` | Database name | | `CLICKHOUSE_USER` | string | `admin` | ClickHouse username | | `CLICKHOUSE_PASSWORD` | string | `""` | ClickHouse password | | `ISOLATION_CONTAMINATION` | float | `0.02` | Contamination parameter for Isolation Forest | | `ANOMALY_THRESHOLD` | float | `-0.03` | Score threshold for anomaly detection | | `ANOMALY_PERCENTILE` | int | `5` | Percentile for adaptive threshold (A2) | | `CYCLE_INTERVAL_SEC` | int | `300` | Seconds between detection cycles | | `MAX_CONSECUTIVE_FAILURES` | int | `3` | Max consecutive failures before exit | | `BOT_DETECTOR_LOG` | string | `/var/log/bot_detector/decisions.jsonl` | Decision log file path | | `LOG_BACKUP_COUNT` | int | `7` | Number of rotated log backups | | `MODEL_DIR` | string | `/var/lib/bot_detector` | Model persistence directory | | `RETRAIN_INTERVAL_HOURS` | int | `24` | Hours between model retraining | | `MODEL_HISTORY_COUNT` | int | `10` | Number of model versions to keep | | `DRIFT_THRESHOLD` | float | `0.30` | KS-test drift threshold (A1) | | `ENABLE_MULTIWINDOW` | bool | `false` | Enable 24h multi-window analysis (A3) | | `MULTIWINDOW_VIEW` | string | `view_ai_features_24h` | View for multi-window mode | | `ENABLE_SHAP` | bool | `true` | Enable SHAP explainability (A4) | | `DEDUP_TTL_MIN` | int | `60` | Deduplication TTL in minutes (A5) | | `RECURRENCE_WEIGHT` | float | `0.005` | Recurrence score weighting factor (A6) | | `MIN_VALID_FEATURE_RATIO` | float | `0.50` | Min valid feature ratio (A7) | | `ENABLE_CLUSTERING` | bool | `true` | Enable DBSCAN clustering (A8) | | `CLUSTERING_MIN_SAMPLES` | int | `3` | DBSCAN min samples per cluster | | `HEALTH_PORT` | int | `8080` | Health check HTTP server port | ## Output Tables ### ml_detected_anomalies Anomaly detections above the threat threshold. Engine: `ReplacingMergeTree(detected_at)`, ORDER BY `(src_ip)`, TTL 30 days. Key columns: `detected_at`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `recurrence`, `campaign_id`, `reason`, `anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`, plus all ML features. ### ml_all_scores All classifications (no threshold filter) for observability. Engine: `ReplacingMergeTree(detected_at)`, ORDER BY `(window_start, src_ip, ja4, host, model_name)`, TTL 3 days. ## Decision Log Format The `decisions.jsonl` file contains structured JSONL entries: ```json {"event": "CYCLE_START", "cycle_id": "20260309T143000", "total": 5000, "human": 1500, "known_bot": 200, "correlated": 3000} {"event": "ANOMALY", "src_ip": "203.0.113.42", "score": -0.25, "threat_level": "HIGH", "reason": "hit_velocity=45.2, fuzzing_index=0.8, ...", "campaign_id": 3} {"event": "KNOWN_BOT", "src_ip": "198.51.100.10", "bot_name": "AhrefsBot"} {"event": "CYCLE_END", "cycle_id": "20260309T143000", "anomalies": 15, "known_bots": 200, "duration_sec": 12.5} ``` Log rotation: 50 MB max size × `LOG_BACKUP_COUNT` backups (default 7). ## Health Check Endpoint - **URL**: `GET http://localhost:8080/` - **Response**: `200 OK` with status JSON - Runs in a separate thread ## Model Persistence | File | Description | |------|-------------| | `model__.joblib` | Serialized Isolation Forest (joblib) | | `model__.meta.json` | Model metadata (features, thresholds, training stats) | | `model_.current` | Pointer to active model version | | `training_history.jsonl` | Training history log | Models are rotated: only the last `MODEL_HISTORY_COUNT` versions (default 10) are kept. ## Docker Deployment ```bash # Build make build-bot-detector # Run with docker-compose cd services/bot-detector docker-compose up -d ``` ### Volumes | Host Path | Container Path | Description | |-----------|---------------|-------------| | `./bot_detector_logs` | `/var/log/bot_detector` | Decision logs (JSONL) | | `./bot_detector_models` | `/var/lib/bot_detector` | Persisted ML models | | `./reputation/data/user_files/bot_ip.csv` | `/data/bot_ip.csv` (ro) | Known bot IP list | | `./reputation/data/user_files/bot_ja4.csv` | `/data/bot_ja4.csv` (ro) | Known bot JA4 list | | `./reputation/data/user_files/asn_reputation.csv` | `/data/asn_reputation.csv` (ro) | ASN reputation labels |