Services: - ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap) - logcorrelator: JA4 log correlation engine (Go, ClickHouse) - mod_reqin_log: Apache module (C, JSON request logging) - bot_detector: ML bot detection pipeline (Python) - dashboard: FastAPI/Streamlit analytics UI (Python) Shared libraries: - shared/go/ja4common: logger, config, shutdown, ipfilter (Go module) - shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package) - shared/clickhouse/: canonical SQL migrations (10 files) Build & packaging: - Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10) - go.work workspace linking sentinel, correlator, ja4common - Makefile with test-all, build-all, rpm-* targets Fixes applied: - go.work: 1.21 → 1.24.6 (required by sentinel) - correlator Dockerfiles: golang:1.21 → golang:1.24 - replace directives in go.mod for ja4common local path - pyproject.toml: setuptools.backends → setuptools.build_meta - Removed static libpcap linking (unavailable on Rocky 9) - Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32) - Rewrote corrupted test files (logger_test.go × 2) Test coverage: - correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%) - sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse) Documentation: - README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
266 lines
12 KiB
Markdown
266 lines
12 KiB
Markdown
# Bot Detector
|
||
|
||
The bot-detector is a Python service that performs machine-learning anomaly detection on aggregated HTTP/TLS traffic features stored in ClickHouse. It runs on a continuous cycle (default: every 5 minutes), using Isolation Forest to identify suspicious traffic patterns, enriched with SHAP explainability, DBSCAN clustering, and Anubis bot-rule enrichment.
|
||
|
||
## ML Algorithm
|
||
|
||
### Isolation Forest (Semi-Supervised)
|
||
|
||
The core algorithm is **Isolation Forest** (Liu, Ting & Zhou, 2008) — an unsupervised anomaly detection algorithm that isolates anomalies by randomly partitioning feature space. Anomalies require fewer partitions to isolate than normal points.
|
||
|
||
The approach is **semi-supervised** because:
|
||
1. **Known bots** are identified a priori via reputation dictionaries (IP, JA4, ASN)
|
||
2. **Human baseline** is identified via ASN reputation labels (`asn_label = 'human'`)
|
||
3. The model trains **only on human-baseline traffic** (minimum 500 sessions required)
|
||
4. Unknown traffic is scored by deviation from the human profile
|
||
|
||
### Two-Model Architecture
|
||
|
||
| Model | Condition | Features | Data |
|
||
|-------|-----------|----------|------|
|
||
| **Complet** | `correlated = 1` | 35 | HTTP + TCP + TLS (full pipeline data) |
|
||
| **Applicatif** | `correlated = 0` | 31 | HTTP only (no TLS correlation available) |
|
||
|
||
### Threat Levels
|
||
|
||
| Score Range | Level | Interpretation |
|
||
|------------|-------|----------------|
|
||
| `< -0.30` | **CRITICAL** | Extremely anomalous behavior |
|
||
| `< -0.15` | **HIGH** | Strong anomaly signal |
|
||
| `< -0.05` | **MEDIUM** | Moderate anomaly |
|
||
| `≥ -0.05` | **LOW** | Slightly unusual |
|
||
|
||
## Feature List
|
||
|
||
### Common Features (31 — Applicatif model)
|
||
|
||
#### HTTP Behavior
|
||
|
||
| Feature | Description |
|
||
|---------|-------------|
|
||
| `hits` | Request count in the window |
|
||
| `hit_velocity` | Requests per second |
|
||
| `fuzzing_index` | Path/parameter diversity anomaly score |
|
||
| `post_ratio` | Fraction of POST requests |
|
||
| `port_exhaustion_ratio` | Fraction of distinct source ports / total |
|
||
| `orphan_ratio` | Requests without TLS correlation |
|
||
| `head_ratio` | Fraction of HEAD requests |
|
||
| `http10_ratio` | Fraction of HTTP/1.0 requests |
|
||
| `generic_accept_ratio` | Fraction of short Accept headers |
|
||
| `sec_fetch_absence_rate` | Fraction missing Sec-Fetch-Site |
|
||
| `missing_accept_enc_ratio` | Fraction missing Accept-Encoding |
|
||
| `http_scheme_ratio` | Fraction using HTTP (not HTTPS) |
|
||
|
||
#### Connection Management
|
||
|
||
| Feature | Description |
|
||
|---------|-------------|
|
||
| `max_keepalives` | Max requests on a single Keep-Alive connection |
|
||
| `tcp_shared_count` | TCP connections shared between sessions |
|
||
| `multiplexing_efficiency` | HTTP/2 multiplexing efficiency |
|
||
|
||
#### Browser Fingerprint
|
||
|
||
| Feature | Description |
|
||
|---------|-------------|
|
||
| `header_count` | HTTP headers sent |
|
||
| `has_accept_language` | Accept-Language header presence |
|
||
| `has_cookie` | Cookie header presence |
|
||
| `has_referer` | Referer header presence |
|
||
| `modern_browser_score` | Composite browser compliance score (0–100) |
|
||
| `ua_ch_mismatch` | User-Agent vs Client Hints inconsistency |
|
||
| `ip_id_zero_ratio` | IP packets with ID=0 (headless/minimal stack) |
|
||
| `header_order_shared_count` | IPs sharing same header order |
|
||
| `header_order_confidence` | Normalized entropy of header order |
|
||
| `distinct_header_orders` | Distinct header orderings per IP |
|
||
| `is_fake_navigation` | Sec-Fetch-Mode=navigate with non-document dest |
|
||
|
||
#### Navigation Patterns
|
||
|
||
| Feature | Description |
|
||
|---------|-------------|
|
||
| `request_size_variance` | Variance of request sizes |
|
||
| `mss_mobile_mismatch` | TCP MSS vs mobile profile inconsistency |
|
||
| `asset_ratio` | Static asset request fraction |
|
||
| `direct_access_ratio` | Direct accesses (no referer) |
|
||
| `is_ua_rotating` | User-Agent rotation detected (flag) |
|
||
| `distinct_ja4_count` | Distinct JA4 fingerprints per IP |
|
||
| `anomalous_payload_ratio` | Anomalous payload size fraction |
|
||
|
||
#### Concentration & Rarity
|
||
|
||
| Feature | Description |
|
||
|---------|-------------|
|
||
| `src_port_density` | Source port entropy |
|
||
| `ja4_asn_concentration` | JA4 concentration within ASN |
|
||
| `ja4_country_concentration` | JA4 concentration per country |
|
||
| `is_rare_ja4` | Rare JA4 fingerprint (< 100 total hits) |
|
||
|
||
#### Temporal & Diversity
|
||
|
||
| Feature | Description |
|
||
|---------|-------------|
|
||
| `temporal_entropy` | Temporal distribution entropy |
|
||
| `path_diversity_ratio` | URL path diversity |
|
||
| `url_depth_variance` | URL depth variance |
|
||
| `ja3_diversity_ratio` | JA3 diversity ratio per IP |
|
||
|
||
### Additional TCP/TLS Features (Complet model only — 4 extra)
|
||
|
||
| Feature | Description |
|
||
|---------|-------------|
|
||
| `tcp_jitter_variance` | TCP inter-packet jitter variance |
|
||
| `alpn_http_mismatch` | ALPN vs actual HTTP protocol mismatch |
|
||
| `is_alpn_missing` | ALPN absent in ClientHello |
|
||
| `sni_host_mismatch` | TLS SNI vs HTTP Host mismatch |
|
||
|
||
### L4 Fingerprint Features (Complet model)
|
||
|
||
| Feature | Description |
|
||
|---------|-------------|
|
||
| `avg_ttl` | Average IP TTL (OS fingerprint) |
|
||
| `ttl_std` | TTL standard deviation |
|
||
| `no_window_scale_ratio` | Fraction without TCP window scale |
|
||
| `syn_timing_cv` | SYN timing coefficient of variation |
|
||
| `tls12_ratio` | Fraction of TLS 1.2 connections |
|
||
| `ip_df_variance` | IP Don't-Fragment flag variance |
|
||
|
||
## Detection Pipeline
|
||
|
||
```
|
||
1. Read view_ai_features_1h (last 24h) → DataFrame
|
||
2. Read view_ip_recurrence → recurrence map
|
||
3. Clean columns (fillna, astype)
|
||
4. Split by correlated=1 / correlated=0
|
||
5. For each model (Complet, Applicatif):
|
||
a. A7: Validate features (exclude missing/constant)
|
||
b. Separate known bots → log as KNOWN_BOT
|
||
c. Filter human baseline (asn_label='human', min 500 sessions)
|
||
d. Load or train Isolation Forest model
|
||
e. A1: Check concept drift (KS test on features)
|
||
f. Score unknown traffic
|
||
g. A10: Normalize scores to [-1, 0]
|
||
h. A2: Compute adaptive threshold = min(percentile_5, ANOMALY_THRESHOLD)
|
||
i. A6: Apply recurrence weighting
|
||
j. Filter scores below threshold
|
||
k. A4: SHAP explainability (top 5 features)
|
||
l. A8: DBSCAN clustering (campaign detection)
|
||
6. Concatenate results, deduplicate by src_ip (keep lowest score)
|
||
7. A5: Deduplication with TTL (skip recently reported IPs)
|
||
8. Insert into ml_detected_anomalies + ml_all_scores
|
||
```
|
||
|
||
## Concept Drift Detection (A1)
|
||
|
||
Uses the **Kolmogorov-Smirnov test** to compare feature distributions between the current data and the training data. If the fraction of drifted features exceeds `DRIFT_THRESHOLD` (default: 0.30), the model is retrained.
|
||
|
||
## SHAP Explainability (A4)
|
||
|
||
When enabled (`ENABLE_SHAP=true`), computes SHAP values for each detected anomaly using `shap.TreeExplainer`. The top 5 contributing features are stored in the `reason` field.
|
||
|
||
## DBSCAN Clustering (A8)
|
||
|
||
When enabled (`ENABLE_CLUSTERING=true`), applies DBSCAN on anomaly feature vectors to group related anomalies into campaigns. Each anomaly gets a `campaign_id` (-1 = no cluster).
|
||
|
||
## Anubis Bot-Rule Enrichment
|
||
|
||
The `view_ai_features_1h` view enriches each IP with Anubis bot detection using a priority cascade:
|
||
1. **UA + IP combined** (same `rule_id`) — highest confidence
|
||
2. **UA only** (no IP requirement)
|
||
3. **IP only** (no UA requirement)
|
||
4. **ASN match**
|
||
5. **Country match**
|
||
|
||
## Environment Variables
|
||
|
||
| Variable | Type | Default | Description |
|
||
|----------|------|---------|-------------|
|
||
| `CLICKHOUSE_HOST` | string | `clickhouse` | ClickHouse server hostname |
|
||
| `CLICKHOUSE_PORT` | int | `8123` | ClickHouse HTTP port |
|
||
| `CLICKHOUSE_DB` | string | `mabase_prod` | Database name |
|
||
| `CLICKHOUSE_USER` | string | `admin` | ClickHouse username |
|
||
| `CLICKHOUSE_PASSWORD` | string | `""` | ClickHouse password |
|
||
| `ISOLATION_CONTAMINATION` | float | `0.02` | Contamination parameter for Isolation Forest |
|
||
| `ANOMALY_THRESHOLD` | float | `-0.03` | Score threshold for anomaly detection |
|
||
| `ANOMALY_PERCENTILE` | int | `5` | Percentile for adaptive threshold (A2) |
|
||
| `CYCLE_INTERVAL_SEC` | int | `300` | Seconds between detection cycles |
|
||
| `MAX_CONSECUTIVE_FAILURES` | int | `3` | Max consecutive failures before exit |
|
||
| `BOT_DETECTOR_LOG` | string | `/var/log/bot_detector/decisions.jsonl` | Decision log file path |
|
||
| `LOG_BACKUP_COUNT` | int | `7` | Number of rotated log backups |
|
||
| `MODEL_DIR` | string | `/var/lib/bot_detector` | Model persistence directory |
|
||
| `RETRAIN_INTERVAL_HOURS` | int | `24` | Hours between model retraining |
|
||
| `MODEL_HISTORY_COUNT` | int | `10` | Number of model versions to keep |
|
||
| `DRIFT_THRESHOLD` | float | `0.30` | KS-test drift threshold (A1) |
|
||
| `ENABLE_MULTIWINDOW` | bool | `false` | Enable 24h multi-window analysis (A3) |
|
||
| `MULTIWINDOW_VIEW` | string | `view_ai_features_24h` | View for multi-window mode |
|
||
| `ENABLE_SHAP` | bool | `true` | Enable SHAP explainability (A4) |
|
||
| `DEDUP_TTL_MIN` | int | `60` | Deduplication TTL in minutes (A5) |
|
||
| `RECURRENCE_WEIGHT` | float | `0.005` | Recurrence score weighting factor (A6) |
|
||
| `MIN_VALID_FEATURE_RATIO` | float | `0.50` | Min valid feature ratio (A7) |
|
||
| `ENABLE_CLUSTERING` | bool | `true` | Enable DBSCAN clustering (A8) |
|
||
| `CLUSTERING_MIN_SAMPLES` | int | `3` | DBSCAN min samples per cluster |
|
||
| `HEALTH_PORT` | int | `8080` | Health check HTTP server port |
|
||
|
||
## Output Tables
|
||
|
||
### ml_detected_anomalies
|
||
|
||
Anomaly detections above the threat threshold. Engine: `ReplacingMergeTree(detected_at)`, ORDER BY `(src_ip)`, TTL 30 days.
|
||
|
||
Key columns: `detected_at`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `recurrence`, `campaign_id`, `reason`, `anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`, plus all ML features.
|
||
|
||
### ml_all_scores
|
||
|
||
All classifications (no threshold filter) for observability. Engine: `ReplacingMergeTree(detected_at)`, ORDER BY `(window_start, src_ip, ja4, host, model_name)`, TTL 3 days.
|
||
|
||
## Decision Log Format
|
||
|
||
The `decisions.jsonl` file contains structured JSONL entries:
|
||
|
||
```json
|
||
{"event": "CYCLE_START", "cycle_id": "20260309T143000", "total": 5000, "human": 1500, "known_bot": 200, "correlated": 3000}
|
||
{"event": "ANOMALY", "src_ip": "203.0.113.42", "score": -0.25, "threat_level": "HIGH", "reason": "hit_velocity=45.2, fuzzing_index=0.8, ...", "campaign_id": 3}
|
||
{"event": "KNOWN_BOT", "src_ip": "198.51.100.10", "bot_name": "AhrefsBot"}
|
||
{"event": "CYCLE_END", "cycle_id": "20260309T143000", "anomalies": 15, "known_bots": 200, "duration_sec": 12.5}
|
||
```
|
||
|
||
Log rotation: 50 MB max size × `LOG_BACKUP_COUNT` backups (default 7).
|
||
|
||
## Health Check Endpoint
|
||
|
||
- **URL**: `GET http://localhost:8080/`
|
||
- **Response**: `200 OK` with status JSON
|
||
- Runs in a separate thread
|
||
|
||
## Model Persistence
|
||
|
||
| File | Description |
|
||
|------|-------------|
|
||
| `model_<name>_<version>.joblib` | Serialized Isolation Forest (joblib) |
|
||
| `model_<name>_<version>.meta.json` | Model metadata (features, thresholds, training stats) |
|
||
| `model_<name>.current` | Pointer to active model version |
|
||
| `training_history.jsonl` | Training history log |
|
||
|
||
Models are rotated: only the last `MODEL_HISTORY_COUNT` versions (default 10) are kept.
|
||
|
||
## Docker Deployment
|
||
|
||
```bash
|
||
# Build
|
||
make build-bot-detector
|
||
|
||
# Run with docker-compose
|
||
cd services/bot-detector
|
||
docker-compose up -d
|
||
```
|
||
|
||
### Volumes
|
||
|
||
| Host Path | Container Path | Description |
|
||
|-----------|---------------|-------------|
|
||
| `./bot_detector_logs` | `/var/log/bot_detector` | Decision logs (JSONL) |
|
||
| `./bot_detector_models` | `/var/lib/bot_detector` | Persisted ML models |
|
||
| `./reputation/data/user_files/bot_ip.csv` | `/data/bot_ip.csv` (ro) | Known bot IP list |
|
||
| `./reputation/data/user_files/bot_ja4.csv` | `/data/bot_ja4.csv` (ro) | Known bot JA4 list |
|
||
| `./reputation/data/user_files/asn_reputation.csv` | `/data/asn_reputation.csv` (ro) | ASN reputation labels |
|