Files
ja4-platform/docs/services/bot-detector.md
toto 9f3e0621e5 feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing)
Architecture:
- ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs)
- ja4_processing: analytics, aggregation, ML, dictionaries, audit

Configuration (env vars):
- CLICKHOUSE_DB_LOGS (default: ja4_logs)
- CLICKHOUSE_DB_PROCESSING (default: ja4_processing)

Changes:
- SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing
  with correct cross-database references (MVs, views, dicts)
- deploy_schema.sh: substitutes DB names from env vars at deploy time
- Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING
- Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs
  with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING
- Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable
- Correlator: DSN example updated to ja4_logs
- Docker-compose + .env files: new env vars with defaults
- All documentation updated (14 markdown files)

All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:10:35 +02:00

12 KiB
Raw Blame History

Bot Detector

The bot-detector is a Python service that performs machine-learning anomaly detection on aggregated HTTP/TLS traffic features stored in ClickHouse. It runs on a continuous cycle (default: every 5 minutes), using Isolation Forest to identify suspicious traffic patterns, enriched with SHAP explainability, DBSCAN clustering, and Anubis bot-rule enrichment.

ML Algorithm

Isolation Forest (Semi-Supervised)

The core algorithm is Isolation Forest (Liu, Ting & Zhou, 2008) — an unsupervised anomaly detection algorithm that isolates anomalies by randomly partitioning feature space. Anomalies require fewer partitions to isolate than normal points.

The approach is semi-supervised because:

  1. Known bots are identified a priori via reputation dictionaries (IP, JA4, ASN)
  2. Human baseline is identified via ASN reputation labels (asn_label = 'human')
  3. The model trains only on human-baseline traffic (minimum 500 sessions required)
  4. Unknown traffic is scored by deviation from the human profile

Two-Model Architecture

Model Condition Features Data
Complet correlated = 1 35 HTTP + TCP + TLS (full pipeline data)
Applicatif correlated = 0 31 HTTP only (no TLS correlation available)

Threat Levels

Score Range Level Interpretation
< -0.30 CRITICAL Extremely anomalous behavior
< -0.15 HIGH Strong anomaly signal
< -0.05 MEDIUM Moderate anomaly
≥ -0.05 LOW Slightly unusual

Feature List

Common Features (31 — Applicatif model)

HTTP Behavior

Feature Description
hits Request count in the window
hit_velocity Requests per second
fuzzing_index Path/parameter diversity anomaly score
post_ratio Fraction of POST requests
port_exhaustion_ratio Fraction of distinct source ports / total
orphan_ratio Requests without TLS correlation
head_ratio Fraction of HEAD requests
http10_ratio Fraction of HTTP/1.0 requests
generic_accept_ratio Fraction of short Accept headers
sec_fetch_absence_rate Fraction missing Sec-Fetch-Site
missing_accept_enc_ratio Fraction missing Accept-Encoding
http_scheme_ratio Fraction using HTTP (not HTTPS)

Connection Management

Feature Description
max_keepalives Max requests on a single Keep-Alive connection
tcp_shared_count TCP connections shared between sessions
multiplexing_efficiency HTTP/2 multiplexing efficiency

Browser Fingerprint

Feature Description
header_count HTTP headers sent
has_accept_language Accept-Language header presence
has_cookie Cookie header presence
has_referer Referer header presence
modern_browser_score Composite browser compliance score (0100)
ua_ch_mismatch User-Agent vs Client Hints inconsistency
ip_id_zero_ratio IP packets with ID=0 (headless/minimal stack)
header_order_shared_count IPs sharing same header order
header_order_confidence Normalized entropy of header order
distinct_header_orders Distinct header orderings per IP
is_fake_navigation Sec-Fetch-Mode=navigate with non-document dest

Navigation Patterns

Feature Description
request_size_variance Variance of request sizes
mss_mobile_mismatch TCP MSS vs mobile profile inconsistency
asset_ratio Static asset request fraction
direct_access_ratio Direct accesses (no referer)
is_ua_rotating User-Agent rotation detected (flag)
distinct_ja4_count Distinct JA4 fingerprints per IP
anomalous_payload_ratio Anomalous payload size fraction

Concentration & Rarity

Feature Description
src_port_density Source port entropy
ja4_asn_concentration JA4 concentration within ASN
ja4_country_concentration JA4 concentration per country
is_rare_ja4 Rare JA4 fingerprint (< 100 total hits)

Temporal & Diversity

Feature Description
temporal_entropy Temporal distribution entropy
path_diversity_ratio URL path diversity
url_depth_variance URL depth variance
ja3_diversity_ratio JA3 diversity ratio per IP

Additional TCP/TLS Features (Complet model only — 4 extra)

Feature Description
tcp_jitter_variance TCP inter-packet jitter variance
alpn_http_mismatch ALPN vs actual HTTP protocol mismatch
is_alpn_missing ALPN absent in ClientHello
sni_host_mismatch TLS SNI vs HTTP Host mismatch

L4 Fingerprint Features (Complet model)

Feature Description
avg_ttl Average IP TTL (OS fingerprint)
ttl_std TTL standard deviation
no_window_scale_ratio Fraction without TCP window scale
syn_timing_cv SYN timing coefficient of variation
tls12_ratio Fraction of TLS 1.2 connections
ip_df_variance IP Don't-Fragment flag variance

Detection Pipeline

1. Read view_ai_features_1h (last 24h) → DataFrame
2. Read view_ip_recurrence → recurrence map
3. Clean columns (fillna, astype)
4. Split by correlated=1 / correlated=0
5. For each model (Complet, Applicatif):
   a. A7: Validate features (exclude missing/constant)
   b. Separate known bots → log as KNOWN_BOT
   c. Filter human baseline (asn_label='human', min 500 sessions)
   d. Load or train Isolation Forest model
   e. A1: Check concept drift (KS test on features)
   f. Score unknown traffic
   g. A10: Normalize scores to [-1, 0]
   h. A2: Compute adaptive threshold = min(percentile_5, ANOMALY_THRESHOLD)
   i. A6: Apply recurrence weighting
   j. Filter scores below threshold
   k. A4: SHAP explainability (top 5 features)
   l. A8: DBSCAN clustering (campaign detection)
6. Concatenate results, deduplicate by src_ip (keep lowest score)
7. A5: Deduplication with TTL (skip recently reported IPs)
8. Insert into ml_detected_anomalies + ml_all_scores

Concept Drift Detection (A1)

Uses the Kolmogorov-Smirnov test to compare feature distributions between the current data and the training data. If the fraction of drifted features exceeds DRIFT_THRESHOLD (default: 0.30), the model is retrained.

SHAP Explainability (A4)

When enabled (ENABLE_SHAP=true), computes SHAP values for each detected anomaly using shap.TreeExplainer. The top 5 contributing features are stored in the reason field.

DBSCAN Clustering (A8)

When enabled (ENABLE_CLUSTERING=true), applies DBSCAN on anomaly feature vectors to group related anomalies into campaigns. Each anomaly gets a campaign_id (-1 = no cluster).

Anubis Bot-Rule Enrichment

The view_ai_features_1h view enriches each IP with Anubis bot detection using a priority cascade:

  1. UA + IP combined (same rule_id) — highest confidence
  2. UA only (no IP requirement)
  3. IP only (no UA requirement)
  4. ASN match
  5. Country match

Environment Variables

Variable Type Default Description
CLICKHOUSE_HOST string clickhouse ClickHouse server hostname
CLICKHOUSE_PORT int 8123 ClickHouse HTTP port
CLICKHOUSE_DB string ja4_processing Database name
CLICKHOUSE_USER string admin ClickHouse username
CLICKHOUSE_PASSWORD string "" ClickHouse password
ISOLATION_CONTAMINATION float 0.02 Contamination parameter for Isolation Forest
ANOMALY_THRESHOLD float -0.03 Score threshold for anomaly detection
ANOMALY_PERCENTILE int 5 Percentile for adaptive threshold (A2)
CYCLE_INTERVAL_SEC int 300 Seconds between detection cycles
MAX_CONSECUTIVE_FAILURES int 3 Max consecutive failures before exit
BOT_DETECTOR_LOG string /var/log/bot_detector/decisions.jsonl Decision log file path
LOG_BACKUP_COUNT int 7 Number of rotated log backups
MODEL_DIR string /var/lib/bot_detector Model persistence directory
RETRAIN_INTERVAL_HOURS int 24 Hours between model retraining
MODEL_HISTORY_COUNT int 10 Number of model versions to keep
DRIFT_THRESHOLD float 0.30 KS-test drift threshold (A1)
ENABLE_MULTIWINDOW bool false Enable 24h multi-window analysis (A3)
MULTIWINDOW_VIEW string view_ai_features_24h View for multi-window mode
ENABLE_SHAP bool true Enable SHAP explainability (A4)
DEDUP_TTL_MIN int 60 Deduplication TTL in minutes (A5)
RECURRENCE_WEIGHT float 0.005 Recurrence score weighting factor (A6)
MIN_VALID_FEATURE_RATIO float 0.50 Min valid feature ratio (A7)
ENABLE_CLUSTERING bool true Enable DBSCAN clustering (A8)
CLUSTERING_MIN_SAMPLES int 3 DBSCAN min samples per cluster
HEALTH_PORT int 8080 Health check HTTP server port

Output Tables

ml_detected_anomalies

Anomaly detections above the threat threshold. Engine: ReplacingMergeTree(detected_at), ORDER BY (src_ip), TTL 30 days.

Key columns: detected_at, src_ip, ja4, host, bot_name, anomaly_score, raw_anomaly_score, threat_level, model_name, recurrence, campaign_id, reason, anubis_bot_name, anubis_bot_action, anubis_bot_category, plus all ML features.

ml_all_scores

All classifications (no threshold filter) for observability. Engine: ReplacingMergeTree(detected_at), ORDER BY (window_start, src_ip, ja4, host, model_name), TTL 3 days.

Decision Log Format

The decisions.jsonl file contains structured JSONL entries:

{"event": "CYCLE_START", "cycle_id": "20260309T143000", "total": 5000, "human": 1500, "known_bot": 200, "correlated": 3000}
{"event": "ANOMALY", "src_ip": "203.0.113.42", "score": -0.25, "threat_level": "HIGH", "reason": "hit_velocity=45.2, fuzzing_index=0.8, ...", "campaign_id": 3}
{"event": "KNOWN_BOT", "src_ip": "198.51.100.10", "bot_name": "AhrefsBot"}
{"event": "CYCLE_END", "cycle_id": "20260309T143000", "anomalies": 15, "known_bots": 200, "duration_sec": 12.5}

Log rotation: 50 MB max size × LOG_BACKUP_COUNT backups (default 7).

Health Check Endpoint

  • URL: GET http://localhost:8080/
  • Response: 200 OK with status JSON
  • Runs in a separate thread

Model Persistence

File Description
model_<name>_<version>.joblib Serialized Isolation Forest (joblib)
model_<name>_<version>.meta.json Model metadata (features, thresholds, training stats)
model_<name>.current Pointer to active model version
training_history.jsonl Training history log

Models are rotated: only the last MODEL_HISTORY_COUNT versions (default 10) are kept.

Docker Deployment

# Build
make build-bot-detector

# Run with docker-compose
cd services/bot-detector
docker-compose up -d

Volumes

Host Path Container Path Description
./bot_detector_logs /var/log/bot_detector Decision logs (JSONL)
./bot_detector_models /var/lib/bot_detector Persisted ML models
./reputation/data/user_files/bot_ip.csv /data/bot_ip.csv (ro) Known bot IP list
./reputation/data/user_files/bot_ja4.csv /data/bot_ja4.csv (ro) Known bot JA4 list
./reputation/data/user_files/asn_reputation.csv /data/asn_reputation.csv (ro) ASN reputation labels