Files

toto 9f3e0621e5 feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing)

Architecture:
- ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs)
- ja4_processing: analytics, aggregation, ML, dictionaries, audit

Configuration (env vars):
- CLICKHOUSE_DB_LOGS (default: ja4_logs)
- CLICKHOUSE_DB_PROCESSING (default: ja4_processing)

Changes:
- SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing
  with correct cross-database references (MVs, views, dicts)
- deploy_schema.sh: substitutes DB names from env vars at deploy time
- Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING
- Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs
  with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING
- Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable
- Correlator: DSN example updated to ja4_logs
- Docker-compose + .env files: new env vars with defaults
- All documentation updated (14 markdown files)

All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-04-07 19:10:35 +02:00

12 KiB

Raw Blame History

Bot Detector

The bot-detector is a Python service that performs machine-learning anomaly detection on aggregated HTTP/TLS traffic features stored in ClickHouse. It runs on a continuous cycle (default: every 5 minutes), using Isolation Forest to identify suspicious traffic patterns, enriched with SHAP explainability, DBSCAN clustering, and Anubis bot-rule enrichment.

ML Algorithm

Isolation Forest (Semi-Supervised)

The core algorithm is Isolation Forest (Liu, Ting & Zhou, 2008) — an unsupervised anomaly detection algorithm that isolates anomalies by randomly partitioning feature space. Anomalies require fewer partitions to isolate than normal points.

The approach is semi-supervised because:

Known bots are identified a priori via reputation dictionaries (IP, JA4, ASN)
Human baseline is identified via ASN reputation labels (asn_label = 'human')
The model trains only on human-baseline traffic (minimum 500 sessions required)
Unknown traffic is scored by deviation from the human profile

Two-Model Architecture

Model	Condition	Features	Data
Complet	`correlated = 1`	35	HTTP + TCP + TLS (full pipeline data)
Applicatif	`correlated = 0`	31	HTTP only (no TLS correlation available)

Threat Levels

Score Range	Level	Interpretation
`< -0.30`	CRITICAL	Extremely anomalous behavior
`< -0.15`	HIGH	Strong anomaly signal
`< -0.05`	MEDIUM	Moderate anomaly
`≥ -0.05`	LOW	Slightly unusual

Feature List

Common Features (31 — Applicatif model)

HTTP Behavior

Feature	Description
`hits`	Request count in the window
`hit_velocity`	Requests per second
`fuzzing_index`	Path/parameter diversity anomaly score
`post_ratio`	Fraction of POST requests
`port_exhaustion_ratio`	Fraction of distinct source ports / total
`orphan_ratio`	Requests without TLS correlation
`head_ratio`	Fraction of HEAD requests
`http10_ratio`	Fraction of HTTP/1.0 requests
`generic_accept_ratio`	Fraction of short Accept headers
`sec_fetch_absence_rate`	Fraction missing Sec-Fetch-Site
`missing_accept_enc_ratio`	Fraction missing Accept-Encoding
`http_scheme_ratio`	Fraction using HTTP (not HTTPS)

Connection Management

Feature	Description
`max_keepalives`	Max requests on a single Keep-Alive connection
`tcp_shared_count`	TCP connections shared between sessions
`multiplexing_efficiency`	HTTP/2 multiplexing efficiency

Browser Fingerprint

Feature	Description
`header_count`	HTTP headers sent
`has_accept_language`	Accept-Language header presence
`has_cookie`	Cookie header presence
`has_referer`	Referer header presence
`modern_browser_score`	Composite browser compliance score (0–100)
`ua_ch_mismatch`	User-Agent vs Client Hints inconsistency
`ip_id_zero_ratio`	IP packets with ID=0 (headless/minimal stack)
`header_order_shared_count`	IPs sharing same header order
`header_order_confidence`	Normalized entropy of header order
`distinct_header_orders`	Distinct header orderings per IP
`is_fake_navigation`	Sec-Fetch-Mode=navigate with non-document dest

Feature	Description
`request_size_variance`	Variance of request sizes
`mss_mobile_mismatch`	TCP MSS vs mobile profile inconsistency
`asset_ratio`	Static asset request fraction
`direct_access_ratio`	Direct accesses (no referer)
`is_ua_rotating`	User-Agent rotation detected (flag)
`distinct_ja4_count`	Distinct JA4 fingerprints per IP
`anomalous_payload_ratio`	Anomalous payload size fraction

Concentration & Rarity

Feature	Description
`src_port_density`	Source port entropy
`ja4_asn_concentration`	JA4 concentration within ASN
`ja4_country_concentration`	JA4 concentration per country
`is_rare_ja4`	Rare JA4 fingerprint (< 100 total hits)

Temporal & Diversity

Feature	Description
`temporal_entropy`	Temporal distribution entropy
`path_diversity_ratio`	URL path diversity
`url_depth_variance`	URL depth variance
`ja3_diversity_ratio`	JA3 diversity ratio per IP

Additional TCP/TLS Features (Complet model only — 4 extra)

Feature	Description
`tcp_jitter_variance`	TCP inter-packet jitter variance
`alpn_http_mismatch`	ALPN vs actual HTTP protocol mismatch
`is_alpn_missing`	ALPN absent in ClientHello
`sni_host_mismatch`	TLS SNI vs HTTP Host mismatch

L4 Fingerprint Features (Complet model)

Feature	Description
`avg_ttl`	Average IP TTL (OS fingerprint)
`ttl_std`	TTL standard deviation
`no_window_scale_ratio`	Fraction without TCP window scale
`syn_timing_cv`	SYN timing coefficient of variation
`tls12_ratio`	Fraction of TLS 1.2 connections
`ip_df_variance`	IP Don't-Fragment flag variance

Detection Pipeline

1. Read view_ai_features_1h (last 24h) → DataFrame
2. Read view_ip_recurrence → recurrence map
3. Clean columns (fillna, astype)
4. Split by correlated=1 / correlated=0
5. For each model (Complet, Applicatif):
   a. A7: Validate features (exclude missing/constant)
   b. Separate known bots → log as KNOWN_BOT
   c. Filter human baseline (asn_label='human', min 500 sessions)
   d. Load or train Isolation Forest model
   e. A1: Check concept drift (KS test on features)
   f. Score unknown traffic
   g. A10: Normalize scores to [-1, 0]
   h. A2: Compute adaptive threshold = min(percentile_5, ANOMALY_THRESHOLD)
   i. A6: Apply recurrence weighting
   j. Filter scores below threshold
   k. A4: SHAP explainability (top 5 features)
   l. A8: DBSCAN clustering (campaign detection)
6. Concatenate results, deduplicate by src_ip (keep lowest score)
7. A5: Deduplication with TTL (skip recently reported IPs)
8. Insert into ml_detected_anomalies + ml_all_scores

Concept Drift Detection (A1)

Uses the Kolmogorov-Smirnov test to compare feature distributions between the current data and the training data. If the fraction of drifted features exceeds DRIFT_THRESHOLD (default: 0.30), the model is retrained.

SHAP Explainability (A4)

When enabled (ENABLE_SHAP=true), computes SHAP values for each detected anomaly using shap.TreeExplainer. The top 5 contributing features are stored in the reason field.

DBSCAN Clustering (A8)

When enabled (ENABLE_CLUSTERING=true), applies DBSCAN on anomaly feature vectors to group related anomalies into campaigns. Each anomaly gets a campaign_id (-1 = no cluster).

Anubis Bot-Rule Enrichment

The view_ai_features_1h view enriches each IP with Anubis bot detection using a priority cascade:

UA + IP combined (same rule_id) — highest confidence
UA only (no IP requirement)
IP only (no UA requirement)
ASN match
Country match

Environment Variables

Variable	Type	Default	Description
`CLICKHOUSE_HOST`	string	`clickhouse`	ClickHouse server hostname
`CLICKHOUSE_PORT`	int	`8123`	ClickHouse HTTP port
`CLICKHOUSE_DB`	string	`ja4_processing`	Database name
`CLICKHOUSE_USER`	string	`admin`	ClickHouse username
`CLICKHOUSE_PASSWORD`	string	`""`	ClickHouse password
`ISOLATION_CONTAMINATION`	float	`0.02`	Contamination parameter for Isolation Forest
`ANOMALY_THRESHOLD`	float	`-0.03`	Score threshold for anomaly detection
`ANOMALY_PERCENTILE`	int	`5`	Percentile for adaptive threshold (A2)
`CYCLE_INTERVAL_SEC`	int	`300`	Seconds between detection cycles
`MAX_CONSECUTIVE_FAILURES`	int	`3`	Max consecutive failures before exit
`BOT_DETECTOR_LOG`	string	`/var/log/bot_detector/decisions.jsonl`	Decision log file path
`LOG_BACKUP_COUNT`	int	`7`	Number of rotated log backups
`MODEL_DIR`	string	`/var/lib/bot_detector`	Model persistence directory
`RETRAIN_INTERVAL_HOURS`	int	`24`	Hours between model retraining
`MODEL_HISTORY_COUNT`	int	`10`	Number of model versions to keep
`DRIFT_THRESHOLD`	float	`0.30`	KS-test drift threshold (A1)
`ENABLE_MULTIWINDOW`	bool	`false`	Enable 24h multi-window analysis (A3)
`MULTIWINDOW_VIEW`	string	`view_ai_features_24h`	View for multi-window mode
`ENABLE_SHAP`	bool	`true`	Enable SHAP explainability (A4)
`DEDUP_TTL_MIN`	int	`60`	Deduplication TTL in minutes (A5)
`RECURRENCE_WEIGHT`	float	`0.005`	Recurrence score weighting factor (A6)
`MIN_VALID_FEATURE_RATIO`	float	`0.50`	Min valid feature ratio (A7)
`ENABLE_CLUSTERING`	bool	`true`	Enable DBSCAN clustering (A8)
`CLUSTERING_MIN_SAMPLES`	int	`3`	DBSCAN min samples per cluster
`HEALTH_PORT`	int	`8080`	Health check HTTP server port

Output Tables

ml_detected_anomalies

Anomaly detections above the threat threshold. Engine: ReplacingMergeTree(detected_at), ORDER BY (src_ip), TTL 30 days.

Key columns: detected_at, src_ip, ja4, host, bot_name, anomaly_score, raw_anomaly_score, threat_level, model_name, recurrence, campaign_id, reason, anubis_bot_name, anubis_bot_action, anubis_bot_category, plus all ML features.

ml_all_scores

All classifications (no threshold filter) for observability. Engine: ReplacingMergeTree(detected_at), ORDER BY (window_start, src_ip, ja4, host, model_name), TTL 3 days.

Decision Log Format

The decisions.jsonl file contains structured JSONL entries:

{"event": "CYCLE_START", "cycle_id": "20260309T143000", "total": 5000, "human": 1500, "known_bot": 200, "correlated": 3000}
{"event": "ANOMALY", "src_ip": "203.0.113.42", "score": -0.25, "threat_level": "HIGH", "reason": "hit_velocity=45.2, fuzzing_index=0.8, ...", "campaign_id": 3}
{"event": "KNOWN_BOT", "src_ip": "198.51.100.10", "bot_name": "AhrefsBot"}
{"event": "CYCLE_END", "cycle_id": "20260309T143000", "anomalies": 15, "known_bots": 200, "duration_sec": 12.5}

Log rotation: 50 MB max size × LOG_BACKUP_COUNT backups (default 7).

Health Check Endpoint

URL: GET http://localhost:8080/
Response: 200 OK with status JSON
Runs in a separate thread

Model Persistence

File	Description
`model_<name>_<version>.joblib`	Serialized Isolation Forest (joblib)
`model_<name>_<version>.meta.json`	Model metadata (features, thresholds, training stats)
`model_<name>.current`	Pointer to active model version
`training_history.jsonl`	Training history log

Models are rotated: only the last MODEL_HISTORY_COUNT versions (default 10) are kept.

Docker Deployment

# Build
make build-bot-detector

# Run with docker-compose
cd services/bot-detector
docker-compose up -d

Volumes

Host Path	Container Path	Description
`./bot_detector_logs`	`/var/log/bot_detector`	Decision logs (JSONL)
`./bot_detector_models`	`/var/lib/bot_detector`	Persisted ML models
`./reputation/data/user_files/bot_ip.csv`	`/data/bot_ip.csv` (ro)	Known bot IP list
`./reputation/data/user_files/bot_ja4.csv`	`/data/bot_ja4.csv` (ro)	Known bot JA4 list
`./reputation/data/user_files/asn_reputation.csv`	`/data/asn_reputation.csv` (ro)	ASN reputation labels

12 KiB Raw Blame History Unescape Escape