Services: - ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap) - logcorrelator: JA4 log correlation engine (Go, ClickHouse) - mod_reqin_log: Apache module (C, JSON request logging) - bot_detector: ML bot detection pipeline (Python) - dashboard: FastAPI/Streamlit analytics UI (Python) Shared libraries: - shared/go/ja4common: logger, config, shutdown, ipfilter (Go module) - shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package) - shared/clickhouse/: canonical SQL migrations (10 files) Build & packaging: - Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10) - go.work workspace linking sentinel, correlator, ja4common - Makefile with test-all, build-all, rpm-* targets Fixes applied: - go.work: 1.21 → 1.24.6 (required by sentinel) - correlator Dockerfiles: golang:1.21 → golang:1.24 - replace directives in go.mod for ja4common local path - pyproject.toml: setuptools.backends → setuptools.build_meta - Removed static libpcap linking (unavailable on Rocky 9) - Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32) - Rewrote corrupted test files (logger_test.go × 2) Test coverage: - correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%) - sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse) Documentation: - README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
12 KiB
Bot Detector
The bot-detector is a Python service that performs machine-learning anomaly detection on aggregated HTTP/TLS traffic features stored in ClickHouse. It runs on a continuous cycle (default: every 5 minutes), using Isolation Forest to identify suspicious traffic patterns, enriched with SHAP explainability, DBSCAN clustering, and Anubis bot-rule enrichment.
ML Algorithm
Isolation Forest (Semi-Supervised)
The core algorithm is Isolation Forest (Liu, Ting & Zhou, 2008) — an unsupervised anomaly detection algorithm that isolates anomalies by randomly partitioning feature space. Anomalies require fewer partitions to isolate than normal points.
The approach is semi-supervised because:
- Known bots are identified a priori via reputation dictionaries (IP, JA4, ASN)
- Human baseline is identified via ASN reputation labels (
asn_label = 'human') - The model trains only on human-baseline traffic (minimum 500 sessions required)
- Unknown traffic is scored by deviation from the human profile
Two-Model Architecture
| Model | Condition | Features | Data |
|---|---|---|---|
| Complet | correlated = 1 |
35 | HTTP + TCP + TLS (full pipeline data) |
| Applicatif | correlated = 0 |
31 | HTTP only (no TLS correlation available) |
Threat Levels
| Score Range | Level | Interpretation |
|---|---|---|
< -0.30 |
CRITICAL | Extremely anomalous behavior |
< -0.15 |
HIGH | Strong anomaly signal |
< -0.05 |
MEDIUM | Moderate anomaly |
≥ -0.05 |
LOW | Slightly unusual |
Feature List
Common Features (31 — Applicatif model)
HTTP Behavior
| Feature | Description |
|---|---|
hits |
Request count in the window |
hit_velocity |
Requests per second |
fuzzing_index |
Path/parameter diversity anomaly score |
post_ratio |
Fraction of POST requests |
port_exhaustion_ratio |
Fraction of distinct source ports / total |
orphan_ratio |
Requests without TLS correlation |
head_ratio |
Fraction of HEAD requests |
http10_ratio |
Fraction of HTTP/1.0 requests |
generic_accept_ratio |
Fraction of short Accept headers |
sec_fetch_absence_rate |
Fraction missing Sec-Fetch-Site |
missing_accept_enc_ratio |
Fraction missing Accept-Encoding |
http_scheme_ratio |
Fraction using HTTP (not HTTPS) |
Connection Management
| Feature | Description |
|---|---|
max_keepalives |
Max requests on a single Keep-Alive connection |
tcp_shared_count |
TCP connections shared between sessions |
multiplexing_efficiency |
HTTP/2 multiplexing efficiency |
Browser Fingerprint
| Feature | Description |
|---|---|
header_count |
HTTP headers sent |
has_accept_language |
Accept-Language header presence |
has_cookie |
Cookie header presence |
has_referer |
Referer header presence |
modern_browser_score |
Composite browser compliance score (0–100) |
ua_ch_mismatch |
User-Agent vs Client Hints inconsistency |
ip_id_zero_ratio |
IP packets with ID=0 (headless/minimal stack) |
header_order_shared_count |
IPs sharing same header order |
header_order_confidence |
Normalized entropy of header order |
distinct_header_orders |
Distinct header orderings per IP |
is_fake_navigation |
Sec-Fetch-Mode=navigate with non-document dest |
Navigation Patterns
| Feature | Description |
|---|---|
request_size_variance |
Variance of request sizes |
mss_mobile_mismatch |
TCP MSS vs mobile profile inconsistency |
asset_ratio |
Static asset request fraction |
direct_access_ratio |
Direct accesses (no referer) |
is_ua_rotating |
User-Agent rotation detected (flag) |
distinct_ja4_count |
Distinct JA4 fingerprints per IP |
anomalous_payload_ratio |
Anomalous payload size fraction |
Concentration & Rarity
| Feature | Description |
|---|---|
src_port_density |
Source port entropy |
ja4_asn_concentration |
JA4 concentration within ASN |
ja4_country_concentration |
JA4 concentration per country |
is_rare_ja4 |
Rare JA4 fingerprint (< 100 total hits) |
Temporal & Diversity
| Feature | Description |
|---|---|
temporal_entropy |
Temporal distribution entropy |
path_diversity_ratio |
URL path diversity |
url_depth_variance |
URL depth variance |
ja3_diversity_ratio |
JA3 diversity ratio per IP |
Additional TCP/TLS Features (Complet model only — 4 extra)
| Feature | Description |
|---|---|
tcp_jitter_variance |
TCP inter-packet jitter variance |
alpn_http_mismatch |
ALPN vs actual HTTP protocol mismatch |
is_alpn_missing |
ALPN absent in ClientHello |
sni_host_mismatch |
TLS SNI vs HTTP Host mismatch |
L4 Fingerprint Features (Complet model)
| Feature | Description |
|---|---|
avg_ttl |
Average IP TTL (OS fingerprint) |
ttl_std |
TTL standard deviation |
no_window_scale_ratio |
Fraction without TCP window scale |
syn_timing_cv |
SYN timing coefficient of variation |
tls12_ratio |
Fraction of TLS 1.2 connections |
ip_df_variance |
IP Don't-Fragment flag variance |
Detection Pipeline
1. Read view_ai_features_1h (last 24h) → DataFrame
2. Read view_ip_recurrence → recurrence map
3. Clean columns (fillna, astype)
4. Split by correlated=1 / correlated=0
5. For each model (Complet, Applicatif):
a. A7: Validate features (exclude missing/constant)
b. Separate known bots → log as KNOWN_BOT
c. Filter human baseline (asn_label='human', min 500 sessions)
d. Load or train Isolation Forest model
e. A1: Check concept drift (KS test on features)
f. Score unknown traffic
g. A10: Normalize scores to [-1, 0]
h. A2: Compute adaptive threshold = min(percentile_5, ANOMALY_THRESHOLD)
i. A6: Apply recurrence weighting
j. Filter scores below threshold
k. A4: SHAP explainability (top 5 features)
l. A8: DBSCAN clustering (campaign detection)
6. Concatenate results, deduplicate by src_ip (keep lowest score)
7. A5: Deduplication with TTL (skip recently reported IPs)
8. Insert into ml_detected_anomalies + ml_all_scores
Concept Drift Detection (A1)
Uses the Kolmogorov-Smirnov test to compare feature distributions between the current data and the training data. If the fraction of drifted features exceeds DRIFT_THRESHOLD (default: 0.30), the model is retrained.
SHAP Explainability (A4)
When enabled (ENABLE_SHAP=true), computes SHAP values for each detected anomaly using shap.TreeExplainer. The top 5 contributing features are stored in the reason field.
DBSCAN Clustering (A8)
When enabled (ENABLE_CLUSTERING=true), applies DBSCAN on anomaly feature vectors to group related anomalies into campaigns. Each anomaly gets a campaign_id (-1 = no cluster).
Anubis Bot-Rule Enrichment
The view_ai_features_1h view enriches each IP with Anubis bot detection using a priority cascade:
- UA + IP combined (same
rule_id) — highest confidence - UA only (no IP requirement)
- IP only (no UA requirement)
- ASN match
- Country match
Environment Variables
| Variable | Type | Default | Description |
|---|---|---|---|
CLICKHOUSE_HOST |
string | clickhouse |
ClickHouse server hostname |
CLICKHOUSE_PORT |
int | 8123 |
ClickHouse HTTP port |
CLICKHOUSE_DB |
string | mabase_prod |
Database name |
CLICKHOUSE_USER |
string | admin |
ClickHouse username |
CLICKHOUSE_PASSWORD |
string | "" |
ClickHouse password |
ISOLATION_CONTAMINATION |
float | 0.02 |
Contamination parameter for Isolation Forest |
ANOMALY_THRESHOLD |
float | -0.03 |
Score threshold for anomaly detection |
ANOMALY_PERCENTILE |
int | 5 |
Percentile for adaptive threshold (A2) |
CYCLE_INTERVAL_SEC |
int | 300 |
Seconds between detection cycles |
MAX_CONSECUTIVE_FAILURES |
int | 3 |
Max consecutive failures before exit |
BOT_DETECTOR_LOG |
string | /var/log/bot_detector/decisions.jsonl |
Decision log file path |
LOG_BACKUP_COUNT |
int | 7 |
Number of rotated log backups |
MODEL_DIR |
string | /var/lib/bot_detector |
Model persistence directory |
RETRAIN_INTERVAL_HOURS |
int | 24 |
Hours between model retraining |
MODEL_HISTORY_COUNT |
int | 10 |
Number of model versions to keep |
DRIFT_THRESHOLD |
float | 0.30 |
KS-test drift threshold (A1) |
ENABLE_MULTIWINDOW |
bool | false |
Enable 24h multi-window analysis (A3) |
MULTIWINDOW_VIEW |
string | view_ai_features_24h |
View for multi-window mode |
ENABLE_SHAP |
bool | true |
Enable SHAP explainability (A4) |
DEDUP_TTL_MIN |
int | 60 |
Deduplication TTL in minutes (A5) |
RECURRENCE_WEIGHT |
float | 0.005 |
Recurrence score weighting factor (A6) |
MIN_VALID_FEATURE_RATIO |
float | 0.50 |
Min valid feature ratio (A7) |
ENABLE_CLUSTERING |
bool | true |
Enable DBSCAN clustering (A8) |
CLUSTERING_MIN_SAMPLES |
int | 3 |
DBSCAN min samples per cluster |
HEALTH_PORT |
int | 8080 |
Health check HTTP server port |
Output Tables
ml_detected_anomalies
Anomaly detections above the threat threshold. Engine: ReplacingMergeTree(detected_at), ORDER BY (src_ip), TTL 30 days.
Key columns: detected_at, src_ip, ja4, host, bot_name, anomaly_score, raw_anomaly_score, threat_level, model_name, recurrence, campaign_id, reason, anubis_bot_name, anubis_bot_action, anubis_bot_category, plus all ML features.
ml_all_scores
All classifications (no threshold filter) for observability. Engine: ReplacingMergeTree(detected_at), ORDER BY (window_start, src_ip, ja4, host, model_name), TTL 3 days.
Decision Log Format
The decisions.jsonl file contains structured JSONL entries:
{"event": "CYCLE_START", "cycle_id": "20260309T143000", "total": 5000, "human": 1500, "known_bot": 200, "correlated": 3000}
{"event": "ANOMALY", "src_ip": "203.0.113.42", "score": -0.25, "threat_level": "HIGH", "reason": "hit_velocity=45.2, fuzzing_index=0.8, ...", "campaign_id": 3}
{"event": "KNOWN_BOT", "src_ip": "198.51.100.10", "bot_name": "AhrefsBot"}
{"event": "CYCLE_END", "cycle_id": "20260309T143000", "anomalies": 15, "known_bots": 200, "duration_sec": 12.5}
Log rotation: 50 MB max size × LOG_BACKUP_COUNT backups (default 7).
Health Check Endpoint
- URL:
GET http://localhost:8080/ - Response:
200 OKwith status JSON - Runs in a separate thread
Model Persistence
| File | Description |
|---|---|
model_<name>_<version>.joblib |
Serialized Isolation Forest (joblib) |
model_<name>_<version>.meta.json |
Model metadata (features, thresholds, training stats) |
model_<name>.current |
Pointer to active model version |
training_history.jsonl |
Training history log |
Models are rotated: only the last MODEL_HISTORY_COUNT versions (default 10) are kept.
Docker Deployment
# Build
make build-bot-detector
# Run with docker-compose
cd services/bot-detector
docker-compose up -d
Volumes
| Host Path | Container Path | Description |
|---|---|---|
./bot_detector_logs |
/var/log/bot_detector |
Decision logs (JSONL) |
./bot_detector_models |
/var/lib/bot_detector |
Persisted ML models |
./reputation/data/user_files/bot_ip.csv |
/data/bot_ip.csv (ro) |
Known bot IP list |
./reputation/data/user_files/bot_ja4.csv |
/data/bot_ja4.csv (ro) |
Known bot JA4 list |
./reputation/data/user_files/asn_reputation.csv |
/data/asn_reputation.csv (ro) |
ASN reputation labels |