Replace single-service-per-endpoint with all-ips mode running nginx, apache,
and hitch+varnish simultaneously on 3 dedicated IPs per VM (eth1 alias IPs).
Add a dedicated traffic VM with curl-impersonate for realistic TLS fingerprints,
parallelized traffic generation, and paired SNI_HOSTS/TARGET_IPS lists for
per-VM per-service hostname identification (e.g. rocky9-nginx-platform.test).
Key changes:
- run-tests-vm.sh: add setup_all_ips(), IP-specific Listen/bind directives
with reset-before-apply pattern, graceful service availability checks
- run-e2e-test.sh: traffic VM architecture, all-ips mode, eth1 network,
paired IP/SNI lists, updated cleanup for alias IPs
- generate-traffic.sh: parallel background jobs, curl-impersonate detection,
auto source interface detection via ip route get, Host header in HTTP traffic
- Vagrantfile: add traffic VM with provision-traffic.sh
- provision-traffic.sh: install curl-impersonate and httpx for traffic gen
- test-rpm.sh: multi-interface TC check, updated ja4ebpf config
- clickhouse-init.sh: load CSV stubs for Anubis/bot-networks dictionaries
- Remove obsolete correlator/sentinel/mod-reqin-log docs
- Add h2_settings_ack column to http_logs schema
- Upgrade Go toolchain to 1.25.0
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add .gitignore rules for generated CSV data, eBPF compiled objects,
and vmlinux.h header. Remove 19 tracked files (~175 MB) that can be
regenerated from scripts (generate_*.py), bpftool, or bpf2go.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use two separate //go:generate directives (Ja4Tc for tc_capture.c, Ja4Ssl
for uprobe_ssl.c) to avoid duplicate LICENSE symbol and multi-file clang issue
- Update loader.go to hold tcObjs/sslObjs separately with correct field names:
UprobeSslSetFd, UprobeSslReadEntry, UretprobeSslReadExit,
KprobeAccept4Entry, KretprobeAccept4Exit
- Add systemd-rpm-macros to all three RPM build stages (el8/el9/el10)
so that %{_unitdir} macro resolves correctly
- RPMs now build successfully for el8, el9, el10
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Ajoute dict_browser_h2 dans /reflists (lecture seule via dict_browser_h2)
- Nouveaux endpoints API :
GET /api/browser-signatures/entries — liste browser_h2_signatures
(fallback dict CSV si migration 06 non appliquée)
POST /api/browser-signatures/entries — ajout fingerprint + reload dict
DELETE /api/browser-signatures/entries — suppression + reload dict
- Page /browsers : 2 nouvelles sections
'Base de signatures H2' — tableau des 10 fingerprints, form d'ajout,
mode lecture seule automatique si migration 06 non appliquée
'Règles de scoring browser_matcher.py' — tableau statique des 7 dimensions
(poids, valeurs par famille, seuils de bypass)
- Integration : browser_h2.csv copié dans user_files au démarrage ClickHouse
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- 350K browser rows (14K IPs) using real JA4s from browser_ja4.csv
- 100K scanner rows (3K IPs) with vuln/cred/scraper/DDoS sub-categories
- 30K legit bot rows (2K IPs) from real bot_ip.csv CIDRs
- 20K AI bot rows (1K IPs) for GPTBot, ClaudeBot, etc.
Key improvements:
- Load browser_ja4.csv at startup, match JA4 to browser family
- Load bot_ip.csv to generate IPs from real Googlebot/Bingbot CIDRs
- Hard-coded ISP /24 prefixes from real ASNs (Comcast, Orange, DT, etc.)
- Realistic navigation patterns with Referer chains and cookies
- Sec-CH-UA headers for Chromium browsers (modern_browser_score >= 50)
- Batch size increased to 2000, progress reporting every 10K rows
- New CLI args: --rows, --ips, --seed, --data-dir
- Bot JA4s are synthetic hashes guaranteed NOT in browser_ja4.csv
Also updated:
- Dockerfile: COPY *.py (was missing seed_clickhouse.py)
- docker-compose.yml: mount scripts/data as /app/data for CSV access
- run-tests.sh: updated seeder description comments
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
3 SQL files were missing from the docker-compose.yml volume mounts:
- 10_perf_indexes.sql (performance indexes)
- 11_views.sql (dashboard views)
- 12_thesis_features.sql (thesis §5 MVs and views)
Also make 10_perf_indexes.sql non-fatal in init script since ALTER TABLE
ADD INDEX may fail if index already exists.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- EIF: Extended Isolation Forest via isotree (fallback to sklearn IF)
- Benford's Law deviation feature on inter-request timing
- Lag-1 autocorrelation feature for cadence analysis
- Validation gate: reject model if val_anomaly_rate > 20%
- Feature pruning: remove variance < 1e-6 features before training
- Quantile drift: replace N(μ,σ) synthetic with quantile interpolation
- Thread safety: Lock for _service_healthy/_consecutive_failures
- Score normalization: inverted to [0,1] where 1=most anomalous
SQL: add lag1_autocorrelation + benford_deviation to view_thesis_features_1h
Tests: 10 new test functions covering all improvements
Integration: verify_mvs.py checks new thesis feature columns
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
deploy_views.sql (v13 → v14):
- CRITICAL: ml_detected_anomalies ORDER BY (src_ip) → (src_ip, ja4, host, model_name)
ReplacingMergeTree was collapsing all detections to 1 row per IP on merge
- Add PARTITION BY toDate + ttl_only_drop_parts on all 4 data tables
- ml_all_scores TTL 3d → 7d; ml_detected_anomalies TTL 30d → 7d
- agg_host_ip_ja4_1h + agg_header_fingerprint_1h: add partition + TTL 7d
- view_ip_recurrence: add WHERE detected_at >= now() - 7 DAY (was full scan)
- Remove dead views: summary/timeseries/threat_dist/variability
- Add view_dashboard_entities (fixes HTTP 500 in clustering/incidents/fingerprints)
- Add view_dashboard_user_agents (fixes HTTP 500 in fingerprints/metrics)
- Add view_ai_features_24h (enables ENABLE_MULTIWINDOW in bot_detector)
- Mark max_requests_per_sec as DEPRECATED (always 0)
New files:
- correlator/sql/migrations/01_ttl_adjustments.sql: ALTER TABLE migration
- tests/integration/verify_mvs.py: MV pipeline verification assertions
- docs/THESIS_HTTP_Traffic_Detection.md: detection techniques thesis
All DB references use ja4_processing/ja4_logs (no mabase_prod).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>