ja4-platform

Author	SHA1	Message	Date
toto	9ea36ad22e	feat(scripts): complete stack init + prod data import with date shift Schema cleanup: - Remove anubis_ua_rules table stub from 03_anubis_tables.sql - Remove anubis_ua_rules from bot-detector deploy_schema.sql - Remove UA seed step from clickhouse-init.sh (no more REGEXP_TREE dependency) - Drop dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules New scripts: - scripts/init-stack.sh: comprehensive ClickHouse init (13 SQL files + migrations + validation + cleanup of obsolete tables). Supports --reset, --import-prod. - scripts/import-prod-data.sh: imports pre-exported prod data (Native format) with dynamic date shift (max(time) → now). Supports --shift, --no-truncate. - scripts/data/prod-export/: directory for cached Native format exports Makefile targets: init-stack, import-prod-data, init-and-import Tested: init-stack.sh passes all 13 SQL + 7 critical tables + 7 dicts import-prod-data.sh: 3M rows in ~37s with auto date shift Dashboard: 55 routes OK, bot-detector: 36/36 tests pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 21:40:05 +02:00
toto	8180f4af04	refactor(anubis): simplify to IP/CIDR + ASN only, remove UA and Country rules - Remove UA regex extraction (extract_ua_regex, _extract_ua_from_all/any) - Remove Country rule collection from parse_bot_policies_inline - Simplify fetch_rules.py: collect_all_rules returns (ip_rules, asn_rules) - Remove insert_ua_rules and insert_country_rules functions - reload_dicts now only reloads dict_anubis_ip + dict_anubis_asn - Simplify CASE blocks in 04_mv_http_logs.sql, 07_ai_features_view.sql, view_ai_features_anubis.sql, mv_http_logs.sql: IP > ASN (was 5-level UA+IP > UA > IP > ASN > Country cascade) - Remove dict_anubis_country + dict_anubis_ua from 03_anubis_tables.sql (UA table kept as stub for REGEXP_TREE catch-all compatibility) - Remove anubis_country_rules table from schema - Remove Anubis UA and Country tabs from dashboard reflists page - Remove anubis_ua_rules/country_rules from API reflist queries - deploy_schema.sql simplified from 339 to 122 lines - 764 lines removed across 9 files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 15:25:33 +02:00
toto	039086a0b3	feat: nouvelles techniques de détection et page tactiques SOC SQL: - Ajout 5 colonnes d'agrégation (count_xff, count_unusual_ct, count_non_std_port, count_login_post, sec_ch_mobile_mismatch) - Exposition de 5 features calculées dans view_ai_features_1h - Migration ALTER TABLE pour déploiements existants Bot-detector: - 7 nouvelles features ML (has_xff, unusual_content_type_ratio, non_standard_port_ratio, login_post_concentration, sec_ch_mobile_mismatch, true_window_size, window_mss_ratio) - Propagation campaign_id vers ml_all_scores (était toujours -1) - Escalade campagne : HIGH→CRITICAL si cluster ≥5 membres Dashboard: - Page Tactiques SOC : brute-force, rotation JA4, récurrence, alertes temps réel — 4 KPIs + 4 panneaux + infobulles doc - Ajout fmtDate() helper global - Navigation sidebar mise à jour Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 14:29:18 +02:00
toto	db306fb9da	fix: P0 audit bugs — bot-detector + dashboard + SQL Bot-detector: - B1.1: campaign_id and raw_anomaly_score now inserted into ml_detected_anomalies - B1.4/B1.5: log_decision argument order fixed (cycle_id, name) - B1.7: AE broadcast error — model now returns features list, scoring uses model's features instead of current cycle's (prevents dim mismatch) - B1.8: Anubis ALLOW bots now get bot_name from anubis_bot_name Dashboard: - C1.1: XSS in ip_detail.html — {{ ip \| tojson }} instead of raw string - C1.2: Stored XSS via innerHTML — added escapeHtml() helper, all user-facing formatters (fmtIP, fmtASN, fmtCountry, fmtJA4, fmtBotName, fmtLabel) sanitized - C2.1: status filter now correctly filters http_version column - C2.2: heatmap toDayOfWeek() - 1 for 0-indexed JS days SQL: - B1.3: view_ip_recurrence worst_score uses max() not min() (0=normal, 1=anomal) - B1.6: view_resource_cascade_1h joined into view_thesis_features_1h (§5.4) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 23:33:00 +02:00
toto	98289ccf04	fix: ASN dictionary pipeline + verbose bot-detector logging - Fix dict_iplocate_asn: remove non-existent org/domain columns (4→4 cols) - Add CSV header to iplocate-ip-to-asn.csv (CSVWithNames format) - Replace org/domain dictGet calls with empty string literals in MV - Full 714K CIDR stub for complete ASN resolution in tests - Add header generation to generate_asn_data.py - Verbose bot-detector stdout: data summary, triage breakdown, model training details, scoring stats, browser classification, boxed results - Fix IPv6 filter in traffic seeder (_ips_from_cidrs) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 17:43:55 +02:00
toto	9a48fb9d29	feat: LEGITIMATE_BROWSER classification from JA4 + behavioral consistency Add browser legitimacy classification (A9) to the bot detection pipeline: - New features: is_known_browser (binary) and browser_consistency_score [0..5] combining 5 signals: JA4 browser match, modern_browser_score, Accept-Language, cookies, Sec-Fetch-* presence - Post-scoring: sessions with known browser JA4 + consistency >= 4/5 + NORMAL/LOW threat level are reclassified as LEGITIMATE_BROWSER - Spoofing detection: inconsistent behavior (known JA4 but low consistency) stays in normal anomaly scoring — prevents evasion via JA4 spoofing - XGBoost treats LEGITIMATE_BROWSER as non-threat (negative label) - ClickHouse: browser_family column added to ml_detected_anomalies and ml_all_scores - Dashboard: browser_family filter/sort on detections and scores endpoints, legitimate_browsers count and browser_stats in overview - 6 new unit tests covering classification threshold, spoofing, exclusion logic Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 15:46:22 +02:00
toto	7d09c614c3	feat: browser JA4 detection, Anubis bot rules, worldwide ASN data - Add generate_browser_ja4.py: 1,186 browser JA4 fingerprints from FoxIO + ja4db.com covering 11 families (Chromium, Firefox, Safari, Edge, Tor, Opera, Vivaldi...) - Rewrite generate_bot_ip.py: Anubis YAML rules (Google, Bing, Apple, DuckDuck, OpenAI, Perplexity bots) + Tor exit nodes + cloud scanner IPs (3,555 entries) - Rewrite generate_asn_data.py: worldwide iptoasn.com data (78,049 ASNs, 714K CIDRs) - Add dict_browser_ja4 ClickHouse dictionary + browser_family in AI features views - Add /api/browsers dashboard endpoint - Fix CSV quoting for fields containing commas (User-Agent strings) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 15:27:37 +02:00
toto	b735bab5a5	feat(dashboard): rebuild SOC dashboard + fix ClickHouse SQL Complete rewrite of the SOC dashboard using FastAPI + Jinja2 + htmx + Chart.js + Tailwind CSS. Replaces the old React/Vite frontend with server-rendered templates. Dashboard pages: - Overview: KPIs, timeline chart, threat distribution, top IPs - Detections: paginated/filterable anomaly table - Scores: ml_all_scores with AE error & XGB prob columns - Traffic: HTTP logs with method/host filters - IP Investigation: full deep-dive (scores, features, HTTP logs, classify) - Classification: SOC feedback form + history - Features: AI + thesis feature stats - Models: scoring stats + model metadata API: 9 JSON endpoints with parameterized queries, sort whitelists SQL fixes: - 05_aggregation_tables: add deduplicate_merge_projection_mode - 11_views: fix nested aggregate (argMax inside sum) - 12_thesis_features: remove invalid 'let' bindings, fix groupArrayIf type Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 03:21:05 +02:00
toto	8d58f2b932	feat(bot-detector): add XGBoost supervised third voice (#10 ) Triple-voice ensemble architecture: - EIF (non-supervisé, anomalies zero-day) - Autoencoder (non-supervisé, corrélations non-linéaires) - XGBoost (supervisé, patterns connus + feedback SOC) XGBoost implementation: - Trained on historical ml_all_scores labels (NORMAL=0, HIGH/CRITICAL/DENY/KNOWN=1) - Weekly retraining (XGB_RETRAIN_INTERVAL_H=168), min 100 labels required - Score = predict_proba, combined via meta-learner: (1-β)(EIF+AE) + βxgb_prob - Configurable: XGB_WEIGHT (β=0.20), XGB_MIN_LABELS, XGB_RETRAIN_INTERVAL_HOURS - Graceful fallback: if xgboost unavailable or labels insufficient, EIF+AE only - ClickHouse: xgb_prob column added to ml_all_scores - Tests: 4 new tests (availability, train/predict, meta-learner, save/load) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:45:57 +02:00
toto	57cf6c3828	feat(bot-detector): add parallel Autoencoder scorer (#9 ) - TrafficAutoEncoder class: symmetric AE (n→64→32→16→32→64→n) with BatchNorm+ReLU - Trained alongside EIF on human_baseline, saved/loaded with model versioning - Score = per-sample MSE reconstruction error, combined with EIF via AE_WEIGHT (α=0.30) - AE latent space (16-dim) used for HDBSCAN clustering instead of raw features - Configurable: AE_WEIGHT, AE_EPOCHS, AE_LATENT_DIM, AE_LEARNING_RATE - Graceful fallback: if torch unavailable or AE fails, EIF-only scoring continues - ClickHouse: ae_recon_error column added to ml_all_scores - Tests: 5 new tests (AE train/score, encode latent, state dict save/load, weight combination) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:40:39 +02:00
toto	f6e2d3c0ca	feat(bot-detector): implement 8 state-of-art improvements - EIF: Extended Isolation Forest via isotree (fallback to sklearn IF) - Benford's Law deviation feature on inter-request timing - Lag-1 autocorrelation feature for cadence analysis - Validation gate: reject model if val_anomaly_rate > 20% - Feature pruning: remove variance < 1e-6 features before training - Quantile drift: replace N(μ,σ) synthetic with quantile interpolation - Thread safety: Lock for _service_healthy/_consecutive_failures - Score normalization: inverted to [0,1] where 1=most anomalous SQL: add lag1_autocorrelation + benford_deviation to view_thesis_features_1h Tests: 10 new test functions covering all improvements Integration: verify_mvs.py checks new thesis feature columns Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:31:26 +02:00
toto	6d02f21c1e	feat: implement thesis §5 advanced detection techniques as ClickHouse MVs New aggregation tables + materialized views: - agg_path_sequences_1h + MV (§5.1 Path Sequence Entropy) - agg_request_timing_1h + MV (§5.3 Request Cadence Fingerprint) - agg_ip_behavior_1h + MV (§5.5 JA4 Drift + §5.8 Cross-Domain) - agg_resource_cascade_1h + MV (§5.4 Resource Dependency Tree) New analytical views: - view_thesis_features_1h: unified view exposing all computable features (path_transition_entropy, cadence_cv, burst_ratio, pause_ratio, ja4_drift_ratio, host_diversity, host_sweep_speed, host_coverage_uniformity) - view_resource_cascade_1h: root_to_first_asset_delay, asset_load_stddev Documented future techniques (not feasible as MV): - §5.2 Bipartite Fleet Graph (needs Python networkx) - §5.6 DNS Shadow Analysis (needs sentinel UDP/53 extension) - §5.7 Compression Ratio Invariant (needs mod_reqin_log extension) Updated: deploy_schema.sh, verify_mvs.py (sections 8-10) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 01:42:52 +02:00
toto	ecceb04174	perf(clickhouse): P3 — view_ip_recurrence avec filtre TTL + supprimer FINAL view_ip_recurrence : Ajout de WHERE detected_at >= now() - INTERVAL 30 DAY → Avec PARTITION BY (P1), ClickHouse élagage les partitions hors de cette plage avant même de lire les données. La vue ne scanne que les partitions actives (au lieu des 30 partitions journalières complètes). → ORDER BY (src_ip) garantit que le GROUP BY src_ip lit des données contiguës (aucune réorganisation mémoire). rotation.py — supprimer FINAL sur ml_detected_anomalies : FINAL force une déduplication complète du ReplacingMergeTree en mémoire (équivalent à un DISTINCT sur toute la table) — une des opérations les plus coûteuses dans ClickHouse. Fix : remplacer le sous-SELECT FINAL par view_ip_recurrence (déjà aggrégée par src_ip, retourne recurrence directement sans FINAL). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 22:33:29 +02:00
toto	14323f7b05	perf(clickhouse): P10 — créer les 4 vues métier manquantes + corriger préfixes DB Bug de production : view_form_bruteforce_detected, view_host_ip_ja4_rotation, view_dashboard_entities, view_dashboard_user_agents étaient référencées dans 13 endpoints du dashboard mais n'existaient nulle part dans le schéma. Tous ces endpoints retournaient HTTP 500 en production. shared/clickhouse/11_views.sql (nouveau) : view_form_bruteforce_detected Source : agg_host_ip_ja4_1h (24h) Logique : GROUP BY (src_ip, host) HAVING count_post >= 10 Usage : bruteforce.py (3 endpoints), investigation_summary.py view_host_ip_ja4_rotation Source : agg_host_ip_ja4_1h (24h) Logique : uniqExact(ja4) par src_ip, HAVING >= 2 (rotation de fingerprint) Usage : rotation.py (3 endpoints), investigation_summary.py view_dashboard_entities Source : http_logs (7 jours), UNION ALL 5 branches (ip/ja4/country/asn/host) Colonnes : entity_type, entity_value, src_ip, ja4, host, log_date, client_headers Array(String), asns Array, countries Array, user_agents Array Usage : entities.py (5 endpoints), clustering.py view_dashboard_user_agents Source : http_logs (7 jours), GROUP BY (src_ip, ja4, hour) Colonnes : src_ip, ja4, hour, log_date, user_agents Array(String), requests Usage : variability.py (4 endpoints), fingerprints.py (5 endpoints) attributes.py (2 endpoints) deploy_schema.sh : ajout de 10_perf_indexes.sql et 11_views.sql dans la liste routes/variability.py + fingerprints.py : Correction de 9 requêtes utilisant view_dashboard_user_agents sans préfixe de base de données → remplacé par {settings.CLICKHOUSE_DB_PROCESSING}.view_* Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 22:30:09 +02:00
toto	f4ffe3410a	perf(clickhouse): P1 — partition + skipping indexes sur ml_detected_anomalies, http_logs, agg_host_ip_ja4_1h Problème : toutes les requêtes du dashboard WHERE detected_at >= now() - INTERVAL N faisaient un full scan car ml_detected_anomalies avait ORDER BY (src_ip) sans partition ni index temporel. Changements : - 06_ml_tables.sql : * ml_detected_anomalies : PARTITION BY toYYYYMMDD(detected_at) → élagage de partitions journalières sur toutes les requêtes temporelles * INDEX idx_detected_at (minmax) → skip des granules hors plage * INDEX idx_threat_level set(8) → skip pour countIf(threat_level = ...) * INDEX idx_bot_name bloom_filter → skip pour bot_name != '' * ttl_only_drop_parts = 1 → TTL par suppression de partition entière * ml_all_scores : même traitement (PARTITION BY + 2 indexes) - 04_mv_http_logs.sql : * http_logs : INDEX idx_src_ip bloom_filter(0.01) → les requêtes WHERE src_ip = X (analysis.py, variability.py) sautent ~90% des granules sans scanner toute la plage temporelle * INDEX idx_ja4 bloom_filter(0.01) → idem pour filtres JA4 - 05_aggregation_tables.sql : * agg_host_ip_ja4_1h : PROJECTION proj_by_ip ORDER BY (src_ip, window_start, ...) → investigation_summary.py et rotation.py (WHERE src_ip = X) utilisent automatiquement la projection au lieu de scanner tous les window_start - 10_perf_indexes.sql (nouveau) : * Migration ALTER TABLE pour instances existantes * ADD INDEX + MATERIALIZE INDEX pour les 4 tables * ADD PROJECTION + MATERIALIZE PROJECTION pour agg_host_ip_ja4_1h * Note : PARTITION BY sur table existante nécessite recréation (documenté) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 22:28:04 +02:00
toto	3dfeba860b	docs: add standardized comments to all services (Python, Go, Bash) - Add docs/commenting-standard.md defining per-language comment standards (Go godoc, Python PEP-257, C Doxygen, Bash header blocks, SQL banners) - services/dashboard: 100% docstring coverage (100/100 functions) - All FastAPI route handlers, helpers, classes, and models documented - Language: French (project convention) - services/bot-detector: 100% docstring coverage (53/53 symbols) - bot_detector.py: 14 functions + module docstring - anubis/fetch_rules.py: 9 functions - shared/python/ja4_common: full docstrings on ClickHouseClient (7 methods) and ClickHouseSettings class - services/correlator: 24 godoc comments added across 6 Go files - correlation_service.go: 10 private helpers - unixsocket/source.go: 6 parsing/socket helpers - correlated_log.go: 4 field extraction helpers - orchestrator.go, logger.go, main.go: 4 comments - services/correlator/scripts/audit-architecture.sh: standardized header block Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 21:32:29 +02:00
toto	d4e7e674d8	feat: full-stack Docker Compose integration tests - 4-container stack: ClickHouse, platform (Rocky 9), bot-detector, dashboard - Platform builds sentinel on Rocky (CGO+libpcap native), correlator static - mod-reqin-log compiled with apxs on Rocky (matching RPM build target) - ClickHouse init script patches credentials for test env (sed-based) - 8-phase test runner: schema, traffic gen, pipeline, dashboard API, bot-detector, sentinel - All 13 checks pass, 3 non-blocking warnings (empty dicts, log paths) SQL schema fixes discovered during integration: - 02_dictionaries: IPv6CIDR → String (not a valid ClickHouse type) - 03_anubis_tables: dict_anubis_ua missing has_ip/rule_id/category attrs - 03_anubis_tables: dict_anubis_country FLAT() → COMPLEX_KEY_HASHED() (String key) - 09_audit_table: CODEC before DEFAULT → DEFAULT before CODEC Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 20:33:25 +02:00
toto	9f3e0621e5	feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing) Architecture: - ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs) - ja4_processing: analytics, aggregation, ML, dictionaries, audit Configuration (env vars): - CLICKHOUSE_DB_LOGS (default: ja4_logs) - CLICKHOUSE_DB_PROCESSING (default: ja4_processing) Changes: - SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing with correct cross-database references (MVs, views, dicts) - deploy_schema.sh: substitutes DB names from env vars at deploy time - Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING - Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING - Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable - Correlator: DSN example updated to ja4_logs - Docker-compose + .env files: new env vars with defaults - All documentation updated (14 markdown files) All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 19:10:35 +02:00
toto	dba2676fa7	fix: correct singleton test for ja4_common ClickHouseClient get_client() returns a lazy ClickHouseClient wrapper — clickhouse_connect.get_client is only called on .connect(), not at construction time. Remove incorrect assertion that expected call_count==1 at module-level get_client() call. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 16:45:14 +02:00
toto	d469e39da7	feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized Services: - ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap) - logcorrelator: JA4 log correlation engine (Go, ClickHouse) - mod_reqin_log: Apache module (C, JSON request logging) - bot_detector: ML bot detection pipeline (Python) - dashboard: FastAPI/Streamlit analytics UI (Python) Shared libraries: - shared/go/ja4common: logger, config, shutdown, ipfilter (Go module) - shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package) - shared/clickhouse/: canonical SQL migrations (10 files) Build & packaging: - Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10) - go.work workspace linking sentinel, correlator, ja4common - Makefile with test-all, build-all, rpm-* targets Fixes applied: - go.work: 1.21 → 1.24.6 (required by sentinel) - correlator Dockerfiles: golang:1.21 → golang:1.24 - replace directives in go.mod for ja4common local path - pyproject.toml: setuptools.backends → setuptools.build_meta - Removed static libpcap linking (unavailable on Rocky 9) - Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32) - Rewrote corrupted test files (logger_test.go × 2) Test coverage: - correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%) - sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse) Documentation: - README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 16:42:59 +02:00

20 Commits