ja4-platform

Author	SHA1	Message	Date
Jacquin Antoine	36b5065a0a	feat(e2e): add multi-IP endpoint architecture with dedicated traffic VM Replace single-service-per-endpoint with all-ips mode running nginx, apache, and hitch+varnish simultaneously on 3 dedicated IPs per VM (eth1 alias IPs). Add a dedicated traffic VM with curl-impersonate for realistic TLS fingerprints, parallelized traffic generation, and paired SNI_HOSTS/TARGET_IPS lists for per-VM per-service hostname identification (e.g. rocky9-nginx-platform.test). Key changes: - run-tests-vm.sh: add setup_all_ips(), IP-specific Listen/bind directives with reset-before-apply pattern, graceful service availability checks - run-e2e-test.sh: traffic VM architecture, all-ips mode, eth1 network, paired IP/SNI lists, updated cleanup for alias IPs - generate-traffic.sh: parallel background jobs, curl-impersonate detection, auto source interface detection via ip route get, Host header in HTTP traffic - Vagrantfile: add traffic VM with provision-traffic.sh - provision-traffic.sh: install curl-impersonate and httpx for traffic gen - test-rpm.sh: multi-interface TC check, updated ja4ebpf config - clickhouse-init.sh: load CSV stubs for Anubis/bot-networks dictionaries - Remove obsolete correlator/sentinel/mod-reqin-log docs - Add h2_settings_ack column to http_logs schema - Upgrade Go toolchain to 1.25.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 14:25:24 +02:00
Jacquin Antoine	f88b739992	feat(e2e): add distributed E2E test framework with parametric traffic generation Add run-e2e-test.sh with CLI parameters (--hits, --http-ratio, --dns, --tls, --src-ips, --keep-analysis, --up) for configurable traffic generation. Traffic runs from VM endpoints with multiple source IPs (alias IPs on eth0) to produce distinct sessions for the ML pipeline. Fix curl TLS flags (--tlsv1.2 instead of --tls-v1-2), skip redundant local verification in distributed mode, and fix dashboard is_available() cache that never retried after ClickHouse recovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 00:09:32 +02:00
Jacquin Antoine	c60ce97f23	feat(bot-detector): add dynamic browser profiling engine with HDBSCAN clustering Implement offline profile building (profile_builder.py) and real-time dynamic scoring (browser_matcher_dynamic.py) using HDBSCAN-based browser fingerprint clustering. Add ClickHouse materialized view (13_h2_profiling.sql) for h2_profile_stats aggregation. Update thesis and project documentation to cover the new dynamic profiling architecture. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-13 02:06:00 +02:00
Jacquin Antoine	d75825278e	feat: multi-distro VM tests, ja4ebpf eBPF improvements, bot-detector scoring ja4ebpf: - Refactor BPF TC capture with improved SYN offset handling and TCP option parsing - Enhance TLS uprobe SSL hooking for better key extraction - Add ClickHouse writer improvements for HTTP log materialized views - Update RPM spec for Rocky Linux 8/9/10, fix systemd service - Simplify loader with cleaner bpf2go integration bot-detector: - Add H2 SETTINGS per-parameter comparison in browser_matcher - Enhance browser signatures and scoring pipeline - Improve preprocessing and cycle detection infra: - Multi-distro Vagrantfile (centos8, rocky9, rocky10) with per-distro provisioning - New Makefile targets: vm-up-all, test-vm-matrix, test-vm-centos8/rocky10 - Add debug helpers and run-test-from-host.sh for host-driven VM testing - Update run-tests-vm.sh for cross-distro compatibility - Remove accidental binary blob (\004) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-13 01:09:33 +02:00
toto	dc6ffd6474	fix: tests intégration matrix — procps-ng, varnish h2, hitch ALPN, pgrep→ps - Ajout de procps-ng dans les 4 Dockerfiles runtime (ps/pgrep disponibles) - Remplacement de pgrep par ps -C dans tous les run-tests.sh - Correction entrypoint nginx-varnish : pgrep nginx → cat nginx.pid (exit 127) - Activation HTTP/2 dans Varnish : ajout de -p feature=+http2 dans les entrypoints nginx-varnish et hitch-varnish - Restauration ALPN h2,http/1.1 dans hitch.conf (varnish supporte maintenant h2) - Correction healthcheck hitch-varnish : curl sans --http1.1 (h2 fonctionnel) - Correction requêtes phase_verify : http_logs_raw → http_logs, colonnes correctes - Correction writer clickhouse.go : noms JSON alignés avec la MV (ip_meta_*, tls_sni…) - Fix toStartOfSecond(DateTime) → toStartOfSecond(toDateTime64(col, 3)) - Retrait du SKIP el8/nginx-varnish (varnish s'installe bien sur AlmaLinux 8) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-12 01:29:01 +02:00
toto	f704541f83	feat(h2): direct per-parameter SETTINGS comparison in browser_matcher - Rewrote _d1_h2_settings() with 3-signal weighted formula: direct_score×0.60 + dict_match×0.30 + ja4_coherence×0.10 when individual SETTINGS cols are available in the DataFrame - Added _H2_SETTINGS_COLS dict (IDs 1,2,3,4,5,6,8 → column names) - Fallback to dict_match×0.80 + ja4_coherence×0.20 for backward compat - Fix view_ai_features_1h: pass 7 individual SETTINGS columns through base_data CTE (h2_header_table_size, h2_enable_push, h2_max_concurrent_streams, h2_initial_window_size, h2_max_frame_size, h2_max_header_list_size, h2_enable_connect_protocol) - Remove non-existent h2_dict_confidence reference from view SQL (dict_browser_h2 only exposes browser_family attribute) - Add 7 new pytest cases: exact match, one wrong setting, forbidden key penalty, unknown fingerprint with correct settings, fallback path, CDN proxy neutralisation, full Chrome simulation - 53/53 bot-detector tests pass - Update thesis §3.9.2: document direct comparison algorithm + fallback Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-11 03:05:36 +02:00
toto	85d3b95b7b	feat: HTTP/2 passive fingerprinting with individual SETTINGS fields Complete implementation of HTTP/2 passive fingerprinting per thesis §2.5.3: mod-reqin-log (C module): - Replace connection-level filter with ap_hook_process_connection (APR_HOOK_FIRST) to capture H2 preface before mod_http2 takes over the connection - AP_MODE_SPECULATIVE read of 512 bytes from c->input_filters - Parse SETTINGS, WINDOW_UPDATE, PRIORITY flags, pseudo-header order - Output individual SETTINGS params as separate JSON fields (IDs 1-6, 8) - Read H2 notes from c1 (master connection) for mod_http2 secondary conns - Fix header_order_signature JSON length bug (26→strlen) ClickHouse schema: - Add 8 new columns to http_logs: h2_has_priority, h2_header_table_size, h2_enable_push, h2_max_concurrent_streams, h2_initial_window_size, h2_max_frame_size, h2_max_header_list_size, h2_enable_connect_protocol - Use Int32/Int64 with DEFAULT -1 to distinguish absent vs zero - Update mv_http_logs to extract individual fields via JSONHas/JSONExtractInt - Migration 04_http2_fields.sql updated for existing deployments Correlator: - Accept both timestamp_ns and timestamp field names (backward compat) Integration: - Enable HTTP/2 in Apache: Protocols h2 http/1.1 in httpd-integration.conf Validated end-to-end via Playwright: H2 curl traffic → mod-reqin-log → correlator → ClickHouse with all 12 H2 columns populated correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-11 02:33:45 +02:00
toto	e52cdcc01f	feat(bot-detector): Browser Signature Detection engine (parallel mode) Étape A — browser_signatures.py Données pures : BROWSER_SIGNATURES (Chrome/Firefox/Safari), NON_BROWSER_SIGNATURES (curl/httpx/go), BROWSER_THRESHOLDS, DIMENSION_WEIGHTS. Valeurs H2 extraites des captures réelles (format Akamai avec virgules, non semicolons). Étape B — browser_matcher.py Moteur vectorisé 7 dimensions (H2 SETTINGS 0.30, WINDOW_UPDATE 0.15, pseudo-header order 0.15, H2 PRIORITY 0.10, HTTP headers 0.15, TLS 0.10, JA4 dict 0.05). run_browser_matcher(df) ajoute bm_family/bm_score/bm_decision. CDN edge case : dimension H2 neutralisée (0.5) si has_xff=1. BROWSER_MATCHER_REPLACE=false par défaut (mode DUAL_MODE logging uniquement). Étape C — 06_browser_signature_detection.sql (migration) Crée browser_h2_signatures (table MergeTree avec 12 fingerprints de référence). Recrée dict_browser_h2 depuis la table avec champ confidence (remplace CSV). Étape D — 07_ai_features_view.sql +h2_wu_val dans le JOIN http_logs, +h2_window_update_value, +h2_dict_family, +h2_dict_confidence, +h2_window_{chrome,firefox,safari,absent}, +h2_order_{chromesafari,firefox}, +h2_priority_present, +h2_pseudo_ord_raw, +tls_h2_family_mismatch (détection incohérence famille JA4 vs famille H2). Étape E — preprocessing.py + pipeline.py preprocessing.py: appelle run_browser_matcher() après compute_browser_axes(), ajoute 7 nouvelles features binaires H2 à FEATURES et binary_features. pipeline.py: appelle log_dual_mode_comparison() après la classification A9. BROWSER_MATCHER_REPLACE=true active le remplacement du bypass. Étape F — test_browser_matcher.py 8 tests : Chrome/Firefox/Safari full match, curl rejeté, httpcloak partiel, TLS↔H2 mismatch, CDN proxy neutralisation, go net/http rejeté. Tous 8 PASSED (+ 36 tests existants inchangés). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 13:52:57 +02:00
toto	9548b1782d	fix: corriger ORDER BY ml_detected_anomalies dans le schéma de base CH 24.8 refuse MODIFY ORDER BY sur des colonnes existantes (erreur BAD_ARGUMENTS 36). La migration 01 ne pouvait donc pas corriger l'ORDER BY en post-init. Correctif : - 06_ml_tables.sql : ORDER BY (src_ip) → ORDER BY (src_ip, ja4, host, model_name) + TTL 30j → 7j (cohérent avec l'architecture documentée) - 01_ttl_adjustments.sql : supprime le MODIFY ORDER BY impossible, conserve uniquement les MODIFY TTL (valides pour les déploiements existants) Résultat : make init-stack sans aucun ⚠ ni ✗ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 01:34:07 +02:00
toto	7a04e47041	fix(sql+api): fix view column mismatches and ClickHouse 24.8 JOIN issue - view_form_bruteforce_detected: add post_count, distinct_paths, first_seen, last_seen - view_host_ip_ja4_rotation: add host, distinct_ja4, ja4_list, window_start - Replace uniqExact/groupUniqArray with count()/groupArray (no nested-agg error) - api.py campaigns/graph: move a.src_ip < b.src_ip from JOIN ON to WHERE (ClickHouse 24.8 forbids cross-table inequality in JOIN ON condition) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 01:05:04 +02:00
toto	040437921c	fix(init-stack): pre-drop mv_http_logs + http_logs before schema apply Ensure h2 columns are always included on fresh init. Also add migration loop for fleet_detections and ml_performance_metrics tables. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 01:00:04 +02:00
toto	b409a70970	fix(views): align SQL views with dashboard API expected columns - view_form_bruteforce_detected: add post_count, distinct_paths, first_seen, last_seen - view_host_ip_ja4_rotation: add host, distinct_ja4, ja4_list, window_start - view_ip_recurrence: add worst_threat alias + top_ja4, top_host columns All three views were missing columns referenced by /api/brute-force, /api/ja4-rotation and /api/recurrence endpoints, causing 500 errors on the Tactiques page. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 00:59:57 +02:00
toto	2f2c5e03bb	fix(sql): contournement bug scope ClickHouse 24.8 dans view_ai_features_1h - Restructure 07_ai_features_view.sql : single anonymous inner subquery avec aliases explicites sur toutes les colonnes (a.xxx AS xxx, h.xxx AS xxx, h2.xxx AS xxx) pour résoudre l'ambiguïté PARTITION BY src_ip dans l'outer SELECT - Supprime les CTEs multiples (h2_agg, enriched) qui déclenchaient le bug - Fix migration 04_http2_fields.sql : ordre DEFAULT avant CODEC (syntax ClickHouse) - make init-stack : 0 erreur sur 13 fichiers SQL Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 00:48:05 +02:00
toto	a108814a56	feat: roadmap détection bots §2-9 — HTTP/2, cohérence, drift, flotte, Jaccard, ExIFFI, méta-learner, métriques Étape 2 — Fingerprinting HTTP/2 dans le pipeline ML : - Ajout du dictionnaire dict_browser_h2 (11 familles de navigateurs) dans 05_aggregation_tables.sql - Ajout du CTE h2_agg et 4 features HTTP/2 dans 07_ai_features_view.sql : h2_settings_known, h2_pseudo_order_match, h2_ja4_coherence, h2_settings_rare - Calcul du fingerprint_coherence_score (5 axes pondérés) dans la vue - Ajout du 6e axe axis_h2_coherence dans browser.py (poids rééquilibrés) - browser_h2.csv : 11 fingerprints Akamai → famille navigateur Étape 3 — Pré-filtre de cohérence sur la baseline humaine : - pipeline.py exclut les sessions avec fingerprint_coherence_score < seuil de la baseline d'entraînement - FINGERPRINT_COHERENCE_THRESHOLD configurable via env (défaut 0.25) - Log des sessions exclues pour analyse SOC Étape 4 — Détection de drift améliorée : - scoring.py : passage de 5 à 9 quantiles (p5…p95) - Ajout de la divergence KL en complément du test KS - Détection de drift adversarial (≥80% des features dérivent dans la même direction) - Split temporel strict pour la validation Étape 5 — Graphe bipartite JA4×ASN (§5.2) : - fleet.py : détection de flottes via NetworkX + Louvain (imports optionnels) - enrich_with_fleet_score() : ajout fleet_score + fleet_campaign_flag au DataFrame - cycle.py : appel après preprocess_df avec log du nombre de sessions en flotte - SQL migration 05_fleet_metrics_tables.sql : table fleet_detections (TTL 7j) - Dashboard : /fleet + /api/fleet (communautés détectées) + template fleet.html Étape 6 — Cross-domain Jaccard §5.8 : - 12_thesis_features.sql : CTE jaccard_paths → cross_domain_path_similarity - Signal : même chemins (/admin, /wp-login) sur plusieurs hosts = scanner Étape 7 — ExIFFI + erreurs AE par feature : - scoring.py : compute_exiffi_importance() par permutation, compute_ae_feature_errors() - pipeline.py : calcul ExIFFI sur X_test, mapping index → dict pour anomalies - build_reason() enrichi avec exiffi_top quand SHAP inactif Étape 8 — Méta-learner pour la pondération de l'ensemble : - scoring.py : classe MetaLearner (LogisticRegression, fallback poids fixes <1000 labels) - Collecte des labels depuis le cycle courant (known_bots, légitimes, Anubis) - pipeline.py : remplacement des poids fixes par MetaLearner.predict() Étape 9 — Métriques de performance et monitoring : - metrics.py : record_cycle_metrics() — taux anomalie, drift, corrélation, latence - SQL migration 05_fleet_metrics_tables.sql : table ml_performance_metrics (TTL 90j) - Dashboard : /health + /api/health + template health.html - cycle.py : appel record_cycle_metrics en fin de cycle (Complet + Applicatif) Tests : 36/36 bot-detector tests passent Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 00:11:35 +02:00
toto	8ca4a1e849	feat(mod_reqin_log): fingerprinting HTTP/2 passif (Akamai format) Ajoute un filtre d'entrée de connexion (AP_FTYPE_CONNECTION, APR_HOOK_LAST) qui s'insère entre mod_ssl et mod_http2 pour lire de manière non-destructive le preface HTTP/2 (RFC 9113 §3.4) et en extraire : - h2_fingerprint : fingerprint Akamai complet ex. '1:65536,2:0,4:6291456,6:262144\|15663105\|0\|m,a,s,p' - h2_settings_fp : entrées SETTINGS brutes (ex. '1:65536,4:6291456') - h2_window_update : incrément WINDOW_UPDATE (ex. '15663105') - h2_pseudo_order : ordre des pseudo-headers (ex. 'm,a,s,p' Chrome, 'm,p,s,a' Firefox) Technique : lecture spéculative AP_MODE_SPECULATIVE (non-destructive) de 512 octets — la donnée reste disponible pour mod_http2. Le filtre se retire de la chaîne après la première invocation. Stockage dans c->notes (H2_NOTE_*) puis émission JSON dans log_request(). ClickHouse : 4 nouvelles colonnes dans http_logs + JSONExtract dans mv_http_logs. Migration pour déploiements existants : 04_http2_fields.sql. 14 tests unitaires (cmocka) couvrent Chrome/Firefox/HTTP1/troncature/HPACK. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 23:46:50 +02:00
toto	14db3d9040	refactor: suppression dépendance User-Agent de la détection navigateur Changements SQL : - modern_browser_score : sec-ch-ua→100, Sec-Fetch→70 (plus de UA fallback) - Ajout has_sec_ch_ua (UInt8) dans agg_header_fingerprint_1h et ml_all_scores - mss_mobile_mismatch utilise has_sec_ch_ua au lieu de modern_browser_score - header_order_confidence : PARTITION BY ja4 au lieu de first_ua - sec_ch_mobile_mismatch : comparaison Client Hints interne (sans UA) - Migration 03_remove_ua_browser_detection.sql Changements Python : - browser.py Axe 3 : Client Hints + Sec-Fetch + is_fake_navigation (PAS de UA) - Pondération axes : ja4_known 0.30, tls_coherence 0.20 (signaux TLS renforcés) - preprocessing.py : has_sec_ch_ua ajouté aux features et binary_features Fichiers modifiés : 8 SQL/Python + 1 migration, 36/36 tests passent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 23:06:01 +02:00
toto	1fa6aec784	fix: SQL view ordering, purge-db flag, ctest directory - 12_thesis_features.sql: move view_resource_cascade_1h before view_thesis_features_1h - Makefile: purge-db uses --reset (not --clean) - mod-reqin-log: ctest --test-dir build/tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 22:39:25 +02:00
toto	8f5e771096	docs: réécriture complète de la documentation base de données en français Réécriture des 3 fichiers de documentation de la base de données ClickHouse : - docs/database/schema.md : couverture complète des 2 bases, 14+ tables, 7 dictionnaires, 8 MVs, 8 vues, TTL, partitions, moteurs et colonnes - docs/database/migrations.md : 13 fichiers SQL (ajout 10-12), prérequis mis à jour (ClickHouse 24.8+, 5 CSV), deploy_schema.sh, init-stack.sh, vérification et rollback complets - shared/clickhouse/README.md : référence rapide des 13 fichiers, deploy_schema.sh, patron double-base, prérequis Suppression des références obsolètes : dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 22:03:37 +02:00
toto	9ea36ad22e	feat(scripts): complete stack init + prod data import with date shift Schema cleanup: - Remove anubis_ua_rules table stub from 03_anubis_tables.sql - Remove anubis_ua_rules from bot-detector deploy_schema.sql - Remove UA seed step from clickhouse-init.sh (no more REGEXP_TREE dependency) - Drop dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules New scripts: - scripts/init-stack.sh: comprehensive ClickHouse init (13 SQL files + migrations + validation + cleanup of obsolete tables). Supports --reset, --import-prod. - scripts/import-prod-data.sh: imports pre-exported prod data (Native format) with dynamic date shift (max(time) → now). Supports --shift, --no-truncate. - scripts/data/prod-export/: directory for cached Native format exports Makefile targets: init-stack, import-prod-data, init-and-import Tested: init-stack.sh passes all 13 SQL + 7 critical tables + 7 dicts import-prod-data.sh: 3M rows in ~37s with auto date shift Dashboard: 55 routes OK, bot-detector: 36/36 tests pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 21:40:05 +02:00
toto	8180f4af04	refactor(anubis): simplify to IP/CIDR + ASN only, remove UA and Country rules - Remove UA regex extraction (extract_ua_regex, _extract_ua_from_all/any) - Remove Country rule collection from parse_bot_policies_inline - Simplify fetch_rules.py: collect_all_rules returns (ip_rules, asn_rules) - Remove insert_ua_rules and insert_country_rules functions - reload_dicts now only reloads dict_anubis_ip + dict_anubis_asn - Simplify CASE blocks in 04_mv_http_logs.sql, 07_ai_features_view.sql, view_ai_features_anubis.sql, mv_http_logs.sql: IP > ASN (was 5-level UA+IP > UA > IP > ASN > Country cascade) - Remove dict_anubis_country + dict_anubis_ua from 03_anubis_tables.sql (UA table kept as stub for REGEXP_TREE catch-all compatibility) - Remove anubis_country_rules table from schema - Remove Anubis UA and Country tabs from dashboard reflists page - Remove anubis_ua_rules/country_rules from API reflist queries - deploy_schema.sql simplified from 339 to 122 lines - 764 lines removed across 9 files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 15:25:33 +02:00
toto	039086a0b3	feat: nouvelles techniques de détection et page tactiques SOC SQL: - Ajout 5 colonnes d'agrégation (count_xff, count_unusual_ct, count_non_std_port, count_login_post, sec_ch_mobile_mismatch) - Exposition de 5 features calculées dans view_ai_features_1h - Migration ALTER TABLE pour déploiements existants Bot-detector: - 7 nouvelles features ML (has_xff, unusual_content_type_ratio, non_standard_port_ratio, login_post_concentration, sec_ch_mobile_mismatch, true_window_size, window_mss_ratio) - Propagation campaign_id vers ml_all_scores (était toujours -1) - Escalade campagne : HIGH→CRITICAL si cluster ≥5 membres Dashboard: - Page Tactiques SOC : brute-force, rotation JA4, récurrence, alertes temps réel — 4 KPIs + 4 panneaux + infobulles doc - Ajout fmtDate() helper global - Navigation sidebar mise à jour Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 14:29:18 +02:00
toto	db306fb9da	fix: P0 audit bugs — bot-detector + dashboard + SQL Bot-detector: - B1.1: campaign_id and raw_anomaly_score now inserted into ml_detected_anomalies - B1.4/B1.5: log_decision argument order fixed (cycle_id, name) - B1.7: AE broadcast error — model now returns features list, scoring uses model's features instead of current cycle's (prevents dim mismatch) - B1.8: Anubis ALLOW bots now get bot_name from anubis_bot_name Dashboard: - C1.1: XSS in ip_detail.html — {{ ip \| tojson }} instead of raw string - C1.2: Stored XSS via innerHTML — added escapeHtml() helper, all user-facing formatters (fmtIP, fmtASN, fmtCountry, fmtJA4, fmtBotName, fmtLabel) sanitized - C2.1: status filter now correctly filters http_version column - C2.2: heatmap toDayOfWeek() - 1 for 0-indexed JS days SQL: - B1.3: view_ip_recurrence worst_score uses max() not min() (0=normal, 1=anomal) - B1.6: view_resource_cascade_1h joined into view_thesis_features_1h (§5.4) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 23:33:00 +02:00
toto	98289ccf04	fix: ASN dictionary pipeline + verbose bot-detector logging - Fix dict_iplocate_asn: remove non-existent org/domain columns (4→4 cols) - Add CSV header to iplocate-ip-to-asn.csv (CSVWithNames format) - Replace org/domain dictGet calls with empty string literals in MV - Full 714K CIDR stub for complete ASN resolution in tests - Add header generation to generate_asn_data.py - Verbose bot-detector stdout: data summary, triage breakdown, model training details, scoring stats, browser classification, boxed results - Fix IPv6 filter in traffic seeder (_ips_from_cidrs) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 17:43:55 +02:00
toto	9a48fb9d29	feat: LEGITIMATE_BROWSER classification from JA4 + behavioral consistency Add browser legitimacy classification (A9) to the bot detection pipeline: - New features: is_known_browser (binary) and browser_consistency_score [0..5] combining 5 signals: JA4 browser match, modern_browser_score, Accept-Language, cookies, Sec-Fetch-* presence - Post-scoring: sessions with known browser JA4 + consistency >= 4/5 + NORMAL/LOW threat level are reclassified as LEGITIMATE_BROWSER - Spoofing detection: inconsistent behavior (known JA4 but low consistency) stays in normal anomaly scoring — prevents evasion via JA4 spoofing - XGBoost treats LEGITIMATE_BROWSER as non-threat (negative label) - ClickHouse: browser_family column added to ml_detected_anomalies and ml_all_scores - Dashboard: browser_family filter/sort on detections and scores endpoints, legitimate_browsers count and browser_stats in overview - 6 new unit tests covering classification threshold, spoofing, exclusion logic Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 15:46:22 +02:00
toto	7d09c614c3	feat: browser JA4 detection, Anubis bot rules, worldwide ASN data - Add generate_browser_ja4.py: 1,186 browser JA4 fingerprints from FoxIO + ja4db.com covering 11 families (Chromium, Firefox, Safari, Edge, Tor, Opera, Vivaldi...) - Rewrite generate_bot_ip.py: Anubis YAML rules (Google, Bing, Apple, DuckDuck, OpenAI, Perplexity bots) + Tor exit nodes + cloud scanner IPs (3,555 entries) - Rewrite generate_asn_data.py: worldwide iptoasn.com data (78,049 ASNs, 714K CIDRs) - Add dict_browser_ja4 ClickHouse dictionary + browser_family in AI features views - Add /api/browsers dashboard endpoint - Fix CSV quoting for fields containing commas (User-Agent strings) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 15:27:37 +02:00
toto	b735bab5a5	feat(dashboard): rebuild SOC dashboard + fix ClickHouse SQL Complete rewrite of the SOC dashboard using FastAPI + Jinja2 + htmx + Chart.js + Tailwind CSS. Replaces the old React/Vite frontend with server-rendered templates. Dashboard pages: - Overview: KPIs, timeline chart, threat distribution, top IPs - Detections: paginated/filterable anomaly table - Scores: ml_all_scores with AE error & XGB prob columns - Traffic: HTTP logs with method/host filters - IP Investigation: full deep-dive (scores, features, HTTP logs, classify) - Classification: SOC feedback form + history - Features: AI + thesis feature stats - Models: scoring stats + model metadata API: 9 JSON endpoints with parameterized queries, sort whitelists SQL fixes: - 05_aggregation_tables: add deduplicate_merge_projection_mode - 11_views: fix nested aggregate (argMax inside sum) - 12_thesis_features: remove invalid 'let' bindings, fix groupArrayIf type Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 03:21:05 +02:00
toto	8d58f2b932	feat(bot-detector): add XGBoost supervised third voice (#10 ) Triple-voice ensemble architecture: - EIF (non-supervisé, anomalies zero-day) - Autoencoder (non-supervisé, corrélations non-linéaires) - XGBoost (supervisé, patterns connus + feedback SOC) XGBoost implementation: - Trained on historical ml_all_scores labels (NORMAL=0, HIGH/CRITICAL/DENY/KNOWN=1) - Weekly retraining (XGB_RETRAIN_INTERVAL_H=168), min 100 labels required - Score = predict_proba, combined via meta-learner: (1-β)(EIF+AE) + βxgb_prob - Configurable: XGB_WEIGHT (β=0.20), XGB_MIN_LABELS, XGB_RETRAIN_INTERVAL_HOURS - Graceful fallback: if xgboost unavailable or labels insufficient, EIF+AE only - ClickHouse: xgb_prob column added to ml_all_scores - Tests: 4 new tests (availability, train/predict, meta-learner, save/load) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:45:57 +02:00
toto	57cf6c3828	feat(bot-detector): add parallel Autoencoder scorer (#9 ) - TrafficAutoEncoder class: symmetric AE (n→64→32→16→32→64→n) with BatchNorm+ReLU - Trained alongside EIF on human_baseline, saved/loaded with model versioning - Score = per-sample MSE reconstruction error, combined with EIF via AE_WEIGHT (α=0.30) - AE latent space (16-dim) used for HDBSCAN clustering instead of raw features - Configurable: AE_WEIGHT, AE_EPOCHS, AE_LATENT_DIM, AE_LEARNING_RATE - Graceful fallback: if torch unavailable or AE fails, EIF-only scoring continues - ClickHouse: ae_recon_error column added to ml_all_scores - Tests: 5 new tests (AE train/score, encode latent, state dict save/load, weight combination) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:40:39 +02:00
toto	f6e2d3c0ca	feat(bot-detector): implement 8 state-of-art improvements - EIF: Extended Isolation Forest via isotree (fallback to sklearn IF) - Benford's Law deviation feature on inter-request timing - Lag-1 autocorrelation feature for cadence analysis - Validation gate: reject model if val_anomaly_rate > 20% - Feature pruning: remove variance < 1e-6 features before training - Quantile drift: replace N(μ,σ) synthetic with quantile interpolation - Thread safety: Lock for _service_healthy/_consecutive_failures - Score normalization: inverted to [0,1] where 1=most anomalous SQL: add lag1_autocorrelation + benford_deviation to view_thesis_features_1h Tests: 10 new test functions covering all improvements Integration: verify_mvs.py checks new thesis feature columns Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:31:26 +02:00
toto	6d02f21c1e	feat: implement thesis §5 advanced detection techniques as ClickHouse MVs New aggregation tables + materialized views: - agg_path_sequences_1h + MV (§5.1 Path Sequence Entropy) - agg_request_timing_1h + MV (§5.3 Request Cadence Fingerprint) - agg_ip_behavior_1h + MV (§5.5 JA4 Drift + §5.8 Cross-Domain) - agg_resource_cascade_1h + MV (§5.4 Resource Dependency Tree) New analytical views: - view_thesis_features_1h: unified view exposing all computable features (path_transition_entropy, cadence_cv, burst_ratio, pause_ratio, ja4_drift_ratio, host_diversity, host_sweep_speed, host_coverage_uniformity) - view_resource_cascade_1h: root_to_first_asset_delay, asset_load_stddev Documented future techniques (not feasible as MV): - §5.2 Bipartite Fleet Graph (needs Python networkx) - §5.6 DNS Shadow Analysis (needs sentinel UDP/53 extension) - §5.7 Compression Ratio Invariant (needs mod_reqin_log extension) Updated: deploy_schema.sh, verify_mvs.py (sections 8-10) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 01:42:52 +02:00
toto	ecceb04174	perf(clickhouse): P3 — view_ip_recurrence avec filtre TTL + supprimer FINAL view_ip_recurrence : Ajout de WHERE detected_at >= now() - INTERVAL 30 DAY → Avec PARTITION BY (P1), ClickHouse élagage les partitions hors de cette plage avant même de lire les données. La vue ne scanne que les partitions actives (au lieu des 30 partitions journalières complètes). → ORDER BY (src_ip) garantit que le GROUP BY src_ip lit des données contiguës (aucune réorganisation mémoire). rotation.py — supprimer FINAL sur ml_detected_anomalies : FINAL force une déduplication complète du ReplacingMergeTree en mémoire (équivalent à un DISTINCT sur toute la table) — une des opérations les plus coûteuses dans ClickHouse. Fix : remplacer le sous-SELECT FINAL par view_ip_recurrence (déjà aggrégée par src_ip, retourne recurrence directement sans FINAL). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 22:33:29 +02:00
toto	14323f7b05	perf(clickhouse): P10 — créer les 4 vues métier manquantes + corriger préfixes DB Bug de production : view_form_bruteforce_detected, view_host_ip_ja4_rotation, view_dashboard_entities, view_dashboard_user_agents étaient référencées dans 13 endpoints du dashboard mais n'existaient nulle part dans le schéma. Tous ces endpoints retournaient HTTP 500 en production. shared/clickhouse/11_views.sql (nouveau) : view_form_bruteforce_detected Source : agg_host_ip_ja4_1h (24h) Logique : GROUP BY (src_ip, host) HAVING count_post >= 10 Usage : bruteforce.py (3 endpoints), investigation_summary.py view_host_ip_ja4_rotation Source : agg_host_ip_ja4_1h (24h) Logique : uniqExact(ja4) par src_ip, HAVING >= 2 (rotation de fingerprint) Usage : rotation.py (3 endpoints), investigation_summary.py view_dashboard_entities Source : http_logs (7 jours), UNION ALL 5 branches (ip/ja4/country/asn/host) Colonnes : entity_type, entity_value, src_ip, ja4, host, log_date, client_headers Array(String), asns Array, countries Array, user_agents Array Usage : entities.py (5 endpoints), clustering.py view_dashboard_user_agents Source : http_logs (7 jours), GROUP BY (src_ip, ja4, hour) Colonnes : src_ip, ja4, hour, log_date, user_agents Array(String), requests Usage : variability.py (4 endpoints), fingerprints.py (5 endpoints) attributes.py (2 endpoints) deploy_schema.sh : ajout de 10_perf_indexes.sql et 11_views.sql dans la liste routes/variability.py + fingerprints.py : Correction de 9 requêtes utilisant view_dashboard_user_agents sans préfixe de base de données → remplacé par {settings.CLICKHOUSE_DB_PROCESSING}.view_* Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 22:30:09 +02:00
toto	f4ffe3410a	perf(clickhouse): P1 — partition + skipping indexes sur ml_detected_anomalies, http_logs, agg_host_ip_ja4_1h Problème : toutes les requêtes du dashboard WHERE detected_at >= now() - INTERVAL N faisaient un full scan car ml_detected_anomalies avait ORDER BY (src_ip) sans partition ni index temporel. Changements : - 06_ml_tables.sql : * ml_detected_anomalies : PARTITION BY toYYYYMMDD(detected_at) → élagage de partitions journalières sur toutes les requêtes temporelles * INDEX idx_detected_at (minmax) → skip des granules hors plage * INDEX idx_threat_level set(8) → skip pour countIf(threat_level = ...) * INDEX idx_bot_name bloom_filter → skip pour bot_name != '' * ttl_only_drop_parts = 1 → TTL par suppression de partition entière * ml_all_scores : même traitement (PARTITION BY + 2 indexes) - 04_mv_http_logs.sql : * http_logs : INDEX idx_src_ip bloom_filter(0.01) → les requêtes WHERE src_ip = X (analysis.py, variability.py) sautent ~90% des granules sans scanner toute la plage temporelle * INDEX idx_ja4 bloom_filter(0.01) → idem pour filtres JA4 - 05_aggregation_tables.sql : * agg_host_ip_ja4_1h : PROJECTION proj_by_ip ORDER BY (src_ip, window_start, ...) → investigation_summary.py et rotation.py (WHERE src_ip = X) utilisent automatiquement la projection au lieu de scanner tous les window_start - 10_perf_indexes.sql (nouveau) : * Migration ALTER TABLE pour instances existantes * ADD INDEX + MATERIALIZE INDEX pour les 4 tables * ADD PROJECTION + MATERIALIZE PROJECTION pour agg_host_ip_ja4_1h * Note : PARTITION BY sur table existante nécessite recréation (documenté) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 22:28:04 +02:00
toto	d4e7e674d8	feat: full-stack Docker Compose integration tests - 4-container stack: ClickHouse, platform (Rocky 9), bot-detector, dashboard - Platform builds sentinel on Rocky (CGO+libpcap native), correlator static - mod-reqin-log compiled with apxs on Rocky (matching RPM build target) - ClickHouse init script patches credentials for test env (sed-based) - 8-phase test runner: schema, traffic gen, pipeline, dashboard API, bot-detector, sentinel - All 13 checks pass, 3 non-blocking warnings (empty dicts, log paths) SQL schema fixes discovered during integration: - 02_dictionaries: IPv6CIDR → String (not a valid ClickHouse type) - 03_anubis_tables: dict_anubis_ua missing has_ip/rule_id/category attrs - 03_anubis_tables: dict_anubis_country FLAT() → COMPLEX_KEY_HASHED() (String key) - 09_audit_table: CODEC before DEFAULT → DEFAULT before CODEC Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 20:33:25 +02:00
toto	9f3e0621e5	feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing) Architecture: - ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs) - ja4_processing: analytics, aggregation, ML, dictionaries, audit Configuration (env vars): - CLICKHOUSE_DB_LOGS (default: ja4_logs) - CLICKHOUSE_DB_PROCESSING (default: ja4_processing) Changes: - SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing with correct cross-database references (MVs, views, dicts) - deploy_schema.sh: substitutes DB names from env vars at deploy time - Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING - Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING - Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable - Correlator: DSN example updated to ja4_logs - Docker-compose + .env files: new env vars with defaults - All documentation updated (14 markdown files) All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 19:10:35 +02:00
toto	d469e39da7	feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized Services: - ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap) - logcorrelator: JA4 log correlation engine (Go, ClickHouse) - mod_reqin_log: Apache module (C, JSON request logging) - bot_detector: ML bot detection pipeline (Python) - dashboard: FastAPI/Streamlit analytics UI (Python) Shared libraries: - shared/go/ja4common: logger, config, shutdown, ipfilter (Go module) - shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package) - shared/clickhouse/: canonical SQL migrations (10 files) Build & packaging: - Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10) - go.work workspace linking sentinel, correlator, ja4common - Makefile with test-all, build-all, rpm-* targets Fixes applied: - go.work: 1.21 → 1.24.6 (required by sentinel) - correlator Dockerfiles: golang:1.21 → golang:1.24 - replace directives in go.mod for ja4common local path - pyproject.toml: setuptools.backends → setuptools.build_meta - Removed static libpcap linking (unavailable on Rocky 9) - Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32) - Rewrote corrupted test files (logger_test.go × 2) Test coverage: - correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%) - sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse) Documentation: - README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 16:42:59 +02:00

36 Commits