ja4-platform

Author	SHA1	Message	Date
Jacquin Antoine	6e5eb38efd	docs: update thesis and docs with Cleanlab label filtering integration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-13 02:19:46 +02:00
Jacquin Antoine	9d27abf43c	fix(ml): integrate Cleanlab to filter noisy SOC labels and prevent model poisoning Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-13 02:11:25 +02:00
Jacquin Antoine	c60ce97f23	feat(bot-detector): add dynamic browser profiling engine with HDBSCAN clustering Implement offline profile building (profile_builder.py) and real-time dynamic scoring (browser_matcher_dynamic.py) using HDBSCAN-based browser fingerprint clustering. Add ClickHouse materialized view (13_h2_profiling.sql) for h2_profile_stats aggregation. Update thesis and project documentation to cover the new dynamic profiling architecture. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-13 02:06:00 +02:00
Jacquin Antoine	d75825278e	feat: multi-distro VM tests, ja4ebpf eBPF improvements, bot-detector scoring ja4ebpf: - Refactor BPF TC capture with improved SYN offset handling and TCP option parsing - Enhance TLS uprobe SSL hooking for better key extraction - Add ClickHouse writer improvements for HTTP log materialized views - Update RPM spec for Rocky Linux 8/9/10, fix systemd service - Simplify loader with cleaner bpf2go integration bot-detector: - Add H2 SETTINGS per-parameter comparison in browser_matcher - Enhance browser signatures and scoring pipeline - Improve preprocessing and cycle detection infra: - Multi-distro Vagrantfile (centos8, rocky9, rocky10) with per-distro provisioning - New Makefile targets: vm-up-all, test-vm-matrix, test-vm-centos8/rocky10 - Add debug helpers and run-test-from-host.sh for host-driven VM testing - Update run-tests-vm.sh for cross-distro compatibility - Remove accidental binary blob (\004) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-13 01:09:33 +02:00
toto	957918c565	fix(ja4ebpf): Rocky Linux RPM builder, remove correlated field, fix thesis - Dockerfile.package: migre go-builder de golang:bookworm (Debian) vers rockylinux:9, installe Go depuis le tarball officiel, remplace apt par dnf (clang llvm libbpf-devel bpftool) - Suppression du champ 'correlated' de l'agent ja4ebpf : avec eBPF/XDP, la corrélation L3/L4↔L7 est toujours implicite par présence des champs. Supprimé de : session.go, manager.go, main.go (x5), clickhouse.go - Thèse (6 corrections listées + cohérence correlated) : 1. §3.5 + §3.9.1 : SSL_read retourne des octets bruts sans respecter les frontières H2 → buffer circulaire de réassemblage en Go userspace 2. §3.1 : supprimé libpcap + CAP_NET_RAW, remplacé par définition uprobe 3. §4 + §7 : compte exact 96 features en 8 familles (Famille 1–8), supprimé taxonomie F1–F11 obsolète, tous les totaux mis à jour 4. §2.4 + §8 : remplacé 7 fausses URLs arXiv par [Référence à vérifier] 5. §4 Famille 2 : ja4_drift_ratio → renvoi à Famille 8 (définition complète) 6. §6.4 : ajouté limite 'Overhead de l'uprobe SSL_read' + §3.6 : supprimé correlated=0/1 du texte architectural Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-12 04:48:40 +02:00
toto	b1218a2367	fix(ja4ebpf): fix TLS capture, SYN offsets, TCP option parsing - Increase MAX_TLS_PAYLOAD from 512 to 2048 bytes to capture full TLS ClientHellos (modern browsers/curl send 1000-1543 byte ClientHellos) - Fix ParseClientHello to tolerate XDP-truncated payloads: clamp recordLength and chLen to available data instead of returning error - Fix cipher suites, compression, extensions truncation to use clamping - Fix consumeSynEvents struct field offsets: dst_ip (4 bytes at offset 4) was not accounted for, causing all L3/L4 metadata to be read from wrong positions (TTL was actually dst_ip[0], windowSize was dst_port, etc.) - Add parseTCPOptions() to extract MSS and Window Scale from raw TCP options (C code sets defaults of mss=0, window_scale=0xFF, expects Go to parse) - Fix consumeAcceptEvents: skip zero-IP events to avoid phantom sessions - Fix consumeSSLEvents: filter zero-IP/port events when proc fallback fails - Add missing consumeHTTPPlainEvents goroutine (was defined but never called) - Fix race condition: SYN consumer sets Correlated=true if TLS already present - Update tls_hello_event struct offsets in Go consumer (payload_len now at offset 2054, was 518, due to payload array growing from 512 to 2048 bytes) - Remove debug logging from consumers and GC E2E verified: HTTP plain (port 80) and HTTPS (port 443) both produce fully correlated sessions in ClickHouse with correct: - ip_meta_ttl=64, ip_meta_df=true, ip_meta_id - tcp_meta_window_size=64240, tcp_meta_window_scale=10, tcp_meta_mss=1460 - ja4=t13i3010_1d37bd780c83_95d2a80e6515 - tls_alpn=http/1.1 - method=GET, path=/, header_order_signature=Host;User-Agent;Accept - correlated=1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-12 04:16:44 +02:00
toto	f85a10b012	feat: pipeline L7 HTTP complet + infrastructure tests VM Correctifs pipeline L7 (uprobe SSL_read) : - uprobe_ssl.c : ssl_set_fd ne retourne plus tôt quand fd_conn_map est vide (accept4 non disponible en Docker). Sauvegarde ssl_ptr→{fd,0,0} pour permettre le fallback /proc côté Go. - main.go : consumeSSLEvents reécrit avec routeur magic-bytes complet : * HTTP/2 preface → extraction SETTINGS + conversion correlation.HTTP2Settings * HTTP/1.x requête → method, path, query, headers, header_order_sig * HTTP/1.x réponse → status_code * Fallback /proc/<tgid>/fd/<fd> quand src_ip=0 (accept4 absent) - writer/clickhouse.go : export header_order_signature ajouté Nouveaux packages : - internal/parser/http1.go : parseur HTTP/1.x (IsHTTP1Request, ParseHTTP1Request, IsHTTP1Response, ParseHTTP1Response) - internal/parser/http1_test.go : 11 tests unitaires (28 total passent) - internal/procutil/proc_lookup.go : résolution fd→IP via /proc avec cache TTL 5s (FDCache). Supporte /proc/PID/net/tcp et tcp6, IPv4-mappé IPv6. Infrastructure tests VM (tests/vm/) : - Vagrantfile : VM Rocky Linux 9 KVM, 4 CPU / 4 GB RAM - provision.sh : installation toolchain eBPF + Go + Docker + nginx - run-tests-vm.sh : suite de test complète dans la VM (L3/L4+TLS+L7) - README.md : guide d'installation et d'utilisation - Makefile : cibles vm-up, vm-down, vm-ssh, test-vm-nginx, test-vm-all, vm-rebuild-ja4ebpf Corrections stack Docker : - Dockerfiles nginx/apache/nginx-varnish/hitch-varnish : suppression des références à shared/go/ja4common/ (répertoire supprimé) - clickhouse-init.sh : restauré depuis git, seed anubis_ua_rules obsolète supprimé (table REGEXP_TREE supprimée du schéma) - traffic-gen : ajout HTTP/1.0 (http.client) et HTTP/2 (httpx) - verify_db.py : script de vérification 35 checks (L3/L4/TLS/L7/corrélation) - run-stack-tests.sh : phase 6 verify_db ajoutée Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-12 02:37:00 +02:00
toto	9734e21fe3	chore: suppression des services obsolètes (sentinel, correlator, mod-reqin-log) Remplacés par l'agent ja4ebpf (eBPF CO-RE). Nettoyage complet : Supprimé : - old/ (archive de l'ancienne architecture) - services/correlator/ (logcorrelator Go) - services/sentinel/ (capture pcap Go) - services/mod-reqin-log/ (module Apache C) - shared/go/ja4common/ (lib Go partagée — plus importée par ja4ebpf) - tests/integration/platform/ (test correlator+sentinel+httpd) - tests/integration/docker-compose.yml (compose ancienne archi) - tests/integration/run-tests.sh (runner correlator/sentinel) - tests/integration/verify_mvs.py (script orphelin) Nettoyé : - go.work : retire ./shared/go/ja4common - services/ja4ebpf/go.mod : retire replace ja4common (jamais importé) - services/ja4ebpf/Dockerfile* : retire les COPY ja4common inutiles - Makefile : retire test-ja4common-python, test-integration*, targets obsolètes - tests/integration/README.md : réécrit pour l'architecture ja4ebpf Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-12 01:48:14 +02:00
toto	dc6ffd6474	fix: tests intégration matrix — procps-ng, varnish h2, hitch ALPN, pgrep→ps - Ajout de procps-ng dans les 4 Dockerfiles runtime (ps/pgrep disponibles) - Remplacement de pgrep par ps -C dans tous les run-tests.sh - Correction entrypoint nginx-varnish : pgrep nginx → cat nginx.pid (exit 127) - Activation HTTP/2 dans Varnish : ajout de -p feature=+http2 dans les entrypoints nginx-varnish et hitch-varnish - Restauration ALPN h2,http/1.1 dans hitch.conf (varnish supporte maintenant h2) - Correction healthcheck hitch-varnish : curl sans --http1.1 (h2 fonctionnel) - Correction requêtes phase_verify : http_logs_raw → http_logs, colonnes correctes - Correction writer clickhouse.go : noms JSON alignés avec la MV (ip_meta_*, tls_sni…) - Fix toStartOfSecond(DateTime) → toStartOfSecond(toDateTime64(col, 3)) - Retrait du SKIP el8/nginx-varnish (varnish s'installe bien sur AlmaLinux 8) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-12 01:29:01 +02:00
toto	3b047b680a	fix(ja4ebpf): split bpf2go generate into Ja4Tc + Ja4Ssl, fix RPM systemd-rpm-macros - Use two separate //go:generate directives (Ja4Tc for tc_capture.c, Ja4Ssl for uprobe_ssl.c) to avoid duplicate LICENSE symbol and multi-file clang issue - Update loader.go to hold tcObjs/sslObjs separately with correct field names: UprobeSslSetFd, UprobeSslReadEntry, UretprobeSslReadExit, KprobeAccept4Entry, KretprobeAccept4Exit - Add systemd-rpm-macros to all three RPM build stages (el8/el9/el10) so that %{_unitdir} macro resolves correctly - RPMs now build successfully for el8, el9, el10 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-11 23:21:11 +02:00
toto	a1e4c1dad5	feat: add ja4ebpf service — eBPF-based TLS/TCP fingerprinting daemon - TC ingress hook captures TCP SYN (L3/L4) and TLS ClientHello - Uprobes on SSL_read/SSL_set_fd capture decrypted TLS data - Kprobes on accept4 correlate socket FDs to client IP:port - JA4 fingerprint computed from parsed TLS ClientHello - HTTP/2 SETTINGS and WINDOW_UPDATE extracted from decrypted streams - Session manager with sharded map (256 shards) and GC goroutine - Slowloris detection: sessions with no requests after 10s threshold - ClickHouse batch writer to ja4_logs.http_logs_raw (raw_json) - All tests pass: 17 parser + 10 correlation tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-11 22:43:26 +02:00
toto	7eb3ad21fd	feat(dashboard): afficher SETTINGS H2 individuels dans la table mismatch - /api/browser-signatures : top_mismatches inclut désormais les 7 colonnes SETTINGS individuelles (h2_header_table_size, h2_enable_push, h2_max_concurrent_streams, h2_initial_window_size, h2_max_frame_size, h2_max_header_list_size, h2_enable_connect_protocol) - stats : ajout sessions_with_priority (countIf h2_priority_present > 0) - browsers.html : colonne SETTINGS compact dans la table suspects (format '3:100, 4:65536, 2:0' — IDs Akamai avec valeurs non-nulles) - Compteur pseudo-priority utilise la vraie valeur sessions_with_priority au lieu d'afficher '—' Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-11 03:11:17 +02:00
toto	f704541f83	feat(h2): direct per-parameter SETTINGS comparison in browser_matcher - Rewrote _d1_h2_settings() with 3-signal weighted formula: direct_score×0.60 + dict_match×0.30 + ja4_coherence×0.10 when individual SETTINGS cols are available in the DataFrame - Added _H2_SETTINGS_COLS dict (IDs 1,2,3,4,5,6,8 → column names) - Fallback to dict_match×0.80 + ja4_coherence×0.20 for backward compat - Fix view_ai_features_1h: pass 7 individual SETTINGS columns through base_data CTE (h2_header_table_size, h2_enable_push, h2_max_concurrent_streams, h2_initial_window_size, h2_max_frame_size, h2_max_header_list_size, h2_enable_connect_protocol) - Remove non-existent h2_dict_confidence reference from view SQL (dict_browser_h2 only exposes browser_family attribute) - Add 7 new pytest cases: exact match, one wrong setting, forbidden key penalty, unknown fingerprint with correct settings, fallback path, CDN proxy neutralisation, full Chrome simulation - 53/53 bot-detector tests pass - Update thesis §3.9.2: document direct comparison algorithm + fallback Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-11 03:05:36 +02:00
toto	85d3b95b7b	feat: HTTP/2 passive fingerprinting with individual SETTINGS fields Complete implementation of HTTP/2 passive fingerprinting per thesis §2.5.3: mod-reqin-log (C module): - Replace connection-level filter with ap_hook_process_connection (APR_HOOK_FIRST) to capture H2 preface before mod_http2 takes over the connection - AP_MODE_SPECULATIVE read of 512 bytes from c->input_filters - Parse SETTINGS, WINDOW_UPDATE, PRIORITY flags, pseudo-header order - Output individual SETTINGS params as separate JSON fields (IDs 1-6, 8) - Read H2 notes from c1 (master connection) for mod_http2 secondary conns - Fix header_order_signature JSON length bug (26→strlen) ClickHouse schema: - Add 8 new columns to http_logs: h2_has_priority, h2_header_table_size, h2_enable_push, h2_max_concurrent_streams, h2_initial_window_size, h2_max_frame_size, h2_max_header_list_size, h2_enable_connect_protocol - Use Int32/Int64 with DEFAULT -1 to distinguish absent vs zero - Update mv_http_logs to extract individual fields via JSONHas/JSONExtractInt - Migration 04_http2_fields.sql updated for existing deployments Correlator: - Accept both timestamp_ns and timestamp field names (backward compat) Integration: - Enable HTTP/2 in Apache: Protocols h2 http/1.1 in httpd-integration.conf Validated end-to-end via Playwright: H2 curl traffic → mod-reqin-log → correlator → ClickHouse with all 12 H2 columns populated correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-11 02:33:45 +02:00
toto	d098de1a66	fix(bot-detector): neutralize H2 dimensions behind proxy (X-Forwarded-For) When has_xff=1, the H2 connection is terminated by the reverse proxy/CDN, so client H2 fingerprints are lost. Previously only D1 (h2_settings) was neutralized; D2 (window_update), D3 (pseudo_order), and D4 (priority) still penalized proxied traffic — a real Chrome behind Cloudflare scored 0.0 on 3 dimensions (45% of total weight). Now all 4 H2 dimensions return 0.5 (neutral) when has_xff>0, and non-browser H2 detection is also disabled behind proxies. Tests: 10/10 passed including 3 new XFF-specific cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 15:15:20 +02:00
toto	261205028d	fix(dashboard): campaigns scatter chart — show campaigns not IPs - API /api/campaigns/scatter: aggregate by campaign_id instead of per-IP Returns avg_score, avg_velocity, unique_ips, ja4_list, asn_list, country_list - Template: one bubble per campaign, sized by IP count - Tooltip: campaign-level info (IPs, score, velocity, ASNs, pays, JA4s) - Click navigates to campaign detail (not IP detail) - Updated doc panel text Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 15:09:02 +02:00
toto	fb73c60e7d	feat(dashboard): fingerprint discovery page — extract and group JA4/H2/headers from traffic - GET /api/fingerprint-discovery: queries http_logs, groups by JA4, aggregates UA family, header presence rates (Sec-CH-UA, Sec-Fetch, Accept-Language, zstd, brotli, gzip, XFF), H2 data, TLS info, dict lookups - /fingerprints page: KPIs, doughnut chart by family, stacked header bars, filterable/sortable profile table, expandable detail panel - Promote button: push H2 fingerprints to browser_h2_signatures via existing POST /api/browser-signatures/entries endpoint - Nav link: Découverte added after Navigateurs in sidebar Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 15:02:53 +02:00
toto	fde6864311	feat(dashboard): browser signatures management UI - Ajoute dict_browser_h2 dans /reflists (lecture seule via dict_browser_h2) - Nouveaux endpoints API : GET /api/browser-signatures/entries — liste browser_h2_signatures (fallback dict CSV si migration 06 non appliquée) POST /api/browser-signatures/entries — ajout fingerprint + reload dict DELETE /api/browser-signatures/entries — suppression + reload dict - Page /browsers : 2 nouvelles sections 'Base de signatures H2' — tableau des 10 fingerprints, form d'ajout, mode lecture seule automatique si migration 06 non appliquée 'Règles de scoring browser_matcher.py' — tableau statique des 7 dimensions (poids, valeurs par famille, seuils de bypass) - Integration : browser_h2.csv copié dans user_files au démarrage ClickHouse Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 14:46:07 +02:00
toto	da1b579d4f	fix(dashboard): rename duplicate /api/browsers route to /api/browser-signatures La route /api/browsers existait déjà (distribution JA4 par famille). La nouvelle route du browser_matcher était en conflit — FastAPI utilisait la première définition. Renommage en /api/browser-signatures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 14:17:38 +02:00
toto	9c308747bd	feat(dashboard): page Browser Signature Detection (/browsers) Nouvelle page dédiée à l'analyse passive des signatures navigateur (§4) : API — GET /api/browsers : Requête view_ai_features_1h pour : - Compteurs globaux (total, sessions_with_h2, matched, mismatch %) - Distribution h2_dict_family (Chrome/Firefox/Safari/Edge) - Répartition des signaux WINDOW_UPDATE (chrome/firefox/safari/absent/autre) - Mismatch TLS↔H2 par famille JA4 (total + count + %) - Top 20 sessions suspectes (tls_h2_family_mismatch=1, triées par hits) Page /browsers : - 6 KPI header (sessions, avec H2, famille connue, taux match, mismatch, % mismatch) - Doc banner expliquant browser_matcher §4 et le mode DUAL_MODE - Donut : familles H2 (dict_browser_h2 lookup) - Bar horizontal : WINDOW_UPDATE signals par famille - Bar groupé + ligne : mismatch TLS↔H2 par famille JA4 (count + %) - Table : top 20 imposteurs potentiels avec IP cliquable, pseudo-order, cohérence - Mini-KPIs : ordres pseudo-headers Chrome/Safari, Firefox, inconnu, PRIORITY frames - Lien nav 'Navigateurs' dans le groupe Surveillance de base.html Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 14:02:39 +02:00
toto	e52cdcc01f	feat(bot-detector): Browser Signature Detection engine (parallel mode) Étape A — browser_signatures.py Données pures : BROWSER_SIGNATURES (Chrome/Firefox/Safari), NON_BROWSER_SIGNATURES (curl/httpx/go), BROWSER_THRESHOLDS, DIMENSION_WEIGHTS. Valeurs H2 extraites des captures réelles (format Akamai avec virgules, non semicolons). Étape B — browser_matcher.py Moteur vectorisé 7 dimensions (H2 SETTINGS 0.30, WINDOW_UPDATE 0.15, pseudo-header order 0.15, H2 PRIORITY 0.10, HTTP headers 0.15, TLS 0.10, JA4 dict 0.05). run_browser_matcher(df) ajoute bm_family/bm_score/bm_decision. CDN edge case : dimension H2 neutralisée (0.5) si has_xff=1. BROWSER_MATCHER_REPLACE=false par défaut (mode DUAL_MODE logging uniquement). Étape C — 06_browser_signature_detection.sql (migration) Crée browser_h2_signatures (table MergeTree avec 12 fingerprints de référence). Recrée dict_browser_h2 depuis la table avec champ confidence (remplace CSV). Étape D — 07_ai_features_view.sql +h2_wu_val dans le JOIN http_logs, +h2_window_update_value, +h2_dict_family, +h2_dict_confidence, +h2_window_{chrome,firefox,safari,absent}, +h2_order_{chromesafari,firefox}, +h2_priority_present, +h2_pseudo_ord_raw, +tls_h2_family_mismatch (détection incohérence famille JA4 vs famille H2). Étape E — preprocessing.py + pipeline.py preprocessing.py: appelle run_browser_matcher() après compute_browser_axes(), ajoute 7 nouvelles features binaires H2 à FEATURES et binary_features. pipeline.py: appelle log_dual_mode_comparison() après la classification A9. BROWSER_MATCHER_REPLACE=true active le remplacement du bypass. Étape F — test_browser_matcher.py 8 tests : Chrome/Firefox/Safari full match, curl rejeté, httpcloak partiel, TLS↔H2 mismatch, CDN proxy neutralisation, go net/http rejeté. Tous 8 PASSED (+ 36 tests existants inchangés). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 13:52:57 +02:00
toto	79dbb23d6f	feat(dashboard): sélecteur de plage temporelle sur /campaigns Avant : toutes les vues de campagnes étaient fixes à 7 jours. Après : sélecteur 1j / 7j (défaut) / 14j / 30j / 90j en haut à droite. - Ajout du paramètre ?days= (1–90, défaut 7) à : GET /api/campaigns GET /api/campaigns/graph GET /api/campaigns/scatter GET /api/campaigns/{cid} - Le sélecteur recharge simultanément les 3 vues (cartes, scatter, graphe) et le panneau de détail avec la même fenêtre temporelle - Le compteur de campagnes indique la plage active : (4 campagnes — 30j) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 13:24:08 +02:00
toto	9548b1782d	fix: corriger ORDER BY ml_detected_anomalies dans le schéma de base CH 24.8 refuse MODIFY ORDER BY sur des colonnes existantes (erreur BAD_ARGUMENTS 36). La migration 01 ne pouvait donc pas corriger l'ORDER BY en post-init. Correctif : - 06_ml_tables.sql : ORDER BY (src_ip) → ORDER BY (src_ip, ja4, host, model_name) + TTL 30j → 7j (cohérent avec l'architecture documentée) - 01_ttl_adjustments.sql : supprime le MODIFY ORDER BY impossible, conserve uniquement les MODIFY TTL (valides pour les déploiements existants) Résultat : make init-stack sans aucun ⚠ ni ✗ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 01:34:07 +02:00
toto	92432085e2	fix(campaigns): fix IP navigation URL encoding fmtIP() returns an HTML <a> tag string. Using encodeURIComponent(fmtIP(ip)) was URL-encoding the entire HTML markup instead of the raw IP address, resulting in /ip/%3Ca%20href%3D... navigation. Fix: extract raw IP (stripping ::ffff: prefix) before building the URL. Applied to all 3 click handlers in campaigns.html: - members table row onclick - scatter chart point click - force graph node click Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 01:08:53 +02:00
toto	7a04e47041	fix(sql+api): fix view column mismatches and ClickHouse 24.8 JOIN issue - view_form_bruteforce_detected: add post_count, distinct_paths, first_seen, last_seen - view_host_ip_ja4_rotation: add host, distinct_ja4, ja4_list, window_start - Replace uniqExact/groupUniqArray with count()/groupArray (no nested-agg error) - api.py campaigns/graph: move a.src_ip < b.src_ip from JOIN ON to WHERE (ClickHouse 24.8 forbids cross-table inequality in JOIN ON condition) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 01:05:04 +02:00
toto	2f2c5e03bb	fix(sql): contournement bug scope ClickHouse 24.8 dans view_ai_features_1h - Restructure 07_ai_features_view.sql : single anonymous inner subquery avec aliases explicites sur toutes les colonnes (a.xxx AS xxx, h.xxx AS xxx, h2.xxx AS xxx) pour résoudre l'ambiguïté PARTITION BY src_ip dans l'outer SELECT - Supprime les CTEs multiples (h2_agg, enriched) qui déclenchaient le bug - Fix migration 04_http2_fields.sql : ordre DEFAULT avant CODEC (syntax ClickHouse) - make init-stack : 0 erreur sur 13 fichiers SQL Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 00:48:05 +02:00
toto	a108814a56	feat: roadmap détection bots §2-9 — HTTP/2, cohérence, drift, flotte, Jaccard, ExIFFI, méta-learner, métriques Étape 2 — Fingerprinting HTTP/2 dans le pipeline ML : - Ajout du dictionnaire dict_browser_h2 (11 familles de navigateurs) dans 05_aggregation_tables.sql - Ajout du CTE h2_agg et 4 features HTTP/2 dans 07_ai_features_view.sql : h2_settings_known, h2_pseudo_order_match, h2_ja4_coherence, h2_settings_rare - Calcul du fingerprint_coherence_score (5 axes pondérés) dans la vue - Ajout du 6e axe axis_h2_coherence dans browser.py (poids rééquilibrés) - browser_h2.csv : 11 fingerprints Akamai → famille navigateur Étape 3 — Pré-filtre de cohérence sur la baseline humaine : - pipeline.py exclut les sessions avec fingerprint_coherence_score < seuil de la baseline d'entraînement - FINGERPRINT_COHERENCE_THRESHOLD configurable via env (défaut 0.25) - Log des sessions exclues pour analyse SOC Étape 4 — Détection de drift améliorée : - scoring.py : passage de 5 à 9 quantiles (p5…p95) - Ajout de la divergence KL en complément du test KS - Détection de drift adversarial (≥80% des features dérivent dans la même direction) - Split temporel strict pour la validation Étape 5 — Graphe bipartite JA4×ASN (§5.2) : - fleet.py : détection de flottes via NetworkX + Louvain (imports optionnels) - enrich_with_fleet_score() : ajout fleet_score + fleet_campaign_flag au DataFrame - cycle.py : appel après preprocess_df avec log du nombre de sessions en flotte - SQL migration 05_fleet_metrics_tables.sql : table fleet_detections (TTL 7j) - Dashboard : /fleet + /api/fleet (communautés détectées) + template fleet.html Étape 6 — Cross-domain Jaccard §5.8 : - 12_thesis_features.sql : CTE jaccard_paths → cross_domain_path_similarity - Signal : même chemins (/admin, /wp-login) sur plusieurs hosts = scanner Étape 7 — ExIFFI + erreurs AE par feature : - scoring.py : compute_exiffi_importance() par permutation, compute_ae_feature_errors() - pipeline.py : calcul ExIFFI sur X_test, mapping index → dict pour anomalies - build_reason() enrichi avec exiffi_top quand SHAP inactif Étape 8 — Méta-learner pour la pondération de l'ensemble : - scoring.py : classe MetaLearner (LogisticRegression, fallback poids fixes <1000 labels) - Collecte des labels depuis le cycle courant (known_bots, légitimes, Anubis) - pipeline.py : remplacement des poids fixes par MetaLearner.predict() Étape 9 — Métriques de performance et monitoring : - metrics.py : record_cycle_metrics() — taux anomalie, drift, corrélation, latence - SQL migration 05_fleet_metrics_tables.sql : table ml_performance_metrics (TTL 90j) - Dashboard : /health + /api/health + template health.html - cycle.py : appel record_cycle_metrics en fin de cycle (Complet + Applicatif) Tests : 36/36 bot-detector tests passent Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 00:11:35 +02:00
toto	8ca4a1e849	feat(mod_reqin_log): fingerprinting HTTP/2 passif (Akamai format) Ajoute un filtre d'entrée de connexion (AP_FTYPE_CONNECTION, APR_HOOK_LAST) qui s'insère entre mod_ssl et mod_http2 pour lire de manière non-destructive le preface HTTP/2 (RFC 9113 §3.4) et en extraire : - h2_fingerprint : fingerprint Akamai complet ex. '1:65536,2:0,4:6291456,6:262144\|15663105\|0\|m,a,s,p' - h2_settings_fp : entrées SETTINGS brutes (ex. '1:65536,4:6291456') - h2_window_update : incrément WINDOW_UPDATE (ex. '15663105') - h2_pseudo_order : ordre des pseudo-headers (ex. 'm,a,s,p' Chrome, 'm,p,s,a' Firefox) Technique : lecture spéculative AP_MODE_SPECULATIVE (non-destructive) de 512 octets — la donnée reste disponible pour mod_http2. Le filtre se retire de la chaîne après la première invocation. Stockage dans c->notes (H2_NOTE_*) puis émission JSON dans log_request(). ClickHouse : 4 nouvelles colonnes dans http_logs + JSONExtract dans mv_http_logs. Migration pour déploiements existants : 04_http2_fields.sql. 14 tests unitaires (cmocka) couvrent Chrome/Firefox/HTTP1/troncature/HPACK. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 23:46:50 +02:00
toto	14db3d9040	refactor: suppression dépendance User-Agent de la détection navigateur Changements SQL : - modern_browser_score : sec-ch-ua→100, Sec-Fetch→70 (plus de UA fallback) - Ajout has_sec_ch_ua (UInt8) dans agg_header_fingerprint_1h et ml_all_scores - mss_mobile_mismatch utilise has_sec_ch_ua au lieu de modern_browser_score - header_order_confidence : PARTITION BY ja4 au lieu de first_ua - sec_ch_mobile_mismatch : comparaison Client Hints interne (sans UA) - Migration 03_remove_ua_browser_detection.sql Changements Python : - browser.py Axe 3 : Client Hints + Sec-Fetch + is_fake_navigation (PAS de UA) - Pondération axes : ja4_known 0.30, tls_coherence 0.20 (signaux TLS renforcés) - preprocessing.py : has_sec_ch_ua ajouté aux features et binary_features Fichiers modifiés : 8 SQL/Python + 1 migration, 36/36 tests passent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 23:06:01 +02:00
toto	00e99e5464	fix(bot-detector): make scoring functions public (remove underscore prefix) compute_shap_top_features, build_reason, cluster_anomalies renamed from private (_prefixed) to public to match pipeline.py imports. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 22:49:48 +02:00
toto	629f7b334d	fix(bot-detector): rename _compute_drift_score to public, fix import Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 22:48:21 +02:00
toto	de6d8da931	fix(bot-detector): FEATURES_BASE → FEATURES import name mismatch Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 22:42:32 +02:00
toto	6d64c2a8a8	fix(rpm): add systemd-rpm-macros to Dockerfile.package, fix correlator spec_version - sentinel/correlator: install systemd-rpm-macros in rpm-builder stage - correlator: use build_version macro (not version) to avoid recursive expansion - mod-reqin-log: fix ctest --test-dir to find tests in build/tests/ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 22:33:53 +02:00
toto	6b3cc54652	docs: réécriture audit, DOCUMENTATION.md et IMPROVEMENTS.md pour architecture modulaire - AUDIT: conformité mise à jour 97.9% (142/145), références modulaires - DOCUMENTATION.md: 1083 lignes, 7 sections, 11 modules documentés - IMPROVEMENTS.md: A1-A10/B1-B10 annotés ✅/🔄/❌ avec localisations Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 22:14:18 +02:00
toto	9ea36ad22e	feat(scripts): complete stack init + prod data import with date shift Schema cleanup: - Remove anubis_ua_rules table stub from 03_anubis_tables.sql - Remove anubis_ua_rules from bot-detector deploy_schema.sql - Remove UA seed step from clickhouse-init.sh (no more REGEXP_TREE dependency) - Drop dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules New scripts: - scripts/init-stack.sh: comprehensive ClickHouse init (13 SQL files + migrations + validation + cleanup of obsolete tables). Supports --reset, --import-prod. - scripts/import-prod-data.sh: imports pre-exported prod data (Native format) with dynamic date shift (max(time) → now). Supports --shift, --no-truncate. - scripts/data/prod-export/: directory for cached Native format exports Makefile targets: init-stack, import-prod-data, init-and-import Tested: init-stack.sh passes all 13 SQL + 7 critical tables + 7 dicts import-prod-data.sh: 3M rows in ~37s with auto date shift Dashboard: 55 routes OK, bot-detector: 36/36 tests pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 21:40:05 +02:00
toto	8180f4af04	refactor(anubis): simplify to IP/CIDR + ASN only, remove UA and Country rules - Remove UA regex extraction (extract_ua_regex, _extract_ua_from_all/any) - Remove Country rule collection from parse_bot_policies_inline - Simplify fetch_rules.py: collect_all_rules returns (ip_rules, asn_rules) - Remove insert_ua_rules and insert_country_rules functions - reload_dicts now only reloads dict_anubis_ip + dict_anubis_asn - Simplify CASE blocks in 04_mv_http_logs.sql, 07_ai_features_view.sql, view_ai_features_anubis.sql, mv_http_logs.sql: IP > ASN (was 5-level UA+IP > UA > IP > ASN > Country cascade) - Remove dict_anubis_country + dict_anubis_ua from 03_anubis_tables.sql (UA table kept as stub for REGEXP_TREE catch-all compatibility) - Remove anubis_country_rules table from schema - Remove Anubis UA and Country tabs from dashboard reflists page - Remove anubis_ua_rules/country_rules from API reflist queries - deploy_schema.sql simplified from 339 to 122 lines - 764 lines removed across 9 files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 15:25:33 +02:00
toto	98abbc80c7	feat(dashboard): page Listes de référence — visualisation CSV/dictionnaires Nouvelle page /reflists pour visualiser les 9 dictionnaires ClickHouse : - bot_ip (3.5K entrées) : IP/CIDR de bots connus - bot_ja4 (31) : fingerprints JA4 de bots - browser_ja4 (1.2K) : fingerprints JA4 navigateurs → famille, lib TLS - asn_reputation (82.5K) : ASN → réputation (isp, datacenter, cdn…) - iplocate_asn (714K) : géolocalisation IP → ASN, pays, nom - anubis_ua_rules, anubis_ip_rules, anubis_asn_rules, anubis_country_rules Fonctionnalités : - 9 onglets de navigation entre les listes - Recherche textuelle avec filtrage côté ClickHouse - Pagination (200 entrées/page) - Tri par colonne (ASC/DESC) - Graphique de répartition (ECharts) par catégorie - KPIs dictionnaires en haut de page - Infobulles de documentation API : /api/dictionaries, /api/reflist/{name}, /api/reflist/{name}/stats Helpers : esc() (HTML escape) ajouté à base.html Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 14:56:54 +02:00
toto	039086a0b3	feat: nouvelles techniques de détection et page tactiques SOC SQL: - Ajout 5 colonnes d'agrégation (count_xff, count_unusual_ct, count_non_std_port, count_login_post, sec_ch_mobile_mismatch) - Exposition de 5 features calculées dans view_ai_features_1h - Migration ALTER TABLE pour déploiements existants Bot-detector: - 7 nouvelles features ML (has_xff, unusual_content_type_ratio, non_standard_port_ratio, login_post_concentration, sec_ch_mobile_mismatch, true_window_size, window_mss_ratio) - Propagation campaign_id vers ml_all_scores (était toujours -1) - Escalade campagne : HIGH→CRITICAL si cluster ≥5 membres Dashboard: - Page Tactiques SOC : brute-force, rotation JA4, récurrence, alertes temps réel — 4 KPIs + 4 panneaux + infobulles doc - Ajout fmtDate() helper global - Navigation sidebar mise à jour Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 14:29:18 +02:00
toto	702c0d5edb	feat(dashboard): add JA4 fingerprint and cluster investigation pages - /ja4/{fingerprint} page: 8 KPIs, timeline, threat pie, IP scores table, ASN/geo charts, HTTP logs, AI features — full JA4 investigation - /cluster/{cid} page: 8 KPIs, timeline, threat/JA4/ASN/host charts, member table with bulk classify — full campaign investigation - /api/ja4/{fingerprint} and /api/cluster/{cid} API endpoints - fmtJA4 links now navigate to /ja4/ investigation page - campaigns.html: 'Ouvrir' button links to /cluster/{cid} full page - Fix: double-brace {{param}} in non-f-string queries → single {param} (was causing HTTP 500 on all parameterized ClickHouse queries) - 50 routes total, all tests pass, 0 JS console errors Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 14:05:52 +02:00
toto	70188b508c	fix(dashboard): eliminate @apply CSS, fix status column, fix click propagation Playwright testing revealed 3 critical bugs: 1. Tailwind CDN @apply with custom brand-* colors produces empty CSS rules, breaking ALL design components (kpi-card, data-table, badges, filter-btn, section-card, nav-item). Fix: replace all @apply directives with equivalent raw CSS values. 2. Traffic API and IP detail API reference non-existent 'status' column in http_logs table → HTTP 500 on /traffic and /ip/{ip}. Fix: remove status from SELECT, sort whitelist, filters, and templates. 3. Nested <a> links (fmtJA4, fmtASN, fmtCountry, fmtBotName) inside clickable <tr onclick> capture clicks, preventing row navigation to /ip/ detail. Fix: add event.stopPropagation() to all formatter links. Verified with Playwright: 10 pages × 0 JS errors, all tooltips hidden by default, sidebar toggle works, keyboard shortcuts (Alt+1-9, Alt+B), classification form saves to DB, campaign detail panel opens on click. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 13:54:38 +02:00
toto	6babc55e3e	fix(dashboard): hover infobulles, full-width layout, UX polish - Fix doc tooltips: split CSS into <style type='text/tailwindcss'> for @apply directives + raw CSS for reliable doc panel rendering - Convert doc panels from click-toggle to hover-based infobulles with arrow pointer, fade-in animation, and auto-dismiss on mobile - Replace '?' icons with 'ⓘ' across all 11 templates (51 tooltips) - Full-width layout: reduce padding on mobile (px-3), scale up on desktop (lg:px-5, xl:px-6) for maximum screen utilization - Auto-collapse sidebar on narrow screens (<1024px) - Keyboard shortcuts: Alt+1–9 for page navigation, Alt+B toggle sidebar - Add LEGITIMATE_BROWSER filter button to detections page - Sticky header with stronger blur (backdrop-blur-md) - All 46 routes pass tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 13:30:16 +02:00
toto	63ba6d203c	feat(dashboard): complete SOC dashboard with full monitoring and workflows - models.html: Full rewrite — 6 KPIs, scoring volume timeline, anomaly rate chart, threat breakdown per model, enhanced model cards with validation gate - classify.html: SOC workflow — suggested unclassified IPs, quick-classify buttons, classification stats pie, pre-fill from URL params - traffic.html: Clickable rows → ip_detail, column sorting, status column, search filter, doc tooltips on all chart sections - scores.html: Search input, clickable rows → ip_detail, LEGITIMATE_BROWSER filter button, doc tooltips on distribution + scatter charts - ip_detail.html: Resource cascade section (headless browser detection), status column in HTTP logs table - detections.html: Doc tooltips on threat/reason/ASN chart sections - features.html: Doc tooltips on radar/importance/scatter sections - api.py: 4 new endpoints — /api/models/timeline, /api/models/threats, /api/classify/stats, /api/classify/suggested. Traffic API: status + search. 46 routes total. All tests pass (dashboard + bot-detector 36/36). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 01:25:01 +02:00
toto	396baa90d2	feat(dashboard): visualisation clusters HDBSCAN - Page /campaigns dédiée avec 4 vues graphiques : · Scatter plot (score vs vélocité, bulles colorées par campagne) · Graphe réseau force-directed (IPs liées par JA4 partagé) · Grille de cartes campagne (KPIs, ASN, pays, JA4) · Panneau détail (radar comportemental, timeline horaire, table membres) - 4 nouveaux endpoints API : · GET /api/campaigns (fix: campaign_id >= 0 au lieu de != '') · GET /api/campaigns/graph (nœuds + arêtes) · GET /api/campaigns/scatter (score/vélocité par IP) · GET /api/campaigns/{cid} (détail + profil + timeline) - Sidebar: lien Campagnes ajouté dans Surveillance - Overview: campagnes clickables → lien vers /campaigns Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 01:11:16 +02:00
toto	f1547423b5	refactor(bot-detector): suppression monolithe, tests multifactoriels - Suppression de bot_detector.py (1982 lignes) remplacé par 11 modules - Tests navigateur mis à jour pour le système multifactoriel (browser_confidence) - 36/36 tests passent avec la nouvelle structure modulaire Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 01:03:17 +02:00
toto	1f103392ac	refactor(bot-detector): extract monolith into modular package Split bot_detector.py (~1982 lines) into 10 focused modules: - config.py: all configuration constants and optional imports - log.py: logging utilities (log_info, log_decision, append_training_history) - infra.py: ClickHouse client, health check HTTP server, shutdown - browser.py: multifactorial browser identification (5 axes) - scoring.py: drift detection, feature validation, SHAP, clustering - models.py: EIF, Autoencoder, XGBoost model management - preprocessing.py: data preprocessing and feature list definitions - pipeline.py: core semi-supervised scoring loop - cycle.py: main analysis cycle orchestration - __main__.py: entry point with startup banner Update Dockerfile to copy package directory and use python -m bot_detector. All 36 existing tests pass unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 01:02:04 +02:00
toto	2d04288e95	feat(dashboard): SOC workflow overhaul — sidebar nav, doc tooltips, full-width layout - base.html: collapsible sidebar navigation, doc tooltip system, JS helpers (fmtNum, fmtPct, fmtDuration, ecGrid, buildTable, docHTML) - overview.html: SOC command center with stacked timeline, live alerts, campaigns panel, browser donut, 6 KPIs - detections.html: threat color dots, raw score column, click-to-navigate rows - network.html: JA4 rotation, brute-force, persistent threats tables, 6 KPIs - ip_detail.html: ASN/country KPIs, AE/XGB/campaign columns, enriched features - scores/traffic/features/models/classify: page_title blocks + doc tooltips - api.py: 9 new endpoints (campaigns, brute-force, ja4-rotation, recurrence, cascade, alerts, timeline-detail, ua-rotation) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 00:29:34 +02:00
toto	c994ad4466	fix: XGB label query + SHAP isotree compatibility XGB: query was selecting features from ml_all_scores which doesn't store them. Now joins ml_all_scores (labels) with view_ai_features_1h (features). Dynamically discovers available columns to skip thesis §5 features not present in the view. Returns (model, features) tuple. SHAP: TreeExplainer doesn't support isotree. Fall back to permutation- based Explainer(model.decision_function, X_sample) for isotree. Verified: XGB trained on 50000 labels (18436 positives), triple-voice ensemble scoring active (EIF+AE+XGB), SHAP silent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 00:06:54 +02:00
toto	c6666e2bba	fix: isotree score convention — proper sklearn calibration isotree decision_function returns [0,1] (higher=anomalous, 0.5=boundary). The entire pipeline (normalize_scores, score_to_threat_level, compute_adaptive_threshold) expects sklearn convention (negative=anomalous). Previous fix (-raw_scores) negated all values, making everything below -0.30 → all CRITICAL. New fix: 0.5 - isotree_score maps correctly to sklearn's convention: isotree 0.80 → -0.30 (CRITICAL) isotree 0.65 → -0.15 (HIGH) isotree 0.55 → -0.05 (MEDIUM) isotree 0.50 → 0.00 (boundary) Verified: 27,952 LEGITIMATE_BROWSER + 15,843 HIGH + 15,059 MEDIUM Tests: 36/36 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 23:56:05 +02:00
toto	db306fb9da	fix: P0 audit bugs — bot-detector + dashboard + SQL Bot-detector: - B1.1: campaign_id and raw_anomaly_score now inserted into ml_detected_anomalies - B1.4/B1.5: log_decision argument order fixed (cycle_id, name) - B1.7: AE broadcast error — model now returns features list, scoring uses model's features instead of current cycle's (prevents dim mismatch) - B1.8: Anubis ALLOW bots now get bot_name from anubis_bot_name Dashboard: - C1.1: XSS in ip_detail.html — {{ ip \| tojson }} instead of raw string - C1.2: Stored XSS via innerHTML — added escapeHtml() helper, all user-facing formatters (fmtIP, fmtASN, fmtCountry, fmtJA4, fmtBotName, fmtLabel) sanitized - C2.1: status filter now correctly filters http_version column - C2.2: heatmap toDayOfWeek() - 1 for 0-indexed JS days SQL: - B1.3: view_ip_recurrence worst_score uses max() not min() (0=normal, 1=anomal) - B1.6: view_resource_cascade_1h joined into view_thesis_features_1h (§5.4) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 23:33:00 +02:00
toto	98289ccf04	fix: ASN dictionary pipeline + verbose bot-detector logging - Fix dict_iplocate_asn: remove non-existent org/domain columns (4→4 cols) - Add CSV header to iplocate-ip-to-asn.csv (CSVWithNames format) - Replace org/domain dictGet calls with empty string literals in MV - Full 714K CIDR stub for complete ASN resolution in tests - Add header generation to generate_asn_data.py - Verbose bot-detector stdout: data summary, triage breakdown, model training details, scoring stats, browser classification, boxed results - Fix IPv6 filter in traffic seeder (_ips_from_cidrs) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 17:43:55 +02:00

1 2

72 Commits