Commit Graph

86 Commits

Author SHA1 Message Date
36b5065a0a feat(e2e): add multi-IP endpoint architecture with dedicated traffic VM
Replace single-service-per-endpoint with all-ips mode running nginx, apache,
and hitch+varnish simultaneously on 3 dedicated IPs per VM (eth1 alias IPs).
Add a dedicated traffic VM with curl-impersonate for realistic TLS fingerprints,
parallelized traffic generation, and paired SNI_HOSTS/TARGET_IPS lists for
per-VM per-service hostname identification (e.g. rocky9-nginx-platform.test).

Key changes:
- run-tests-vm.sh: add setup_all_ips(), IP-specific Listen/bind directives
  with reset-before-apply pattern, graceful service availability checks
- run-e2e-test.sh: traffic VM architecture, all-ips mode, eth1 network,
  paired IP/SNI lists, updated cleanup for alias IPs
- generate-traffic.sh: parallel background jobs, curl-impersonate detection,
  auto source interface detection via ip route get, Host header in HTTP traffic
- Vagrantfile: add traffic VM with provision-traffic.sh
- provision-traffic.sh: install curl-impersonate and httpx for traffic gen
- test-rpm.sh: multi-interface TC check, updated ja4ebpf config
- clickhouse-init.sh: load CSV stubs for Anubis/bot-networks dictionaries
- Remove obsolete correlator/sentinel/mod-reqin-log docs
- Add h2_settings_ack column to http_logs schema
- Upgrade Go toolchain to 1.25.0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 14:25:24 +02:00
f0c8fe81c6 feat(ja4ebpf): add multi-interface TC, LPM_TRIE ignore_src, unit tests, and fix bugs
- Add multi-interface TC attachment (default "any" = all UP interfaces)
- Add BPF LPM_TRIE map ignored_src for kernel-side CIDR filtering
- Add userspace ignore_src filtering for SSL/accept4 path via net.IPNet.Contains()
- Add AcceptCache for fd→SessionKey correlation with TTL and Close()
- Add 5 test files covering writer, procutil, dispatcher, accept_cache, and cmd
- Fix formatTCPOptions infinite loop on EOL (case 0 break→return)
- Fix pseudoOrderToShort panic on empty slice (negative cap)
- Fix AcceptCache goroutine leak (add done channel + Close())
- Update config.yml.example with interfaces, listen_ports, ignore_src
- Rewrite docs/services/ja4ebpf.md (was massively stale: XDP, RingBuffer, etc.)
- Fix stale XDP/RingBuffer references in docs/architecture.md, thesis, tls.go

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 01:49:26 +02:00
fd84aebc44 fix(ja4ebpf): correct HPACK static table per RFC 7541 and decode indexed representations
The HPACK static table was completely wrong from index 15 onwards — entries
were shifted and missing, causing all header name lookups to return wrong
names (e.g. index 19 returned "cookie" instead of "accept"). Rewrite the
entire table as hpackStaticEntry{Name,Value} structs matching RFC 7541 Appendix
A (indices 1-61) plus browser extensions (62-100). Fix DecodeH2HeadersBlock to
properly decode fully-indexed representations (6.1) which were silently dropped
before — now both name and value are extracted from the static table entry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 15:24:04 +02:00
0975d40609 feat(ja4ebpf): add dst_ip/dst_port to TLS and HTTP plain events for complete L3/L4
Add dst_ip and dst_port fields to tls_hello_event BPF struct and populate
them in tc_capture.c. Update Go TLS event handler with new byte offsets
(payload[2048]+src_ip(4)+dst_ip(4)+src_port(2)+dst_port(2)+payload_len(2)+
timestamp_ns(8) = 2070 bytes). Read dst_ip/dst_port from HTTP plain events
and use them to populate L3L4 when SYN was not captured, ensuring dst_ip
and dst_port are always available in ClickHouse for both TLS and HTTP sessions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 14:31:46 +02:00
65d833bb18 fix(ja4ebpf): correct BPF struct byte offsets and regenerate SSL_write programs
Fix two critical offset bugs introduced when ip_total_length was added to
tcp_syn_event: tcp_options_raw offset 21→23 and tcp_options_len offset 61→63,
plus minimum size check 70→72. Fix ssl_data_event direction field offset from
4118 (inside timestamp_ns) to 4126. Simplify attachSSLWrite to use generated
objects directly instead of dynamic spec loading. Regenerate BPF objects with
SSL_write uprobe programs included.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 14:06:28 +02:00
24306ef390 feat(ja4ebpf): add SSL_write uprobe, HPACK decoder, and AcceptCache for session correlation
Add uprobe_ssl_write_entry/uretprobe_ssl_write_exit to capture server HTTP
responses via SSL_write with direction=1. Implement full HPACK decoder
(RFC 7541 static table, multi-byte integers, literal representations) for
HTTP/2 header extraction. Add AcceptCache mapping {tgid,fd}→SessionKey
from accept4 events as authoritative source for SSL correlation when BPF
ssl_conn_map has src_ip=0. Add ip_total_length to tcp_syn_event BPF struct.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 03:34:43 +02:00
a02423fd18 feat: maximize data completeness across L3/L4/TLS/HTTP layers and add E2E test infra
Add SSL_write uprobe for HTTP response capture, HPACK decoder for HTTP/2
header extraction, and AcceptCache for reliable SSL/TC session correlation.
Populate all ClickHouse fields including tcp_meta_options, ip_meta_total_length,
syn_to_clienthello_ms, client_headers, TLS cipher suites/extensions, and
h2_enable_connect_protocol. Increase BPF capture buffers (HTTP 512B, TLS 1024B).
Add distributed E2E testing infrastructure with multi-VM Vagrant setup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 03:34:33 +02:00
e25caa85da fix(ja4ebpf): remove double bswap16 on accept4 port
The manual byte assembly (sa_buf[2]<<8 | sa_buf[3]) already produces
a host-byte-order port value; __builtin_bswap16 was swapping it again,
causing SSL events to use wrong source ports and preventing TLS/HTTP
session correlation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 02:57:29 +02:00
61addc8cfa feat: JA3 fingerprinting, SSL correlation fix, ML pipeline overhaul, E2E test infra
ja4ebpf:
- Add JA3 raw + MD5 hash fingerprinting (ComputeJA3 in TLS parser)
- Fix accept4 port double-swap bug (__builtin_bswap16 on already-host-order value)
- Fix scheme override bug in ClickHouse writer (HTTP block clearing HTTPS)
- Add HTTP/2 passive fingerprinting (Akamai H2 FP, SETTINGS, pseudo-header order)
- Enrich ClickHouse schema with IP/TCP metadata, H2 settings, Sec-* headers
- Ensure maximum data completeness: all available L3/L4, TLS, HTTP fields emitted

bot-detector:
- Replace logistic regression with MLP fusion classifier
- Replace KS drift detection with ADWIN online learning
- Replace NetworkX/Louvain with PyTorch Geometric GraphSAGE for fleet detection
- Replace autoencoder with RealNVP normalizing flow + SessionTransformer embeddings

infra:
- Add distributed E2E test infrastructure (4 VMs: endpoints + analysis)
- Add Vagrant provisioning for analysis VM, e2e Makefile targets, run scripts

docs:
- Restructure thesis into chapter files with corrected references
- Add E2E testing documentation
- Update architecture, schema, deployment, service docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 02:57:07 +02:00
f88b739992 feat(e2e): add distributed E2E test framework with parametric traffic generation
Add run-e2e-test.sh with CLI parameters (--hits, --http-ratio, --dns, --tls,
--src-ips, --keep-analysis, --up) for configurable traffic generation. Traffic
runs from VM endpoints with multiple source IPs (alias IPs on eth0) to produce
distinct sessions for the ML pipeline. Fix curl TLS flags (--tlsv1.2 instead
of --tls-v1-2), skip redundant local verification in distributed mode, and
fix dashboard is_available() cache that never retried after ClickHouse recovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 00:09:32 +02:00
7894d39f1c feat(ml): replace logistic regression with MLP fusion and KS drift with ADWIN online learning
Replace the LogisticRegression meta-learner with a PyTorch MetaFusionMLP
(Linear(3,16)->BN->ReLU->Dropout->Linear(16,1)->Sigmoid) for non-linear
fusion of EIF, NF, and XGBoost scores. Replace KS-test + quantile digest
drift detection with ADWIN (adaptive sliding window, Hoeffding bound).
Replace weekly XGBoost batch retraining with River HoeffdingAdaptiveTree
for incremental online learning (learn_one per cycle). Update all thesis
documentation sections (2.4.2c, 2.4.3, 3.8, discussion, conclusion).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 16:32:34 +02:00
c6cb12981c feat(ml): replace NetworkX/Louvain with PyTorch Geometric GraphSAGE for fleet detection
Rewrite fleet.py to use a GNN-based approach: nodes are src_ip with ML feature
vectors, edges connect IPs sharing (JA4, ASN) pairs, GraphSAGE (2 SAGEConv
layers, in→64→32) produces 32D embeddings clustered by HDBSCAN. PyG NeighborLoader
activates for >50k nodes. Update thesis docs (§5.2, §6.4, §2, §8) to reflect
GraphSAGE architecture and PyG scalability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 15:45:34 +02:00
c1821dcbc4 feat(ml): replace Autoencoder with RealNVP Normalizing Flow and add SessionTransformer embeddings
Replace TrafficAutoEncoder (MSE reconstruction scoring) with TrafficNormalizingFlow
(RealNVP via FrEIA, 4 affine coupling blocks, anomaly score = -log p(x)) for
mathematically rigorous density estimation. Add SessionTransformer module producing
32-dimensional sequence embeddings from raw HTTP request sequences (path, method,
timing) via a lightweight TransformerEncoder, replacing path_transition_entropy and
cadence_cv features. Update thesis documentation sections 2.4.2b and 3.8 accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 15:11:21 +02:00
ac75ce2956 chore: remove regenerable data and build artifacts from git tracking
Add .gitignore rules for generated CSV data, eBPF compiled objects,
and vmlinux.h header. Remove 19 tracked files (~175 MB) that can be
regenerated from scripts (generate_*.py), bpftool, or bpf2go.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 02:31:27 +02:00
6e5eb38efd docs: update thesis and docs with Cleanlab label filtering integration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 02:19:46 +02:00
9d27abf43c fix(ml): integrate Cleanlab to filter noisy SOC labels and prevent model poisoning
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 02:11:25 +02:00
c60ce97f23 feat(bot-detector): add dynamic browser profiling engine with HDBSCAN clustering
Implement offline profile building (profile_builder.py) and real-time
dynamic scoring (browser_matcher_dynamic.py) using HDBSCAN-based browser
fingerprint clustering. Add ClickHouse materialized view (13_h2_profiling.sql)
for h2_profile_stats aggregation. Update thesis and project documentation
to cover the new dynamic profiling architecture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 02:06:00 +02:00
d75825278e feat: multi-distro VM tests, ja4ebpf eBPF improvements, bot-detector scoring
ja4ebpf:
- Refactor BPF TC capture with improved SYN offset handling and TCP option parsing
- Enhance TLS uprobe SSL hooking for better key extraction
- Add ClickHouse writer improvements for HTTP log materialized views
- Update RPM spec for Rocky Linux 8/9/10, fix systemd service
- Simplify loader with cleaner bpf2go integration

bot-detector:
- Add H2 SETTINGS per-parameter comparison in browser_matcher
- Enhance browser signatures and scoring pipeline
- Improve preprocessing and cycle detection

infra:
- Multi-distro Vagrantfile (centos8, rocky9, rocky10) with per-distro provisioning
- New Makefile targets: vm-up-all, test-vm-matrix, test-vm-centos8/rocky10
- Add debug helpers and run-test-from-host.sh for host-driven VM testing
- Update run-tests-vm.sh for cross-distro compatibility
- Remove accidental binary blob (\004)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 01:09:33 +02:00
957918c565 fix(ja4ebpf): Rocky Linux RPM builder, remove correlated field, fix thesis
- Dockerfile.package: migre go-builder de golang:bookworm (Debian) vers
  rockylinux:9, installe Go depuis le tarball officiel, remplace apt par
  dnf (clang llvm libbpf-devel bpftool)

- Suppression du champ 'correlated' de l'agent ja4ebpf : avec eBPF/XDP,
  la corrélation L3/L4↔L7 est toujours implicite par présence des champs.
  Supprimé de : session.go, manager.go, main.go (x5), clickhouse.go

- Thèse (6 corrections listées + cohérence correlated) :
  1. §3.5 + §3.9.1 : SSL_read retourne des octets bruts sans respecter les
     frontières H2 → buffer circulaire de réassemblage en Go userspace
  2. §3.1 : supprimé libpcap + CAP_NET_RAW, remplacé par définition uprobe
  3. §4 + §7 : compte exact 96 features en 8 familles (Famille 1–8),
     supprimé taxonomie F1–F11 obsolète, tous les totaux mis à jour
  4. §2.4 + §8 : remplacé 7 fausses URLs arXiv par [Référence à vérifier]
  5. §4 Famille 2 : ja4_drift_ratio → renvoi à Famille 8 (définition complète)
  6. §6.4 : ajouté limite 'Overhead de l'uprobe SSL_read'
  + §3.6 : supprimé correlated=0/1 du texte architectural

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-12 04:48:40 +02:00
b1218a2367 fix(ja4ebpf): fix TLS capture, SYN offsets, TCP option parsing
- Increase MAX_TLS_PAYLOAD from 512 to 2048 bytes to capture full
  TLS ClientHellos (modern browsers/curl send 1000-1543 byte ClientHellos)
- Fix ParseClientHello to tolerate XDP-truncated payloads: clamp
  recordLength and chLen to available data instead of returning error
- Fix cipher suites, compression, extensions truncation to use clamping
- Fix consumeSynEvents struct field offsets: dst_ip (4 bytes at offset 4)
  was not accounted for, causing all L3/L4 metadata to be read from
  wrong positions (TTL was actually dst_ip[0], windowSize was dst_port, etc.)
- Add parseTCPOptions() to extract MSS and Window Scale from raw TCP options
  (C code sets defaults of mss=0, window_scale=0xFF, expects Go to parse)
- Fix consumeAcceptEvents: skip zero-IP events to avoid phantom sessions
- Fix consumeSSLEvents: filter zero-IP/port events when proc fallback fails
- Add missing consumeHTTPPlainEvents goroutine (was defined but never called)
- Fix race condition: SYN consumer sets Correlated=true if TLS already present
- Update tls_hello_event struct offsets in Go consumer (payload_len now at
  offset 2054, was 518, due to payload array growing from 512 to 2048 bytes)
- Remove debug logging from consumers and GC

E2E verified: HTTP plain (port 80) and HTTPS (port 443) both produce
fully correlated sessions in ClickHouse with correct:
  - ip_meta_ttl=64, ip_meta_df=true, ip_meta_id
  - tcp_meta_window_size=64240, tcp_meta_window_scale=10, tcp_meta_mss=1460
  - ja4=t13i3010_1d37bd780c83_95d2a80e6515
  - tls_alpn=http/1.1
  - method=GET, path=/, header_order_signature=Host;User-Agent;Accept
  - correlated=1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-12 04:16:44 +02:00
f85a10b012 feat: pipeline L7 HTTP complet + infrastructure tests VM
Correctifs pipeline L7 (uprobe SSL_read) :
- uprobe_ssl.c : ssl_set_fd ne retourne plus tôt quand fd_conn_map est
  vide (accept4 non disponible en Docker). Sauvegarde ssl_ptr→{fd,0,0}
  pour permettre le fallback /proc côté Go.
- main.go : consumeSSLEvents reécrit avec routeur magic-bytes complet :
  * HTTP/2 preface → extraction SETTINGS + conversion correlation.HTTP2Settings
  * HTTP/1.x requête → method, path, query, headers, header_order_sig
  * HTTP/1.x réponse → status_code
  * Fallback /proc/<tgid>/fd/<fd> quand src_ip=0 (accept4 absent)
- writer/clickhouse.go : export header_order_signature ajouté

Nouveaux packages :
- internal/parser/http1.go : parseur HTTP/1.x (IsHTTP1Request,
  ParseHTTP1Request, IsHTTP1Response, ParseHTTP1Response)
- internal/parser/http1_test.go : 11 tests unitaires (28 total passent)
- internal/procutil/proc_lookup.go : résolution fd→IP via /proc avec cache
  TTL 5s (FDCache). Supporte /proc/PID/net/tcp et tcp6, IPv4-mappé IPv6.

Infrastructure tests VM (tests/vm/) :
- Vagrantfile : VM Rocky Linux 9 KVM, 4 CPU / 4 GB RAM
- provision.sh : installation toolchain eBPF + Go + Docker + nginx
- run-tests-vm.sh : suite de test complète dans la VM (L3/L4+TLS+L7)
- README.md : guide d'installation et d'utilisation
- Makefile : cibles vm-up, vm-down, vm-ssh, test-vm-nginx, test-vm-all,
  vm-rebuild-ja4ebpf

Corrections stack Docker :
- Dockerfiles nginx/apache/nginx-varnish/hitch-varnish : suppression des
  références à shared/go/ja4common/ (répertoire supprimé)
- clickhouse-init.sh : restauré depuis git, seed anubis_ua_rules obsolète
  supprimé (table REGEXP_TREE supprimée du schéma)
- traffic-gen : ajout HTTP/1.0 (http.client) et HTTP/2 (httpx)
- verify_db.py : script de vérification 35 checks (L3/L4/TLS/L7/corrélation)
- run-stack-tests.sh : phase 6 verify_db ajoutée

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-12 02:37:00 +02:00
9734e21fe3 chore: suppression des services obsolètes (sentinel, correlator, mod-reqin-log)
Remplacés par l'agent ja4ebpf (eBPF CO-RE). Nettoyage complet :

Supprimé :
- old/ (archive de l'ancienne architecture)
- services/correlator/ (logcorrelator Go)
- services/sentinel/ (capture pcap Go)
- services/mod-reqin-log/ (module Apache C)
- shared/go/ja4common/ (lib Go partagée — plus importée par ja4ebpf)
- tests/integration/platform/ (test correlator+sentinel+httpd)
- tests/integration/docker-compose.yml (compose ancienne archi)
- tests/integration/run-tests.sh (runner correlator/sentinel)
- tests/integration/verify_mvs.py (script orphelin)

Nettoyé :
- go.work : retire ./shared/go/ja4common
- services/ja4ebpf/go.mod : retire replace ja4common (jamais importé)
- services/ja4ebpf/Dockerfile* : retire les COPY ja4common inutiles
- Makefile : retire test-ja4common-python, test-integration*, targets obsolètes
- tests/integration/README.md : réécrit pour l'architecture ja4ebpf

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-12 01:48:14 +02:00
dc6ffd6474 fix: tests intégration matrix — procps-ng, varnish h2, hitch ALPN, pgrep→ps
- Ajout de procps-ng dans les 4 Dockerfiles runtime (ps/pgrep disponibles)
- Remplacement de pgrep par ps -C dans tous les run-tests.sh
- Correction entrypoint nginx-varnish : pgrep nginx → cat nginx.pid (exit 127)
- Activation HTTP/2 dans Varnish : ajout de -p feature=+http2 dans les
  entrypoints nginx-varnish et hitch-varnish
- Restauration ALPN h2,http/1.1 dans hitch.conf (varnish supporte maintenant h2)
- Correction healthcheck hitch-varnish : curl sans --http1.1 (h2 fonctionnel)
- Correction requêtes phase_verify : http_logs_raw → http_logs, colonnes correctes
- Correction writer clickhouse.go : noms JSON alignés avec la MV (ip_meta_*, tls_sni…)
- Fix toStartOfSecond(DateTime) → toStartOfSecond(toDateTime64(col, 3))
- Retrait du SKIP el8/nginx-varnish (varnish s'installe bien sur AlmaLinux 8)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-12 01:29:01 +02:00
3b047b680a fix(ja4ebpf): split bpf2go generate into Ja4Tc + Ja4Ssl, fix RPM systemd-rpm-macros
- Use two separate //go:generate directives (Ja4Tc for tc_capture.c, Ja4Ssl
  for uprobe_ssl.c) to avoid duplicate LICENSE symbol and multi-file clang issue
- Update loader.go to hold tcObjs/sslObjs separately with correct field names:
  UprobeSslSetFd, UprobeSslReadEntry, UretprobeSslReadExit,
  KprobeAccept4Entry, KretprobeAccept4Exit
- Add systemd-rpm-macros to all three RPM build stages (el8/el9/el10)
  so that %{_unitdir} macro resolves correctly
- RPMs now build successfully for el8, el9, el10

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-11 23:21:11 +02:00
a1e4c1dad5 feat: add ja4ebpf service — eBPF-based TLS/TCP fingerprinting daemon
- TC ingress hook captures TCP SYN (L3/L4) and TLS ClientHello
- Uprobes on SSL_read/SSL_set_fd capture decrypted TLS data
- Kprobes on accept4 correlate socket FDs to client IP:port
- JA4 fingerprint computed from parsed TLS ClientHello
- HTTP/2 SETTINGS and WINDOW_UPDATE extracted from decrypted streams
- Session manager with sharded map (256 shards) and GC goroutine
- Slowloris detection: sessions with no requests after 10s threshold
- ClickHouse batch writer to ja4_logs.http_logs_raw (raw_json)
- All tests pass: 17 parser + 10 correlation tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-11 22:43:26 +02:00
7eb3ad21fd feat(dashboard): afficher SETTINGS H2 individuels dans la table mismatch
- /api/browser-signatures : top_mismatches inclut désormais les 7 colonnes
  SETTINGS individuelles (h2_header_table_size, h2_enable_push,
  h2_max_concurrent_streams, h2_initial_window_size, h2_max_frame_size,
  h2_max_header_list_size, h2_enable_connect_protocol)
- stats : ajout sessions_with_priority (countIf h2_priority_present > 0)
- browsers.html : colonne SETTINGS compact dans la table suspects
  (format '3:100, 4:65536, 2:0' — IDs Akamai avec valeurs non-nulles)
- Compteur pseudo-priority utilise la vraie valeur sessions_with_priority
  au lieu d'afficher '—'

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-11 03:11:17 +02:00
f704541f83 feat(h2): direct per-parameter SETTINGS comparison in browser_matcher
- Rewrote _d1_h2_settings() with 3-signal weighted formula:
  direct_score×0.60 + dict_match×0.30 + ja4_coherence×0.10
  when individual SETTINGS cols are available in the DataFrame
- Added _H2_SETTINGS_COLS dict (IDs 1,2,3,4,5,6,8 → column names)
- Fallback to dict_match×0.80 + ja4_coherence×0.20 for backward compat
- Fix view_ai_features_1h: pass 7 individual SETTINGS columns through
  base_data CTE (h2_header_table_size, h2_enable_push,
  h2_max_concurrent_streams, h2_initial_window_size, h2_max_frame_size,
  h2_max_header_list_size, h2_enable_connect_protocol)
- Remove non-existent h2_dict_confidence reference from view SQL
  (dict_browser_h2 only exposes browser_family attribute)
- Add 7 new pytest cases: exact match, one wrong setting, forbidden key
  penalty, unknown fingerprint with correct settings, fallback path,
  CDN proxy neutralisation, full Chrome simulation
- 53/53 bot-detector tests pass
- Update thesis §3.9.2: document direct comparison algorithm + fallback

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-11 03:05:36 +02:00
85d3b95b7b feat: HTTP/2 passive fingerprinting with individual SETTINGS fields
Complete implementation of HTTP/2 passive fingerprinting per thesis §2.5.3:

mod-reqin-log (C module):
- Replace connection-level filter with ap_hook_process_connection (APR_HOOK_FIRST)
  to capture H2 preface before mod_http2 takes over the connection
- AP_MODE_SPECULATIVE read of 512 bytes from c->input_filters
- Parse SETTINGS, WINDOW_UPDATE, PRIORITY flags, pseudo-header order
- Output individual SETTINGS params as separate JSON fields (IDs 1-6, 8)
- Read H2 notes from c1 (master connection) for mod_http2 secondary conns
- Fix header_order_signature JSON length bug (26→strlen)

ClickHouse schema:
- Add 8 new columns to http_logs: h2_has_priority, h2_header_table_size,
  h2_enable_push, h2_max_concurrent_streams, h2_initial_window_size,
  h2_max_frame_size, h2_max_header_list_size, h2_enable_connect_protocol
- Use Int32/Int64 with DEFAULT -1 to distinguish absent vs zero
- Update mv_http_logs to extract individual fields via JSONHas/JSONExtractInt
- Migration 04_http2_fields.sql updated for existing deployments

Correlator:
- Accept both timestamp_ns and timestamp field names (backward compat)

Integration:
- Enable HTTP/2 in Apache: Protocols h2 http/1.1 in httpd-integration.conf

Validated end-to-end via Playwright: H2 curl traffic → mod-reqin-log →
correlator → ClickHouse with all 12 H2 columns populated correctly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-11 02:33:45 +02:00
d098de1a66 fix(bot-detector): neutralize H2 dimensions behind proxy (X-Forwarded-For)
When has_xff=1, the H2 connection is terminated by the reverse proxy/CDN,
so client H2 fingerprints are lost. Previously only D1 (h2_settings) was
neutralized; D2 (window_update), D3 (pseudo_order), and D4 (priority)
still penalized proxied traffic — a real Chrome behind Cloudflare scored
0.0 on 3 dimensions (45% of total weight).

Now all 4 H2 dimensions return 0.5 (neutral) when has_xff>0, and
non-browser H2 detection is also disabled behind proxies.

Tests: 10/10 passed including 3 new XFF-specific cases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 15:15:20 +02:00
261205028d fix(dashboard): campaigns scatter chart — show campaigns not IPs
- API /api/campaigns/scatter: aggregate by campaign_id instead of per-IP
  Returns avg_score, avg_velocity, unique_ips, ja4_list, asn_list, country_list
- Template: one bubble per campaign, sized by IP count
- Tooltip: campaign-level info (IPs, score, velocity, ASNs, pays, JA4s)
- Click navigates to campaign detail (not IP detail)
- Updated doc panel text

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 15:09:02 +02:00
fb73c60e7d feat(dashboard): fingerprint discovery page — extract and group JA4/H2/headers from traffic
- GET /api/fingerprint-discovery: queries http_logs, groups by JA4, aggregates
  UA family, header presence rates (Sec-CH-UA, Sec-Fetch, Accept-Language,
  zstd, brotli, gzip, XFF), H2 data, TLS info, dict lookups
- /fingerprints page: KPIs, doughnut chart by family, stacked header bars,
  filterable/sortable profile table, expandable detail panel
- Promote button: push H2 fingerprints to browser_h2_signatures via existing
  POST /api/browser-signatures/entries endpoint
- Nav link: Découverte added after Navigateurs in sidebar

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 15:02:53 +02:00
fde6864311 feat(dashboard): browser signatures management UI
- Ajoute dict_browser_h2 dans /reflists (lecture seule via dict_browser_h2)
- Nouveaux endpoints API :
    GET  /api/browser-signatures/entries — liste browser_h2_signatures
         (fallback dict CSV si migration 06 non appliquée)
    POST /api/browser-signatures/entries — ajout fingerprint + reload dict
    DELETE /api/browser-signatures/entries — suppression + reload dict
- Page /browsers : 2 nouvelles sections
    'Base de signatures H2' — tableau des 10 fingerprints, form d'ajout,
    mode lecture seule automatique si migration 06 non appliquée
    'Règles de scoring browser_matcher.py' — tableau statique des 7 dimensions
    (poids, valeurs par famille, seuils de bypass)
- Integration : browser_h2.csv copié dans user_files au démarrage ClickHouse

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 14:46:07 +02:00
da1b579d4f fix(dashboard): rename duplicate /api/browsers route to /api/browser-signatures
La route /api/browsers existait déjà (distribution JA4 par famille).
La nouvelle route du browser_matcher était en conflit — FastAPI utilisait
la première définition. Renommage en /api/browser-signatures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 14:17:38 +02:00
9c308747bd feat(dashboard): page Browser Signature Detection (/browsers)
Nouvelle page dédiée à l'analyse passive des signatures navigateur (§4) :

API — GET /api/browsers :
  Requête view_ai_features_1h pour :
  - Compteurs globaux (total, sessions_with_h2, matched, mismatch %)
  - Distribution h2_dict_family (Chrome/Firefox/Safari/Edge)
  - Répartition des signaux WINDOW_UPDATE (chrome/firefox/safari/absent/autre)
  - Mismatch TLS↔H2 par famille JA4 (total + count + %)
  - Top 20 sessions suspectes (tls_h2_family_mismatch=1, triées par hits)

Page /browsers :
  - 6 KPI header (sessions, avec H2, famille connue, taux match, mismatch, % mismatch)
  - Doc banner expliquant browser_matcher §4 et le mode DUAL_MODE
  - Donut : familles H2 (dict_browser_h2 lookup)
  - Bar horizontal : WINDOW_UPDATE signals par famille
  - Bar groupé + ligne : mismatch TLS↔H2 par famille JA4 (count + %)
  - Table : top 20 imposteurs potentiels avec IP cliquable, pseudo-order, cohérence
  - Mini-KPIs : ordres pseudo-headers Chrome/Safari, Firefox, inconnu, PRIORITY frames
  - Lien nav 'Navigateurs' dans le groupe Surveillance de base.html

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 14:02:39 +02:00
e52cdcc01f feat(bot-detector): Browser Signature Detection engine (parallel mode)
Étape A — browser_signatures.py
  Données pures : BROWSER_SIGNATURES (Chrome/Firefox/Safari), NON_BROWSER_SIGNATURES
  (curl/httpx/go), BROWSER_THRESHOLDS, DIMENSION_WEIGHTS. Valeurs H2 extraites
  des captures réelles (format Akamai avec virgules, non semicolons).

Étape B — browser_matcher.py
  Moteur vectorisé 7 dimensions (H2 SETTINGS 0.30, WINDOW_UPDATE 0.15,
  pseudo-header order 0.15, H2 PRIORITY 0.10, HTTP headers 0.15, TLS 0.10,
  JA4 dict 0.05). run_browser_matcher(df) ajoute bm_family/bm_score/bm_decision.
  CDN edge case : dimension H2 neutralisée (0.5) si has_xff=1.
  BROWSER_MATCHER_REPLACE=false par défaut (mode DUAL_MODE logging uniquement).

Étape C — 06_browser_signature_detection.sql (migration)
  Crée browser_h2_signatures (table MergeTree avec 12 fingerprints de référence).
  Recrée dict_browser_h2 depuis la table avec champ confidence (remplace CSV).

Étape D — 07_ai_features_view.sql
  +h2_wu_val dans le JOIN http_logs, +h2_window_update_value, +h2_dict_family,
  +h2_dict_confidence, +h2_window_{chrome,firefox,safari,absent},
  +h2_order_{chromesafari,firefox}, +h2_priority_present, +h2_pseudo_ord_raw,
  +tls_h2_family_mismatch (détection incohérence famille JA4 vs famille H2).

Étape E — preprocessing.py + pipeline.py
  preprocessing.py: appelle run_browser_matcher() après compute_browser_axes(),
  ajoute 7 nouvelles features binaires H2 à FEATURES et binary_features.
  pipeline.py: appelle log_dual_mode_comparison() après la classification A9.
  BROWSER_MATCHER_REPLACE=true active le remplacement du bypass.

Étape F — test_browser_matcher.py
  8 tests : Chrome/Firefox/Safari full match, curl rejeté, httpcloak partiel,
  TLS↔H2 mismatch, CDN proxy neutralisation, go net/http rejeté.
  Tous 8 PASSED (+ 36 tests existants inchangés).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 13:52:57 +02:00
79dbb23d6f feat(dashboard): sélecteur de plage temporelle sur /campaigns
Avant : toutes les vues de campagnes étaient fixes à 7 jours.
Après : sélecteur 1j / 7j (défaut) / 14j / 30j / 90j en haut à droite.

- Ajout du paramètre ?days= (1–90, défaut 7) à :
    GET /api/campaigns
    GET /api/campaigns/graph
    GET /api/campaigns/scatter
    GET /api/campaigns/{cid}
- Le sélecteur recharge simultanément les 3 vues (cartes, scatter, graphe)
  et le panneau de détail avec la même fenêtre temporelle
- Le compteur de campagnes indique la plage active : (4 campagnes — 30j)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 13:24:08 +02:00
9548b1782d fix: corriger ORDER BY ml_detected_anomalies dans le schéma de base
CH 24.8 refuse MODIFY ORDER BY sur des colonnes existantes (erreur BAD_ARGUMENTS 36).
La migration 01 ne pouvait donc pas corriger l'ORDER BY en post-init.

Correctif :
- 06_ml_tables.sql : ORDER BY (src_ip) → ORDER BY (src_ip, ja4, host, model_name)
  + TTL 30j → 7j (cohérent avec l'architecture documentée)
- 01_ttl_adjustments.sql : supprime le MODIFY ORDER BY impossible, conserve
  uniquement les MODIFY TTL (valides pour les déploiements existants)

Résultat : make init-stack sans aucun ⚠ ni ✗

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:34:07 +02:00
92432085e2 fix(campaigns): fix IP navigation URL encoding
fmtIP() returns an HTML <a> tag string. Using encodeURIComponent(fmtIP(ip))
was URL-encoding the entire HTML markup instead of the raw IP address,
resulting in /ip/%3Ca%20href%3D... navigation.

Fix: extract raw IP (stripping ::ffff: prefix) before building the URL.
Applied to all 3 click handlers in campaigns.html:
- members table row onclick
- scatter chart point click
- force graph node click

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:08:53 +02:00
7a04e47041 fix(sql+api): fix view column mismatches and ClickHouse 24.8 JOIN issue
- view_form_bruteforce_detected: add post_count, distinct_paths, first_seen, last_seen
- view_host_ip_ja4_rotation: add host, distinct_ja4, ja4_list, window_start
- Replace uniqExact/groupUniqArray with count()/groupArray (no nested-agg error)
- api.py campaigns/graph: move a.src_ip < b.src_ip from JOIN ON to WHERE
  (ClickHouse 24.8 forbids cross-table inequality in JOIN ON condition)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:05:04 +02:00
2f2c5e03bb fix(sql): contournement bug scope ClickHouse 24.8 dans view_ai_features_1h
- Restructure 07_ai_features_view.sql : single anonymous inner subquery
  avec aliases explicites sur toutes les colonnes (a.xxx AS xxx, h.xxx AS xxx,
  h2.xxx AS xxx) pour résoudre l'ambiguïté PARTITION BY src_ip dans l'outer SELECT
- Supprime les CTEs multiples (h2_agg, enriched) qui déclenchaient le bug
- Fix migration 04_http2_fields.sql : ordre DEFAULT avant CODEC (syntax ClickHouse)
- make init-stack : 0 erreur sur 13 fichiers SQL

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 00:48:05 +02:00
a108814a56 feat: roadmap détection bots §2-9 — HTTP/2, cohérence, drift, flotte, Jaccard, ExIFFI, méta-learner, métriques
Étape 2 — Fingerprinting HTTP/2 dans le pipeline ML :
- Ajout du dictionnaire dict_browser_h2 (11 familles de navigateurs) dans 05_aggregation_tables.sql
- Ajout du CTE h2_agg et 4 features HTTP/2 dans 07_ai_features_view.sql :
  h2_settings_known, h2_pseudo_order_match, h2_ja4_coherence, h2_settings_rare
- Calcul du fingerprint_coherence_score (5 axes pondérés) dans la vue
- Ajout du 6e axe axis_h2_coherence dans browser.py (poids rééquilibrés)
- browser_h2.csv : 11 fingerprints Akamai → famille navigateur

Étape 3 — Pré-filtre de cohérence sur la baseline humaine :
- pipeline.py exclut les sessions avec fingerprint_coherence_score < seuil de la baseline d'entraînement
- FINGERPRINT_COHERENCE_THRESHOLD configurable via env (défaut 0.25)
- Log des sessions exclues pour analyse SOC

Étape 4 — Détection de drift améliorée :
- scoring.py : passage de 5 à 9 quantiles (p5…p95)
- Ajout de la divergence KL en complément du test KS
- Détection de drift adversarial (≥80% des features dérivent dans la même direction)
- Split temporel strict pour la validation

Étape 5 — Graphe bipartite JA4×ASN (§5.2) :
- fleet.py : détection de flottes via NetworkX + Louvain (imports optionnels)
- enrich_with_fleet_score() : ajout fleet_score + fleet_campaign_flag au DataFrame
- cycle.py : appel après preprocess_df avec log du nombre de sessions en flotte
- SQL migration 05_fleet_metrics_tables.sql : table fleet_detections (TTL 7j)
- Dashboard : /fleet + /api/fleet (communautés détectées) + template fleet.html

Étape 6 — Cross-domain Jaccard §5.8 :
- 12_thesis_features.sql : CTE jaccard_paths → cross_domain_path_similarity
- Signal : même chemins (/admin, /wp-login) sur plusieurs hosts = scanner

Étape 7 — ExIFFI + erreurs AE par feature :
- scoring.py : compute_exiffi_importance() par permutation, compute_ae_feature_errors()
- pipeline.py : calcul ExIFFI sur X_test, mapping index → dict pour anomalies
- build_reason() enrichi avec exiffi_top quand SHAP inactif

Étape 8 — Méta-learner pour la pondération de l'ensemble :
- scoring.py : classe MetaLearner (LogisticRegression, fallback poids fixes <1000 labels)
- Collecte des labels depuis le cycle courant (known_bots, légitimes, Anubis)
- pipeline.py : remplacement des poids fixes par MetaLearner.predict()

Étape 9 — Métriques de performance et monitoring :
- metrics.py : record_cycle_metrics() — taux anomalie, drift, corrélation, latence
- SQL migration 05_fleet_metrics_tables.sql : table ml_performance_metrics (TTL 90j)
- Dashboard : /health + /api/health + template health.html
- cycle.py : appel record_cycle_metrics en fin de cycle (Complet + Applicatif)

Tests : 36/36 bot-detector tests passent

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 00:11:35 +02:00
8ca4a1e849 feat(mod_reqin_log): fingerprinting HTTP/2 passif (Akamai format)
Ajoute un filtre d'entrée de connexion (AP_FTYPE_CONNECTION, APR_HOOK_LAST)
qui s'insère entre mod_ssl et mod_http2 pour lire de manière non-destructive
le preface HTTP/2 (RFC 9113 §3.4) et en extraire :

- h2_fingerprint    : fingerprint Akamai complet
                      ex. '1:65536,2:0,4:6291456,6:262144|15663105|0|m,a,s,p'
- h2_settings_fp    : entrées SETTINGS brutes  (ex. '1:65536,4:6291456')
- h2_window_update  : incrément WINDOW_UPDATE  (ex. '15663105')
- h2_pseudo_order   : ordre des pseudo-headers (ex. 'm,a,s,p' Chrome,
                                                     'm,p,s,a' Firefox)

Technique : lecture spéculative AP_MODE_SPECULATIVE (non-destructive)
de 512 octets — la donnée reste disponible pour mod_http2. Le filtre
se retire de la chaîne après la première invocation.

Stockage dans c->notes (H2_NOTE_*) puis émission JSON dans log_request().
ClickHouse : 4 nouvelles colonnes dans http_logs + JSONExtract dans mv_http_logs.
Migration pour déploiements existants : 04_http2_fields.sql.
14 tests unitaires (cmocka) couvrent Chrome/Firefox/HTTP1/troncature/HPACK.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 23:46:50 +02:00
14db3d9040 refactor: suppression dépendance User-Agent de la détection navigateur
Changements SQL :
- modern_browser_score : sec-ch-ua→100, Sec-Fetch→70 (plus de UA fallback)
- Ajout has_sec_ch_ua (UInt8) dans agg_header_fingerprint_1h et ml_all_scores
- mss_mobile_mismatch utilise has_sec_ch_ua au lieu de modern_browser_score
- header_order_confidence : PARTITION BY ja4 au lieu de first_ua
- sec_ch_mobile_mismatch : comparaison Client Hints interne (sans UA)
- Migration 03_remove_ua_browser_detection.sql

Changements Python :
- browser.py Axe 3 : Client Hints + Sec-Fetch + is_fake_navigation (PAS de UA)
- Pondération axes : ja4_known 0.30, tls_coherence 0.20 (signaux TLS renforcés)
- preprocessing.py : has_sec_ch_ua ajouté aux features et binary_features

Fichiers modifiés : 8 SQL/Python + 1 migration, 36/36 tests passent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 23:06:01 +02:00
00e99e5464 fix(bot-detector): make scoring functions public (remove underscore prefix)
compute_shap_top_features, build_reason, cluster_anomalies renamed from
private (_prefixed) to public to match pipeline.py imports.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:49:48 +02:00
629f7b334d fix(bot-detector): rename _compute_drift_score to public, fix import
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:48:21 +02:00
de6d8da931 fix(bot-detector): FEATURES_BASE → FEATURES import name mismatch
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:42:32 +02:00
6d64c2a8a8 fix(rpm): add systemd-rpm-macros to Dockerfile.package, fix correlator spec_version
- sentinel/correlator: install systemd-rpm-macros in rpm-builder stage
- correlator: use build_version macro (not version) to avoid recursive expansion
- mod-reqin-log: fix ctest --test-dir to find tests in build/tests/

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:33:53 +02:00
6b3cc54652 docs: réécriture audit, DOCUMENTATION.md et IMPROVEMENTS.md pour architecture modulaire
- AUDIT: conformité mise à jour 97.9% (142/145), références modulaires
- DOCUMENTATION.md: 1083 lignes, 7 sections, 11 modules documentés
- IMPROVEMENTS.md: A1-A10/B1-B10 annotés /🔄/ avec localisations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:14:18 +02:00
9ea36ad22e feat(scripts): complete stack init + prod data import with date shift
Schema cleanup:
- Remove anubis_ua_rules table stub from 03_anubis_tables.sql
- Remove anubis_ua_rules from bot-detector deploy_schema.sql
- Remove UA seed step from clickhouse-init.sh (no more REGEXP_TREE dependency)
- Drop dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules

New scripts:
- scripts/init-stack.sh: comprehensive ClickHouse init (13 SQL files + migrations
  + validation + cleanup of obsolete tables). Supports --reset, --import-prod.
- scripts/import-prod-data.sh: imports pre-exported prod data (Native format)
  with dynamic date shift (max(time) → now). Supports --shift, --no-truncate.
- scripts/data/prod-export/: directory for cached Native format exports

Makefile targets: init-stack, import-prod-data, init-and-import

Tested: init-stack.sh passes all 13 SQL + 7 critical tables + 7 dicts
        import-prod-data.sh: 3M rows in ~37s with auto date shift
        Dashboard: 55 routes OK, bot-detector: 36/36 tests pass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 21:40:05 +02:00
8180f4af04 refactor(anubis): simplify to IP/CIDR + ASN only, remove UA and Country rules
- Remove UA regex extraction (extract_ua_regex, _extract_ua_from_all/any)
- Remove Country rule collection from parse_bot_policies_inline
- Simplify fetch_rules.py: collect_all_rules returns (ip_rules, asn_rules)
- Remove insert_ua_rules and insert_country_rules functions
- reload_dicts now only reloads dict_anubis_ip + dict_anubis_asn
- Simplify CASE blocks in 04_mv_http_logs.sql, 07_ai_features_view.sql,
  view_ai_features_anubis.sql, mv_http_logs.sql: IP > ASN (was 5-level
  UA+IP > UA > IP > ASN > Country cascade)
- Remove dict_anubis_country + dict_anubis_ua from 03_anubis_tables.sql
  (UA table kept as stub for REGEXP_TREE catch-all compatibility)
- Remove anubis_country_rules table from schema
- Remove Anubis UA and Country tabs from dashboard reflists page
- Remove anubis_ua_rules/country_rules from API reflist queries
- deploy_schema.sql simplified from 339 to 122 lines
- 764 lines removed across 9 files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 15:25:33 +02:00