c60ce97f23
feat(bot-detector): add dynamic browser profiling engine with HDBSCAN clustering
...
Implement offline profile building (profile_builder.py) and real-time
dynamic scoring (browser_matcher_dynamic.py) using HDBSCAN-based browser
fingerprint clustering. Add ClickHouse materialized view (13_h2_profiling.sql)
for h2_profile_stats aggregation. Update thesis and project documentation
to cover the new dynamic profiling architecture.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-13 02:06:00 +02:00
64dada980f
docs: synchronize thesis with codebase — browser_matcher, fleet detection, module counts
...
Update THESIS_HTTP_Traffic_Detection.md to reflect actual implementation state:
- Module count: 12 → 14 modules, 2 912 → 3 700 lines
- browser_matcher: [partiel] → [impl.] (all 7 scoring dimensions, Chrome/Firefox/Safari signatures)
- Fleet detection §5.2: HDBSCAN → Louvain with connected components fallback
- §3.9.1 H2 tables: update Firefox/Safari pseudo-header order to real captured values
- §7 conclusion: 96/0/0 features (all implemented, zero partial)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-13 01:38:57 +02:00
957918c565
fix(ja4ebpf): Rocky Linux RPM builder, remove correlated field, fix thesis
...
- Dockerfile.package: migre go-builder de golang:bookworm (Debian) vers
rockylinux:9, installe Go depuis le tarball officiel, remplace apt par
dnf (clang llvm libbpf-devel bpftool)
- Suppression du champ 'correlated' de l'agent ja4ebpf : avec eBPF/XDP,
la corrélation L3/L4↔L7 est toujours implicite par présence des champs.
Supprimé de : session.go, manager.go, main.go (x5), clickhouse.go
- Thèse (6 corrections listées + cohérence correlated) :
1. §3.5 + §3.9.1 : SSL_read retourne des octets bruts sans respecter les
frontières H2 → buffer circulaire de réassemblage en Go userspace
2. §3.1 : supprimé libpcap + CAP_NET_RAW, remplacé par définition uprobe
3. §4 + §7 : compte exact 96 features en 8 familles (Famille 1–8),
supprimé taxonomie F1–F11 obsolète, tous les totaux mis à jour
4. §2.4 + §8 : remplacé 7 fausses URLs arXiv par [Référence à vérifier]
5. §4 Famille 2 : ja4_drift_ratio → renvoi à Famille 8 (définition complète)
6. §6.4 : ajouté limite 'Overhead de l'uprobe SSL_read'
+ §3.6 : supprimé correlated=0/1 du texte architectural
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-12 04:48:40 +02:00
3b047b680a
fix(ja4ebpf): split bpf2go generate into Ja4Tc + Ja4Ssl, fix RPM systemd-rpm-macros
...
- Use two separate //go:generate directives (Ja4Tc for tc_capture.c, Ja4Ssl
for uprobe_ssl.c) to avoid duplicate LICENSE symbol and multi-file clang issue
- Update loader.go to hold tcObjs/sslObjs separately with correct field names:
UprobeSslSetFd, UprobeSslReadEntry, UretprobeSslReadExit,
KprobeAccept4Entry, KretprobeAccept4Exit
- Add systemd-rpm-macros to all three RPM build stages (el8/el9/el10)
so that %{_unitdir} macro resolves correctly
- RPMs now build successfully for el8, el9, el10
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-11 23:21:11 +02:00
f704541f83
feat(h2): direct per-parameter SETTINGS comparison in browser_matcher
...
- Rewrote _d1_h2_settings() with 3-signal weighted formula:
direct_score×0.60 + dict_match×0.30 + ja4_coherence×0.10
when individual SETTINGS cols are available in the DataFrame
- Added _H2_SETTINGS_COLS dict (IDs 1,2,3,4,5,6,8 → column names)
- Fallback to dict_match×0.80 + ja4_coherence×0.20 for backward compat
- Fix view_ai_features_1h: pass 7 individual SETTINGS columns through
base_data CTE (h2_header_table_size, h2_enable_push,
h2_max_concurrent_streams, h2_initial_window_size, h2_max_frame_size,
h2_max_header_list_size, h2_enable_connect_protocol)
- Remove non-existent h2_dict_confidence reference from view SQL
(dict_browser_h2 only exposes browser_family attribute)
- Add 7 new pytest cases: exact match, one wrong setting, forbidden key
penalty, unknown fingerprint with correct settings, fallback path,
CDN proxy neutralisation, full Chrome simulation
- 53/53 bot-detector tests pass
- Update thesis §3.9.2: document direct comparison algorithm + fallback
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-11 03:05:36 +02:00
95e87149aa
docs: mise à jour thèse — capture HTTP/2 passive et colonnes individuelles
...
Sections mises à jour :
- Architecture §3.1 : H2 capture déplacé de ja4sentinel vers mod_reqin_log
- §3.5 Couche L7 : description du hook process_connection (APR_HOOK_FIRST,
AP_MODE_SPECULATIVE), parsing preface, notes c1/c2, colonnes individuelles
- Tableau des 12 colonnes H2 dans ja4_logs.http_logs avec types et conventions
- §3.9 Browser Matcher : statut mis à jour (capture [impl.], scoring [partiel])
- §3.9.1 : mécanisme de capture via process_connection au lieu de filtre connexion
- §3.9.4 : h2_window_update_value, h2_has_priority, h2_pseudo_order → [impl.]
- §6.6 Roadmap : dépendance capture H2 résolue, travail restant = browser_match_*
- Contribution 3 : description de l implémentation technique ajoutée
- Tableau récapitulatif : F8 passe de 0/8/0 à 3/5/0 (70 impl. / 5 partiel)
- Résumé quantitatif : 82% impl. (était 79%), 6% partiel (était 9%)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-11 02:40:41 +02:00
85d3b95b7b
feat: HTTP/2 passive fingerprinting with individual SETTINGS fields
...
Complete implementation of HTTP/2 passive fingerprinting per thesis §2.5.3:
mod-reqin-log (C module):
- Replace connection-level filter with ap_hook_process_connection (APR_HOOK_FIRST)
to capture H2 preface before mod_http2 takes over the connection
- AP_MODE_SPECULATIVE read of 512 bytes from c->input_filters
- Parse SETTINGS, WINDOW_UPDATE, PRIORITY flags, pseudo-header order
- Output individual SETTINGS params as separate JSON fields (IDs 1-6, 8)
- Read H2 notes from c1 (master connection) for mod_http2 secondary conns
- Fix header_order_signature JSON length bug (26→strlen)
ClickHouse schema:
- Add 8 new columns to http_logs: h2_has_priority, h2_header_table_size,
h2_enable_push, h2_max_concurrent_streams, h2_initial_window_size,
h2_max_frame_size, h2_max_header_list_size, h2_enable_connect_protocol
- Use Int32/Int64 with DEFAULT -1 to distinguish absent vs zero
- Update mv_http_logs to extract individual fields via JSONHas/JSONExtractInt
- Migration 04_http2_fields.sql updated for existing deployments
Correlator:
- Accept both timestamp_ns and timestamp field names (backward compat)
Integration:
- Enable HTTP/2 in Apache: Protocols h2 http/1.1 in httpd-integration.conf
Validated end-to-end via Playwright: H2 curl traffic → mod-reqin-log →
correlator → ClickHouse with all 12 H2 columns populated correctly.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-11 02:33:45 +02:00
bd81331411
maj these
2026-04-11 00:27:20 +02:00
aa233bc55c
docs(thesis): v3 — corrections + §3.9 browser_matcher + XFF proxy accuracy
...
User-authored updates verified and corrected:
- Correction 1: 85 features / 8 familles (was 65+/7)
- Correction 2: diagram adds MetaLearner, ExIFFI, fleet.py
- Correction 3: axis 5 weight 0.20→0.15, new axis 6 (H2 Coherence 0.05)
- Correction 4: 5 quantiles (p10-p90), p5/p95 as future work
- Correction 5: §5.6 DNS Shadow + §5.7 Compression Ratio named as future
- New §3.9 Browser Signature Detection (browser_matcher 7 dimensions)
- New §2.4.5 ExIFFI + MetaLearner + KL divergence drift
- New §2.5.3 HTTP/2 fingerprinting passif literature
- Updated §5.2 fleet.py implementation details
- Updated §5.8 cross_domain_path_similarity + Jaccard
Additional fixes (code-accuracy alignment):
- XFF proxy: 4 H2 dimensions neutralized (70% weight), not redistributed
- Module count: 12→13 (browser_signatures.py added)
- §6.5 limitations table: precise proxy weight impact (70%→30%)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-10 19:45:02 +02:00
c77d479d6c
docs(thesis): 5 corrections — 85 features, MetaLearner diagram, browser axes note, quantile clarification, §5.6/5.7 named
...
- Correction 1 (l.65, 701): '65+ features sur 7 familles' → '85 features sur 8 familles'
- Correction 2 (l.374-378): diagramme ASCII bot_detector — ajout MetaLearner, ExIFFI, fleet.py
- Correction 3 (après l.506): note poids axe 5 réduit 0.20→0.15, axe 6 ajouté 0.05
- Correction 4 (l.279): clarification quantiles actuels 5 (p10→p90), p5/p95 = futur
- Correction 5 (l.776): §5.6 (DNS UDP/53) et §5.7 (Apache compression) nommés explicitement
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-10 13:51:31 +02:00
51dd376f7a
docs: mise à jour complète — 7/8 techniques, 85 features, 12 modules
...
Reflète l'état réel du système après les étapes 1-9 du roadmap :
- §5.2 (fleet_detector NetworkX/Louvain) et §5.8 (Jaccard cross-domain) : ✅
- MetaLearner (régression logistique, fallback poids fixes) : documenté
- ExIFFI (profondeur isolation EIF) + erreur AE par feature : documenté
- KL divergence en complément du KS, drift adversarial : documenté
- HTTP/2 fingerprinting (h2_fingerprint, dict_browser_h2, axis_h2_coherence) : documenté
- Métriques de cycle (metrics.py, ml_performance_metrics, alertes) : documenté
- Browser confidence : 5 axes → 6 axes (axis_h2_coherence)
- 85 features (73 FEATURES + 12 FEATURES_COMPLET), 12 modules, 53 routes dashboard
- Conformité thèse : 99.4% (était 97.9%), §5 : 87.5% (était 62.5%)
- Tables nouvelles : fleet_detections, ml_performance_metrics, soc_feedback
- Dictionnaires : 8 (dict_browser_h2 ajouté)
- Dashboard : 16 pages + 37 API routes (fleet, health ajoutés)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-10 01:31:20 +02:00
6b3cc54652
docs: réécriture audit, DOCUMENTATION.md et IMPROVEMENTS.md pour architecture modulaire
...
- AUDIT: conformité mise à jour 97.9% (142/145), références modulaires
- DOCUMENTATION.md: 1083 lignes, 7 sections, 11 modules documentés
- IMPROVEMENTS.md: A1-A10/B1-B10 annotés ✅ /🔄 /❌ avec localisations
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:14:18 +02:00
c96c41fb45
docs: réécriture complète de la documentation des services en français
...
- bot-detector.md : architecture 11 modules, 77/65 features,
ensemble triple voix (EIF+AE+XGBoost), browser 5 axes, HDBSCAN,
toutes les variables d'environnement vérifiées depuis le code source
- dashboard.md : corrigé stack (Jinja2+htmx, pas React+Vite),
14 pages + 35 API routes + health, dual-database, IPv4/IPv6
- python-ja4common.md : ajouté CLICKHOUSE_DB_PROCESSING/LOGS,
schéma dual-database, note dashboard n'utilise pas ja4_common
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:04:58 +02:00
8f5e771096
docs: réécriture complète de la documentation base de données en français
...
Réécriture des 3 fichiers de documentation de la base de données ClickHouse :
- docs/database/schema.md : couverture complète des 2 bases, 14+ tables,
7 dictionnaires, 8 MVs, 8 vues, TTL, partitions, moteurs et colonnes
- docs/database/migrations.md : 13 fichiers SQL (ajout 10-12), prérequis
mis à jour (ClickHouse 24.8+, 5 CSV), deploy_schema.sh, init-stack.sh,
vérification et rollback complets
- shared/clickhouse/README.md : référence rapide des 13 fichiers,
deploy_schema.sh, patron double-base, prérequis
Suppression des références obsolètes : dict_anubis_ua, dict_anubis_country,
anubis_ua_rules, anubis_country_rules.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:03:37 +02:00
d05969867f
docs: rewrite architecture/README, update deployment/development
...
- architecture.md: complete rewrite (French) with dual-database diagram,
5-phase data flow, full table ownership, triple-voice ML pipeline,
7 dictionaries, 13 SQL files, updated tech stack
- README.md: complete rewrite (English) with updated pipeline diagram,
services table, scripts section, integration tests, full doc index,
Go 1.24.6 workspace
- deployment.md: update to 13 SQL files, remove Anubis UA/Country refs,
add scripts section, add ensemble env vars (AE_WEIGHT, XGB_WEIGHT),
update verification queries and network diagram
- development.md: translate to French, add bot-detector 11-module structure,
add Python ML deps, add scripts/integration test sections,
fix bot-detector run command, add make targets
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:00:29 +02:00
7bdc6e2865
docs: mise à jour du document de thèse (§2-§8)
...
- §2.1.3: Simplifié Anubis à 2 dictionnaires (dict_anubis_ip, dict_anubis_asn) avec priorité COALESCE
- §2.4.2: Ajouté bibliothèque isotree, formule de calibration, ntrees=300, sérialisation joblib
- §2.4.2b/§2.4.4: Remplacé DBSCAN par HDBSCAN partout
- §2.4.2c: Remplacé régression logistique par pondération linéaire fixe, ajouté formule et poids
- §2.4.3: Clarifié approximation par 5 quantiles pour la détection de dérive
- §3.1: Mis à jour le diagramme ASCII (dual-database, 3×EIF+AE+XGB, HDBSCAN, 55 routes)
- §3.8: Mis à jour la trifurcation + ajouté détection multifactorielle navigateur (5 axes)
- §4: Élargi taxonomie de 51 à 65+ features sur 8 familles
- §5: Ajouté statut d'implémentation (✅ /❌ ) à chaque technique
- §6: Ajouté §6.6 résultats de déploiement (3M+ logs, 34K sessions/cycle)
- §7: Mis à jour conclusion (65+ features, 5/8 techniques, refactorisation modulaire)
- §8: Ajouté références isotree, PyTorch, HDBSCAN, XGBoost, SHAP
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 21:59:34 +02:00
b66d41a200
docs: updated conformity audit bot-detector + dashboard vs thesis
...
Score: 93% (was 72%) — 4 thesis techniques now implemented,
browser classification, ASN PeeringDB, SOC feedback loop.
Identifies 9 bot-detector bugs (2 critical: campaign_id/raw_anomaly_score
never inserted, worst_score inverted) and 11 dashboard bugs (4 critical:
XSS, no auth, no CSRF, CORS misconfiguration).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 23:25:19 +02:00
0d1a6a81e0
docs: update thesis with EIF, autoencoders, ensemble architecture, quantile drift
...
- §2.4.2: Add Extended Isolation Forest theory (Hariri et al., TKDE 2021)
- §2.4.2b: New section on autoencoders for network anomaly detection
(Kitsune, β-VAE, hybrid AE+IF studies)
- §2.4.2c: New section on hybrid supervised+unsupervised ensembles
(triple-voice architecture: EIF + AE + XGBoost + meta-learner)
- §2.4.3: Enhanced drift detection with quantile digest and validation gate
- §6.2: Multi-level baseline contamination mitigation
- §7: Updated conclusion reflecting ensemble architecture
- §8: 10 new references (Hariri 2021, Mirsky 2018, Baptiste 2026, etc.)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 02:23:00 +02:00
0ccd417a02
docs: audit conformité détection vs thèse état de l'art
...
Analyse exhaustive feature-par-feature des techniques de détection
implémentées vs ce que décrit la thèse. Score: 97% base, 6% techniques
avancées, 72% global pondéré.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 00:12:51 +02:00
51b8eb57a8
feat: port v14 schema fixes, migration, MV verifier, thesis from ja4/
...
deploy_views.sql (v13 → v14):
- CRITICAL: ml_detected_anomalies ORDER BY (src_ip) → (src_ip, ja4, host, model_name)
ReplacingMergeTree was collapsing all detections to 1 row per IP on merge
- Add PARTITION BY toDate + ttl_only_drop_parts on all 4 data tables
- ml_all_scores TTL 3d → 7d; ml_detected_anomalies TTL 30d → 7d
- agg_host_ip_ja4_1h + agg_header_fingerprint_1h: add partition + TTL 7d
- view_ip_recurrence: add WHERE detected_at >= now() - 7 DAY (was full scan)
- Remove dead views: summary/timeseries/threat_dist/variability
- Add view_dashboard_entities (fixes HTTP 500 in clustering/incidents/fingerprints)
- Add view_dashboard_user_agents (fixes HTTP 500 in fingerprints/metrics)
- Add view_ai_features_24h (enables ENABLE_MULTIWINDOW in bot_detector)
- Mark max_requests_per_sec as DEPRECATED (always 0)
New files:
- correlator/sql/migrations/01_ttl_adjustments.sql: ALTER TABLE migration
- tests/integration/verify_mvs.py: MV pipeline verification assertions
- docs/THESIS_HTTP_Traffic_Detection.md: detection techniques thesis
All DB references use ja4_processing/ja4_logs (no mabase_prod).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-07 23:51:56 +02:00
3dfeba860b
docs: add standardized comments to all services (Python, Go, Bash)
...
- Add docs/commenting-standard.md defining per-language comment standards
(Go godoc, Python PEP-257, C Doxygen, Bash header blocks, SQL banners)
- services/dashboard: 100% docstring coverage (100/100 functions)
- All FastAPI route handlers, helpers, classes, and models documented
- Language: French (project convention)
- services/bot-detector: 100% docstring coverage (53/53 symbols)
- bot_detector.py: 14 functions + module docstring
- anubis/fetch_rules.py: 9 functions
- shared/python/ja4_common: full docstrings on ClickHouseClient (7 methods)
and ClickHouseSettings class
- services/correlator: 24 godoc comments added across 6 Go files
- correlation_service.go: 10 private helpers
- unixsocket/source.go: 6 parsing/socket helpers
- correlated_log.go: 4 field extraction helpers
- orchestrator.go, logger.go, main.go: 4 comments
- services/correlator/scripts/audit-architecture.sh: standardized header block
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-07 21:32:29 +02:00
7b8dff2925
docs: guide de déploiement complet (DB, users, services, vérification)
...
Couvre les 7 étapes de mise en place :
1. Installation ClickHouse
2. Déploiement du schéma (deploy_schema.sh + migrations)
3. Configuration des utilisateurs (data_writer, analyst, bot_writer)
4. Fichiers CSV externes pour les dictionnaires
5. Installation des services Go via RPM (sentinel, correlator, mod-reqin-log)
6. Installation des services Python via Docker (bot-detector, dashboard)
7. Vérification bout-en-bout (ingestion, agrégation, ML, dashboard)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-07 19:28:58 +02:00
9f3e0621e5
feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing)
...
Architecture:
- ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs)
- ja4_processing: analytics, aggregation, ML, dictionaries, audit
Configuration (env vars):
- CLICKHOUSE_DB_LOGS (default: ja4_logs)
- CLICKHOUSE_DB_PROCESSING (default: ja4_processing)
Changes:
- SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing
with correct cross-database references (MVs, views, dicts)
- deploy_schema.sh: substitutes DB names from env vars at deploy time
- Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING
- Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs
with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING
- Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable
- Correlator: DSN example updated to ja4_logs
- Docker-compose + .env files: new env vars with defaults
- All documentation updated (14 markdown files)
All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-07 19:10:35 +02:00
d469e39da7
feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
...
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)
Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)
Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets
Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)
Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)
Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-07 16:42:59 +02:00