895d7894a9
docs: mise à jour copilot-instructions.md
...
- bot-detector : monolithe → 10 modules
- Ajout convention browser detection sans UA (5 axes, Client Hints)
- Ajout targets Makefile : init-stack, import-prod-data, purge-db, help
- Anubis : simplifié IP/CIDR + ASN (suppression dict_anubis_ua / REGEXP_TREE)
- Tests bot-detector : clarification imports lourds
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 23:11:24 +02:00
14db3d9040
refactor: suppression dépendance User-Agent de la détection navigateur
...
Changements SQL :
- modern_browser_score : sec-ch-ua→100, Sec-Fetch→70 (plus de UA fallback)
- Ajout has_sec_ch_ua (UInt8) dans agg_header_fingerprint_1h et ml_all_scores
- mss_mobile_mismatch utilise has_sec_ch_ua au lieu de modern_browser_score
- header_order_confidence : PARTITION BY ja4 au lieu de first_ua
- sec_ch_mobile_mismatch : comparaison Client Hints interne (sans UA)
- Migration 03_remove_ua_browser_detection.sql
Changements Python :
- browser.py Axe 3 : Client Hints + Sec-Fetch + is_fake_navigation (PAS de UA)
- Pondération axes : ja4_known 0.30, tls_coherence 0.20 (signaux TLS renforcés)
- preprocessing.py : has_sec_ch_ua ajouté aux features et binary_features
Fichiers modifiés : 8 SQL/Python + 1 migration, 36/36 tests passent.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 23:06:01 +02:00
00e99e5464
fix(bot-detector): make scoring functions public (remove underscore prefix)
...
compute_shap_top_features, build_reason, cluster_anomalies renamed from
private (_prefixed) to public to match pipeline.py imports.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:49:48 +02:00
629f7b334d
fix(bot-detector): rename _compute_drift_score to public, fix import
...
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:48:21 +02:00
de6d8da931
fix(bot-detector): FEATURES_BASE → FEATURES import name mismatch
...
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:42:32 +02:00
1fa6aec784
fix: SQL view ordering, purge-db flag, ctest directory
...
- 12_thesis_features.sql: move view_resource_cascade_1h before view_thesis_features_1h
- Makefile: purge-db uses --reset (not --clean)
- mod-reqin-log: ctest --test-dir build/tests
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:39:25 +02:00
6d64c2a8a8
fix(rpm): add systemd-rpm-macros to Dockerfile.package, fix correlator spec_version
...
- sentinel/correlator: install systemd-rpm-macros in rpm-builder stage
- correlator: use build_version macro (not version) to avoid recursive expansion
- mod-reqin-log: fix ctest --test-dir to find tests in build/tests/
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:33:53 +02:00
ea488c0b11
feat: add make help with all targets documented
...
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:22:25 +02:00
0ba66729da
feat: add make purge-db target for full database reset
...
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:21:15 +02:00
6b3cc54652
docs: réécriture audit, DOCUMENTATION.md et IMPROVEMENTS.md pour architecture modulaire
...
- AUDIT: conformité mise à jour 97.9% (142/145), références modulaires
- DOCUMENTATION.md: 1083 lignes, 7 sections, 11 modules documentés
- IMPROVEMENTS.md: A1-A10/B1-B10 annotés ✅ /🔄 /❌ avec localisations
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:14:18 +02:00
c96c41fb45
docs: réécriture complète de la documentation des services en français
...
- bot-detector.md : architecture 11 modules, 77/65 features,
ensemble triple voix (EIF+AE+XGBoost), browser 5 axes, HDBSCAN,
toutes les variables d'environnement vérifiées depuis le code source
- dashboard.md : corrigé stack (Jinja2+htmx, pas React+Vite),
14 pages + 35 API routes + health, dual-database, IPv4/IPv6
- python-ja4common.md : ajouté CLICKHOUSE_DB_PROCESSING/LOGS,
schéma dual-database, note dashboard n'utilise pas ja4_common
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:04:58 +02:00
8f5e771096
docs: réécriture complète de la documentation base de données en français
...
Réécriture des 3 fichiers de documentation de la base de données ClickHouse :
- docs/database/schema.md : couverture complète des 2 bases, 14+ tables,
7 dictionnaires, 8 MVs, 8 vues, TTL, partitions, moteurs et colonnes
- docs/database/migrations.md : 13 fichiers SQL (ajout 10-12), prérequis
mis à jour (ClickHouse 24.8+, 5 CSV), deploy_schema.sh, init-stack.sh,
vérification et rollback complets
- shared/clickhouse/README.md : référence rapide des 13 fichiers,
deploy_schema.sh, patron double-base, prérequis
Suppression des références obsolètes : dict_anubis_ua, dict_anubis_country,
anubis_ua_rules, anubis_country_rules.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:03:37 +02:00
d05969867f
docs: rewrite architecture/README, update deployment/development
...
- architecture.md: complete rewrite (French) with dual-database diagram,
5-phase data flow, full table ownership, triple-voice ML pipeline,
7 dictionaries, 13 SQL files, updated tech stack
- README.md: complete rewrite (English) with updated pipeline diagram,
services table, scripts section, integration tests, full doc index,
Go 1.24.6 workspace
- deployment.md: update to 13 SQL files, remove Anubis UA/Country refs,
add scripts section, add ensemble env vars (AE_WEIGHT, XGB_WEIGHT),
update verification queries and network diagram
- development.md: translate to French, add bot-detector 11-module structure,
add Python ML deps, add scripts/integration test sections,
fix bot-detector run command, add make targets
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:00:29 +02:00
7bdc6e2865
docs: mise à jour du document de thèse (§2-§8)
...
- §2.1.3: Simplifié Anubis à 2 dictionnaires (dict_anubis_ip, dict_anubis_asn) avec priorité COALESCE
- §2.4.2: Ajouté bibliothèque isotree, formule de calibration, ntrees=300, sérialisation joblib
- §2.4.2b/§2.4.4: Remplacé DBSCAN par HDBSCAN partout
- §2.4.2c: Remplacé régression logistique par pondération linéaire fixe, ajouté formule et poids
- §2.4.3: Clarifié approximation par 5 quantiles pour la détection de dérive
- §3.1: Mis à jour le diagramme ASCII (dual-database, 3×EIF+AE+XGB, HDBSCAN, 55 routes)
- §3.8: Mis à jour la trifurcation + ajouté détection multifactorielle navigateur (5 axes)
- §4: Élargi taxonomie de 51 à 65+ features sur 8 familles
- §5: Ajouté statut d'implémentation (✅ /❌ ) à chaque technique
- §6: Ajouté §6.6 résultats de déploiement (3M+ logs, 34K sessions/cycle)
- §7: Mis à jour conclusion (65+ features, 5/8 techniques, refactorisation modulaire)
- §8: Ajouté références isotree, PyTorch, HDBSCAN, XGBoost, SHAP
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 21:59:34 +02:00
9ea36ad22e
feat(scripts): complete stack init + prod data import with date shift
...
Schema cleanup:
- Remove anubis_ua_rules table stub from 03_anubis_tables.sql
- Remove anubis_ua_rules from bot-detector deploy_schema.sql
- Remove UA seed step from clickhouse-init.sh (no more REGEXP_TREE dependency)
- Drop dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules
New scripts:
- scripts/init-stack.sh: comprehensive ClickHouse init (13 SQL files + migrations
+ validation + cleanup of obsolete tables). Supports --reset, --import-prod.
- scripts/import-prod-data.sh: imports pre-exported prod data (Native format)
with dynamic date shift (max(time) → now). Supports --shift, --no-truncate.
- scripts/data/prod-export/: directory for cached Native format exports
Makefile targets: init-stack, import-prod-data, init-and-import
Tested: init-stack.sh passes all 13 SQL + 7 critical tables + 7 dicts
import-prod-data.sh: 3M rows in ~37s with auto date shift
Dashboard: 55 routes OK, bot-detector: 36/36 tests pass
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 21:40:05 +02:00
d8ca804a55
feat(scripts): add reload-prod-logs.sh for prod→dev data sync
...
Exports http_logs from prod ClickHouse via HTTP API, imports into dev
with dynamic date shifting (max(time) → now() by default).
Features:
- Batch export in Native format (200K rows/batch, ~10s each)
- Auto date shift: prod max(time) aligned to current time
- --shift N: manual override (seconds)
- --days N: filter to last N days only
- --cron: silent mode for scheduled runs
- Staging table approach: export → staging → INSERT SELECT with shift → cleanup
Tested: 3,054,122 rows imported in ~3 minutes, dates 2026-04-03→2026-04-09.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 15:41:38 +02:00
8180f4af04
refactor(anubis): simplify to IP/CIDR + ASN only, remove UA and Country rules
...
- Remove UA regex extraction (extract_ua_regex, _extract_ua_from_all/any)
- Remove Country rule collection from parse_bot_policies_inline
- Simplify fetch_rules.py: collect_all_rules returns (ip_rules, asn_rules)
- Remove insert_ua_rules and insert_country_rules functions
- reload_dicts now only reloads dict_anubis_ip + dict_anubis_asn
- Simplify CASE blocks in 04_mv_http_logs.sql, 07_ai_features_view.sql,
view_ai_features_anubis.sql, mv_http_logs.sql: IP > ASN (was 5-level
UA+IP > UA > IP > ASN > Country cascade)
- Remove dict_anubis_country + dict_anubis_ua from 03_anubis_tables.sql
(UA table kept as stub for REGEXP_TREE catch-all compatibility)
- Remove anubis_country_rules table from schema
- Remove Anubis UA and Country tabs from dashboard reflists page
- Remove anubis_ua_rules/country_rules from API reflist queries
- deploy_schema.sql simplified from 339 to 122 lines
- 764 lines removed across 9 files
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 15:25:33 +02:00
98abbc80c7
feat(dashboard): page Listes de référence — visualisation CSV/dictionnaires
...
Nouvelle page /reflists pour visualiser les 9 dictionnaires ClickHouse :
- bot_ip (3.5K entrées) : IP/CIDR de bots connus
- bot_ja4 (31) : fingerprints JA4 de bots
- browser_ja4 (1.2K) : fingerprints JA4 navigateurs → famille, lib TLS
- asn_reputation (82.5K) : ASN → réputation (isp, datacenter, cdn…)
- iplocate_asn (714K) : géolocalisation IP → ASN, pays, nom
- anubis_ua_rules, anubis_ip_rules, anubis_asn_rules, anubis_country_rules
Fonctionnalités :
- 9 onglets de navigation entre les listes
- Recherche textuelle avec filtrage côté ClickHouse
- Pagination (200 entrées/page)
- Tri par colonne (ASC/DESC)
- Graphique de répartition (ECharts) par catégorie
- KPIs dictionnaires en haut de page
- Infobulles de documentation
API : /api/dictionaries, /api/reflist/{name}, /api/reflist/{name}/stats
Helpers : esc() (HTML escape) ajouté à base.html
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 14:56:54 +02:00
039086a0b3
feat: nouvelles techniques de détection et page tactiques SOC
...
SQL:
- Ajout 5 colonnes d'agrégation (count_xff, count_unusual_ct,
count_non_std_port, count_login_post, sec_ch_mobile_mismatch)
- Exposition de 5 features calculées dans view_ai_features_1h
- Migration ALTER TABLE pour déploiements existants
Bot-detector:
- 7 nouvelles features ML (has_xff, unusual_content_type_ratio,
non_standard_port_ratio, login_post_concentration,
sec_ch_mobile_mismatch, true_window_size, window_mss_ratio)
- Propagation campaign_id vers ml_all_scores (était toujours -1)
- Escalade campagne : HIGH→CRITICAL si cluster ≥5 membres
Dashboard:
- Page Tactiques SOC : brute-force, rotation JA4, récurrence,
alertes temps réel — 4 KPIs + 4 panneaux + infobulles doc
- Ajout fmtDate() helper global
- Navigation sidebar mise à jour
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 14:29:18 +02:00
702c0d5edb
feat(dashboard): add JA4 fingerprint and cluster investigation pages
...
- /ja4/{fingerprint} page: 8 KPIs, timeline, threat pie, IP scores
table, ASN/geo charts, HTTP logs, AI features — full JA4 investigation
- /cluster/{cid} page: 8 KPIs, timeline, threat/JA4/ASN/host charts,
member table with bulk classify — full campaign investigation
- /api/ja4/{fingerprint} and /api/cluster/{cid} API endpoints
- fmtJA4 links now navigate to /ja4/ investigation page
- campaigns.html: 'Ouvrir' button links to /cluster/{cid} full page
- Fix: double-brace {{param}} in non-f-string queries → single {param}
(was causing HTTP 500 on all parameterized ClickHouse queries)
- 50 routes total, all tests pass, 0 JS console errors
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 14:05:52 +02:00
70188b508c
fix(dashboard): eliminate @apply CSS, fix status column, fix click propagation
...
Playwright testing revealed 3 critical bugs:
1. Tailwind CDN @apply with custom brand-* colors produces empty CSS
rules, breaking ALL design components (kpi-card, data-table, badges,
filter-btn, section-card, nav-item). Fix: replace all @apply
directives with equivalent raw CSS values.
2. Traffic API and IP detail API reference non-existent 'status' column
in http_logs table → HTTP 500 on /traffic and /ip/{ip}. Fix: remove
status from SELECT, sort whitelist, filters, and templates.
3. Nested <a> links (fmtJA4, fmtASN, fmtCountry, fmtBotName) inside
clickable <tr onclick> capture clicks, preventing row navigation to
/ip/ detail. Fix: add event.stopPropagation() to all formatter links.
Verified with Playwright: 10 pages × 0 JS errors, all tooltips hidden
by default, sidebar toggle works, keyboard shortcuts (Alt+1-9, Alt+B),
classification form saves to DB, campaign detail panel opens on click.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 13:54:38 +02:00
6babc55e3e
fix(dashboard): hover infobulles, full-width layout, UX polish
...
- Fix doc tooltips: split CSS into <style type='text/tailwindcss'> for
@apply directives + raw CSS for reliable doc panel rendering
- Convert doc panels from click-toggle to hover-based infobulles with
arrow pointer, fade-in animation, and auto-dismiss on mobile
- Replace '?' icons with 'ⓘ' across all 11 templates (51 tooltips)
- Full-width layout: reduce padding on mobile (px-3), scale up on
desktop (lg:px-5, xl:px-6) for maximum screen utilization
- Auto-collapse sidebar on narrow screens (<1024px)
- Keyboard shortcuts: Alt+1–9 for page navigation, Alt+B toggle sidebar
- Add LEGITIMATE_BROWSER filter button to detections page
- Sticky header with stronger blur (backdrop-blur-md)
- All 46 routes pass tests
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 13:30:16 +02:00
63ba6d203c
feat(dashboard): complete SOC dashboard with full monitoring and workflows
...
- models.html: Full rewrite — 6 KPIs, scoring volume timeline, anomaly rate
chart, threat breakdown per model, enhanced model cards with validation gate
- classify.html: SOC workflow — suggested unclassified IPs, quick-classify
buttons, classification stats pie, pre-fill from URL params
- traffic.html: Clickable rows → ip_detail, column sorting, status column,
search filter, doc tooltips on all chart sections
- scores.html: Search input, clickable rows → ip_detail, LEGITIMATE_BROWSER
filter button, doc tooltips on distribution + scatter charts
- ip_detail.html: Resource cascade section (headless browser detection),
status column in HTTP logs table
- detections.html: Doc tooltips on threat/reason/ASN chart sections
- features.html: Doc tooltips on radar/importance/scatter sections
- api.py: 4 new endpoints — /api/models/timeline, /api/models/threats,
/api/classify/stats, /api/classify/suggested. Traffic API: status + search.
46 routes total. All tests pass (dashboard + bot-detector 36/36).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 01:25:01 +02:00
396baa90d2
feat(dashboard): visualisation clusters HDBSCAN
...
- Page /campaigns dédiée avec 4 vues graphiques :
· Scatter plot (score vs vélocité, bulles colorées par campagne)
· Graphe réseau force-directed (IPs liées par JA4 partagé)
· Grille de cartes campagne (KPIs, ASN, pays, JA4)
· Panneau détail (radar comportemental, timeline horaire, table membres)
- 4 nouveaux endpoints API :
· GET /api/campaigns (fix: campaign_id >= 0 au lieu de != '')
· GET /api/campaigns/graph (nœuds + arêtes)
· GET /api/campaigns/scatter (score/vélocité par IP)
· GET /api/campaigns/{cid} (détail + profil + timeline)
- Sidebar: lien Campagnes ajouté dans Surveillance
- Overview: campagnes clickables → lien vers /campaigns
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 01:11:16 +02:00
f1547423b5
refactor(bot-detector): suppression monolithe, tests multifactoriels
...
- Suppression de bot_detector.py (1982 lignes) remplacé par 11 modules
- Tests navigateur mis à jour pour le système multifactoriel (browser_confidence)
- 36/36 tests passent avec la nouvelle structure modulaire
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 01:03:17 +02:00
1f103392ac
refactor(bot-detector): extract monolith into modular package
...
Split bot_detector.py (~1982 lines) into 10 focused modules:
- config.py: all configuration constants and optional imports
- log.py: logging utilities (log_info, log_decision, append_training_history)
- infra.py: ClickHouse client, health check HTTP server, shutdown
- browser.py: multifactorial browser identification (5 axes)
- scoring.py: drift detection, feature validation, SHAP, clustering
- models.py: EIF, Autoencoder, XGBoost model management
- preprocessing.py: data preprocessing and feature list definitions
- pipeline.py: core semi-supervised scoring loop
- cycle.py: main analysis cycle orchestration
- __main__.py: entry point with startup banner
Update Dockerfile to copy package directory and use python -m bot_detector.
All 36 existing tests pass unchanged.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 01:02:04 +02:00
2d04288e95
feat(dashboard): SOC workflow overhaul — sidebar nav, doc tooltips, full-width layout
...
- base.html: collapsible sidebar navigation, doc tooltip system, JS helpers
(fmtNum, fmtPct, fmtDuration, ecGrid, buildTable, docHTML)
- overview.html: SOC command center with stacked timeline, live alerts,
campaigns panel, browser donut, 6 KPIs
- detections.html: threat color dots, raw score column, click-to-navigate rows
- network.html: JA4 rotation, brute-force, persistent threats tables, 6 KPIs
- ip_detail.html: ASN/country KPIs, AE/XGB/campaign columns, enriched features
- scores/traffic/features/models/classify: page_title blocks + doc tooltips
- api.py: 9 new endpoints (campaigns, brute-force, ja4-rotation, recurrence,
cascade, alerts, timeline-detail, ua-rotation)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 00:29:34 +02:00
c994ad4466
fix: XGB label query + SHAP isotree compatibility
...
XGB: query was selecting features from ml_all_scores which doesn't
store them. Now joins ml_all_scores (labels) with view_ai_features_1h
(features). Dynamically discovers available columns to skip thesis §5
features not present in the view. Returns (model, features) tuple.
SHAP: TreeExplainer doesn't support isotree. Fall back to permutation-
based Explainer(model.decision_function, X_sample) for isotree.
Verified: XGB trained on 50000 labels (18436 positives), triple-voice
ensemble scoring active (EIF+AE+XGB), SHAP silent.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 00:06:54 +02:00
c6666e2bba
fix: isotree score convention — proper sklearn calibration
...
isotree decision_function returns [0,1] (higher=anomalous, 0.5=boundary).
The entire pipeline (normalize_scores, score_to_threat_level,
compute_adaptive_threshold) expects sklearn convention (negative=anomalous).
Previous fix (-raw_scores) negated all values, making everything
below -0.30 → all CRITICAL. New fix: 0.5 - isotree_score maps
correctly to sklearn's convention:
isotree 0.80 → -0.30 (CRITICAL)
isotree 0.65 → -0.15 (HIGH)
isotree 0.55 → -0.05 (MEDIUM)
isotree 0.50 → 0.00 (boundary)
Verified: 27,952 LEGITIMATE_BROWSER + 15,843 HIGH + 15,059 MEDIUM
Tests: 36/36 pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 23:56:05 +02:00
db306fb9da
fix: P0 audit bugs — bot-detector + dashboard + SQL
...
Bot-detector:
- B1.1: campaign_id and raw_anomaly_score now inserted into ml_detected_anomalies
- B1.4/B1.5: log_decision argument order fixed (cycle_id, name)
- B1.7: AE broadcast error — model now returns features list, scoring
uses model's features instead of current cycle's (prevents dim mismatch)
- B1.8: Anubis ALLOW bots now get bot_name from anubis_bot_name
Dashboard:
- C1.1: XSS in ip_detail.html — {{ ip | tojson }} instead of raw string
- C1.2: Stored XSS via innerHTML — added escapeHtml() helper, all user-facing
formatters (fmtIP, fmtASN, fmtCountry, fmtJA4, fmtBotName, fmtLabel) sanitized
- C2.1: status filter now correctly filters http_version column
- C2.2: heatmap toDayOfWeek() - 1 for 0-indexed JS days
SQL:
- B1.3: view_ip_recurrence worst_score uses max() not min() (0=normal, 1=anomal)
- B1.6: view_resource_cascade_1h joined into view_thesis_features_1h (§5.4)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 23:33:00 +02:00
b66d41a200
docs: updated conformity audit bot-detector + dashboard vs thesis
...
Score: 93% (was 72%) — 4 thesis techniques now implemented,
browser classification, ASN PeeringDB, SOC feedback loop.
Identifies 9 bot-detector bugs (2 critical: campaign_id/raw_anomaly_score
never inserted, worst_score inverted) and 11 dashboard bugs (4 critical:
XSS, no auth, no CSRF, CORS misconfiguration).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 23:25:19 +02:00
98289ccf04
fix: ASN dictionary pipeline + verbose bot-detector logging
...
- Fix dict_iplocate_asn: remove non-existent org/domain columns (4→4 cols)
- Add CSV header to iplocate-ip-to-asn.csv (CSVWithNames format)
- Replace org/domain dictGet calls with empty string literals in MV
- Full 714K CIDR stub for complete ASN resolution in tests
- Add header generation to generate_asn_data.py
- Verbose bot-detector stdout: data summary, triage breakdown, model
training details, scoring stats, browser classification, boxed results
- Fix IPv6 filter in traffic seeder (_ips_from_cidrs)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 17:43:55 +02:00
7b7b69dee3
Rewrite seed_clickhouse.py: 500K rows from 20K IPs with realistic traffic
...
- 350K browser rows (14K IPs) using real JA4s from browser_ja4.csv
- 100K scanner rows (3K IPs) with vuln/cred/scraper/DDoS sub-categories
- 30K legit bot rows (2K IPs) from real bot_ip.csv CIDRs
- 20K AI bot rows (1K IPs) for GPTBot, ClaudeBot, etc.
Key improvements:
- Load browser_ja4.csv at startup, match JA4 to browser family
- Load bot_ip.csv to generate IPs from real Googlebot/Bingbot CIDRs
- Hard-coded ISP /24 prefixes from real ASNs (Comcast, Orange, DT, etc.)
- Realistic navigation patterns with Referer chains and cookies
- Sec-CH-UA headers for Chromium browsers (modern_browser_score >= 50)
- Batch size increased to 2000, progress reporting every 10K rows
- New CLI args: --rows, --ips, --seed, --data-dir
- Bot JA4s are synthetic hashes guaranteed NOT in browser_ja4.csv
Also updated:
- Dockerfile: COPY *.py (was missing seed_clickhouse.py)
- docker-compose.yml: mount scripts/data as /app/data for CSV access
- run-tests.sh: updated seeder description comments
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 16:35:40 +02:00
74e0406c38
chore: update ASN stubs with new classification labels
...
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 16:05:25 +02:00
5c5bca71d1
feat: rewrite ASN classification with PeeringDB + expanded heuristics
...
Major improvements to generate_asn_data.py:
- Add PeeringDB network data source (34K networks with info_type)
- Add new categories: education, government, enterprise
- Rename 'human' label to 'isp' across all consumers
- Expand keyword heuristics (ISP, datacenter, hosting, CDN, education, gov)
- Add hard-coded lists for education, government, enterprise ASNs
- Support both --output-dir and --output-asn/--output-ipasn CLI interfaces
- Add --no-peeringdb flag for offline use
Results: unknown dropped from 86% to 57%, ISP coverage 21.8K ASNs,
education 3.1K, enterprise 5.7K, government 520.
Updated consumers:
- bot_detector.py: 'human' -> 'isp' for baseline selection
- dashboard api.py: 'human' -> 'isp' in SQL queries
- run-tests.sh: 'human' -> 'isp' in integration test assertions
- update-csv-data.sh: updated label description comment
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 16:02:07 +02:00
9a48fb9d29
feat: LEGITIMATE_BROWSER classification from JA4 + behavioral consistency
...
Add browser legitimacy classification (A9) to the bot detection pipeline:
- New features: is_known_browser (binary) and browser_consistency_score [0..5]
combining 5 signals: JA4 browser match, modern_browser_score, Accept-Language,
cookies, Sec-Fetch-* presence
- Post-scoring: sessions with known browser JA4 + consistency >= 4/5 + NORMAL/LOW
threat level are reclassified as LEGITIMATE_BROWSER
- Spoofing detection: inconsistent behavior (known JA4 but low consistency) stays
in normal anomaly scoring — prevents evasion via JA4 spoofing
- XGBoost treats LEGITIMATE_BROWSER as non-threat (negative label)
- ClickHouse: browser_family column added to ml_detected_anomalies and ml_all_scores
- Dashboard: browser_family filter/sort on detections and scores endpoints,
legitimate_browsers count and browser_stats in overview
- 6 new unit tests covering classification threshold, spoofing, exclusion logic
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 15:46:22 +02:00
7d09c614c3
feat: browser JA4 detection, Anubis bot rules, worldwide ASN data
...
- Add generate_browser_ja4.py: 1,186 browser JA4 fingerprints from FoxIO + ja4db.com
covering 11 families (Chromium, Firefox, Safari, Edge, Tor, Opera, Vivaldi...)
- Rewrite generate_bot_ip.py: Anubis YAML rules (Google, Bing, Apple, DuckDuck,
OpenAI, Perplexity bots) + Tor exit nodes + cloud scanner IPs (3,555 entries)
- Rewrite generate_asn_data.py: worldwide iptoasn.com data (78,049 ASNs, 714K CIDRs)
- Add dict_browser_ja4 ClickHouse dictionary + browser_family in AI features views
- Add /api/browsers dashboard endpoint
- Fix CSV quoting for fields containing commas (User-Agent strings)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 15:27:37 +02:00
b6184e6529
feat: CSV generation scripts, API filter params, enriched CSV stubs
...
- scripts/generate_bot_ip.py: download Tor exit nodes + curate scanner IPs (1353 entries)
- scripts/generate_bot_ja4.py: 31 bot JA4 fingerprints across 16 families
- scripts/generate_asn_data.py: 38 ASNs + 96 IP-to-ASN prefixes
- scripts/update-csv-data.sh: master orchestrator with --install-stubs
- api.py: add asn_org/country_code/ja4/bot_name filters on detections+scores
- pages.py: add /network route
- csv-stubs: enriched with generated data (Tor nodes, scanner IPs, etc.)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 15:05:43 +02:00
c6ca352db9
feat(dashboard): add clickable drill-down to all data elements
...
Add navigation helpers (fmtASN, fmtCountry, fmtJA4, fmtBotName,
fmtThreatLink, fmtLabel) to base.html for SOC analyst drill-down.
Update all templates:
- overview.html: clickable table cells + ECharts click handlers for
ASN, country, JA4, bot, and threat charts
- detections.html: URL param pre-filters, active filter bar with
clear buttons, clickable ASN/country/JA4/threat in table
- scores.html: URL param pre-filters, clickable threat/JA4/country
- traffic.html: clickable JA4 and country columns
- ip_detail.html: clickable threat/JA4 in detections, clickable
asn_org/country_code/asn_label in AI features grid
- network.html: click handlers on ASN treemap and country sunburst,
fmtJA4Full/fmtLabel/fmtBotName/fmtASN in tables
- features.html: scatter plot click navigates to /ip/{ip}
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 14:58:48 +02:00
fc882dd3e7
feat(tests): realistic traffic seeder + IP diversity via mod_remoteip
...
Option A — X-Forwarded-For + mod_remoteip:
- httpd-integration.conf: load mod_remoteip, trust all Docker RFC-1918
subnets (172/192.168/10). mod_reqin_log uses r->useragent_ip which
mod_remoteip updates from XFF → each request logged with distinct src_ip
- generate_traffic.py: XFF always set (was 30% only); human scenarios
use 91.121/78.41/90.x ranges, bot scenarios use 185.220/45.155/193.32;
pool of 1168 human IPs and 180 bot IPs; default --requests 500
Option D — Direct ClickHouse seeder (seed_clickhouse.py, stdlib only):
- Inserts ~4000 rows into http_logs_raw triggering full MV chain:
http_logs_raw → mv_http_logs → http_logs
→ mv_agg_host_ip_ja4_1h → agg_host_ip_ja4_1h
• 720 human sessions: IPs in OVH/SFR/Orange ASN ranges (16276/15557/3215)
→ dict_asn_reputation maps these to asn_label='human'
→ satisfies bot_detector human_baseline >= 500 threshold
• 150 scanner sessions: datacenter IPs, attack paths (/.env, wp-login,
SQLi, path traversal), scanner UAs, minimal TCP fingerprints
• 100 known-bot sessions: IPs matching bot_ip.csv entries
• 20 brute-force clusters: 20-50 POST /login per IP
All TCP/TLS metadata is profile-realistic (window, MSS, TTL, JA4, JA3)
CSV stubs (mounted at /var/lib/clickhouse/user_files/):
- iplocate-ip-to-asn.csv: 13 CIDR→ASN mappings (OVH/SFR/Orange/Tor/Contabo)
- asn_reputation.csv: 13 ASN→label (8 'human', 3 'datacenter'/'hosting')
- bot_ip.csv: 14 known scanner/Tor IPs (Shodan, Censys, Tor exits)
- bot_ja4.csv: 5 bot JA4 fingerprints (curl, python-requests, masscan, zgrab)
run-tests.sh:
- Phase 4a: seeder runs before live traffic (ensures bot_detector baseline)
- Phase 4b: live traffic gen at 500 requests (up from 200)
- Phase 5f: new assertions — agg_host_ip_ja4_1h populated, ≥500 human
rows in view_ai_features_1h, known-bot labels present
- Phase 7: verifies ml_all_scores populated (bot_detector ran a cycle)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 11:35:34 +02:00
f448dcb4b0
fix(rpm): standardize systemd scriptlets and unit installation paths
...
- Add BuildRequires: systemd-rpm-macros to sentinel and correlator specs
- Replace manual systemctl calls with %systemd_post, %systemd_preun,
%systemd_postun_with_restart macros (handles daemon-reload, stop/disable,
try-restart on upgrade correctly and is a no-op in containers)
- ja4sentinel.spec: use %{_unitdir} macro instead of hardcoded path
(/usr/lib/systemd/system); remove cross-service /var/run/logcorrelator
from %files and %post (owned by logcorrelator package, not sentinel)
- logcorrelator.spec: move unit from /etc/systemd/system (admin namespace)
to %{_unitdir} (/usr/lib/systemd/system) — correct packaging location;
move user/group creation from %post to %pre so file ownership is valid
during RPM install phase; add Requires(pre): shadow-utils; fix bare
directory entries in %files with %dir macro; add version fallback macro
so spec is buildable without --define version
- test-rpm.sh: auto-build RPM via Dockerfile.package if dist/rpm/ is
empty; update service file path check to /usr/lib/systemd/system/
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 10:49:21 +02:00
f7ee5e63f8
fix(docker): add g++ for isotree build, add dashboard Dockerfile.tests
...
- bot-detector Dockerfile + Dockerfile.tests: install g++ for isotree C++ extension
- dashboard Dockerfile.tests: new smoke test (verify FastAPI app loads)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 08:08:13 +02:00
77c0450a22
docs: update copilot-instructions.md for dashboard rewrite and ML upgrades
...
- Dashboard: FastAPI+React → FastAPI+Jinja2+htmx+Chart.js (2 route modules)
- Bot-detector: IsolationForest → triple-voice EIF+Autoencoder+XGBoost ensemble
- SQL schema: 10 → 13 files (added thesis features, perf indexes, views)
- Added ClickHouse 24.8 gotchas (projections, nested aggregates, let bindings)
- Added IPv4/IPv6 duality pattern, bot-detector test patterns
- Updated data retention table with 4 new thesis aggregation tables
- Fixed single-test commands to reference existing files
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 07:31:10 +02:00
b735bab5a5
feat(dashboard): rebuild SOC dashboard + fix ClickHouse SQL
...
Complete rewrite of the SOC dashboard using FastAPI + Jinja2 + htmx + Chart.js + Tailwind CSS.
Replaces the old React/Vite frontend with server-rendered templates.
Dashboard pages:
- Overview: KPIs, timeline chart, threat distribution, top IPs
- Detections: paginated/filterable anomaly table
- Scores: ml_all_scores with AE error & XGB prob columns
- Traffic: HTTP logs with method/host filters
- IP Investigation: full deep-dive (scores, features, HTTP logs, classify)
- Classification: SOC feedback form + history
- Features: AI + thesis feature stats
- Models: scoring stats + model metadata
API: 9 JSON endpoints with parameterized queries, sort whitelists
SQL fixes:
- 05_aggregation_tables: add deduplicate_merge_projection_mode
- 11_views: fix nested aggregate (argMax inside sum)
- 12_thesis_features: remove invalid 'let' bindings, fix groupArrayIf type
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 03:21:05 +02:00
228ad7026a
fix(integration): mount missing SQL files 10-12 in ClickHouse init
...
3 SQL files were missing from the docker-compose.yml volume mounts:
- 10_perf_indexes.sql (performance indexes)
- 11_views.sql (dashboard views)
- 12_thesis_features.sql (thesis §5 MVs and views)
Also make 10_perf_indexes.sql non-fatal in init script since ALTER TABLE
ADD INDEX may fail if index already exists.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 02:55:43 +02:00
8d58f2b932
feat(bot-detector): add XGBoost supervised third voice ( #10 )
...
Triple-voice ensemble architecture:
- EIF (non-supervisé, anomalies zero-day)
- Autoencoder (non-supervisé, corrélations non-linéaires)
- XGBoost (supervisé, patterns connus + feedback SOC)
XGBoost implementation:
- Trained on historical ml_all_scores labels (NORMAL=0, HIGH/CRITICAL/DENY/KNOWN=1)
- Weekly retraining (XGB_RETRAIN_INTERVAL_H=168), min 100 labels required
- Score = predict_proba, combined via meta-learner: (1-β)*(EIF+AE) + β*xgb_prob
- Configurable: XGB_WEIGHT (β=0.20), XGB_MIN_LABELS, XGB_RETRAIN_INTERVAL_HOURS
- Graceful fallback: if xgboost unavailable or labels insufficient, EIF+AE only
- ClickHouse: xgb_prob column added to ml_all_scores
- Tests: 4 new tests (availability, train/predict, meta-learner, save/load)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 02:45:57 +02:00
57cf6c3828
feat(bot-detector): add parallel Autoencoder scorer ( #9 )
...
- TrafficAutoEncoder class: symmetric AE (n→64→32→16→32→64→n) with BatchNorm+ReLU
- Trained alongside EIF on human_baseline, saved/loaded with model versioning
- Score = per-sample MSE reconstruction error, combined with EIF via AE_WEIGHT (α=0.30)
- AE latent space (16-dim) used for HDBSCAN clustering instead of raw features
- Configurable: AE_WEIGHT, AE_EPOCHS, AE_LATENT_DIM, AE_LEARNING_RATE
- Graceful fallback: if torch unavailable or AE fails, EIF-only scoring continues
- ClickHouse: ae_recon_error column added to ml_all_scores
- Tests: 5 new tests (AE train/score, encode latent, state dict save/load, weight combination)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 02:40:39 +02:00
f6e2d3c0ca
feat(bot-detector): implement 8 state-of-art improvements
...
- EIF: Extended Isolation Forest via isotree (fallback to sklearn IF)
- Benford's Law deviation feature on inter-request timing
- Lag-1 autocorrelation feature for cadence analysis
- Validation gate: reject model if val_anomaly_rate > 20%
- Feature pruning: remove variance < 1e-6 features before training
- Quantile drift: replace N(μ,σ) synthetic with quantile interpolation
- Thread safety: Lock for _service_healthy/_consecutive_failures
- Score normalization: inverted to [0,1] where 1=most anomalous
SQL: add lag1_autocorrelation + benford_deviation to view_thesis_features_1h
Tests: 10 new test functions covering all improvements
Integration: verify_mvs.py checks new thesis feature columns
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 02:31:26 +02:00
0d1a6a81e0
docs: update thesis with EIF, autoencoders, ensemble architecture, quantile drift
...
- §2.4.2: Add Extended Isolation Forest theory (Hariri et al., TKDE 2021)
- §2.4.2b: New section on autoencoders for network anomaly detection
(Kitsune, β-VAE, hybrid AE+IF studies)
- §2.4.2c: New section on hybrid supervised+unsupervised ensembles
(triple-voice architecture: EIF + AE + XGBoost + meta-learner)
- §2.4.3: Enhanced drift detection with quantile digest and validation gate
- §6.2: Multi-level baseline contamination mitigation
- §7: Updated conclusion reflecting ensemble architecture
- §8: 10 new references (Hariri 2021, Mirsky 2018, Baptiste 2026, etc.)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 02:23:00 +02:00
3ae8c7d9c9
feat(bot-detector): upgrade to state-of-the-art detection pipeline
...
- Fix UnboundLocalError on global _consecutive_failures/_service_healthy
- Add SQL identifier validation for DB names at startup
- Replace Z-score drift detection with KS test (scipy.stats.ks_2samp)
- Replace DBSCAN with HDBSCAN (adaptive clustering, no epsilon needed)
- Fix NaN→0 blanket imputation with per-feature median/sentinel strategy
- Add 80/20 temporal train/validation split with offline metrics logging
- Integrate thesis §5 features from view_thesis_features_1h:
path_transition_entropy, cadence_cv, burst/pause ratios,
host_diversity, host_sweep_speed, host_coverage_uniformity,
ja4_drift_ratio (Complet model only)
- Add SOC feedback loop: read classifications from audit_logs,
reclassify FP IPs as human, exclude TP IPs from baseline
- Update dependencies: clickhouse-connect 0.8.12, scikit-learn 1.6.1,
pandas 2.2.3, shap 0.47.2, add scipy>=1.14, hdbscan>=0.8.38
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 02:09:18 +02:00