Commit Graph

8 Commits

Author SHA1 Message Date
c6cb12981c feat(ml): replace NetworkX/Louvain with PyTorch Geometric GraphSAGE for fleet detection
Rewrite fleet.py to use a GNN-based approach: nodes are src_ip with ML feature
vectors, edges connect IPs sharing (JA4, ASN) pairs, GraphSAGE (2 SAGEConv
layers, in→64→32) produces 32D embeddings clustered by HDBSCAN. PyG NeighborLoader
activates for >50k nodes. Update thesis docs (§5.2, §6.4, §2, §8) to reflect
GraphSAGE architecture and PyG scalability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 15:45:34 +02:00
c1821dcbc4 feat(ml): replace Autoencoder with RealNVP Normalizing Flow and add SessionTransformer embeddings
Replace TrafficAutoEncoder (MSE reconstruction scoring) with TrafficNormalizingFlow
(RealNVP via FrEIA, 4 affine coupling blocks, anomaly score = -log p(x)) for
mathematically rigorous density estimation. Add SessionTransformer module producing
32-dimensional sequence embeddings from raw HTTP request sequences (path, method,
timing) via a lightweight TransformerEncoder, replacing path_transition_entropy and
cadence_cv features. Update thesis documentation sections 2.4.2b and 3.8 accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 15:11:21 +02:00
9d27abf43c fix(ml): integrate Cleanlab to filter noisy SOC labels and prevent model poisoning
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 02:11:25 +02:00
8d58f2b932 feat(bot-detector): add XGBoost supervised third voice (#10)
Triple-voice ensemble architecture:
- EIF (non-supervisé, anomalies zero-day)
- Autoencoder (non-supervisé, corrélations non-linéaires)
- XGBoost (supervisé, patterns connus + feedback SOC)

XGBoost implementation:
- Trained on historical ml_all_scores labels (NORMAL=0, HIGH/CRITICAL/DENY/KNOWN=1)
- Weekly retraining (XGB_RETRAIN_INTERVAL_H=168), min 100 labels required
- Score = predict_proba, combined via meta-learner: (1-β)*(EIF+AE) + β*xgb_prob
- Configurable: XGB_WEIGHT (β=0.20), XGB_MIN_LABELS, XGB_RETRAIN_INTERVAL_HOURS
- Graceful fallback: if xgboost unavailable or labels insufficient, EIF+AE only
- ClickHouse: xgb_prob column added to ml_all_scores
- Tests: 4 new tests (availability, train/predict, meta-learner, save/load)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:45:57 +02:00
57cf6c3828 feat(bot-detector): add parallel Autoencoder scorer (#9)
- TrafficAutoEncoder class: symmetric AE (n→64→32→16→32→64→n) with BatchNorm+ReLU
- Trained alongside EIF on human_baseline, saved/loaded with model versioning
- Score = per-sample MSE reconstruction error, combined with EIF via AE_WEIGHT (α=0.30)
- AE latent space (16-dim) used for HDBSCAN clustering instead of raw features
- Configurable: AE_WEIGHT, AE_EPOCHS, AE_LATENT_DIM, AE_LEARNING_RATE
- Graceful fallback: if torch unavailable or AE fails, EIF-only scoring continues
- ClickHouse: ae_recon_error column added to ml_all_scores
- Tests: 5 new tests (AE train/score, encode latent, state dict save/load, weight combination)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:40:39 +02:00
f6e2d3c0ca feat(bot-detector): implement 8 state-of-art improvements
- EIF: Extended Isolation Forest via isotree (fallback to sklearn IF)
- Benford's Law deviation feature on inter-request timing
- Lag-1 autocorrelation feature for cadence analysis
- Validation gate: reject model if val_anomaly_rate > 20%
- Feature pruning: remove variance < 1e-6 features before training
- Quantile drift: replace N(μ,σ) synthetic with quantile interpolation
- Thread safety: Lock for _service_healthy/_consecutive_failures
- Score normalization: inverted to [0,1] where 1=most anomalous

SQL: add lag1_autocorrelation + benford_deviation to view_thesis_features_1h
Tests: 10 new test functions covering all improvements
Integration: verify_mvs.py checks new thesis feature columns

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:31:26 +02:00
3ae8c7d9c9 feat(bot-detector): upgrade to state-of-the-art detection pipeline
- Fix UnboundLocalError on global _consecutive_failures/_service_healthy
- Add SQL identifier validation for DB names at startup
- Replace Z-score drift detection with KS test (scipy.stats.ks_2samp)
- Replace DBSCAN with HDBSCAN (adaptive clustering, no epsilon needed)
- Fix NaN→0 blanket imputation with per-feature median/sentinel strategy
- Add 80/20 temporal train/validation split with offline metrics logging
- Integrate thesis §5 features from view_thesis_features_1h:
  path_transition_entropy, cadence_cv, burst/pause ratios,
  host_diversity, host_sweep_speed, host_coverage_uniformity,
  ja4_drift_ratio (Complet model only)
- Add SOC feedback loop: read classifications from audit_logs,
  reclassify FP IPs as human, exclude TP IPs from baseline
- Update dependencies: clickhouse-connect 0.8.12, scikit-learn 1.6.1,
  pandas 2.2.3, shap 0.47.2, add scipy>=1.14, hdbscan>=0.8.38

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:09:18 +02:00
d469e39da7 feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 16:42:59 +02:00