feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized

Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
toto
2026-04-07 16:42:59 +02:00
commit d469e39da7
278 changed files with 1621301 additions and 0 deletions

35
.gitignore vendored Normal file
View File

@ -0,0 +1,35 @@
# Secrets — never commit
.env
.env.*
!.env.example
*.env
# Python
__pycache__/
*.pyc
*.pyo
.pytest_cache/
*.egg-info/
dist/
build/
.coverage
coverage.xml
htmlcov/
# Go
*.test
coverage.out
coverage.html
# Node
node_modules/
frontend/dist/
# Models and logs (runtime artifacts)
bot_detector_models/
bot_detector_logs/
# IDE
.vscode/
.idea/
*.swp

129
Makefile Normal file
View File

@ -0,0 +1,129 @@
# =============================================================================
# ja4-platform — Monorepo Makefile
# All targets use new service names:
# sentinel, correlator, bot-detector, dashboard, mod-reqin-log
# =============================================================================
.PHONY: build-all test-all rpm-all dist \
build-sentinel test-sentinel rpm-sentinel \
test-mod-reqin-log rpm-mod-reqin-log \
build-correlator test-correlator rpm-correlator \
build-bot-detector test-bot-detector \
build-dashboard test-dashboard \
test-ja4common-python
# --- Root -------------------------------------------------------------------
build-all: build-sentinel build-correlator build-bot-detector build-dashboard
@echo "All services built."
test-all: test-sentinel test-correlator test-bot-detector test-dashboard test-ja4common-python
@echo "All tests completed."
rpm-all: rpm-sentinel rpm-correlator rpm-mod-reqin-log
@echo "All RPMs built."
dist: rpm-all
@echo "Distribution packages ready in services/*/dist/"
# --- sentinel (was ja4sentinel) ---------------------------------------------
build-sentinel:
docker build \
--build-arg VERSION=$$(git -C services/sentinel describe --tags --always 2>/dev/null || echo dev) \
--build-arg GIT_COMMIT=$$(git rev-parse --short HEAD 2>/dev/null || echo unknown) \
--build-arg BUILD_TIME=$$(date -u +%Y-%m-%dT%H:%M:%SZ) \
-f services/sentinel/Dockerfile \
-t ja4-platform/sentinel:latest \
.
test-sentinel:
# Tests run inside Docker — no native Go required on the host
docker build -f services/sentinel/Dockerfile.dev -t ja4-platform/sentinel-tests:latest .
docker run --rm --cap-add=NET_RAW --cap-add=NET_ADMIN ja4-platform/sentinel-tests:latest
rpm-sentinel:
# Méthode: Dockerfile.package → builder Go → rpm-builder (rpmbuild ×3) → output alpine
docker build \
-f services/sentinel/Dockerfile.package \
--target output \
--output type=local,dest=services/sentinel/dist \
--build-arg VERSION=$(shell git -C services/sentinel describe --tags --always 2>/dev/null || echo dev) \
.
@echo "📦 RPMs sentinel dans services/sentinel/dist/"
# --- mod-reqin-log (was mod_reqin_log) --------------------------------------
test-mod-reqin-log:
docker build -f services/mod-reqin-log/Dockerfile.tests -t ja4-platform/mod-reqin-log-tests:latest .
docker run --rm ja4-platform/mod-reqin-log-tests:latest
rpm-mod-reqin-log:
# Méthode: Dockerfile.package → builder C (×3 distros) → rpm-builder (rpmbuild ×3) → output alpine
docker build \
-f services/mod-reqin-log/Dockerfile.package \
--target output \
--output type=local,dest=services/mod-reqin-log/dist \
.
@echo "📦 RPMs mod-reqin-log dans services/mod-reqin-log/dist/"
# --- correlator (was logcorrelator) -----------------------------------------
build-correlator:
docker build \
-f services/correlator/Dockerfile \
-t ja4-platform/correlator:latest \
.
test-correlator:
# Tests run inside the Dockerfile builder stage (80% coverage gate enforced)
docker build --target builder -f services/correlator/Dockerfile -t ja4-platform/correlator-tests:latest .
rpm-correlator:
# Méthode: Dockerfile.package → builder Go → rpm-builder (rpmbuild ×3) → output alpine
docker build \
-f services/correlator/Dockerfile.package \
--target output \
--output type=local,dest=services/correlator/dist \
--build-arg VERSION=$(shell git -C services/correlator describe --tags --always 2>/dev/null || echo dev) \
.
@echo "📦 RPMs correlator dans services/correlator/dist/"
# --- bot-detector (was bot_detector) ----------------------------------------
build-bot-detector:
docker build \
-f services/bot-detector/bot_detector/Dockerfile \
-t ja4-platform/bot-detector:latest \
.
test-bot-detector:
docker build \
-f services/bot-detector/bot_detector/Dockerfile.tests \
-t ja4-platform/bot-detector-tests:latest \
.
docker run --rm ja4-platform/bot-detector-tests:latest
# --- dashboard --------------------------------------------------------------
build-dashboard:
docker build \
-f services/dashboard/Dockerfile \
-t ja4-platform/dashboard:latest \
.
test-dashboard:
docker build \
-f services/dashboard/Dockerfile.tests \
-t ja4-platform/dashboard-tests:latest \
.
docker run --rm ja4-platform/dashboard-tests:latest
# --- shared/python/ja4_common -----------------------------------------------
test-ja4common-python:
docker build \
-f shared/python/ja4_common/Dockerfile.tests \
-t ja4-platform/ja4common-python-tests:latest \
shared/python/ja4_common/
docker run --rm ja4-platform/ja4common-python-tests:latest

123
README.md Normal file
View File

@ -0,0 +1,123 @@
# ja4-platform
**ja4-platform** is a monorepo security pipeline for TLS fingerprinting (JA4/JA3) and bot detection. It captures live network traffic, correlates TLS handshakes with HTTP requests, detects anomalous behavior using machine learning (Isolation Forest), and presents results through a SOC analyst dashboard — all backed by ClickHouse as the central data store.
## Pipeline Overview
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Linux Server (Apache) │
│ │
│ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ mod-reqin-log │───────▶│ UNIX socket (HTTP) │──┐ │
│ │ (Apache module) │ JSON │ /var/run/logcorr/ │ │ │
│ │ C · httpd DSO │ │ http.socket │ │ │
│ └─────────────────┘ └─────────────────────┘ │ │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────────┐ ┌──────────────────┐ │
│ │ sentinel │───────▶│ UNIX socket (TLS) │─▶│ correlator │ │
│ │ (TLS capture) │ JSON │ /var/run/logcorr/ │ │ (event join) │ │
│ │ Go · libpcap │ │ network.socket │ │ Go · hex. arch │ │
│ └─────────────────┘ └─────────────────────┘ └────────┬─────────┘ │
│ │ │
└────────────────────────────────────────────────────────────────┼────────────┘
│ INSERT
┌──────────────────┐
│ ClickHouse │
│ mabase_prod │
│ (all tables) │
└────────┬─────────┘
│ SELECT
┌────────────────────┼────────────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ bot-detector │ │ dashboard │
│ (ML anomaly det) │ │ (SOC web UI) │
│ Python · sklearn │ │ FastAPI + React │
└──────────────────┘ └──────────────────┘
```
## Services
| Service | Language | Purpose | Interface |
|---------|----------|---------|-----------|
| [sentinel](docs/services/sentinel.md) | Go | Live TLS packet capture, JA4/JA3 fingerprint generation | UNIX socket (`network.socket`) |
| [mod-reqin-log](docs/services/mod-reqin-log.md) | C | Apache HTTPD module, HTTP request JSON logging | UNIX socket (`http.socket`) |
| [correlator](docs/services/correlator.md) | Go | Joins HTTP + TLS events by `src_ip:src_port` + time window | ClickHouse INSERT, file, stdout |
| [bot-detector](docs/services/bot-detector.md) | Python | Isolation Forest ML anomaly detection on aggregated traffic | ClickHouse read/write, HTTP `:8080` |
| [dashboard](docs/services/dashboard.md) | Python/JS | SOC analyst web dashboard (FastAPI + React) | HTTP `:8000` |
## Shared Libraries
| Library | Language | Description |
|---------|----------|-------------|
| [go/ja4common](docs/shared/go-ja4common.md) | Go | Logger, config loader, shutdown handler, IP filter |
| [python/ja4_common](docs/shared/python-ja4common.md) | Python | ClickHouse client singleton, settings |
## Quickstart
### Prerequisites
- Docker (with BuildKit) and Docker Compose
- `make`
- No native Go, Python, or C toolchains required — all builds run inside Docker
### Build All Services
```bash
make build-all
```
### Run All Tests
```bash
make test-all
```
### Build RPM Packages
```bash
make rpm-all
# RPMs written to services/<service>/dist/
```
## Documentation
| Document | Description |
|----------|-------------|
| [Architecture](docs/architecture.md) | System architecture, data flow, component interactions |
| [Development](docs/development.md) | Build, test, package, and extend the platform |
| [Database Schema](docs/database/schema.md) | Every ClickHouse table, view, dictionary, and materialized view |
| [Database Migrations](docs/database/migrations.md) | Migration order, application, verification, and rollback |
### Service Documentation
- [Sentinel](docs/services/sentinel.md) — TLS capture daemon
- [mod-reqin-log](docs/services/mod-reqin-log.md) — Apache HTTP logging module
- [Correlator](docs/services/correlator.md) — HTTP/TLS event correlation engine
- [Bot Detector](docs/services/bot-detector.md) — ML anomaly detection
- [Dashboard](docs/services/dashboard.md) — SOC web dashboard and API
### Shared Library Documentation
- [go-ja4common](docs/shared/go-ja4common.md) — Go shared library
- [python-ja4common](docs/shared/python-ja4common.md) — Python shared library
## Go Workspace
The repository uses a Go workspace (`go.work`) to link the Go modules:
```
go 1.21
use (
./services/sentinel
./services/correlator
./shared/go/ja4common
)
```
## License
See individual service directories for license information.

162
docs/architecture.md Normal file
View File

@ -0,0 +1,162 @@
# Architecture
The ja4-platform is a security pipeline that captures live network traffic, generates JA4/JA3 TLS fingerprints, correlates them with HTTP requests, applies machine-learning anomaly detection, and surfaces results through a SOC analyst dashboard. ClickHouse serves as the central data store linking all services.
## System Architecture
```
┌───────────────────────────────────────────────────────────────────────────────────┐
│ Target Linux Server │
│ │
│ ┌─────────────┐ HTTP req ┌───────────────────────┐ UNIX socket (DGRAM) │
│ │ Client │────────────▶│ Apache HTTPD │──────────────┐ │
│ │ (browser / │ │ + mod-reqin-log │ │ │
│ │ bot) │ └───────────────────────┘ │ │
│ │ │ ▼ │
│ │ │ TLS CH ┌───────────────────────┐ ┌─────────────────────┐ │
│ │ │────────────▶│ sentinel │ │ correlator │ │
│ │ │ (pcap) │ (packet capture) │──▶│ (event join) │ │
│ └─────────────┘ └───────────────────────┘ └────────┬────────────┘ │
│ │ │
└────────────────────────────────────────────────────────────────────┼──────────────┘
│ INSERT JSON
┌─────────────────────┐
│ ClickHouse │
│ mabase_prod │
│ │
│ http_logs_raw │
│ ──(MV)──▶ http_logs│
│ ──(MV)──▶ agg_* │
│ view_ai_features │
│ ml_detected_anom. │
│ ml_all_scores │
└──────┬──────┬───────┘
│ │
┌──────────────────┘ └──────────────────┐
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ bot-detector │ │ dashboard │
│ (Python) │ │ (FastAPI + React) │
│ │ │ │
│ Reads: │ │ Reads: │
│ view_ai_features │ │ ml_detected_anom. │
│ view_ip_recurrence │ │ ml_all_scores │
│ Writes: │ │ http_logs │
│ ml_detected_anom. │ │ agg_* tables │
│ ml_all_scores │ │ audit_logs │
└──────────────────────┘ └──────────────────────┘
```
## Data Flow
### 1. Capture Phase
1. **mod-reqin-log** (Apache C module) hooks into `post_read_request`. On each HTTP request, it serializes method, path, headers, client IP/port into JSON and sends it via UNIX datagram socket to `/var/run/logcorrelator/http.socket`.
2. **sentinel** (Go daemon) uses libpcap to capture live TLS ClientHello packets on configured ports (default: 443, 8443). It extracts IP/TCP metadata, generates JA4 and JA3 fingerprints, and sends the result as JSON via UNIX datagram socket to `/var/run/logcorrelator/network.socket`.
### 2. Correlation Phase
3. **correlator** (Go daemon) listens on both UNIX sockets. It buffers incoming events and correlates them by matching `src_ip:src_port` within a configurable time window (default: 10 s). HTTP Keep-Alive connections are supported via `one_to_many` matching mode where a single TLS handshake (source B) is reused for multiple HTTP requests (source A). Correlated events merge HTTP fields (method, path, headers) with TLS fields (JA4, JA3, IP/TCP metadata) into a single `CorrelatedLog` JSON object, which is inserted into `http_logs_raw`.
### 3. Enrichment Phase (ClickHouse)
4. **mv_http_logs** materialized view automatically transforms `http_logs_raw` JSON into the structured `http_logs` table, enriching each row with:
- ASN/geo data via `dict_iplocate_asn`
- Anubis bot identification via `dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`
5. **mv_agg_host_ip_ja4_1h** and **mv_agg_header_fingerprint_1h** aggregate `http_logs` into 1-hour behavioral windows.
6. **view_ai_features_1h** joins the two aggregation tables and computes 50+ ML features per `(src_ip, ja4, host)` tuple.
### 4. Detection Phase
7. **bot-detector** (Python) runs on a 5-minute cycle:
- Reads `view_ai_features_1h` for the last 24 hours
- Separates known bots (via reputation dictionaries) from unknown traffic
- Trains/loads Isolation Forest models on human-baseline traffic
- Scores unknown traffic and writes anomalies to `ml_detected_anomalies` and all scores to `ml_all_scores`
### 5. Visualization Phase
8. **dashboard** (FastAPI + React) queries ClickHouse to display detections, feature analysis, investigation summaries, and clustering to SOC analysts.
## Component Interaction Matrix
| From → To | mod-reqin-log | sentinel | correlator | ClickHouse | bot-detector | dashboard |
|-----------|:---:|:---:|:---:|:---:|:---:|:---:|
| **mod-reqin-log** | — | — | UNIX socket (DGRAM) | — | — | — |
| **sentinel** | — | — | UNIX socket (DGRAM) | — | — | — |
| **correlator** | — | — | — | Native TCP :9000 (INSERT) | — | — |
| **ClickHouse** | — | — | — | — | — | — |
| **bot-detector** | — | — | — | HTTP :8123 (SELECT/INSERT) | — | — |
| **dashboard** | — | — | — | HTTP :8123 (SELECT/INSERT) | — | — |
## ClickHouse Table Ownership
| Table/View | Written By | Read By |
|------------|-----------|---------|
| `http_logs_raw` | correlator | mv_http_logs (MV) |
| `http_logs` | mv_http_logs (MV) | mv_agg_*, dashboard |
| `agg_host_ip_ja4_1h` | mv_agg_host_ip_ja4_1h (MV) | view_ai_features_1h |
| `agg_header_fingerprint_1h` | mv_agg_header_fingerprint_1h (MV) | view_ai_features_1h |
| `view_ai_features_1h` | — (view) | bot-detector |
| `view_ip_recurrence` | — (view) | bot-detector |
| `ml_detected_anomalies` | bot-detector | dashboard |
| `ml_all_scores` | bot-detector | dashboard |
| `audit_logs` | dashboard | dashboard |
## Correlation Algorithm
The correlator joins HTTP events (source A) with TLS/network events (source B) using a two-key correlation:
1. **Key**: `src_ip + src_port` — the client's source IP and ephemeral port uniquely identify a TCP connection.
2. **Time window**: Events must arrive within the configured window (default 10 seconds).
3. **Matching mode**:
- `one_to_one`: Each B event matches at most one A event (consumed after match).
- `one_to_many` (default, Keep-Alive): A single B (TLS handshake) can match multiple A events (HTTP requests) on the same connection. The B event has a configurable TTL (default 120 s) that resets on each match.
4. **Orphan handling**: Unmatched A events are emitted after a configurable delay (default 500 ms) with `correlated=false` and `orphan_side=A`.
## JA4/JA3 Fingerprint Format
### JA4
JA4 is a modern TLS fingerprinting format (successor to JA3) with the structure:
```
t{TLS_VER}{SNI}{CIPHER_COUNT}{EXT_COUNT}_{CIPHER_HASH}_{EXT_HASH}
```
Example: `t13d1516h2_8daaf6152771_b0da82dd1658`
- Prefix `t` = TLS, followed by version (`13` = TLS 1.3)
- `d` = SNI present, `i` = SNI absent
- Cipher suite count and extension count
- SHA-256 truncated hashes of sorted cipher suites and extensions
### JA3
JA3 is the original TLS fingerprinting format:
```
{TLS_VER},{CIPHERS},{EXTENSIONS},{ELLIPTIC_CURVES},{EC_POINT_FORMATS}
```
The `ja3_hash` is the MD5 hash of the JA3 string.
Both fingerprints are generated by sentinel from the TLS ClientHello payload.
## Technology Stack
| Component | Technology |
|-----------|-----------|
| Packet capture | Go + libpcap (gopacket) |
| HTTP logging | C Apache module (APR) |
| Event correlation | Go (hexagonal architecture) |
| ML detection | Python 3.11 + scikit-learn |
| Dashboard backend | FastAPI (Python) |
| Dashboard frontend | React + Vite |
| Data store | ClickHouse |
| Deployment | systemd, Docker, RPM |
| IPC | UNIX datagram sockets |

256
docs/database/migrations.md Normal file
View File

@ -0,0 +1,256 @@
# Database Migrations
The ClickHouse schema for ja4-platform is managed through numbered SQL migration files in `shared/clickhouse/`. Migrations are idempotent (using `IF NOT EXISTS` / `IF EXISTS`) and must be applied in numeric order.
## Migration Order
| File | Purpose |
|------|---------|
| `00_database.sql` | Creates the `mabase_prod` database |
| `01_raw_tables.sql` | Creates `http_logs_raw` ingest table (MergeTree, 1-day TTL) |
| `02_dictionaries.sql` | Creates ASN geo dictionary (`dict_iplocate_asn`), bot IP/JA4 reference tables, `ref_bot_networks` |
| `03_anubis_tables.sql` | Creates Anubis crawler rule tables (`anubis_ua_rules`, `anubis_ip_rules`, `anubis_asn_rules`, `anubis_country_rules`) and their dictionaries (`dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`) |
| `04_mv_http_logs.sql` | Creates the canonical `http_logs` table and `mv_http_logs` materialized view with full Anubis enrichment |
| `05_aggregation_tables.sql` | Creates reputation dictionaries (`dict_bot_ip`, `dict_bot_ja4`, `dict_asn_reputation`), behavioral aggregation tables (`agg_host_ip_ja4_1h`, `agg_header_fingerprint_1h`), and their materialized views |
| `06_ml_tables.sql` | Creates ML output tables (`ml_detected_anomalies`, `ml_all_scores`) and `view_ip_recurrence` |
| `07_ai_features_view.sql` | Creates `view_ai_features_1h` — the 50+ feature view used by bot-detector |
| `08_users.sql` | Creates ClickHouse users (`data_writer`, `analyst`) and grants permissions |
| `09_audit_table.sql` | Creates `audit_logs` table for SOC dashboard audit trail |
## Prerequisites
### 1. ClickHouse Server
A running ClickHouse server (version 23.8+ recommended for `REGEXP_TREE` dictionary support).
### 2. CSV Data Files
Place the following files in `/var/lib/clickhouse/user_files/`:
| File | Source | Description |
|------|--------|-------------|
| `iplocate-ip-to-asn.csv` | [IPLocate](https://iplocate.io) | IP-to-ASN mapping with country, org, domain |
| `bot_ip.csv` | Custom | Known bot IP prefixes (CIDR format) |
| `bot_ja4.csv` | Custom | Known bot JA4 fingerprints |
| `asn_reputation.csv` | Custom | ASN reputation labels (`human`, `bot`, `unknown`) |
### 3. Anubis Passwords
Migration `03_anubis_tables.sql` contains placeholder passwords (`CHANGE_ME`) for the Anubis dictionaries. Replace these with the actual ClickHouse admin password before applying:
```bash
sed -i "s/CHANGE_ME/your_actual_password/g" 03_anubis_tables.sql
```
## How to Apply
### Full Initial Setup
Apply all migrations in order:
```bash
cd shared/clickhouse/
clickhouse-client --multiquery < 00_database.sql
clickhouse-client --multiquery < 01_raw_tables.sql
clickhouse-client --multiquery < 02_dictionaries.sql
clickhouse-client --multiquery < 03_anubis_tables.sql
clickhouse-client --multiquery < 04_mv_http_logs.sql
clickhouse-client --multiquery < 05_aggregation_tables.sql
clickhouse-client --multiquery < 06_ml_tables.sql
clickhouse-client --multiquery < 07_ai_features_view.sql
clickhouse-client --multiquery < 08_users.sql
clickhouse-client --multiquery < 09_audit_table.sql
```
### With Authentication
```bash
clickhouse-client --user admin --password 'your_password' --multiquery < 00_database.sql
# ... repeat for each file
```
### One-Liner (All at Once)
```bash
cd shared/clickhouse/
for f in 0*.sql; do
echo "Applying $f..."
clickhouse-client --multiquery < "$f"
done
```
## How to Verify
After applying all migrations, run these queries to verify each migration was successful:
### 00 — Database
```sql
SHOW DATABASES LIKE 'mabase_prod';
-- Expected: mabase_prod
```
### 01 — Raw Tables
```sql
EXISTS mabase_prod.http_logs_raw;
-- Expected: 1
```
### 02 — Dictionaries
```sql
SELECT dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'country_code',
toIPv6(toIPv4('8.8.8.8')), 'MISSING');
-- Expected: US (if CSV loaded) or MISSING
```
### 03 — Anubis Tables
```sql
EXISTS mabase_prod.anubis_ua_rules;
EXISTS mabase_prod.anubis_ip_rules;
EXISTS mabase_prod.anubis_asn_rules;
EXISTS mabase_prod.anubis_country_rules;
-- Expected: 1 for each
```
### 04 — MV + http_logs
```sql
EXISTS mabase_prod.http_logs;
SELECT name FROM system.tables WHERE database = 'mabase_prod' AND name = 'mv_http_logs';
-- Expected: mv_http_logs
```
### 05 — Aggregation Tables
```sql
EXISTS mabase_prod.agg_host_ip_ja4_1h;
EXISTS mabase_prod.agg_header_fingerprint_1h;
SELECT name FROM system.dictionaries WHERE database = 'mabase_prod' AND name = 'dict_bot_ip';
-- Expected: dict_bot_ip
```
### 06 — ML Tables
```sql
EXISTS mabase_prod.ml_detected_anomalies;
EXISTS mabase_prod.ml_all_scores;
SELECT name FROM system.tables WHERE database = 'mabase_prod' AND name LIKE 'view_ip%';
-- Expected: view_ip_recurrence
```
### 07 — AI Features View
```sql
SELECT name FROM system.tables WHERE database = 'mabase_prod' AND name = 'view_ai_features_1h';
-- Expected: view_ai_features_1h
```
### 08 — Users
```sql
SHOW GRANTS FOR data_writer;
-- Expected: GRANT INSERT, SELECT ON mabase_prod.http_logs_raw TO data_writer
SHOW GRANTS FOR analyst;
-- Expected: GRANT SELECT ON multiple tables
```
### 09 — Audit Table
```sql
EXISTS mabase_prod.audit_logs;
-- Expected: 1
```
### Full Verification Query
```sql
SELECT
count() AS total_tables
FROM system.tables
WHERE database = 'mabase_prod'
AND name IN (
'http_logs_raw', 'http_logs', 'agg_host_ip_ja4_1h', 'agg_header_fingerprint_1h',
'ml_detected_anomalies', 'ml_all_scores', 'ref_bot_networks',
'anubis_ua_rules', 'anubis_ip_rules', 'anubis_asn_rules', 'anubis_country_rules',
'audit_logs', 'bot_ip', 'bot_ja4'
);
-- Expected: 14
```
## Rollback Notes
### General Approach
ClickHouse does not support transactional DDL. To roll back a migration:
1. **Tables**: `DROP TABLE IF EXISTS mabase_prod.<table_name>`
2. **Materialized Views**: `DROP VIEW IF EXISTS mabase_prod.<mv_name>` (drop MV before its target table)
3. **Dictionaries**: `DROP DICTIONARY IF EXISTS mabase_prod.<dict_name>`
4. **Views**: `DROP VIEW IF EXISTS mabase_prod.<view_name>`
5. **Users**: `DROP USER IF EXISTS <username>`
### Rollback Order (Reverse of Apply)
```sql
-- 09: Audit
DROP TABLE IF EXISTS mabase_prod.audit_logs;
-- 08: Users
DROP USER IF EXISTS data_writer;
DROP USER IF EXISTS analyst;
-- 07: AI Features View
DROP VIEW IF EXISTS mabase_prod.view_ai_features_1h;
-- 06: ML Tables
DROP VIEW IF EXISTS mabase_prod.view_ip_recurrence;
DROP TABLE IF EXISTS mabase_prod.ml_all_scores;
DROP TABLE IF EXISTS mabase_prod.ml_detected_anomalies;
-- 05: Aggregation
DROP VIEW IF EXISTS mabase_prod.mv_agg_header_fingerprint_1h;
DROP VIEW IF EXISTS mabase_prod.mv_agg_host_ip_ja4_1h;
DROP TABLE IF EXISTS mabase_prod.agg_header_fingerprint_1h;
DROP TABLE IF EXISTS mabase_prod.agg_host_ip_ja4_1h;
DROP DICTIONARY IF EXISTS mabase_prod.dict_asn_reputation;
DROP DICTIONARY IF EXISTS mabase_prod.dict_bot_ja4;
DROP DICTIONARY IF EXISTS mabase_prod.dict_bot_ip;
-- 04: MV + http_logs
DROP VIEW IF EXISTS mabase_prod.mv_http_logs;
DROP TABLE IF EXISTS mabase_prod.http_logs;
-- 03: Anubis
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_country;
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_asn;
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_ip;
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_ua;
DROP TABLE IF EXISTS mabase_prod.anubis_country_rules;
DROP TABLE IF EXISTS mabase_prod.anubis_asn_rules;
DROP TABLE IF EXISTS mabase_prod.anubis_ip_rules;
DROP TABLE IF EXISTS mabase_prod.anubis_ua_rules;
-- 02: Dictionaries
DROP DICTIONARY IF EXISTS mabase_prod.dict_iplocate_asn;
DROP TABLE IF EXISTS mabase_prod.bot_ja4;
DROP TABLE IF EXISTS mabase_prod.bot_ip;
DROP TABLE IF EXISTS mabase_prod.ref_bot_networks;
-- 01: Raw Tables
DROP TABLE IF EXISTS mabase_prod.http_logs_raw;
-- 00: Database
DROP DATABASE IF EXISTS mabase_prod;
```
### Important Notes
- **Data loss**: Dropping tables destroys all data. Always back up before rollback.
- **MV dependency**: Materialized views must be dropped before their target tables.
- **Dictionary dependency**: Views/MVs using dictionaries will fail if dictionaries are dropped while they still reference them.
- **Idempotent re-apply**: After rollback, migrations can be safely re-applied since they use `IF NOT EXISTS`.
- **`04_mv_http_logs.sql`** is the canonical version of the MV, superseding any base version in `services/correlator/sql/init.sql`.

334
docs/database/schema.md Normal file
View File

@ -0,0 +1,334 @@
# Database Schema
The ja4-platform uses ClickHouse as its central data store with database `mabase_prod`. This document describes every table, materialized view, dictionary, and view in the schema.
## Tables
### http_logs_raw
Raw JSON ingest table — direct target for correlator INSERTs.
| Column | Type | Description |
|--------|------|-------------|
| `raw_json` | String (ZSTD(3)) | Complete correlated log as JSON string |
| `ingest_time` | DateTime | Insertion timestamp (default: `now()`) |
- **Engine**: MergeTree
- **Partition by**: `toDate(ingest_time)`
- **Order by**: `ingest_time`
- **TTL**: `ingest_time + INTERVAL 1 DAY`
---
### http_logs
Parsed and enriched HTTP log table — populated by `mv_http_logs` materialized view.
| Column | Type | Nullable | Description |
|--------|------|----------|-------------|
| `time` | DateTime | No | Request timestamp |
| `log_date` | Date | No | Date partition key (default: `toDate(time)`) |
| `src_ip` | IPv4 | No | Client source IP |
| `src_port` | UInt16 | No | Client source port |
| `dst_ip` | IPv4 | No | Server destination IP |
| `dst_port` | UInt16 | No | Server destination port |
| `src_asn` | UInt32 | No | Source ASN (enriched via dict_iplocate_asn) |
| `src_country_code` | LowCardinality(String) | No | Source country code |
| `src_as_name` | LowCardinality(String) | No | AS name |
| `src_org` | LowCardinality(String) | No | AS organization |
| `src_domain` | LowCardinality(String) | No | AS domain |
| `method` | LowCardinality(String) | No | HTTP method |
| `scheme` | LowCardinality(String) | No | URL scheme (http/https) |
| `host` | LowCardinality(String) | No | HTTP Host header |
| `path` | String (ZSTD(3)) | No | Request path |
| `query` | String (ZSTD(3)) | No | Query string |
| `http_version` | LowCardinality(String) | No | HTTP version |
| `orphan_side` | LowCardinality(String) | No | Orphan side (A, B, or empty) |
| `correlated` | UInt8 | No | 1 if HTTP+TLS correlated |
| `keepalives` | UInt16 | No | Keep-alive request sequence |
| `a_timestamp` | UInt64 | No | Source A event timestamp (ns) |
| `b_timestamp` | UInt64 | No | Source B event timestamp (ns) |
| `conn_id` | String (ZSTD(3)) | No | TCP connection identifier |
| `ip_meta_df` | UInt8 | No | IP Don't Fragment flag |
| `ip_meta_id` | UInt16 | No | IP identification |
| `ip_meta_total_length` | UInt16 | No | IP total length |
| `ip_meta_ttl` | UInt8 | No | IP TTL |
| `tcp_meta_options` | LowCardinality(String) | No | TCP options list |
| `tcp_meta_window_size` | UInt32 | No | TCP window size |
| `tcp_meta_mss` | UInt16 | No | TCP MSS |
| `tcp_meta_window_scale` | UInt8 | No | TCP window scale |
| `syn_to_clienthello_ms` | Int32 | No | SYN-to-ClientHello timing (ms) |
| `tls_version` | LowCardinality(String) | No | TLS version |
| `tls_sni` | LowCardinality(String) | No | TLS SNI |
| `tls_alpn` | LowCardinality(String) | No | TLS ALPN |
| `ja3` | String (ZSTD(3)) | No | JA3 fingerprint |
| `ja3_hash` | String (ZSTD(3)) | No | JA3 MD5 hash |
| `ja4` | String (ZSTD(3)) | No | JA4 fingerprint |
| `client_headers` | String (ZSTD(3)) | No | Comma-separated header names |
| `header_user_agent` | String (ZSTD(3)) | No | User-Agent header |
| `header_accept` | String (ZSTD(3)) | No | Accept header |
| `header_accept_encoding` | String (ZSTD(3)) | No | Accept-Encoding header |
| `header_accept_language` | String (ZSTD(3)) | No | Accept-Language header |
| `header_content_type` | String (ZSTD(3)) | No | Content-Type header |
| `header_x_request_id` | String (ZSTD(3)) | No | X-Request-Id header |
| `header_x_trace_id` | String (ZSTD(3)) | No | X-Trace-Id header |
| `header_x_forwarded_for` | String (ZSTD(3)) | No | X-Forwarded-For header |
| `header_sec_ch_ua` | String (ZSTD(3)) | No | Sec-CH-UA header |
| `header_sec_ch_ua_mobile` | String (ZSTD(3)) | No | Sec-CH-UA-Mobile header |
| `header_sec_ch_ua_platform` | String (ZSTD(3)) | No | Sec-CH-UA-Platform header |
| `header_sec_fetch_dest` | String (ZSTD(3)) | No | Sec-Fetch-Dest header |
| `header_sec_fetch_mode` | String (ZSTD(3)) | No | Sec-Fetch-Mode header |
| `header_sec_fetch_site` | String (ZSTD(3)) | No | Sec-Fetch-Site header |
| `anubis_bot_name` | LowCardinality(String) | No | Anubis-detected bot name (default: '') |
| `anubis_bot_action` | LowCardinality(String) | No | Anubis-detected bot action (default: '') |
| `anubis_bot_category` | LowCardinality(String) | No | Anubis-detected bot category (default: '') |
- **Engine**: MergeTree
- **Partition by**: `log_date`
- **Order by**: `(time, src_ip, dst_ip, ja4)`
- **TTL**: `log_date + INTERVAL 7 DAY`
---
### agg_host_ip_ja4_1h
Behavioral aggregation per `(src_ip, ja4, host)` per hour. Uses `AggregatingMergeTree` with `SimpleAggregateFunction` and `AggregateFunction` columns for incremental aggregation.
Key columns include: `window_start`, `src_ip`, `ja4`, `host`, `src_asn`, `hits`, `count_post`, `uniq_paths`, `uniq_query_params`, `tcp_jitter_variance`, `unique_src_ports`, `unique_conn_id`, `orphan_count`, `ip_id_zero_count`, `mss_1460_count`, `uniq_ua`, `url_depth_variance`, `count_anomalous_payload`, `uniq_ja3`, `avg_syn_ms`, `tls12_count`, `count_head`, `count_no_sec_fetch`, `count_generic_accept`, `count_http10`, `ip_df_var`, `avg_ttl`, `ttl_var`, `count_no_wscale`, `count_correlated`, `count_no_accept_enc`, `count_http_scheme`.
- **Engine**: AggregatingMergeTree
- **Order by**: `(window_start, src_ip, ja4, host)`
---
### agg_header_fingerprint_1h
Header-level behavioral fingerprint aggregation per `(src_ip)` per hour.
| Column | Type | Description |
|--------|------|-------------|
| `window_start` | DateTime | Hour window start |
| `src_ip` | IPv6 | Source IP |
| `header_order_hash` | SimpleAggregateFunction(any, String) | Hash of header order |
| `header_count` | SimpleAggregateFunction(max, UInt16) | Max header count |
| `has_accept_language` | SimpleAggregateFunction(max, UInt8) | Accept-Language presence |
| `has_cookie` | SimpleAggregateFunction(max, UInt8) | Cookie presence |
| `has_referer` | SimpleAggregateFunction(max, UInt8) | Referer presence |
| `modern_browser_score` | SimpleAggregateFunction(max, UInt8) | Browser compliance score |
| `ua_ch_mismatch` | SimpleAggregateFunction(max, UInt8) | UA/Client Hints mismatch |
| `sec_fetch_mode` | SimpleAggregateFunction(any, String) | Sec-Fetch-Mode value |
| `sec_fetch_dest` | SimpleAggregateFunction(any, String) | Sec-Fetch-Dest value |
- **Engine**: AggregatingMergeTree
- **Order by**: `(window_start, src_ip)`
---
### ml_detected_anomalies
Anomaly detections above the threat threshold.
Key columns: `detected_at`, `src_ip` (IPv6), `ja4`, `host`, `bot_name`, `anomaly_score` (Float32), `raw_anomaly_score` (Float32), `threat_level`, `model_name`, `recurrence` (UInt32), `campaign_id` (Int32), `reason`, plus all ML feature columns and Anubis enrichment (`anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`).
- **Engine**: ReplacingMergeTree(detected_at)
- **Order by**: `(src_ip)`
- **TTL**: `detected_at + INTERVAL 30 DAY`
---
### ml_all_scores
All ML classifications (no threshold filter) for observability.
Key columns: `detected_at`, `window_start`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `correlated`, `campaign_id`, plus ASN and Anubis enrichment.
- **Engine**: ReplacingMergeTree(detected_at)
- **Order by**: `(window_start, src_ip, ja4, host, model_name)`
- **TTL**: `window_start + INTERVAL 3 DAY`
---
### ref_bot_networks
Bot network CIDR reference table.
| Column | Type | Description |
|--------|------|-------------|
| `network` | IPv6CIDR | Network CIDR |
| `bot_name` | LowCardinality(String) | Bot name |
| `is_legitimate` | UInt8 | 1 = legitimate bot |
| `last_update` | DateTime | Last update timestamp |
- **Engine**: ReplacingMergeTree(last_update)
- **Order by**: `(network, bot_name)`
---
### bot_ip / bot_ja4
CSV-backed flat tables for quick bot lookups.
- `bot_ip`: single column `ip` (String) — Engine: File(CSV, 'bot_ip.csv')
- `bot_ja4`: single column `ja4` (String) — Engine: File(CSV, 'bot_ja4.csv')
---
### Anubis Rule Tables
| Table | Key | Columns | Engine |
|-------|-----|---------|--------|
| `anubis_ua_rules` | `id` (UInt64) | `parent_id`, `regexp`, `keys`, `values` | ReplacingMergeTree |
| `anubis_ip_rules` | `prefix` (String) | `bot_name`, `action`, `rule_id`, `has_ua`, `category` | ReplacingMergeTree |
| `anubis_asn_rules` | `asn` (UInt32) | `bot_name`, `action`, `category` | ReplacingMergeTree |
| `anubis_country_rules` | `country_code` (String) | `bot_name`, `action`, `category` | ReplacingMergeTree |
---
### audit_logs
SOC audit trail for dashboard activity.
| Column | Type | Default | Description |
|--------|------|---------|-------------|
| `timestamp` | DateTime | `now()` | Event time |
| `user_name` | LowCardinality(String) | `'soc_user'` | Analyst name |
| `action` | LowCardinality(String) | — | Action performed |
| `entity_type` | LowCardinality(String) | `''` | Entity type (ip, ja4, etc.) |
| `entity_id` | String | `''` | Entity identifier |
| `entity_count` | UInt32 | `0` | Entity count |
| `details` | String (ZSTD(3)) | `''` | JSON details |
| `client_ip` | String | `''` | Analyst client IP |
- **Engine**: MergeTree
- **Partition by**: `toDate(timestamp)`
- **Order by**: `(timestamp, user_name, action)`
- **TTL**: `toDate(timestamp) + INTERVAL 90 DAY`
---
## Materialized Views
### mv_http_logs
- **Source**: `http_logs_raw`
- **Target**: `http_logs`
- **Transformation**: Parses `raw_json` via `JSONExtract*` functions, enriches with ASN data from `dict_iplocate_asn` and Anubis bot detection from `dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`. Uses a 5-level priority cascade for Anubis: UA+IP combined > UA only > IP only > ASN > Country.
### mv_agg_host_ip_ja4_1h
- **Source**: `http_logs`
- **Target**: `agg_host_ip_ja4_1h`
- **Transformation**: Groups by `(toStartOfHour(time), src_ip, ja4, host, src_asn)`. Computes counts, unique values, variances, and aggregate functions for 50+ behavioral features.
### mv_agg_header_fingerprint_1h
- **Source**: `http_logs`
- **Target**: `agg_header_fingerprint_1h`
- **Transformation**: Groups by `(toStartOfHour(time), src_ip)`. Computes header order hash, header count, browser compliance score, Client Hints mismatch.
---
## Dictionaries
### dict_iplocate_asn
- **Source**: CSV file `/var/lib/clickhouse/user_files/iplocate-ip-to-asn.csv`
- **Key**: `network` (String)
- **Layout**: `IP_TRIE`
- **Attributes**: `asn` (UInt32), `country_code`, `name`, `org`, `domain`
- **Lifetime**: 36007200 seconds
### dict_bot_ip
- **Source**: CSV file `/var/lib/clickhouse/user_files/bot_ip.csv`
- **Key**: `prefix` (String)
- **Layout**: `IP_TRIE`
- **Attributes**: `bot_name` (String)
- **Lifetime**: 300 seconds
### dict_bot_ja4
- **Source**: CSV file `/var/lib/clickhouse/user_files/bot_ja4.csv`
- **Key**: `ja4` (String)
- **Layout**: `COMPLEX_KEY_HASHED`
- **Attributes**: `bot_name` (String)
- **Lifetime**: 300 seconds
### dict_asn_reputation
- **Source**: CSV file `/var/lib/clickhouse/user_files/asn_reputation.csv`
- **Key**: `src_asn` (UInt64)
- **Layout**: `HASHED`
- **Attributes**: `label` (String)
- **Lifetime**: 300 seconds
### dict_anubis_ua
- **Source**: ClickHouse table `anubis_ua_rules`
- **Key**: `regexp` (String)
- **Layout**: `REGEXP_TREE`
- **Attributes**: `bot_name`, `action`, `has_ip`, `rule_id`, `category`
- **Lifetime**: 300600 seconds
### dict_anubis_ip
- **Source**: ClickHouse table `anubis_ip_rules`
- **Key**: `prefix` (String)
- **Layout**: `IP_TRIE`
- **Attributes**: `bot_name`, `action`, `rule_id`, `has_ua`, `category`
- **Lifetime**: 300600 seconds
### dict_anubis_asn
- **Source**: ClickHouse table `anubis_asn_rules`
- **Key**: `asn` (UInt32)
- **Layout**: `FLAT`
- **Attributes**: `bot_name`, `action`, `category`
- **Lifetime**: 300600 seconds
### dict_anubis_country
- **Source**: ClickHouse table `anubis_country_rules`
- **Key**: `country_code` (String)
- **Layout**: `FLAT`
- **Attributes**: `bot_name`, `action`, `category`
- **Lifetime**: 300600 seconds
---
## Views
### view_ai_features_1h
Computes 50+ ML features per `(src_ip, ja4, host)` from the last 24 hours by joining `agg_host_ip_ja4_1h` and `agg_header_fingerprint_1h`. Includes:
- Behavioral features: `hits`, `hit_velocity`, `fuzzing_index`, `post_ratio`, `orphan_ratio`
- Connection features: `max_keepalives`, `multiplexing_efficiency`, `port_exhaustion_ratio`
- Browser features: `modern_browser_score`, `ua_ch_mismatch`, `header_order_shared_count`
- TLS features: `alpn_http_mismatch`, `is_alpn_missing`, `sni_host_mismatch`
- L4 features: `tcp_jitter_variance`, `avg_ttl`, `ttl_std`, `syn_timing_cv`
- Reputation: `bot_name` (from dict_bot_ip/dict_bot_ja4), `anubis_bot_name/action/category`
- Derived: `temporal_entropy`, `ja3_diversity_ratio`
### view_ip_recurrence
Aggregates recurrence data from `ml_detected_anomalies`:
```sql
SELECT src_ip, count() AS recurrence,
min(detected_at) AS first_seen, max(detected_at) AS last_seen,
min(anomaly_score) AS worst_score,
argMin(threat_level, anomaly_score) AS worst_threat_level
FROM ml_detected_anomalies GROUP BY src_ip;
```
---
## User Accounts
| User | Permissions | Purpose |
|------|------------|---------|
| `data_writer` | INSERT + SELECT on `http_logs_raw` | Used by correlator service |
| `analyst` | SELECT on `http_logs`, `ml_detected_anomalies`, `ml_all_scores`, `view_ai_features_1h`, `view_ip_recurrence`, `audit_logs` | Used by dashboard/SOC analysts |
> **Security note**: Default passwords are `ChangeMe` — replace with strong passwords before production use. Store credentials in a secrets manager.

246
docs/development.md Normal file
View File

@ -0,0 +1,246 @@
# Development Guide
This guide covers building, testing, packaging, and extending the ja4-platform monorepo. All build and test operations run inside Docker — no native Go, Python, or C toolchains are required on the host.
## Prerequisites
| Requirement | Minimum Version | Notes |
|-------------|----------------|-------|
| Docker | 20.10+ | BuildKit enabled (`DOCKER_BUILDKIT=1`) |
| Docker Compose | 2.x | For bot-detector and dashboard |
| make | 3.81+ | GNU Make |
| git | 2.x | For version tagging |
No Go, Python, or C compilers are needed on the host machine.
## Building All Services
```bash
make build-all
```
This builds Docker images for:
- `ja4-platform/sentinel:latest`
- `ja4-platform/correlator:latest`
- `ja4-platform/bot-detector:latest`
- `ja4-platform/dashboard:latest`
mod-reqin-log is an Apache module and is only built as part of the RPM packaging process.
### Building Individual Services
```bash
make build-sentinel # Go binary in Docker
make build-correlator # Go binary in Docker
make build-bot-detector # Python image
make build-dashboard # FastAPI + React image
```
## Running Tests
```bash
make test-all
```
### Per-Service Testing
| Service | Command | Details |
|---------|---------|---------|
| sentinel | `make test-sentinel` | Go tests with `-race` flag, requires `NET_RAW`/`NET_ADMIN` caps |
| correlator | `make test-correlator` | Go tests with 80% coverage gate enforced |
| mod-reqin-log | `make test-mod-reqin-log` | C unit tests (JSON serialization, config parsing, header handling) |
| bot-detector | `make test-bot-detector` | Python pytest suite |
| dashboard | `make test-dashboard` | Python pytest for FastAPI routes |
| ja4_common (Python) | `make test-ja4common-python` | Shared Python library tests |
## Building RPM Packages
```bash
make rpm-all
```
Builds RPMs for sentinel, correlator, and mod-reqin-log targeting Rocky Linux 8/9/10:
```bash
make rpm-sentinel # → services/sentinel/dist/rpm/
make rpm-correlator # → services/correlator/dist/rpm/
make rpm-mod-reqin-log # → services/mod-reqin-log/dist/rpm/
```
Each RPM build uses a multi-stage Docker pipeline:
1. Builder stage compiles the binary (Go) or shared object (C)
2. RPM builder stage runs `rpmbuild` for each target distro (el8, el9, el10)
3. Output stage copies RPMs to the host via `--output type=local`
### Distribution Packages
```bash
make dist # Alias for rpm-all
# RPMs in services/<service>/dist/rpm/el{8,9,10}/
```
## Local Development Workflow
### Go Services (sentinel, correlator)
The `go.work` workspace links Go modules:
```
go 1.21
use (
./services/sentinel
./services/correlator
./shared/go/ja4common
)
```
If you have Go 1.21+ installed locally, you can develop without Docker:
```bash
# Run sentinel tests locally
cd services/sentinel && go test ./... -race -v
# Run correlator tests locally
cd services/correlator && go test ./... -race -cover -v
# Build sentinel binary locally (requires libpcap-dev)
cd services/sentinel && go build -o ja4sentinel ./cmd/ja4sentinel/
```
### Python Services (bot-detector, dashboard)
```bash
# Install shared library in development mode
cd shared/python/ja4_common && pip install -e .
# Run bot-detector locally
cd services/bot-detector && pip install -r bot_detector/requirements.txt
python -m bot_detector.bot_detector
# Run dashboard locally
cd services/dashboard && pip install -r backend/requirements.txt
uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000
```
### C Module (mod-reqin-log)
Requires `apxs` (Apache extension tool) and development headers:
```bash
cd services/mod-reqin-log
make build # Compiles mod_reqin_log.so
make test # Runs unit tests
make rpm # Builds RPM packages
```
## Adding a New Service
### Go Service
1. Create the service directory:
```bash
mkdir -p services/my-service/cmd/my-service
mkdir -p services/my-service/internal
```
2. Initialize the Go module:
```bash
cd services/my-service
go mod init github.com/antitbone/ja4/my-service
```
3. Add to `go.work`:
```
use (
./services/sentinel
./services/correlator
./services/my-service # ← add this
./shared/go/ja4common
)
```
4. Import the shared library:
```go
import (
"github.com/antitbone/ja4/ja4common/logger"
"github.com/antitbone/ja4/ja4common/config"
"github.com/antitbone/ja4/ja4common/shutdown"
)
```
5. Add Makefile targets:
```makefile
build-my-service:
docker build -f services/my-service/Dockerfile -t ja4-platform/my-service:latest .
test-my-service:
docker build -f services/my-service/Dockerfile.dev -t ja4-platform/my-service-tests:latest .
docker run --rm ja4-platform/my-service-tests:latest
```
6. Update `build-all` and `test-all` dependencies.
### Python Service
1. Create the service directory with a `requirements.txt` or `pyproject.toml`.
2. Add `ja4-common` as a dependency (installed from `shared/python/ja4_common`).
3. Use `from ja4_common.clickhouse import get_client` for ClickHouse access.
4. Add Makefile targets following the bot-detector/dashboard pattern.
## go.work Workspace
The `go.work` file at the repository root links all Go modules, allowing cross-module development without publishing:
```
go 1.21
use (
./services/sentinel
./services/correlator
./shared/go/ja4common
)
```
When adding a new Go module:
1. `go mod init` in the service directory
2. Add the path to `go.work`
3. Reference shared packages via their module path: `github.com/antitbone/ja4/ja4common/...`
4. Run `go work sync` to update the workspace
## ja4_common Python Package
The shared Python package (`shared/python/ja4_common`) provides:
- `ClickHouseSettings` — pydantic-settings model reading from `.env`
- `ClickHouseClient` — singleton client with auto-reconnect
- `get_client()` — module-level singleton accessor
### Extending ja4_common
1. Add new modules under `shared/python/ja4_common/ja4_common/`
2. Export them in `__init__.py`
3. Add dependencies to `pyproject.toml`
4. Run tests: `make test-ja4common-python`
### Using in a New Service
Add to `requirements.txt`:
```
ja4-common @ file:///app/shared/python/ja4_common
```
Or in Docker, copy the shared library and install:
```dockerfile
COPY shared/python/ja4_common /app/shared/python/ja4_common
RUN pip install /app/shared/python/ja4_common
```
## Environment Variables
Each service reads configuration from environment variables and/or YAML config files. See individual service documentation for the full reference:
- [Sentinel configuration](services/sentinel.md#configuration-reference)
- [Correlator configuration](services/correlator.md#configuration-reference)
- [Bot Detector configuration](services/bot-detector.md#environment-variables)
- [Dashboard configuration](services/dashboard.md#configuration)

View File

@ -0,0 +1,265 @@
# Bot Detector
The bot-detector is a Python service that performs machine-learning anomaly detection on aggregated HTTP/TLS traffic features stored in ClickHouse. It runs on a continuous cycle (default: every 5 minutes), using Isolation Forest to identify suspicious traffic patterns, enriched with SHAP explainability, DBSCAN clustering, and Anubis bot-rule enrichment.
## ML Algorithm
### Isolation Forest (Semi-Supervised)
The core algorithm is **Isolation Forest** (Liu, Ting & Zhou, 2008) — an unsupervised anomaly detection algorithm that isolates anomalies by randomly partitioning feature space. Anomalies require fewer partitions to isolate than normal points.
The approach is **semi-supervised** because:
1. **Known bots** are identified a priori via reputation dictionaries (IP, JA4, ASN)
2. **Human baseline** is identified via ASN reputation labels (`asn_label = 'human'`)
3. The model trains **only on human-baseline traffic** (minimum 500 sessions required)
4. Unknown traffic is scored by deviation from the human profile
### Two-Model Architecture
| Model | Condition | Features | Data |
|-------|-----------|----------|------|
| **Complet** | `correlated = 1` | 35 | HTTP + TCP + TLS (full pipeline data) |
| **Applicatif** | `correlated = 0` | 31 | HTTP only (no TLS correlation available) |
### Threat Levels
| Score Range | Level | Interpretation |
|------------|-------|----------------|
| `< -0.30` | **CRITICAL** | Extremely anomalous behavior |
| `< -0.15` | **HIGH** | Strong anomaly signal |
| `< -0.05` | **MEDIUM** | Moderate anomaly |
| `≥ -0.05` | **LOW** | Slightly unusual |
## Feature List
### Common Features (31 — Applicatif model)
#### HTTP Behavior
| Feature | Description |
|---------|-------------|
| `hits` | Request count in the window |
| `hit_velocity` | Requests per second |
| `fuzzing_index` | Path/parameter diversity anomaly score |
| `post_ratio` | Fraction of POST requests |
| `port_exhaustion_ratio` | Fraction of distinct source ports / total |
| `orphan_ratio` | Requests without TLS correlation |
| `head_ratio` | Fraction of HEAD requests |
| `http10_ratio` | Fraction of HTTP/1.0 requests |
| `generic_accept_ratio` | Fraction of short Accept headers |
| `sec_fetch_absence_rate` | Fraction missing Sec-Fetch-Site |
| `missing_accept_enc_ratio` | Fraction missing Accept-Encoding |
| `http_scheme_ratio` | Fraction using HTTP (not HTTPS) |
#### Connection Management
| Feature | Description |
|---------|-------------|
| `max_keepalives` | Max requests on a single Keep-Alive connection |
| `tcp_shared_count` | TCP connections shared between sessions |
| `multiplexing_efficiency` | HTTP/2 multiplexing efficiency |
#### Browser Fingerprint
| Feature | Description |
|---------|-------------|
| `header_count` | HTTP headers sent |
| `has_accept_language` | Accept-Language header presence |
| `has_cookie` | Cookie header presence |
| `has_referer` | Referer header presence |
| `modern_browser_score` | Composite browser compliance score (0100) |
| `ua_ch_mismatch` | User-Agent vs Client Hints inconsistency |
| `ip_id_zero_ratio` | IP packets with ID=0 (headless/minimal stack) |
| `header_order_shared_count` | IPs sharing same header order |
| `header_order_confidence` | Normalized entropy of header order |
| `distinct_header_orders` | Distinct header orderings per IP |
| `is_fake_navigation` | Sec-Fetch-Mode=navigate with non-document dest |
#### Navigation Patterns
| Feature | Description |
|---------|-------------|
| `request_size_variance` | Variance of request sizes |
| `mss_mobile_mismatch` | TCP MSS vs mobile profile inconsistency |
| `asset_ratio` | Static asset request fraction |
| `direct_access_ratio` | Direct accesses (no referer) |
| `is_ua_rotating` | User-Agent rotation detected (flag) |
| `distinct_ja4_count` | Distinct JA4 fingerprints per IP |
| `anomalous_payload_ratio` | Anomalous payload size fraction |
#### Concentration & Rarity
| Feature | Description |
|---------|-------------|
| `src_port_density` | Source port entropy |
| `ja4_asn_concentration` | JA4 concentration within ASN |
| `ja4_country_concentration` | JA4 concentration per country |
| `is_rare_ja4` | Rare JA4 fingerprint (< 100 total hits) |
#### Temporal & Diversity
| Feature | Description |
|---------|-------------|
| `temporal_entropy` | Temporal distribution entropy |
| `path_diversity_ratio` | URL path diversity |
| `url_depth_variance` | URL depth variance |
| `ja3_diversity_ratio` | JA3 diversity ratio per IP |
### Additional TCP/TLS Features (Complet model only — 4 extra)
| Feature | Description |
|---------|-------------|
| `tcp_jitter_variance` | TCP inter-packet jitter variance |
| `alpn_http_mismatch` | ALPN vs actual HTTP protocol mismatch |
| `is_alpn_missing` | ALPN absent in ClientHello |
| `sni_host_mismatch` | TLS SNI vs HTTP Host mismatch |
### L4 Fingerprint Features (Complet model)
| Feature | Description |
|---------|-------------|
| `avg_ttl` | Average IP TTL (OS fingerprint) |
| `ttl_std` | TTL standard deviation |
| `no_window_scale_ratio` | Fraction without TCP window scale |
| `syn_timing_cv` | SYN timing coefficient of variation |
| `tls12_ratio` | Fraction of TLS 1.2 connections |
| `ip_df_variance` | IP Don't-Fragment flag variance |
## Detection Pipeline
```
1. Read view_ai_features_1h (last 24h) → DataFrame
2. Read view_ip_recurrence → recurrence map
3. Clean columns (fillna, astype)
4. Split by correlated=1 / correlated=0
5. For each model (Complet, Applicatif):
a. A7: Validate features (exclude missing/constant)
b. Separate known bots → log as KNOWN_BOT
c. Filter human baseline (asn_label='human', min 500 sessions)
d. Load or train Isolation Forest model
e. A1: Check concept drift (KS test on features)
f. Score unknown traffic
g. A10: Normalize scores to [-1, 0]
h. A2: Compute adaptive threshold = min(percentile_5, ANOMALY_THRESHOLD)
i. A6: Apply recurrence weighting
j. Filter scores below threshold
k. A4: SHAP explainability (top 5 features)
l. A8: DBSCAN clustering (campaign detection)
6. Concatenate results, deduplicate by src_ip (keep lowest score)
7. A5: Deduplication with TTL (skip recently reported IPs)
8. Insert into ml_detected_anomalies + ml_all_scores
```
## Concept Drift Detection (A1)
Uses the **Kolmogorov-Smirnov test** to compare feature distributions between the current data and the training data. If the fraction of drifted features exceeds `DRIFT_THRESHOLD` (default: 0.30), the model is retrained.
## SHAP Explainability (A4)
When enabled (`ENABLE_SHAP=true`), computes SHAP values for each detected anomaly using `shap.TreeExplainer`. The top 5 contributing features are stored in the `reason` field.
## DBSCAN Clustering (A8)
When enabled (`ENABLE_CLUSTERING=true`), applies DBSCAN on anomaly feature vectors to group related anomalies into campaigns. Each anomaly gets a `campaign_id` (-1 = no cluster).
## Anubis Bot-Rule Enrichment
The `view_ai_features_1h` view enriches each IP with Anubis bot detection using a priority cascade:
1. **UA + IP combined** (same `rule_id`) highest confidence
2. **UA only** (no IP requirement)
3. **IP only** (no UA requirement)
4. **ASN match**
5. **Country match**
## Environment Variables
| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `CLICKHOUSE_HOST` | string | `clickhouse` | ClickHouse server hostname |
| `CLICKHOUSE_PORT` | int | `8123` | ClickHouse HTTP port |
| `CLICKHOUSE_DB` | string | `mabase_prod` | Database name |
| `CLICKHOUSE_USER` | string | `admin` | ClickHouse username |
| `CLICKHOUSE_PASSWORD` | string | `""` | ClickHouse password |
| `ISOLATION_CONTAMINATION` | float | `0.02` | Contamination parameter for Isolation Forest |
| `ANOMALY_THRESHOLD` | float | `-0.03` | Score threshold for anomaly detection |
| `ANOMALY_PERCENTILE` | int | `5` | Percentile for adaptive threshold (A2) |
| `CYCLE_INTERVAL_SEC` | int | `300` | Seconds between detection cycles |
| `MAX_CONSECUTIVE_FAILURES` | int | `3` | Max consecutive failures before exit |
| `BOT_DETECTOR_LOG` | string | `/var/log/bot_detector/decisions.jsonl` | Decision log file path |
| `LOG_BACKUP_COUNT` | int | `7` | Number of rotated log backups |
| `MODEL_DIR` | string | `/var/lib/bot_detector` | Model persistence directory |
| `RETRAIN_INTERVAL_HOURS` | int | `24` | Hours between model retraining |
| `MODEL_HISTORY_COUNT` | int | `10` | Number of model versions to keep |
| `DRIFT_THRESHOLD` | float | `0.30` | KS-test drift threshold (A1) |
| `ENABLE_MULTIWINDOW` | bool | `false` | Enable 24h multi-window analysis (A3) |
| `MULTIWINDOW_VIEW` | string | `view_ai_features_24h` | View for multi-window mode |
| `ENABLE_SHAP` | bool | `true` | Enable SHAP explainability (A4) |
| `DEDUP_TTL_MIN` | int | `60` | Deduplication TTL in minutes (A5) |
| `RECURRENCE_WEIGHT` | float | `0.005` | Recurrence score weighting factor (A6) |
| `MIN_VALID_FEATURE_RATIO` | float | `0.50` | Min valid feature ratio (A7) |
| `ENABLE_CLUSTERING` | bool | `true` | Enable DBSCAN clustering (A8) |
| `CLUSTERING_MIN_SAMPLES` | int | `3` | DBSCAN min samples per cluster |
| `HEALTH_PORT` | int | `8080` | Health check HTTP server port |
## Output Tables
### ml_detected_anomalies
Anomaly detections above the threat threshold. Engine: `ReplacingMergeTree(detected_at)`, ORDER BY `(src_ip)`, TTL 30 days.
Key columns: `detected_at`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `recurrence`, `campaign_id`, `reason`, `anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`, plus all ML features.
### ml_all_scores
All classifications (no threshold filter) for observability. Engine: `ReplacingMergeTree(detected_at)`, ORDER BY `(window_start, src_ip, ja4, host, model_name)`, TTL 3 days.
## Decision Log Format
The `decisions.jsonl` file contains structured JSONL entries:
```json
{"event": "CYCLE_START", "cycle_id": "20260309T143000", "total": 5000, "human": 1500, "known_bot": 200, "correlated": 3000}
{"event": "ANOMALY", "src_ip": "203.0.113.42", "score": -0.25, "threat_level": "HIGH", "reason": "hit_velocity=45.2, fuzzing_index=0.8, ...", "campaign_id": 3}
{"event": "KNOWN_BOT", "src_ip": "198.51.100.10", "bot_name": "AhrefsBot"}
{"event": "CYCLE_END", "cycle_id": "20260309T143000", "anomalies": 15, "known_bots": 200, "duration_sec": 12.5}
```
Log rotation: 50 MB max size × `LOG_BACKUP_COUNT` backups (default 7).
## Health Check Endpoint
- **URL**: `GET http://localhost:8080/`
- **Response**: `200 OK` with status JSON
- Runs in a separate thread
## Model Persistence
| File | Description |
|------|-------------|
| `model_<name>_<version>.joblib` | Serialized Isolation Forest (joblib) |
| `model_<name>_<version>.meta.json` | Model metadata (features, thresholds, training stats) |
| `model_<name>.current` | Pointer to active model version |
| `training_history.jsonl` | Training history log |
Models are rotated: only the last `MODEL_HISTORY_COUNT` versions (default 10) are kept.
## Docker Deployment
```bash
# Build
make build-bot-detector
# Run with docker-compose
cd services/bot-detector
docker-compose up -d
```
### Volumes
| Host Path | Container Path | Description |
|-----------|---------------|-------------|
| `./bot_detector_logs` | `/var/log/bot_detector` | Decision logs (JSONL) |
| `./bot_detector_models` | `/var/lib/bot_detector` | Persisted ML models |
| `./reputation/data/user_files/bot_ip.csv` | `/data/bot_ip.csv` (ro) | Known bot IP list |
| `./reputation/data/user_files/bot_ja4.csv` | `/data/bot_ja4.csv` (ro) | Known bot JA4 list |
| `./reputation/data/user_files/asn_reputation.csv` | `/data/asn_reputation.csv` (ro) | ASN reputation labels |

220
docs/services/correlator.md Normal file
View File

@ -0,0 +1,220 @@
# Correlator
The correlator (`logcorrelator`) is a Go daemon that joins HTTP events from [mod-reqin-log](mod-reqin-log.md) (source A) with TLS/network events from [sentinel](sentinel.md) (source B) into unified correlated log entries. It uses a `src_ip:src_port` key with a configurable time window to match events, supports HTTP Keep-Alive connections, and writes results to ClickHouse, file, and/or stdout.
## Correlation Algorithm
### Key Matching
Events are correlated by their **correlation key**: `src_ip:src_port`. Since a client's ephemeral source port uniquely identifies a TCP connection, matching on this pair reliably joins the HTTP request (seen by Apache) with the TLS handshake (seen by sentinel) from the same connection.
### Time Window
Events must arrive within the configured time window (default: **10 seconds**) to be matched. This accounts for:
- Processing latency between Apache and sentinel
- Packet capture buffering
- UNIX socket delivery ordering
### Keep-Alive Support
In `one_to_many` mode (default), a single TLS handshake event (source B) can match **multiple** HTTP requests (source A) on the same TCP connection:
1. Source B event arrives → buffered with TTL (default: 120 s)
2. Source A event arrives with same key → correlation match, B event TTL resets
3. Next A event on same connection → matches same B event (TTL resets again)
4. Connection closes → B event expires after TTL
Each A event within a Keep-Alive session gets an incrementing `keepalives` counter.
### Orphan Handling
- **Source A orphans** (HTTP without TLS match): Emitted after `apache_emit_delay_ms` (default: 500 ms) with `correlated=false`, `orphan_side=A`
- **Source B orphans** (TLS without HTTP match): Not emitted by default (`network_emit: false`)
- **Buffer overflow**: Oldest events are rotated out and emitted as orphans
### Field Merging
When two events are correlated:
- HTTP fields (method, path, headers, etc.) come from source A
- TLS/network fields (JA4, JA3, IP/TCP metadata) come from source B
- On field collision with different values: both are kept with `a_` and `b_` prefixes
## Configuration Reference
Configuration is loaded from a YAML file (default: `/etc/logcorrelator/logcorrelator.yml`).
### Log Settings
| Name | Type | Default | Description |
|------|------|---------|-------------|
| `log.level` | string | `INFO` | Log level: `DEBUG`, `INFO`, `WARN`, `ERROR` |
### Input Settings
| Name | Type | Default | Description |
|------|------|---------|-------------|
| `inputs.unix_sockets[].name` | string | — | Human-readable source name (e.g., `http`, `network`) |
| `inputs.unix_sockets[].path` | string | — | UNIX socket path to listen on |
| `inputs.unix_sockets[].format` | string | `json` | Input format |
| `inputs.unix_sockets[].source_type` | string | — | Event source: `A` (HTTP), `B` (Network) |
| `inputs.unix_sockets[].socket_permissions` | string | `0666` | Socket file permissions (octal) |
### Output Settings
#### File Output
| Name | Type | Default | Description |
|------|------|---------|-------------|
| `outputs.file.enabled` | bool | `true` | Enable file output |
| `outputs.file.path` | string | `/var/log/logcorrelator/correlated.log` | Output file path |
#### ClickHouse Output
| Name | Type | Default | Description |
|------|------|---------|-------------|
| `outputs.clickhouse.enabled` | bool | `false` | Enable ClickHouse output |
| `outputs.clickhouse.dsn` | string | — | ClickHouse DSN (e.g., `clickhouse://user:pass@host:9000/db`) |
| `outputs.clickhouse.table` | string | — | Target table name |
| `outputs.clickhouse.batch_size` | int | `500` | Records per batch insert |
| `outputs.clickhouse.flush_interval_ms` | int | `200` | Flush interval in milliseconds |
| `outputs.clickhouse.max_buffer_size` | int | `5000` | Maximum in-memory buffer size |
| `outputs.clickhouse.drop_on_overflow` | bool | `true` | Drop records when buffer is full |
| `outputs.clickhouse.async_insert` | bool | `true` | Use ClickHouse async inserts |
| `outputs.clickhouse.timeout_ms` | int | `1000` | Operation timeout in milliseconds |
#### Stdout Output
| Name | Type | Default | Description |
|------|------|---------|-------------|
| `outputs.stdout.enabled` | bool | `false` | Enable stdout output |
| `outputs.stdout.level` | string | — | Output verbosity filter |
### Correlation Settings
| Name | Type | Default | Description |
|------|------|---------|-------------|
| `correlation.time_window.value` | int | `10` | Time window value |
| `correlation.time_window.unit` | string | `s` | Time window unit (`s`, `ms`) |
| `correlation.orphan_policy.apache_always_emit` | bool | `true` | Always emit A events even without B match |
| `correlation.orphan_policy.apache_emit_delay_ms` | int | `500` | Delay before emitting orphan A (ms) |
| `correlation.orphan_policy.network_emit` | bool | `false` | Emit B events without A match |
| `correlation.matching.mode` | string | `one_to_many` | Matching mode: `one_to_one` or `one_to_many` |
| `correlation.buffers.max_http_items` | int | `10000` | Max buffered HTTP (source A) events |
| `correlation.buffers.max_network_items` | int | `20000` | Max buffered network (source B) events |
| `correlation.ttl.network_ttl_s` | int | `120` | TTL for source B events (seconds) |
| `correlation.exclude_source_ips` | []string | `[]` | IPs or CIDRs to exclude from correlation |
| `correlation.include_dest_ports` | []int | `[]` | If non-empty, only correlate events on these ports |
### Metrics Settings
| Name | Type | Default | Description |
|------|------|---------|-------------|
| `metrics.enabled` | bool | `false` | Enable metrics HTTP server |
| `metrics.addr` | string | `:8080` | Metrics server listen address |
## Input Events
### Source A (HTTP — from mod-reqin-log)
JSON fields: `time`, `src_ip`, `src_port`, `dst_ip`, `dst_port`, `method`, `scheme`, `host`, `path`, `query`, `http_version`, `client_headers`, `header_*`
### Source B (Network — from sentinel)
JSON fields: `src_ip`, `src_port`, `dst_ip`, `dst_port`, `ip_meta_*`, `tcp_meta_*`, `tls_version`, `tls_sni`, `tls_alpn`, `ja4`, `ja3`, `ja3_hash`, `conn_id`, `syn_to_clienthello_ms`, `timestamp`
## Output CorrelatedLog JSON Schema
```json
{
"timestamp": "2026-03-09T14:30:00Z",
"src_ip": "203.0.113.42",
"src_port": 52341,
"dst_ip": "192.168.1.10",
"dst_port": 443,
"correlated": true,
"method": "GET",
"host": "example.com",
"path": "/api/v1/users",
"ja4": "t13d1516h2_8daaf6152771_b0da82dd1658",
"ja3_hash": "e7d705a3286e19ea42f587b344ee6865",
"ip_meta_ttl": 64,
"tcp_meta_window_size": 65535,
"tls_version": "1.3",
"tls_sni": "example.com",
"tls_alpn": "h2",
"header_User-Agent": "Mozilla/5.0 ...",
"keepalives": 3
}
```
Core fields are always present; additional fields are merged from A and B event raw data.
## ClickHouse Sink
- **Protocol**: ClickHouse native TCP (port 9000) via `clickhouse-go/v2`
- **Target table**: `http_logs_raw` (raw JSON stored, then parsed by materialized views)
- **Batch inserts**: Buffered up to `batch_size` records (default 500)
- **Flush interval**: Default 200 ms timer triggers flush if batch not full
- **Retry behavior**: Up to 3 retries with exponential backoff (100 ms base)
- **Connection ping**: 5-second timeout on startup
- **Buffer overflow**: Records dropped when buffer exceeds `max_buffer_size` (configurable)
## Metrics HTTP Server
When `metrics.enabled: true`, exposes:
| Endpoint | Description |
|----------|-------------|
| `GET /metrics` | Correlation metrics as JSON (events received, correlated, orphans, buffer sizes) |
| `GET /health` | Health check endpoint |
## systemd Service
```ini
[Unit]
Description=logcorrelator service
After=network.target
[Service]
Type=simple
User=logcorrelator
Group=logcorrelator
ExecStart=/usr/bin/logcorrelator -config /etc/logcorrelator/logcorrelator.yml
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
RuntimeDirectory=logcorrelator
RuntimeDirectoryMode=0755
# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log/logcorrelator /etc/logcorrelator
# Resource limits
LimitNOFILE=65536
TimeoutStartSec=10
TimeoutStopSec=30
[Install]
WantedBy=multi-user.target
```
### Security Hardening
- Runs as dedicated `logcorrelator` user/group
- `NoNewPrivileges=true` — prevents privilege escalation
- `ProtectSystem=strict` — read-only filesystem except `ReadWritePaths`
- `ProtectHome=true` — no access to home directories
- `RuntimeDirectory=logcorrelator` — systemd creates socket directory with correct ownership
## RPM Package Contents
| Path | Description |
|------|-------------|
| `/usr/bin/logcorrelator` | Binary |
| `/etc/logcorrelator/logcorrelator.yml` | Configuration file |
| `/usr/lib/systemd/system/logcorrelator.service` | systemd unit |
| `/var/log/logcorrelator/` | Log directory |
| `/var/run/logcorrelator/` | Socket directory (RuntimeDirectory) |

308
docs/services/dashboard.md Normal file
View File

@ -0,0 +1,308 @@
# Dashboard
The dashboard is a SOC (Security Operations Center) web application built with FastAPI (backend) and React (frontend) that provides real-time visualization, investigation, and analysis of bot detections generated by the [bot-detector](bot-detector.md). It queries ClickHouse (`mabase_prod`) for all data.
## Technology Stack
| Component | Technology |
|-----------|-----------|
| Backend | Python 3.11 + FastAPI |
| Frontend | React + Vite |
| Database | ClickHouse (via `ja4_common` shared client) |
| API Docs | Swagger UI (`/docs`) and ReDoc (`/redoc`) |
## Configuration
| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `CLICKHOUSE_HOST` | string | `clickhouse` | ClickHouse hostname |
| `CLICKHOUSE_PORT` | int | `8123` | ClickHouse HTTP port |
| `CLICKHOUSE_DB` | string | `mabase_prod` | Database name |
| `CLICKHOUSE_USER` | string | `admin` | ClickHouse user |
| `CLICKHOUSE_PASSWORD` | string | `""` | ClickHouse password |
| `API_HOST` | string | `0.0.0.0` | API listen address |
| `API_PORT` | int | `8000` | API listen port |
| `CORS_ORIGINS` | list | `["http://localhost:3000", "http://127.0.0.1:3000"]` | Allowed CORS origins |
## API Reference
All endpoints are prefixed with `/api/`. The dashboard exposes **74+ endpoints** across 20 routers.
### Health
| Method | Path | Description |
|--------|------|-------------|
| GET | `/health` | Health check — returns ClickHouse connection status |
---
### Metrics (`/api/metrics`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/metrics` | Global dashboard metrics: detection counts by threat level, unique IPs, time series |
| GET | `/api/metrics/threats` | Threat distribution summary |
| GET | `/api/metrics/baseline` | Human baseline statistics |
---
### Detections (`/api/detections`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/detections` | Paginated detection list with filtering, sorting, and text search |
| GET | `/api/detections/{detection_id}` | Single detection details |
**Query Parameters** (GET `/api/detections`):
| Parameter | Type | Description |
|-----------|------|-------------|
| `page` | int | Page number (default: 1) |
| `page_size` | int | Items per page (default: 20) |
| `threat_level` | string | Filter by threat level |
| `model_name` | string | Filter by model name |
| `search` | string | Full-text search across IP, JA4, host, bot_name |
| `sort_by` | string | Sort field |
| `sort_order` | string | `asc` or `desc` |
---
### Investigation (`/api/investigation`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/investigation/{ip}/summary` | **Primary investigation endpoint.** Aggregates ML score, brute-force, TCP spoofing, JA4 rotation, persistence, and 24h timeline into a single response with a `risk_score` (0100) |
---
### Reputation (`/api/reputation`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/reputation/ip/{ip_address}` | Full IP reputation from IP-API.com and IPinfo.io (proxy, VPN, Tor, hosting detection) |
| GET | `/api/reputation/ip/{ip_address}/summary` | Simplified reputation summary |
---
### Analysis (`/api/analysis`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/analysis/{ip}/subnet` | Subnet analysis for an IP (related IPs in same /24) |
| GET | `/api/analysis/{ip}/country` | Country-level analysis for an IP |
| GET | `/api/analysis/country` | Global country analysis across all detections |
| GET | `/api/analysis/{ip}/ja4` | JA4 fingerprint analysis for an IP |
| GET | `/api/analysis/{ip}/user-agents` | User-agent analysis for an IP |
| GET | `/api/analysis/{ip}/recommendation` | SOC classification recommendation |
| POST | `/api/analysis/classifications` | Create a classification (legitimate/suspicious/malicious) |
| GET | `/api/analysis/classifications` | List all classifications |
| GET | `/api/analysis/classifications/stats` | Classification statistics |
---
### Entities (`/api/entities`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/entities/types` | List available entity types |
| GET | `/api/entities/subnet/{subnet}` | Investigate a subnet |
| GET | `/api/entities/{entity_type}/{entity_value}` | Investigate any entity (IP, JA4, subnet, UA, host) |
| GET | `/api/entities/{entity_type}/{entity_value}/related` | Related entities |
| GET | `/api/entities/{entity_type}/{entity_value}/user_agents` | User-agents for entity |
| GET | `/api/entities/{entity_type}/{entity_value}/client_headers` | Client headers for entity |
| GET | `/api/entities/{entity_type}/{entity_value}/paths` | URL paths for entity |
| GET | `/api/entities/{entity_type}/{entity_value}/query_params` | Query parameters for entity |
---
### Incidents (`/api/incidents`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/incidents` | List all incidents |
| GET | `/api/incidents/clusters` | Active incident clusters (behavioral similarity grouping) |
| GET | `/api/incidents/{cluster_id}` | Incident cluster details |
| POST | `/api/incidents/{cluster_id}/classify` | Classify an incident cluster |
---
### Fingerprints (`/api/fingerprints`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/fingerprints/spoofing` | TLS fingerprint spoofing detection |
| GET | `/api/fingerprints/ja4-ua-matrix` | JA4 ↔ User-Agent correlation matrix |
| GET | `/api/fingerprints/ua-analysis` | Suspicious user-agent analysis |
| GET | `/api/fingerprints/ip/{ip}/coherence` | Fingerprint coherence analysis per IP |
| GET | `/api/fingerprints/legitimate-ja4` | Known legitimate JA4 fingerprints |
| GET | `/api/fingerprints/asn-correlation` | JA4-ASN correlation analysis |
---
### Brute Force (`/api/bruteforce`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/bruteforce/targets` | Brute-force target hosts |
| GET | `/api/bruteforce/attackers` | Brute-force attacker IPs |
| GET | `/api/bruteforce/timeline` | Brute-force attack timeline |
| GET | `/api/bruteforce/host/{host}/attackers` | Attackers for a specific host |
---
### TCP Spoofing (`/api/tcp-spoofing`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/tcp-spoofing/overview` | TCP/OS fingerprint spoofing overview |
| GET | `/api/tcp-spoofing/list` | Spoofing detection list |
| GET | `/api/tcp-spoofing/matrix` | TTL × MSS anomaly matrix |
---
### Header Fingerprint (`/api/headers`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/headers/clusters` | Header fingerprint clusters (suspicious patterns) |
| GET | `/api/headers/cluster/{hash}/ips` | IPs sharing a header fingerprint |
---
### Heatmap (`/api/heatmap`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/heatmap/hourly` | Hourly traffic heatmap |
| GET | `/api/heatmap/top-hosts` | Top hosts by traffic volume |
| GET | `/api/heatmap/matrix` | Activity/hour matrix |
---
### Botnets (`/api/botnets`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/botnets/ja4-spread` | JA4 geographic spread (botnet indicator) |
| GET | `/api/botnets/ja4/{ja4}/countries` | Country distribution for a JA4 fingerprint |
| GET | `/api/botnets/summary` | Global botnet detection summary |
---
### Rotation (`/api/rotation`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/rotation/ja4-rotators` | IPs rotating JA4 fingerprints (evasion detection) |
| GET | `/api/rotation/persistent-threats` | Persistent threats across time windows |
| GET | `/api/rotation/ip/{ip}/ja4-history` | JA4 fingerprint history for an IP |
| GET | `/api/rotation/sophistication` | Sophistication score analysis |
| GET | `/api/rotation/proactive-hunt` | Proactive threat hunting suggestions |
---
### ML Features (`/api/ml`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/ml/top-anomalies` | Top anomalies with feature details |
| GET | `/api/ml/ip/{ip}/radar` | Feature radar chart data for an IP |
| GET | `/api/ml/score-distribution` | Anomaly score distribution histogram |
| GET | `/api/ml/score-trends` | Score trends over time |
| GET | `/api/ml/b-features` | Source B (TCP/TLS) feature analysis |
| GET | `/api/ml/campaigns` | ML-detected campaign analysis |
| GET | `/api/ml/scatter` | Feature scatter plot data |
---
### Attributes (`/api/attributes`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/attributes/{attr_type}` | List distinct values for an attribute (ja4, user_agent, asn, country, host) with counts |
---
### Variability (`/api/variability`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/variability/{attr_type}/{value}` | Behavioral variability analysis for an attribute value |
| GET | `/api/variability/{attr_type}/{value}/ips` | IPs associated with an attribute value |
| GET | `/api/variability/{attr_type}/{value}/attributes` | Attribute breakdown for a value |
| GET | `/api/variability/{attr_type}/{value}/user_agents` | User-agents for an attribute value |
---
### Clustering (`/api/clustering`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/clustering/status` | Clustering cache status |
| GET | `/api/clustering/clusters` | K-Means cluster list |
| GET | `/api/clustering/cluster/{cluster_id}/points` | Data points in a cluster |
| GET | `/api/clustering/cluster/{cluster_id}/ips` | IPs in a cluster |
---
### Search (`/api/search`)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/search/quick` | Cross-entity search (IP, JA4, host, UA, country, ASN) |
---
### Audit (`/api/audit`)
| Method | Path | Description |
|--------|------|-------------|
| POST | `/api/audit/logs` | Create an audit log entry |
| GET | `/api/audit/logs` | Query audit logs (filtered, paginated) |
| GET | `/api/audit/stats` | Audit statistics |
| GET | `/api/audit/users/activity` | Per-user activity summary |
## Frontend Structure
The React frontend is built with Vite and served as static assets:
- **Entry point**: `/``frontend/dist/index.html`
- **Static assets**: `/assets/*``frontend/dist/assets/`
- **SPA routing**: All non-`/api/` paths fall through to `index.html` (React Router)
- **API proxy**: Frontend calls `/api/*` which is handled by FastAPI routers
## Services
### IPReputationService
Queries public IP reputation databases (IP-API.com, IPinfo.io) without API keys:
- Proxy/VPN/Tor detection
- ASN, country, ISP information
- Hosting provider identification
### ClusteringEngine
K-Means clustering on ML features with caching:
- Automatic cluster count selection
- Feature normalization via StandardScaler
- In-memory cache with TTL
## Deployment
```bash
# Build Docker image
make build-dashboard
# Run tests
make test-dashboard
# Run locally (development)
cd services/dashboard
uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000
```
### Health Check
```
GET /health → {"status": "healthy", "clickhouse": "connected"}
```

View File

@ -0,0 +1,200 @@
# mod-reqin-log
`mod_reqin_log` is an Apache HTTPD module (C shared object) that captures HTTP request metadata and sends it as JSON to a UNIX datagram socket. It serves as the HTTP-layer ingestion point for the ja4-platform pipeline, feeding request data to the [correlator](correlator.md) for joining with TLS fingerprint data from [sentinel](sentinel.md).
## Purpose
Apache processes HTTP requests after TLS termination, so it has access to the decoded HTTP method, path, headers, and client IP/port. mod-reqin-log hooks into the `post_read_request` phase to serialize this data immediately, before any rewrite or auth module modifies the request.
## Apache Directives Reference
All directives are server-level (`RSRC_CONF`):
| Directive | Type | Default | Description |
|-----------|------|---------|-------------|
| `JsonSockLogEnabled` | Flag (On/Off) | Off | Enable or disable the module |
| `JsonSockLogSocket` | String | — | UNIX domain socket path for JSON output |
| `JsonSockLogHeaders` | String list | — | HTTP header names to log (repeatable) |
| `JsonSockLogMaxHeaders` | Integer | `25` | Maximum number of headers to log |
| `JsonSockLogMaxHeaderValueLen` | Integer | `256` | Maximum length of each header value (truncated beyond) |
| `JsonSockLogReconnectInterval` | Integer (seconds) | `10` | Minimum seconds between reconnection attempts |
| `JsonSockLogErrorReportInterval` | Integer (seconds) | `10` | Minimum seconds between error log entries (throttling) |
| `JsonSockLogLevel` | String | `WARNING` | Module log level: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `EMERG` |
### Example httpd.conf
```apache
LoadModule reqin_log_module modules/mod_reqin_log.so
JsonSockLogEnabled On
JsonSockLogSocket /var/run/logcorrelator/http.socket
JsonSockLogHeaders User-Agent Accept Accept-Encoding Accept-Language
JsonSockLogHeaders Content-Type X-Request-Id X-Trace-Id X-Forwarded-For
JsonSockLogHeaders Sec-CH-UA Sec-CH-UA-Mobile Sec-CH-UA-Platform
JsonSockLogHeaders Sec-Fetch-Dest Sec-Fetch-Mode Sec-Fetch-Site
JsonSockLogMaxHeaders 25
JsonSockLogMaxHeaderValueLen 256
JsonSockLogReconnectInterval 10
JsonSockLogErrorReportInterval 10
JsonSockLogLevel WARNING
```
## Output JSON Schema
Each HTTP request is serialized as a flat JSON object and sent as a single UNIX datagram:
```json
{
"time": "2026-03-09T14:30:00Z",
"src_ip": "203.0.113.42",
"src_port": 52341,
"dst_ip": "192.168.1.10",
"dst_port": 443,
"method": "GET",
"scheme": "https",
"host": "example.com",
"path": "/api/v1/users",
"query": "page=1&limit=20",
"http_version": "HTTP/2.0",
"client_headers": "User-Agent,Accept,Accept-Encoding,Accept-Language",
"header_User-Agent": "Mozilla/5.0 ...",
"header_Accept": "text/html,application/xhtml+xml",
"header_Accept-Encoding": "gzip, deflate, br",
"header_Accept-Language": "en-US,en;q=0.9",
"header_Sec-Fetch-Dest": "document",
"header_Sec-Fetch-Mode": "navigate",
"header_Sec-Fetch-Site": "none"
}
```
### Field Reference
| Field | Type | Description |
|-------|------|-------------|
| `time` | string (ISO 8601) | Request timestamp (UTC) |
| `src_ip` | string | Client IP address |
| `src_port` | int | Client source port |
| `dst_ip` | string | Server IP address |
| `dst_port` | int | Server port |
| `method` | string | HTTP method (`GET`, `POST`, etc.) |
| `scheme` | string | URL scheme (`http` or `https`) |
| `host` | string | HTTP Host header value |
| `path` | string | Request URI path |
| `query` | string | Query string (without `?`) |
| `http_version` | string | HTTP version (`HTTP/1.1`, `HTTP/2.0`) |
| `client_headers` | string | Comma-separated list of header names sent by client (order preserved) |
| `header_<Name>` | string | Value of each configured header (one field per header) |
### Sensitive Headers
The following headers are **always excluded** from output regardless of `JsonSockLogHeaders`:
- `Authorization`
- `Cookie`
- `Set-Cookie`
- `X-Api-Key`
- `X-Auth-Token`
- `Proxy-Authorization`
- `WWW-Authenticate`
### Size Limits
- Maximum JSON size: **64 KB** (prevents memory exhaustion DoS)
- Header values are truncated to `JsonSockLogMaxHeaderValueLen` bytes
## Thread Safety
mod-reqin-log is designed for Apache's `worker` and `event` MPMs (multi-threaded):
- **Socket FD** is protected by an `apr_thread_mutex_t` (`fd_mutex`)
- **Per-child process state** includes the socket file descriptor, mutex, and error tracking
- **Error reporting** uses `LOG_THROTTLED` macro with timestamp-based deduplication
- All JSON serialization uses per-request pool allocation — no shared buffers
### Architecture
```
Apache HTTPD process
├── child process 1
│ ├── fd_mutex (apr_thread_mutex_t)
│ ├── socket_fd (shared across threads)
│ ├── thread 1 → post_read_request → serialize JSON → mutex lock → sendto() → unlock
│ ├── thread 2 → post_read_request → serialize JSON → mutex lock → sendto() → unlock
│ └── ...
├── child process 2
│ ├── fd_mutex
│ ├── socket_fd (independent)
│ └── ...
```
## Reconnection Behavior
- Socket is opened during `child_init` (per-child process startup)
- If the socket is unavailable at startup, connection is deferred
- On send failure, reconnection is attempted respecting `JsonSockLogReconnectInterval`
- Failed sends are silently dropped (HTTP request processing is not blocked)
- Error log entries are throttled by `JsonSockLogErrorReportInterval`
- Socket type: `SOCK_DGRAM` (connectionless UNIX datagram)
- Non-blocking sends with `MSG_NOSIGNAL`
## Deployment
### Installation via RPM
```bash
rpm -ivh mod_reqin_log-1.0.19-1.el10.x86_64.rpm
```
### LoadModule Directive
```apache
LoadModule reqin_log_module modules/mod_reqin_log.so
```
### Verifying Installation
```bash
httpd -M | grep reqin_log
# Expected: reqin_log_module (shared)
```
## Build
All builds run inside Docker:
```bash
# Run unit tests
make test-mod-reqin-log
# Build RPM packages (el8, el9, el10)
make rpm-mod-reqin-log
# RPMs in services/mod-reqin-log/dist/rpm/el{8,9,10}/
```
### Local Build (requires Apache development headers)
```bash
cd services/mod-reqin-log
make build # Compiles mod_reqin_log.so via apxs
make test # Runs unit tests
```
### Test Coverage
Unit tests cover:
- JSON serialization (escaping, size limits, field output)
- Config parsing (all directives, edge cases)
- Header handling (sensitive header exclusion, max headers, truncation)
- Module integration (real Apache module hooks)
## Source Files
| File | Description |
|------|-------------|
| `src/mod_reqin_log.c` | Main module source |
| `src/mod_reqin_log.h` | Header with types, constants, defaults |
| `conf/mod_reqin_log.conf` | Example Apache configuration |
| `tests/unit/test_json_serialization.c` | JSON output tests |
| `tests/unit/test_config_parsing.c` | Directive parsing tests |
| `tests/unit/test_header_handling.c` | Header filtering tests |
| `tests/unit/test_module_real.c` | Integration tests |

247
docs/services/sentinel.md Normal file
View File

@ -0,0 +1,247 @@
# Sentinel
Sentinel (`ja4sentinel`) is a Go daemon that performs live network packet capture on a Linux server, extracts TLS ClientHello handshakes, generates JA4 and JA3 fingerprints, enriches them with IP/TCP metadata, and outputs structured JSON log records to configurable destinations (UNIX socket, file, or stdout).
## Role in the Pipeline
Sentinel is the **network-layer ingestion point**. It sits on the target server, captures TLS traffic via libpcap, and feeds fingerprinted events to the [correlator](correlator.md) through a UNIX datagram socket.
```
Network traffic (port 443/8443)
│ pcap
┌───────────────┐
│ sentinel │
│ ┌─────────┐ │
│ │ capture │──▶ Raw packets
│ └─────────┘ │
│ ┌─────────┐ │
│ │ tlsparse│──▶ TLS ClientHello extraction + TCP reassembly
│ └─────────┘ │
│ ┌─────────┐ │
│ │ finger- │──▶ JA4/JA3 fingerprint generation
│ │ print │ │
│ └─────────┘ │
│ ┌─────────┐ │
│ │ output │──▶ UNIX socket / file / stdout
│ └─────────┘ │
└───────────────┘
```
## Architecture
Sentinel uses a pipeline of goroutines:
1. **Capture goroutine** — Opens pcap handle on the configured interface, applies BPF filter, reads raw packets into a buffered channel (`packet_buffer_size`).
2. **Packet processor goroutine** — Reads from the channel, feeds packets to the TLS parser, generates fingerprints, and writes output.
3. **Watchdog goroutine** — Sends systemd watchdog heartbeats at half the configured interval.
4. **Signal handler** — Listens for `SIGINT`/`SIGTERM` (graceful shutdown) and `SIGHUP` (log rotation).
### Key Interfaces
| Interface | Package | Description |
|-----------|---------|-------------|
| `Capture` | `internal/capture` | Packet capture via libpcap |
| `Parser` | `internal/tlsparse` | TCP reassembly + ClientHello extraction |
| `Engine` | `internal/fingerprint` | JA4/JA3 fingerprint generation |
| `Writer` | `internal/output` | Log record output (stdout, file, UNIX socket) |
| `MultiWriter` | `internal/output` | Fan-out to multiple writers |
| `Builder` | `internal/output` | Factory for constructing writers from config |
## Configuration Reference
Configuration is loaded from a YAML file (default: `config.yml`) with environment variable overrides.
### Core Settings
| Name | Type | Default | Env Override | Description |
|------|------|---------|-------------|-------------|
| `core.interface` | string | `any` | `JA4SENTINEL_INTERFACE` | Network interface to capture (`any` = all interfaces) |
| `core.listen_ports` | []uint16 | `[443]` | `JA4SENTINEL_PORTS` | TCP ports to monitor (comma-separated in env) |
| `core.bpf_filter` | string | `""` (auto) | `JA4SENTINEL_BPF_FILTER` | Custom BPF filter (empty = auto-generated) |
| `core.local_ips` | []string | `[]` (auto) | — | Local IPs to monitor (empty = auto-detect, excludes loopback) |
| `core.exclude_source_ips` | []string | `[]` | — | Source IPs or CIDRs to exclude (e.g., `["10.0.0.0/8"]`) |
| `core.flow_timeout_sec` | int | `30` | `JA4SENTINEL_FLOW_TIMEOUT` | Timeout for TLS handshake extraction (1300) |
| `core.packet_buffer_size` | int | `1000` | `JA4SENTINEL_PACKET_BUFFER_SIZE` | Packet channel buffer size (11,000,000) |
| `core.log_level` | string | `info` | — | Log level: `debug`, `info`, `warn`, `error` (YAML only) |
> **Note:** `log_level` is intentionally not overridable via environment variable (architecture decision since v1.1.12).
### Output Settings
Each output is an entry in the `outputs` array:
| Name | Type | Default | Description |
|------|------|---------|-------------|
| `type` | string | — | Output type: `unix_socket`, `stdout`, `file` |
| `enabled` | bool | — | Whether this output is active |
| `async_buffer` | int | `1000` | Queue size for async writes |
| `params.socket_path` | string | — | Path for `unix_socket` type |
| `params.path` | string | — | File path for `file` type |
### Example Configuration
```yaml
core:
interface: any
listen_ports: [443, 8443]
bpf_filter: ""
local_ips: []
exclude_source_ips: ["10.0.0.0/8", "192.168.1.1"]
flow_timeout_sec: 30
packet_buffer_size: 1000
log_level: info
outputs:
- type: unix_socket
enabled: true
params:
socket_path: /var/run/logcorrelator/network.socket
- type: file
enabled: false
params:
path: /var/log/ja4sentinel/ja4.log
```
## Output Format (LogRecord JSON Schema)
Each output record is a flat JSON object:
```json
{
"src_ip": "203.0.113.42",
"src_port": 52341,
"dst_ip": "192.168.1.10",
"dst_port": 443,
"ip_meta_ttl": 64,
"ip_meta_total_length": 583,
"ip_meta_id": 12345,
"ip_meta_df": true,
"tcp_meta_window_size": 65535,
"tcp_meta_mss": 1460,
"tcp_meta_window_scale": 8,
"tcp_meta_options": "MSS,NOP,WScale,NOP,NOP,Timestamps,SACK",
"conn_id": "203.0.113.42:52341-192.168.1.10:443",
"sensor_id": "",
"tls_version": "1.3",
"tls_sni": "example.com",
"tls_alpn": "h2",
"syn_to_clienthello_ms": 12,
"ja4": "t13d1516h2_8daaf6152771_b0da82dd1658",
"ja3": "771,4866-4867-4865-49196-49200...",
"ja3_hash": "e7d705a3286e19ea42f587b344ee6865",
"timestamp": 1709312345678901234
}
```
### Field Reference
| Field | Type | Description |
|-------|------|-------------|
| `src_ip` | string | Client source IP address |
| `src_port` | uint16 | Client source port |
| `dst_ip` | string | Server destination IP address |
| `dst_port` | uint16 | Server destination port |
| `ip_meta_ttl` | uint8 | IP Time-To-Live |
| `ip_meta_total_length` | uint16 | IP total packet length |
| `ip_meta_id` | uint16 | IP identification field |
| `ip_meta_df` | bool | IP Don't Fragment flag |
| `tcp_meta_window_size` | uint16 | TCP window size |
| `tcp_meta_mss` | uint16 | TCP Maximum Segment Size (omitted if 0) |
| `tcp_meta_window_scale` | uint8 | TCP window scale factor (omitted if 0) |
| `tcp_meta_options` | string | Comma-separated TCP options |
| `conn_id` | string | Unique flow identifier |
| `sensor_id` | string | Sensor/captor identifier |
| `tls_version` | string | Max TLS version from ClientHello |
| `tls_sni` | string | Server Name Indication |
| `tls_alpn` | string | ALPN protocol (e.g., `h2`, `http/1.1`) |
| `syn_to_clienthello_ms` | uint32 | Time from SYN to ClientHello (ms) |
| `ja4` | string | JA4 TLS fingerprint |
| `ja3` | string | JA3 TLS fingerprint |
| `ja3_hash` | string | MD5 hash of JA3 string |
| `timestamp` | int64 | Unix nanoseconds |
## UNIX Socket Output Protocol
- **Socket type**: `unixgram` (DGRAM — connectionless)
- **Encoding**: One JSON object per datagram (no delimiter)
- **Max datagram size**: 64 KB
- **Reconnection**: Exponential backoff (100 ms → 2 s), max 3 attempts per write
- **Queue**: Async write queue (default 1000 items) absorbs transient socket failures
- **Error callback**: Consecutive failures are tracked and reported
## Signal Handling
| Signal | Behavior |
|--------|----------|
| `SIGTERM` / `SIGINT` | Graceful shutdown: cancel context, close capture, flush outputs, log filter stats |
| `SIGHUP` | Log rotation: reopen file outputs (used by `systemctl reload` + logrotate) |
## JA4 Fingerprint Algorithm
1. Extract TLS ClientHello from the TCP payload (with TCP reassembly for fragmented handshakes)
2. Parse cipher suites, extensions, ALPN, SNI, supported versions
3. Build JA4 string: `t{version}{sni_flag}{cipher_count}{ext_count}_{cipher_hash}_{ext_hash}`
4. Build JA3 string: `{version},{ciphers},{extensions},{curves},{formats}`
5. Compute JA3 MD5 hash
Sentinel uses the `tlsfingerprint` library for ALPN and TLS version parsing, with custom sanitization for malformed/truncated ClientHellos.
## Deployment
### systemd
```ini
[Unit]
Description=ja4sentinel TLS fingerprinting daemon
After=network.target
[Service]
Type=notify
ExecStart=/usr/bin/ja4sentinel -config /etc/ja4sentinel/config.yml
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
WatchdogSec=30
TimeoutStopSec=2
[Install]
WantedBy=multi-user.target
```
Sentinel uses systemd `sd_notify` for:
- `READY` — sent after initialization
- `WATCHDOG` — sent at half the `WatchdogSec` interval
- `STOPPING` — sent before shutdown
### Docker
```bash
make build-sentinel
docker run --cap-add=NET_RAW --cap-add=NET_ADMIN \
-v /var/run/logcorrelator:/var/run/logcorrelator \
ja4-platform/sentinel:latest
```
## RPM Package Contents
| Path | Description |
|------|-------------|
| `/usr/bin/ja4sentinel` | Binary (statically linked Go) |
| `/etc/ja4sentinel/config.yml.default` | Default configuration (noreplace) |
| `/usr/share/ja4sentinel/config.yml` | Reference configuration |
| `/usr/lib/systemd/system/ja4sentinel.service` | systemd unit |
| `/etc/logrotate.d/ja4sentinel` | logrotate configuration |
| `/var/lib/ja4sentinel/` | State directory |
| `/var/log/ja4sentinel/` | Log directory |
| `/var/run/logcorrelator/` | Socket directory |
### RPM Dependencies
- `systemd`
- `libpcap >= 1.9.0`
### Supported Distributions
- Rocky Linux 8, 9, 10
- AlmaLinux 8, 9
- RHEL 8, 9

244
docs/shared/go-ja4common.md Normal file
View File

@ -0,0 +1,244 @@
# go-ja4common
`ja4common` is the shared Go library for the ja4-platform, providing unified logging, YAML configuration loading with environment variable overrides, graceful shutdown handling, and IP address filtering. It is used by both [sentinel](../services/sentinel.md) and [correlator](../services/correlator.md).
**Module path**: `github.com/antitbone/ja4/ja4common`
**Go version**: 1.21+
**Dependencies**: `gopkg.in/yaml.v3`
## Packages
### logger
Unified structured logging with two styles:
- **Prefix+Fields style** (correlator pattern) — `Logger`
- **Component style** (sentinel pattern) — `ComponentLogger`
#### Types
```go
type LogLevel int
const (
DEBUG LogLevel = iota
INFO
WARN
ERROR
)
```
#### Logger API
| Method | Signature | Description |
|--------|-----------|-------------|
| `New` | `New(prefix string) *Logger` | Create logger with INFO level |
| `NewWithLevel` | `NewWithLevel(prefix, level string) *Logger` | Create logger with specified level |
| `SetLevel` | `(l *Logger) SetLevel(level string)` | Change minimum log level at runtime |
| `ShouldLog` | `(l *Logger) ShouldLog(level LogLevel) bool` | Check if level would be logged |
| `WithFields` | `(l *Logger) WithFields(fields map[string]any) *Logger` | Return new logger with additional fields |
| `Info` | `(l *Logger) Info(msg string)` | Log info message |
| `Infof` | `(l *Logger) Infof(msg string, args ...any)` | Log formatted info |
| `Warn` | `(l *Logger) Warn(msg string)` | Log warning |
| `Warnf` | `(l *Logger) Warnf(msg string, args ...any)` | Log formatted warning |
| `Error` | `(l *Logger) Error(msg string, err error)` | Log error with optional error value |
| `Debug` | `(l *Logger) Debug(msg string)` | Log debug message |
| `Debugf` | `(l *Logger) Debugf(msg string, args ...any)` | Log formatted debug |
| `ParseLogLevel` | `ParseLogLevel(level string) LogLevel` | Parse string to LogLevel |
#### ComponentLogger API
Wraps `Logger` to satisfy sentinel's component-based logging interface:
| Method | Signature | Description |
|--------|-----------|-------------|
| `NewComponentLogger` | `NewComponentLogger(level string) *ComponentLogger` | Create component logger |
| `Log` | `(c *ComponentLogger) Log(component, level, message string, details map[string]string)` | Log with component context |
| `Debug` | `(c *ComponentLogger) Debug(component, message string, details map[string]string)` | Debug with component |
| `Info` | `(c *ComponentLogger) Info(component, message string, details map[string]string)` | Info with component |
| `Warn` | `(c *ComponentLogger) Warn(component, message string, details map[string]string)` | Warn with component |
| `Error` | `(c *ComponentLogger) Error(component, message string, details map[string]string)` | Error with component |
#### Usage Example
```go
import "github.com/antitbone/ja4/ja4common/logger"
// Prefix+Fields style
log := logger.NewWithLevel("myservice", "DEBUG")
log.Info("starting up")
log.WithFields(map[string]any{"port": 8080}).Info("listening")
// Component style (sentinel compatibility)
clog := logger.NewComponentLogger("info")
clog.Info("capture", "packets received", map[string]string{"count": "1000"})
```
---
### config
Generic YAML configuration loading with environment variable overrides using struct tags.
#### API
| Function | Signature | Description |
|----------|-----------|-------------|
| `LoadYAML` | `LoadYAML[T any](path string, optional bool) (T, error)` | Load and unmarshal YAML file |
| `OverrideFromEnv` | `OverrideFromEnv[T any](cfg *T, envPrefix string) error` | Apply env var overrides via `env` struct tags |
#### Supported Types for Environment Override
- `string`
- `int`, `int8`, `int16`, `int32`, `int64`
- `uint`, `uint8`, `uint16`, `uint32`, `uint64`
- `bool`
- `[]string` (comma-separated)
#### Usage Example
```go
import "github.com/antitbone/ja4/ja4common/config"
type MyConfig struct {
Host string `yaml:"host" env:"HOST"`
Port int `yaml:"port" env:"PORT"`
Debug bool `yaml:"debug" env:"DEBUG"`
Tags []string `yaml:"tags" env:"TAGS"`
}
// Load YAML (optional=true means missing file returns zero value)
cfg, err := config.LoadYAML[MyConfig]("config.yml", true)
// Override from environment (prefix="" means use tag directly)
err = config.OverrideFromEnv(&cfg, "MYAPP")
// Reads: MYAPP_HOST, MYAPP_PORT, MYAPP_DEBUG, MYAPP_TAGS
```
---
### shutdown
Graceful shutdown handler that blocks until `SIGTERM`/`SIGINT`, then runs cleanup hooks.
#### API
```go
type Hook struct {
Name string
Fn func() error
}
func Handle(ctx context.Context, cancel context.CancelFunc, hooks []Hook, logger simpleLogger)
```
The `Handle` function:
1. Blocks until `SIGTERM`, `SIGINT`, or context cancellation
2. Calls `cancel()` to propagate shutdown
3. Runs all hooks in order, logging errors but not aborting
#### Usage Example
```go
import "github.com/antitbone/ja4/ja4common/shutdown"
ctx, cancel := context.WithCancel(context.Background())
hooks := []shutdown.Hook{
{Name: "close-db", Fn: func() error { return db.Close() }},
{Name: "flush-logs", Fn: func() error { return logger.Flush() }},
}
// This blocks until signal received
shutdown.Handle(ctx, cancel, hooks, myLogger)
```
---
### ipfilter
IP address and CIDR range matching for source IP exclusion.
#### API
| Method | Signature | Description |
|--------|-----------|-------------|
| `New` | `New(excludeList []string) (*Filter, error)` | Create filter from IP/CIDR list |
| `ShouldExclude` | `(f *Filter) ShouldExclude(ipStr string) bool` | Check if IP should be excluded |
| `Count` | `(f *Filter) Count() (ips int, networks int)` | Return number of loaded entries |
Accepts: single IPs (`192.168.1.1`), CIDR ranges (`10.0.0.0/8`), IPv6 addresses and ranges.
#### Usage Example
```go
import "github.com/antitbone/ja4/ja4common/ipfilter"
filter, err := ipfilter.New([]string{
"10.0.0.0/8",
"192.168.1.1",
"2001:db8::/32",
})
if filter.ShouldExclude("10.0.0.5") {
// Skip this IP
}
ips, nets := filter.Count() // 1 IP, 2 networks
```
## Using from a New Service
### 1. Add to go.mod
```bash
cd services/my-service
go mod init github.com/antitbone/ja4/my-service
```
Add the dependency:
```
require github.com/antitbone/ja4/ja4common v0.0.0
```
### 2. Add to go.work
In the repository root `go.work`:
```
use (
./services/sentinel
./services/correlator
./services/my-service // ← add
./shared/go/ja4common
)
```
### 3. Import and Use
```go
package main
import (
"context"
"github.com/antitbone/ja4/ja4common/config"
"github.com/antitbone/ja4/ja4common/logger"
"github.com/antitbone/ja4/ja4common/shutdown"
)
func main() {
log := logger.NewWithLevel("myservice", "INFO")
cfg, _ := config.LoadYAML[MyConfig]("config.yml", true)
config.OverrideFromEnv(&cfg, "MYSERVICE")
ctx, cancel := context.WithCancel(context.Background())
shutdown.Handle(ctx, cancel, nil, log)
}
```
### 4. Sync Workspace
```bash
go work sync
```

View File

@ -0,0 +1,216 @@
# python-ja4common
`ja4_common` is the shared Python library for the ja4-platform, providing a unified ClickHouse client singleton and configuration settings. It is used by [bot-detector](../services/bot-detector.md) and [dashboard](../services/dashboard.md).
**Package name**: `ja4-common`
**Python version**: ≥ 3.11
**Dependencies**:
- `clickhouse-connect >= 0.8.0`
- `pydantic-settings >= 2.1.0`
## ClickHouseSettings
Pydantic-settings model that reads configuration from environment variables and `.env` files.
### Fields
| Field | Type | Default | Env Variable | Description |
|-------|------|---------|-------------|-------------|
| `CLICKHOUSE_HOST` | str | `"clickhouse"` | `CLICKHOUSE_HOST` | ClickHouse server hostname |
| `CLICKHOUSE_PORT` | int | `8123` | `CLICKHOUSE_PORT` | ClickHouse HTTP API port |
| `CLICKHOUSE_DB` | str | `"mabase_prod"` | `CLICKHOUSE_DB` | Database name |
| `CLICKHOUSE_USER` | str | `"admin"` | `CLICKHOUSE_USER` | Username for authentication |
| `CLICKHOUSE_PASSWORD` | str | `""` | `CLICKHOUSE_PASSWORD` | Password for authentication |
### Configuration Sources
Settings are loaded in order of precedence:
1. **Environment variables** (highest priority)
2. **`.env` file** in the current working directory
3. **Default values** (lowest priority)
Environment variable names are **case-sensitive** (e.g., `CLICKHOUSE_HOST`, not `clickhouse_host`).
### Usage
```python
from ja4_common.settings import settings
print(settings.CLICKHOUSE_HOST) # "clickhouse" or from env
print(settings.CLICKHOUSE_PORT) # 8123 or from env
```
## ClickHouseClient
Wraps `clickhouse_connect` with auto-reconnection and a clean API.
### Methods
| Method | Signature | Description |
|--------|-----------|-------------|
| `connect` | `connect() -> Client` | Returns the underlying `clickhouse_connect` client, creating or reconnecting as needed |
| `query` | `query(query: str, params: dict = None)` | Execute a SELECT query, returns result set |
| `command` | `command(query: str, params: dict = None)` | Execute a DDL/DML command (CREATE, INSERT, etc.) |
| `insert` | `insert(table: str, data, column_names=None)` | Bulk insert data into a table |
| `close` | `close()` | Close the connection and release resources |
### Auto-Reconnection
The `connect()` method automatically reconnects if the current connection is lost:
```python
def connect(self):
if self._client is None or not self._ping():
self._client = clickhouse_connect.get_client(
host=settings.CLICKHOUSE_HOST,
port=settings.CLICKHOUSE_PORT,
database=settings.CLICKHOUSE_DB,
user=settings.CLICKHOUSE_USER,
password=settings.CLICKHOUSE_PASSWORD,
connect_timeout=10,
)
return self._client
```
### Usage Example
```python
from ja4_common.clickhouse import get_client
client = get_client()
# SELECT query
result = client.query("SELECT count() FROM http_logs WHERE src_ip = {ip:String}", {"ip": "203.0.113.42"})
print(result.result_rows)
# INSERT
client.insert("audit_logs", [[datetime.now(), "analyst1", "investigate", "ip", "203.0.113.42"]],
column_names=["timestamp", "user_name", "action", "entity_type", "entity_id"])
# Command
client.command("OPTIMIZE TABLE http_logs FINAL")
```
## get_client() Singleton
The `get_client()` function provides a module-level singleton `ClickHouseClient`:
```python
from ja4_common.clickhouse import get_client
# First call creates the client
client1 = get_client()
# Subsequent calls return the same instance
client2 = get_client()
assert client1 is client2
```
### Implementation
```python
_client: Optional[ClickHouseClient] = None
def get_client() -> ClickHouseClient:
global _client
if _client is None:
_client = ClickHouseClient()
return _client
```
## Using from a New Service
### 1. Add Dependency
In your service's `requirements.txt`:
```
ja4-common @ file:///app/shared/python/ja4_common
```
Or in `pyproject.toml`:
```toml
[project]
dependencies = [
"ja4-common",
]
```
### 2. Docker Setup
```dockerfile
# Copy shared library
COPY shared/python/ja4_common /app/shared/python/ja4_common
RUN pip install /app/shared/python/ja4_common
# Copy service code
COPY services/my-service /app/services/my-service
```
### 3. Use in Code
```python
from ja4_common.clickhouse import get_client
from ja4_common.settings import settings
# Access settings
print(f"Connecting to {settings.CLICKHOUSE_HOST}:{settings.CLICKHOUSE_PORT}")
# Use client
db = get_client()
result = db.query("SELECT count() FROM ml_detected_anomalies")
```
### 4. Environment Configuration
Create a `.env` file or set environment variables:
```bash
CLICKHOUSE_HOST=clickhouse.example.com
CLICKHOUSE_PORT=8123
CLICKHOUSE_DB=mabase_prod
CLICKHOUSE_USER=data_writer
CLICKHOUSE_PASSWORD=secret
```
## Testing: Mocking the Client
### Using unittest.mock
```python
from unittest.mock import MagicMock, patch
from ja4_common.clickhouse import ClickHouseClient
def test_my_service():
mock_client = MagicMock(spec=ClickHouseClient)
mock_client.query.return_value = MagicMock(result_rows=[(42,)])
with patch("ja4_common.clickhouse._client", mock_client):
from ja4_common.clickhouse import get_client
client = get_client()
result = client.query("SELECT count() FROM http_logs")
assert result.result_rows == [(42,)]
```
### Overriding Settings in Tests
```python
from ja4_common.settings import ClickHouseSettings
# Create custom settings for tests
test_settings = ClickHouseSettings(
CLICKHOUSE_HOST="localhost",
CLICKHOUSE_PORT=8123,
CLICKHOUSE_DB="test_db",
CLICKHOUSE_USER="test_user",
CLICKHOUSE_PASSWORD="test_pass",
)
```
## Source Files
| File | Description |
|------|-------------|
| `ja4_common/settings.py` | `ClickHouseSettings` pydantic-settings model |
| `ja4_common/clickhouse.py` | `ClickHouseClient` class and `get_client()` singleton |
| `pyproject.toml` | Package metadata and dependencies |

7
go.work Normal file
View File

@ -0,0 +1,7 @@
go 1.24.6
use (
./services/sentinel
./services/correlator
./shared/go/ja4common
)

2
go.work.sum Normal file
View File

@ -0,0 +1,2 @@
github.com/ClickHouse/clickhouse-go v1.5.4 h1:cKjXeYLNWVJIx2J1K6H2CqyRmfwVJVY1OV1coaaFcI0=
github.com/kr/pretty v0.2.1/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI=

View File

@ -0,0 +1,10 @@
# bot-detector configuration — DO NOT COMMIT real values
CLICKHOUSE_HOST=clickhouse
CLICKHOUSE_PORT=8123
CLICKHOUSE_DB=mabase_prod
CLICKHOUSE_USER=admin
CLICKHOUSE_PASSWORD=
ANOMALY_THRESHOLD=-0.1
DEDUP_TTL_MIN=60
HEALTH_PORT=8080
MIN_VALID_FEATURE_RATIO=0.5

2
services/bot-detector/.gitignore vendored Normal file
View File

@ -0,0 +1,2 @@
bot_detector_models/
bot_detector_logs/

View File

@ -0,0 +1,204 @@
# Diagnostic — Features manquantes dans `view_ai_features_1h`
> Généré le 2026-03-17 — Mis à jour le 2026-03-17 (corrections appliquées) — À destination de l'administrateur ClickHouse
## ✅ Statut des corrections (2026-03-17 13:05)
| Problème | Correction appliquée | Résultat |
|----------|---------------------|----------|
| **1** — MV `mv_agg_header_fingerprint_1h` absente | MV recréée + backfill 25h | ✅ 10 features header actives |
| **2**`header_order_shared_count` / `distinct_header_orders` globales | Se corrige avec Problème 1 | ✅ Résolu automatiquement |
| **3**`orphan_ratio` = 0 pour `correlated=1` | Comportement normal (by design) | Pas d'action requise |
| **4** — 4 vues dashboard absentes | Vues créées | ✅ |
| **5**`view_dashboard_variability` référence `header_user_agent` inexistant | Colonne remplacée par `reason` | ✅ Bug corrigé |
| **6** — Anciennes vues heuristiques orphelines | Droppées | ✅ |
Cycle post-correction (13:05) — features dans les warnings :
- `Complet` : seulement `orphan_ratio` (by design)
- `Applicatif` : `request_size_variance`, `mss_mobile_mismatch`, `is_rare_ja4` (see §4 below)
- Header features **disparues des warnings** → pipeline opérationnel ✅
---
---
## Résumé
Le service Bot Detector signale des **features non-discriminantes** à chaque cycle. Ce document en explique les causes exactes et les corrections nécessaires côté ClickHouse.
Ces avertissements **n'empêchent pas le service de fonctionner** — les features invalides sont automatiquement exclues du modèle (A7). Mais leur absence réduit la qualité de la détection.
---
## Problème 1 — Pipeline `agg_header_fingerprint_1h` arrêté ⚠️ CRITIQUE
### Symptôme
Les features suivantes sont toujours à **0** dans `view_ai_features_1h` :
- `header_count`
- `has_accept_language`
- `has_cookie`
- `has_referer`
- `modern_browser_score`
- `ua_ch_mismatch`
- `mss_mobile_mismatch` *(dépend de `modern_browser_score`)*
### Cause
La table `mabase_prod.agg_header_fingerprint_1h` (AggregatingMergeTree) n'a plus reçu de données depuis le **2026-03-13 23:00** :
```sql
SELECT max(window_start), count()
FROM mabase_prod.agg_header_fingerprint_1h;
-- Résultat : 2026-03-13 23:00:00, 73024 lignes
```
La vue fait un `LEFT JOIN` avec condition `window_start >= now() - INTERVAL 24 HOUR`, et comme aucune ligne récente n'existe dans `agg_header_fingerprint_1h`, **toutes les colonnes issues de ce JOIN retournent NULL** (→ 0 après coalesce).
### Recherche de la MV source
La liste des Materialized Views ne montre aucune MV dédiée à `agg_header_fingerprint_1h` :
```sql
SELECT name FROM system.tables
WHERE database = 'mabase_prod' AND engine = 'MaterializedView';
-- mv_agg_host_ip_ja4_1h
-- mv_http_logs
-- view_dashboard_entities_mv
-- view_dashboard_user_agents_mv
```
Aucune MV ne cible `agg_header_fingerprint_1h`. Elle est probablement alimentée par un **processus externe** (ETL, script, pipeline Kafka, etc.) qui s'est arrêté.
### Correction appliquée ✅
La MV `mv_agg_header_fingerprint_1h` était **définie dans `deploy_views.sql`** mais n'avait jamais été créée en base. Elle a été recréée le 2026-03-17 :
```sql
-- Recréation de la MV (déjà appliquée)
CREATE MATERIALIZED VIEW mabase_prod.mv_agg_header_fingerprint_1h
TO mabase_prod.agg_header_fingerprint_1h AS
SELECT
toStartOfHour(src.time) AS window_start,
toIPv6(src.src_ip) AS src_ip,
any(toString(cityHash64(src.client_headers))) AS header_order_hash,
max(toUInt16(length(src.client_headers) - length(replaceAll(src.client_headers, ',', '')) + 1)) AS header_count,
-- ... (voir deploy_views.sql §5)
FROM mabase_prod.http_logs AS src
GROUP BY window_start, src.src_ip;
```
Un **backfill de 25 heures** a été effectué depuis `http_logs` pour alimenter la table avec des données historiques (377 689 lignes insérées). Les nouvelles données sont désormais alimentées en temps réel par la MV.
### Cause historique
La MV avait été omise lors du déploiement initial. La table `agg_header_fingerprint_1h` contenait 73 024 lignes datant du 2026-03-13 (probablement issues d'un backfill manuel ponctuel), puis n'avait plus été alimentée.
---
## Problème 2 — Features non-discriminantes (agrégat global, non per-IP)
### Symptôme
Les features suivantes ont une **valeur unique non-nulle identique pour toutes les IPs** :
- `header_order_shared_count` (valeur ≈ 421 000 pour toutes les lignes)
- `distinct_header_orders` (valeur identique pour toutes les lignes)
### Cause
Ces features sont calculées via des window functions `PARTITION BY header_order_hash` :
```sql
-- Dans la vue :
count() OVER (PARTITION BY h.header_order_hash) AS header_order_shared_count
uniqExact(h.header_order_hash) OVER (PARTITION BY a.src_ip) AS distinct_header_orders
```
Comme `h.header_order_hash` est **NULL pour toutes les lignes** (problème 1 ci-dessus), la `PARTITION BY NULL` regroupe **toutes les lignes dans une seule partition**`count()` retourne le total de toutes les lignes pour chaque IP.
### Correction ✅ (auto-résolue avec Problème 1)
Ce problème s'est résolu automatiquement une fois la MV `mv_agg_header_fingerprint_1h` recréée. `header_order_hash` est désormais non-NULL, les partitions de window functions sont correctement calculées par hash d'ordre d'en-têtes.
---
## Problème 3 — `orphan_ratio` absent pour le trafic corrélé TCP
### Symptôme
`orphan_ratio` = 0 pour **toutes les lignes avec `correlated = 1`** (trafic TCP enrichi).
### Cause
La colonne `orphan_count` dans `mabase_prod.agg_host_ip_ja4_1h` est calculée par la MV `mv_agg_host_ip_ja4_1h` :
```sql
sum(IF(src.orphan_side = 'A' OR src.correlated = 0, 1, 0)) AS orphan_count
```
Pour les connexions `correlated=1`, `correlated = 0` est toujours faux, et `orphan_side = 'A'` n'est jamais vrai pour le trafic corrélé → `orphan_count = 0` systématiquement.
**C'est un comportement intentionnel** : les connexions TCP corrélées ont une réponse confirmée, donc elles ne sont pas des requêtes orphelines par définition.
### Statut
Pas d'action requise. La feature reste exclue automatiquement par A7 pour le modèle `Complet` (correlated=1).
---
## Problème 4 — Features à 0 persistantes dans le modèle Applicatif
### Symptôme (post-correction)
Depuis le 2026-03-17 13:05, le modèle `Applicatif` (trafic non-corrélé) signale encore ces features à 0 :
- `request_size_variance`
- `mss_mobile_mismatch`
- `is_rare_ja4`
### Cause
Ces features sont calculées depuis des colonnes L4/TCP qui sont **absent ou non-pertinentes pour le trafic applicatif pur** (`correlated=0`) :
| Feature | Cause |
|---------|-------|
| `request_size_variance` | `varPopMerge(total_ip_length_var)` — variance de longueur IP ; trafic non-corrélé = pas de données IP brutes fiables |
| `mss_mobile_mismatch` | Dépend de `tcp_meta_mss` et `modern_browser_score` — MSS non fiable sans corrélation TCP |
| `is_rare_ja4` | `sum(hits) OVER (PARTITION BY ja4) < 100` — dans la fenêtre Applicatif (1h, trafic réduit), tous les JA4 sont rares |
### Impact
Faible — ces features sont exclues automatiquement (A7). Elles ne dégradent pas le modèle.
---
## Impact sur le modèle IA
| Feature | Impact si absente | Statut |
|---------|-------------------|--------|
| `header_count` | Perte d'un signal fort : bots envoient souvent peu d'en-têtes | ✅ Corrigé |
| `has_accept_language` | Perte de détection des bots sans localisation | ✅ Corrigé |
| `has_cookie` | Perte de détection des sessions sans état | ✅ Corrigé |
| `has_referer` | Perte du signal de navigation directe | ✅ Corrigé |
| `modern_browser_score` | Perte du score composite de conformité navigateur | ✅ Corrigé |
| `ua_ch_mismatch` | Perte de détection des fausses déclarations UA | ✅ Corrigé |
| `header_order_shared_count` | Perte de la détection de fingerprints d'en-têtes partagés | ✅ Corrigé |
| `orphan_ratio` | Signal faible pour trafic corrélé | By design |
| `request_size_variance` | Signal L4 faible pour Applicatif | Normal |
| `mss_mobile_mismatch` | Signal TCP faible pour Applicatif | Normal |
---
## Vérification post-correction
Cycle du 2026-03-17 13:05 — résultat observé :
```
[Complet] Features à 0 : ['orphan_ratio'] ← by design ✅
[Applicatif] Features à 0 : ['request_size_variance', 'mss_mobile_mismatch', 'is_rare_ja4'] ← normales ✅
[Applicatif] Features non-discriminantes : ['tcp_shared_count'] ← agrégat global résiduel
```
Les **10 features header** (`header_count`, `has_accept_language`, `has_cookie`, `has_referer`, `modern_browser_score`, `ua_ch_mismatch`, `header_order_shared_count`, `distinct_header_orders`, `header_order_confidence`, `mss_mobile_mismatch` pour Complet) **ne sont plus dans les warnings**. Le pipeline est opérationnel.

View File

@ -0,0 +1,710 @@
# Bot Detector IA — Documentation Technique
> Version du code : v11 | Dernière mise à jour : 2026-03-17
---
## Table des matières
1. [Vue d'ensemble](#1-vue-densemble)
2. [Architecture système](#2-architecture-système)
3. [Pipeline de détection](#3-pipeline-de-détection)
4. [Modèles et features](#4-modèles-et-features)
5. [Approche semi-supervisée](#5-approche-semi-supervisée)
6. [Gestion des modèles](#6-gestion-des-modèles)
7. [Données d'entrée — vue ClickHouse](#7-données-dentrée--vue-clickhouse)
8. [Données de sortie](#8-données-de-sortie)
9. [Configuration](#9-configuration)
10. [Observabilité](#10-observabilité)
11. [Réputation et enrichissement](#11-réputation-et-enrichissement)
12. [Fondements scientifiques](#12-fondements-scientifiques)
13. [Améliorations implémentées (v11)](#13-améliorations-implémentées-v11)
14. [Migration de schéma ClickHouse](#14-migration-de-schéma-clickhouse)
---
## 1. Vue d'ensemble
Le **Bot Detector IA** est un service de détection d'activité suspecte et de bots sur un trafic HTTP. Il tourne en boucle continue (toutes les 5 minutes par défaut) et analyse des données agrégées issues de ClickHouse.
### Principe général
```
ClickHouse (view_ai_features_1h)
┌───────────────────────┐
│ Séparation du trafic │
│ ├─ Bots connus │ → Étiquetés via réputation IP / JA4 / ASN
│ ├─ Trafic humain │ → Sert de baseline d'entraînement pour l'IF
│ └─ Trafic inconnu │ → Scoré par Isolation Forest
└───────────────────────┘
┌───────────────────────┐
│ Isolation Forest │
│ (semi-supervisé) │
│ ├─ Modèle Complet │ TCP + TLS + HTTP (35 features, correlated=1)
│ └─ Modèle Applicatif │ HTTP seul (31 features, correlated=0)
└───────────────────────┘
ClickHouse (ml_detected_anomalies)
```
### Caractéristiques clés
| Propriété | Valeur |
|-----------|--------|
| Algorithme | Isolation Forest (sklearn) |
| Supervision | Semi-supervisée (baseline humain + réputation) |
| Fenêtre d'analyse | 1 heure glissante (optionnel : 24h avec `ENABLE_MULTIWINDOW`) |
| Cycle d'exécution | 300 s (configurable) |
| Re-entraînement | Toutes les 1 h (configurable) + retrain forcé sur dérive conceptuelle |
| Contamination | 2 % (fraction d'anomalies attendues dans la baseline) |
| Seuil d'anomalie | Adaptatif : min(percentile_5, -0.03) |
---
## 2. Architecture système
```
┌─────────────────────────────────────────────────────────────────┐
│ Docker Compose │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ bot_detector_ai │ │
│ │ │ │
│ │ ┌────────────┐ ┌──────────────┐ ┌─────────────────┐ │ │
│ │ │ Health │ │ Main Loop │ │ ClickHouse │ │ │
│ │ │ :8080 │ │ (300s cycle)│ │ Client │ │ │
│ │ │ (thread) │ │ │ │ (reconnect) │ │ │
│ │ └────────────┘ └──────────────┘ └─────────────────┘ │ │
│ │ │ │
│ │ Volumes: │ │
│ │ ├─ ./bot_detector_logs → /var/log/bot_detector │ │
│ │ ├─ ./bot_detector_models → /var/lib/bot_detector │ │
│ │ ├─ ./reputation/data/user_files/bot_ip.csv (ro) │ │
│ │ ├─ ./reputation/data/user_files/bot_ja4.csv (ro) │ │
│ │ └─ ./reputation/data/user_files/asn_reputation.csv (ro) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────┬────────────────────────────────────┘
│ HTTP :8123
ClickHouse externe
(test-sdv-anubis.sdv.fr)
```
### Fichiers et répertoires
| Chemin | Rôle |
|--------|------|
| `bot_detector/bot_detector.py` | Code source principal |
| `bot_detector/requirements.txt` | Dépendances Python |
| `bot_detector/Dockerfile` | Image Python 3.11-slim |
| `docker-compose.yml` | Orchestration Docker |
| `.env` | Variables d'environnement (non commité) |
| `bot_detector_logs/decisions.jsonl` | Journal JSONL structuré (rotation 50 MB × 7) |
| `bot_detector_models/model_<name>_<version>.joblib` | Modèle sérialisé |
| `bot_detector_models/model_<name>_<version>.meta.json` | Métadonnées du modèle |
| `bot_detector_models/model_<name>.current` | Pointeur vers la version active |
| `bot_detector_models/training_history.jsonl` | Historique des entraînements |
| `reputation/bot_ip.csv` | ~288 k entrées IP/CIDR de bots connus |
| `reputation/bot_ja4.csv` | Empreintes JA4 de bots |
| `reputation/asn_reputation.csv` | Labels ASN (human / bot) |
---
## 3. Pipeline de détection
### 3.1 Cycle principal (`fetch_and_analyze`)
```
1. Génération d'un cycle_id (timestamp)
2. Requête view_ai_features_1h → DataFrame df
3. Requête view_ip_recurrence → recurrence_map {src_ip: count}
4. Nettoyage des colonnes (fillna, astype)
5. Log CYCLE_START (total, human, known_bot, correlated)
6. Séparation df → correlated=1 / correlated=0
7. Appel run_semi_supervised_logic() × 2 (modèle Complet + Applicatif)
8. Concaténation, déduplication par src_ip (score le plus bas)
9. Insertion dans ml_detected_anomalies
10. Log CYCLE_END
11. Attente CYCLE_INTERVAL secondes
```
### 3.2 Logique semi-supervisée (`run_semi_supervised_logic`)
```
df (trafic de la fenêtre 1h)
├─ A7 → validate_features() : exclusion des features manquantes ou constantes
├─ bot_name != '' → known_bots → KNOWN_BOT (log + insertion)
└─ bot_name == '' → unknown_traffic
├─ asn_label == 'human' → human_baseline
│ (min. 500 sessions requis)
│ └──► load_or_train_model()
│ ├─ A1 : drift check (z-score / features)
│ └─ Si drift ≥ DRIFT_THRESHOLD : retrain forcé
└─ reste du trafic inconnu
IsolationForest.decision_function() → raw_scores
A10 : normalize_scores() → anomaly_score [-1, 0]
A2 : effective_threshold = min(percentile_5, ANOMALY_THRESHOLD)
A6 : raw_score -= log1p(recurrence) × RECURRENCE_WEIGHT
raw_score < effective_threshold ?
YES → A4 : SHAP top-5 features → reason
A8 : DBSCAN clustering → campaign_id
ANOMALY (log + insertion)
NO → ignoré
```
### 3.3 Niveaux de menace
| Score | Niveau | Interprétation |
|-------|--------|----------------|
| `< -0.30` | **CRITICAL** | Comportement extrêmement anormal |
| `< -0.15` | **HIGH** | Fort signal d'anomalie |
| `< -0.05` | **MEDIUM** | Anomalie modérée |
| `≥ -0.05` | **LOW** | Légèrement inhabituel |
> Le seuil d'insertion (`ANOMALY_THRESHOLD = -0.03`) est plus permissif que LOW. Toutes les IP dont le score passe sous ce seuil sont insérées, quelle que soit leur catégorie de niveau.
---
## 4. Modèles et features
### 4.1 Architecture à deux niveaux
| Modèle | Condition | Nb features | Données utilisées |
|--------|-----------|-------------|-------------------|
| **Complet** | `correlated = 1` | 35 | HTTP + TCP + TLS |
| **Applicatif** | `correlated = 0` | 31 | HTTP uniquement |
La corrélation (`correlated`) indique si les logs HTTP ont pu être enrichis avec les données TCP/TLS de la même connexion. En l'absence de corrélation (capture incomplète ou trafic chiffré sans inspection), seul le modèle Applicatif est utilisé.
### 4.2 Features communes (31 — modèle Applicatif)
#### Comportement HTTP de base
| Feature | Description |
|---------|-------------|
| `hits` | Nombre de requêtes sur la fenêtre |
| `hit_velocity` | Requêtes par seconde |
| `fuzzing_index` | Score de diversité anormale des chemins/paramètres |
| `post_ratio` | Fraction de requêtes POST |
| `port_exhaustion_ratio` | Fraction de ports sources différents / total ports |
| `orphan_ratio` | Requêtes sans réponse associée |
#### Gestion des connexions
| Feature | Description |
|---------|-------------|
| `max_keepalives` | Nb max de requêtes sur une même connexion keep-alive |
| `tcp_shared_count` | Connexions TCP partagées entre plusieurs sessions HTTP |
#### Empreinte navigateur (Browser Fingerprint)
| Feature | Description |
|---------|-------------|
| `header_count` | Nombre d'en-têtes HTTP envoyés |
| `has_accept_language` | Présence de Accept-Language |
| `has_cookie` | Présence de Cookie |
| `has_referer` | Présence de Referer |
| `modern_browser_score` | Score composite de conformité navigateur moderne |
| `ua_ch_mismatch` | Incohérence entre User-Agent et Client Hints |
| `ip_id_zero_ratio` | Ratio de paquets IP avec ID=0 (headless / stack minimale) |
| `header_order_shared_count` | Partage d'un même ordre d'en-têtes entre IPs |
| `header_order_confidence` | Confiance dans l'ordre d'en-têtes (entropie normalisée) |
| `distinct_header_orders` | Nombre d'ordres d'en-têtes distincts observés |
#### Patterns de navigation
| Feature | Description |
|---------|-------------|
| `request_size_variance` | Variance de la taille des requêtes |
| `multiplexing_efficiency` | Efficacité du multiplexage HTTP/2 |
| `mss_mobile_mismatch` | Incohérence MSS TCP / profil mobile annoncé |
| `asset_ratio` | Fraction de requêtes vers des ressources statiques |
| `direct_access_ratio` | Fraction d'accès directs (sans referer) |
| `is_ua_rotating` | Rotation de User-Agent détectée (flag 0/1) |
| `distinct_ja4_count` | Nombre de fingerprints JA4 distincts par IP |
#### Concentration et rareté
| Feature | Description |
|---------|-------------|
| `src_port_density` | Densité des ports sources (entropy) |
| `ja4_asn_concentration` | Concentration d'un même JA4 dans un ASN |
| `ja4_country_concentration` | Concentration d'un même JA4 par pays |
| `is_rare_ja4` | JA4 peu commun dans la population (flag 0/1) |
#### Dimensions temporelles et de diversité (académiques)
| Feature | Description |
|---------|-------------|
| `temporal_entropy` | Entropie de la distribution temporelle des requêtes |
| `path_diversity_ratio` | Diversité des chemins URL accédés |
| `url_depth_variance` | Variance de la profondeur des URL |
| `anomalous_payload_ratio` | Fraction de payloads avec patterns anormaux |
### 4.3 Features additionnelles TCP/TLS (modèle Complet uniquement)
| Feature | Description |
|---------|-------------|
| `tcp_jitter_variance` | Variance de la gigue inter-paquets TCP |
| `alpn_http_mismatch` | Incohérence entre ALPN négocié et protocole HTTP effectif |
| `is_alpn_missing` | ALPN absent dans le TLS ClientHello |
| `sni_host_mismatch` | Incohérence entre SNI TLS et Host HTTP |
---
## 5. Approche semi-supervisée
### 5.1 Fondement théorique
L'**Isolation Forest** (Liu, Ting & Zhou, 2008) est un algorithme d'apprentissage non supervisé conçu pour la détection d'anomalies. Son principe : les anomalies, étant rares et différentes, sont **isolées en moins de partitions** dans un arbre de décision aléatoire que les points normaux.
Le score de décision (`decision_function`) est normalisé entre -1 (très anormal) et +1 (très normal). Le paramètre `contamination` fixe la fraction de points considérés comme anomalies dans l'ensemble d'entraînement.
### 5.2 Dimension semi-supervisée
L'approche est **semi-supervisée** car :
1. **Étiquetage partiel** : Les bots connus (via réputation IP/JA4) et les humains (via réputation ASN) sont identifiés *a priori*.
2. **Entraînement sur la classe normale uniquement** : L'IF est entraîné **exclusivement sur la baseline humaine** (`asn_label = 'human'`, `bot_name = ''`). Il apprend ainsi le profil du trafic légitime.
3. **Détection par déviation** : Tout trafic inconnu qui s'éloigne du profil humain est scoré négativement.
Cette approche suit le paradigme **One-Class Classification** (Tax & Duin, 2004) appliqué à la détection de bots, proche des travaux de Kruegel & Vigna (2003) sur la détection d'anomalies réseau.
### 5.3 Qualité de la baseline humaine
Le minimum de 500 sessions humaines est une garde-fou empirique. En dessous de ce seuil, l'IF ne dispose pas de suffisamment d'exemples pour définir un profil normal robuste, augmentant le risque de faux positifs.
En pratique, les cycles observés montrent entre **1 264** et **1 725** sessions humaines par fenêtre d'une heure.
---
## 6. Gestion des modèles
### 6.1 Cycle de vie d'un modèle
```
Démarrage cycle
Existe un .current ? ──NON──► Entraîner nouveau modèle
OUI
Âge < RETRAIN_INTERVAL_H ?
│ │
OUI NON
│ │
▼ └──► Entraîner nouveau modèle
A1 : Drift check (MODEL_TRAINED)
(z-score vs baseline_stats)
Drift ≥ DRIFT_THRESHOLD ?
│ │
NON OUI
│ │
Charger modèle Entraîner nouveau modèle
(MODEL_LOADED) (DRIFT_DETECTED + MODEL_TRAINED)
```
### 6.2 Versioning des modèles
Chaque modèle est identifié par un `version_id` au format `YYYYMMDD_HHMMSS`. Les fichiers associés sont :
- `model_{name}_{version_id}.joblib` — modèle sérialisé (joblib/pickle)
- `model_{name}_{version_id}.meta.json` — métadonnées (features, contamination, nb samples, etc.)
- `model_{name}.current` — pointeur atomique vers la version active
L'historique est limité à `MODEL_HISTORY_COUNT` versions (72 en production = 3 jours à 1 h de retrain).
Le fichier `.meta.json` contient maintenant un champ `baseline_stats` avec les statistiques de distribution (mean, std, p25, p75) de chaque feature, utilisées pour la détection de dérive (A1).
### 6.3 Paramètres Isolation Forest
```python
IsolationForest(
n_estimators=300, # Nombre d'arbres (compromis précision/temps)
contamination=0.02, # 2% d'anomalies estimées dans la baseline
random_state=42, # Reproductibilité
n_jobs=-1 # Parallélisation sur tous les cores
)
```
---
## 7. Données d'entrée — vue ClickHouse
### 7.1 Vue principale : `view_ai_features_1h`
Agrégation sur 1 heure glissante, une ligne par `src_ip`. Colonnes clés :
| Colonne | Type | Source |
|---------|------|--------|
| `src_ip` | String | TCP/IP |
| `ja4` | String | TLS fingerprint (JA4+) |
| `host` | String | HTTP Host header |
| `bot_name` | String | Réputation IP/JA4 (vide si inconnu) |
| `asn_number` | String | GeoIP/ASN lookup |
| `asn_org` | String | Organisation ASN |
| `asn_domain` | String | Domaine ASN |
| `country_code` | String | Pays source |
| `asn_label` | String | `human` / `bot` / `unknown` |
| `correlated` | Int | 1 si TCP/TLS disponible, 0 sinon |
| `hits` | Float | Nb requêtes |
| `hit_velocity` | Float | Req/s |
| *…(26+ features)* | Float | Voir section 4.2 |
### 7.2 Vue de récurrence : `view_ip_recurrence`
```sql
SELECT src_ip, recurrence FROM {DB}.view_ip_recurrence
```
Donne le nombre de fois qu'une IP a déjà été détectée comme menace dans l'historique. Enrichit le champ `recurrence` dans la sortie.
---
## 8. Données de sortie
### 8.1 Table ClickHouse : `ml_detected_anomalies`
Toutes les anomalies et bots connus détectés sont insérés dans cette table. Colonnes notables :
| Colonne | Description |
|---------|-------------|
| `detected_at` | Timestamp de détection |
| `src_ip` | IP source |
| `ja4` | Fingerprint TLS/JA4 (`HTTP_CLEAR_TEXT` si absent) |
| `host` | Vhost ciblé |
| `bot_name` | Nom du bot (vide si anomalie IF) |
| `anomaly_score` | Score IF (0.0 pour bots connus) |
| `threat_level` | `CRITICAL` / `HIGH` / `MEDIUM` / `LOW` / `KNOWN_BOT` |
| `model_name` | `Complet` ou `Applicatif` |
| `recurrence` | Nb d'apparitions historiques + 1 |
| `reason` | Description textuelle de l'anomalie |
| `is_headless` | Dérivé de `is_fake_navigation` |
| *…(toutes les features)* | Pour analyse post-mortem |
### 8.2 Journal JSONL : `decisions.jsonl`
Événements structurés en JSON Lines, rotatifs (50 MB × 7 fichiers).
| Événement | Déclencheur |
|-----------|-------------|
| `SERVICE_START` | Démarrage du conteneur |
| `SERVICE_STOP` | Arrêt propre (SIGTERM/SIGINT) |
| `CYCLE_START` | Début d'un cycle d'analyse |
| `CYCLE_END` | Fin du cycle (résumé inserés) |
| `MODEL_LOADED` | Réutilisation d'un modèle existant |
| `MODEL_TRAINED` | Nouvel entraînement |
| `KNOWN_BOT` | Bot connu identifié |
| `ANOMALY` | Anomalie IF détectée |
| `SKIPPED_LOW_DATA` | Cycle ignoré (baseline < 500) |
| `CONSECUTIVE_FAILURES` | Erreur ClickHouse répétée |
---
## 9. Configuration
Toutes les valeurs sont passées via variables d'environnement (fichier `.env`).
| Variable | Défaut | Description |
|----------|--------|-------------|
| `CLICKHOUSE_HOST` | `clickhouse` | Hôte ClickHouse |
| `CLICKHOUSE_DB` | `mabase_prod` | Base de données |
| `CLICKHOUSE_USER` | `default` | Utilisateur |
| `CLICKHOUSE_PASSWORD` | *(vide)* | Mot de passe |
| `ISOLATION_CONTAMINATION` | `0.001` | Fraction d'anomalies attendues (0 < x < 0.5) |
| `ANOMALY_THRESHOLD` | `-0.05` | Seuil statique de score pour insertion |
| `CYCLE_INTERVAL_SEC` | `300` | Délai entre cycles (secondes) |
| `MAX_CONSECUTIVE_FAILURES` | `3` | Échecs avant passage en DEGRADED |
| `BOT_DETECTOR_LOG` | `/var/log/bot_detector/decisions.jsonl` | Fichier de log |
| `LOG_BACKUP_COUNT` | `7` | Nb de rotations conservées |
| `MODEL_DIR` | `/var/lib/bot_detector` | Répertoire des modèles |
| `RETRAIN_INTERVAL_HOURS` | `24` | Fréquence de re-entraînement |
| `MODEL_HISTORY_COUNT` | `10` | Nb de versions de modèles conservées |
| `HEALTH_PORT` | `8080` | Port du health check HTTP |
| **A1** `DRIFT_THRESHOLD` | `0.30` | Fraction de features déroutantes déclenchant un retrain forcé |
| **A2** `ANOMALY_PERCENTILE` | `5` | Percentile pour le seuil adaptatif (020) |
| **A3** `ENABLE_MULTIWINDOW` | `false` | Active l'analyse sur fenêtre 24h |
| **A3** `MULTIWINDOW_VIEW` | `view_ai_features_24h` | Nom de la vue 24h dans ClickHouse |
| **A4** `ENABLE_SHAP` | `true` | Active le calcul SHAP (désactivé si shap non installé) |
| **A5** `DEDUP_TTL_MIN` | `60` | TTL de déduplication inter-cycles (0 = désactivé) |
| **A6** `RECURRENCE_WEIGHT` | `0.005` | Pénalité de score par log(récurrence) |
| **A7** `MIN_VALID_FEATURE_RATIO` | `0.50` | Ratio minimum de features valides pour procéder |
| **A8** `ENABLE_CLUSTERING` | `true` | Active le clustering DBSCAN des anomalies |
| **A8** `CLUSTERING_MIN_SAMPLES` | `3` | Taille minimale d'un cluster DBSCAN |
---
## 10. Observabilité
### 10.1 Health check
```bash
GET http://localhost:8080/
# → 200 OK service opérationnel
# → 503 DEGRADED ≥ MAX_CONSECUTIVE_FAILURES échecs ClickHouse consécutifs
```
### 10.2 Logs opérationnels
Les logs console suivent le format `[YYYY-MM-DD HH:MM:SS] message`. Le fichier JSONL permet des analyses post-mortem avec des outils comme `jq` :
```bash
# Voir les dernières anomalies CRITICAL
jq 'select(.event=="ANOMALY" and .threat_level=="CRITICAL")' decisions.jsonl
# Voir les top features SHAP pour les anomalies HIGH
jq 'select(.event=="ANOMALY" and .threat_level=="HIGH") | .reason' decisions.jsonl
# Détecter les dérives de distribution
jq 'select(.event=="DRIFT_DETECTED")' decisions.jsonl
# Voir les campagnes coordonnées (campaign_id >= 0)
jq 'select(.event=="ANOMALY" and .campaign_id >= 0) | {src_ip, campaign_id, threat_level}' decisions.jsonl
# Compter les bots connus par nom
jq -r 'select(.event=="KNOWN_BOT") | .bot_name' decisions.jsonl | sort | uniq -c | sort -rn
# Résumé des cycles
jq 'select(.event=="CYCLE_END")' decisions.jsonl
```
| Événement | Déclencheur |
|-----------|-------------|
| `SERVICE_START` | Démarrage du conteneur |
| `SERVICE_STOP` | Arrêt propre (SIGTERM/SIGINT) |
| `CYCLE_START` | Début d'un cycle d'analyse |
| `CYCLE_END` | Fin du cycle (résumé insertés + dedup_ttl_min) |
| `MODEL_LOADED` | Réutilisation d'un modèle existant (+ drift_score) |
| `MODEL_TRAINED` | Nouvel entraînement |
| `DRIFT_DETECTED` | Dérive conceptuelle détectée retrain forcé |
| `FEATURE_WARNING` | Features manquantes / constantes / agrégats globaux détectés (loggué uniquement si la situation change) |
| `SKIPPED_INVALID_FEATURES` | Cycle ignoré (trop peu de features valides) |
| `KNOWN_BOT` | Bot connu identifié |
| `ANOMALY` | Anomalie IF détectée (+ effective_threshold, campaign_id, raw_anomaly_score) |
| `SKIPPED_LOW_DATA` | Cycle ignoré (baseline < 500) |
| `CONSECUTIVE_FAILURES` | Erreur ClickHouse répétée |
### 10.3 Avertissements sur les features (A7)
Les avertissements de features ne sont affichés en console **qu'une seule fois** (à la première détection ou lors d'un changement). Les cycles suivants avec la même situation ne génèrent pas de bruit. L'événement `FEATURE_WARNING` reste dans le JSONL pour traçabilité.
| Catégorie | Message console | Cause typique |
|-----------|-----------------|---------------|
| `zero` | `Features à 0 (pipeline non-alimenté)` | Table source vide / LEFT JOIN sans match |
| `unique_nonzero` | `Features non-discriminantes (agrégat global)` | `PARTITION BY` sur valeur NULL partition unique |
| `missing` | `Features absentes du schéma` | Colonne manquante dans la vue ClickHouse |
Voir [`CLICKHOUSE_FEATURES_DIAGNOSTIC.md`](CLICKHOUSE_FEATURES_DIAGNOSTIC.md) pour le détail des corrections ClickHouse nécessaires.
### 11.1 Sources de réputation
| Fichier | Format | Contenu |
|---------|--------|---------|
| `bot_ip.csv` | `ip_cidr,bot_name` | ~288 k IP/CIDR de bots référencés |
| `bot_ja4.csv` | `ja4,bot_name` | Fingerprints JA4 de bots |
| `asn_reputation.csv` | `asn_number,label` | Labels ASN (human/bot) |
Ces fichiers sont montés en lecture seule dans le conteneur. Ils sont écrits par ClickHouse (FILE engine) et partagés via volume Docker.
### 11.2 Hiérarchie de classification
```
1. bot_name != '' (depuis view_ai_features_1h)
→ KNOWN_BOT : bot identifié par réputation IP ou JA4
2. asn_label == 'human' (depuis view_ai_features_1h)
→ Utilisé pour la baseline d'entraînement de l'IF
3. Trafic restant
→ Scoré par Isolation Forest
→ Anomalie si score < ANOMALY_THRESHOLD
```
---
## 12. Fondements scientifiques
### 12.1 Isolation Forest (Liu et al., 2008)
L'algorithme repose sur la propriété que les anomalies sont **isolées plus rapidement** dans des arbres de partitionnement aléatoire. La longueur moyenne du chemin d'isolation est normalisée pour produire un score entre 0 et 1 (transposé ici en -1 à +1 par `decision_function`).
**Propriétés clés :**
- Complexité O(n log n) pour l'entraînement
- Robuste aux données de haute dimensionnalité (3135 features ici)
- Pas d'hypothèse sur la distribution des données
- Efficace sur de grands volumes (n_estimators=300, n_jobs=-1)
### 12.2 JA4+ Fingerprinting (FoxIO, 2023)
JA4 est la 4e génération de fingerprints TLS/QUIC/HTTP, successeur de JA3. Il capture les caractéristiques du ClientHello TLS (versions, ciphers, extensions) en une empreinte compacte permettant d'identifier des familles de clients (navigateurs, bots, outils). L'utilisation de `is_rare_ja4`, `distinct_ja4_count` et `ja4_asn_concentration` exploite cette propriété.
### 12.3 One-Class Classification appliquée aux bots
L'approche s'inscrit dans la lignée des travaux sur la détection de bots web :
- **Stevanovic et al. (2013)** : détection de bots par analyse comportementale de flux HTTP
- **Kruegel & Vigna (2003)** : détection d'anomalies réseau par profils normaux
- **Barford & Yegneswaran (2007)** : classification comportementale des botnets
La combinaison de features HTTP comportementales (velocity, fuzzing, post_ratio), de features d'empreinte (JA4, headers), et de features TCP/TLS (jitter, ALPN, SNI) reproduit l'approche multi-couche recommandée par la littérature récente.
### 12.4 Entropie temporelle comme signal d'anomalie
Le feature `temporal_entropy` mesure l'entropie de Shannon sur la distribution temporelle des requêtes dans la fenêtre. Un bot avec un timing régulier (scripted polling) produit une entropie faible, tandis qu'un humain naviguant naturellement produit une distribution plus aléatoire. Ce signal est utilisé dans les travaux de **Wang et al. (2014)** sur la détection de crawlers web.
---
## 13. Améliorations implémentées (v11)
### A1 — Détection de dérive conceptuelle
**Fonctionnement** : À chaque cycle, avant de décider de charger ou de réentraîner le modèle, on compare la distribution courante de la baseline humaine avec celle sauvegardée lors du dernier entraînement. Pour chaque feature, un z-score est calculé :
```
z = |mean_current - mean_trained| / std_trained
```
Si la fraction de features avec `z > 2.0` dépasse `DRIFT_THRESHOLD` (30% par défaut), un re-entraînement est forcé et l'événement `DRIFT_DETECTED` est loggué.
**Métadonnées sauvegardées** : `baseline_stats` dans le `.meta.json` contient `{mean, std, p25, p75}` par feature.
**Références** : Gama et al. (2014) *A Survey on Concept Drift Adaptation*
---
### A2 — Seuil adaptatif par percentile
**Fonctionnement** :
```python
effective_threshold = min(np.percentile(raw_scores[raw_scores < 0], ANOMALY_PERCENTILE),
ANOMALY_THRESHOLD)
```
Le seuil effectif est le minimum entre le `ANOMALY_PERCENTILE`-ème percentile des scores négatifs et le seuil statique. Cela garantit que le seuil ne peut pas remonter au-dessus du seuil configuré, mais peut s'adapter vers le bas selon la distribution courante.
Le seuil utilisé est loggué dans chaque événement `ANOMALY`.
---
### A3 — Analyse multi-fenêtres (optionnelle)
**Activation** : `ENABLE_MULTIWINDOW=true` + une vue `view_ai_features_24h` dans ClickHouse.
**Fonctionnement** : Deux paires de modèles supplémentaires (`Complet_24h`, `Applicatif_24h`) tournent sur la fenêtre de 24h. Les anomalies des deux fenêtres sont fusionnées via une logique OR : une IP est flaggée si elle est anormale dans au moins une fenêtre. En cas de doublon, le score le plus bas (le plus anormal) est conservé.
**Utilité** : Détection des bots low-and-slow invisibles sur 1h mais clairement anormaux sur 24h.
---
### A4 — Explainabilité par SHAP
**Fonctionnement** : Pour chaque anomalie détectée, `shap.TreeExplainer` calcule la contribution de chaque feature au score d'anomalie. Les 5 features les plus négatives (les plus responsables de l'anomalie) sont incluses dans le champ `reason` :
```
[Complet] Score: -0.112 | SHAP: is_alpn_missing(-1.081) | tcp_jitter_variance(-1.073) |
ja4_asn_concentration(-1.062) | temporal_entropy(-0.887) |
direct_access_ratio(-0.886) | Threat: MEDIUM
```
**Désactivation** : `ENABLE_SHAP=false` ou si le package `shap` n'est pas installé.
**Références** : Lundberg & Lee (2017) *A Unified Approach to Interpreting Model Predictions*
---
### A5 — Déduplication inter-cycles avec TTL
**Fonctionnement** : Avant chaque insertion, la table `ml_detected_anomalies` est interrogée pour identifier les IPs déjà insérées dans les `DEDUP_TTL_MIN` dernières minutes. Une IP est réinsérée uniquement si son score brut s'est dégradé d'au moins 0.05 points.
**Désactivation** : `DEDUP_TTL_MIN=0`
---
### A6 — Pondération du score par récurrence
**Fonctionnement** :
```python
raw_score_adjusted = raw_score - log1p(recurrence) × RECURRENCE_WEIGHT
```
Une IP détectée 10 fois reçoit une pénalité de `log(11) × 0.005 ≈ 0.012` sur son score brut, ce qui la rapproche du seuil de détection. Ce mécanisme simule un prior bayésien : les IPs récidivistes sont plus probablement malveillantes.
---
### A7 — Validation de complétude des features
**Fonctionnement** : Avant entraînement et scoring, `validate_features()` détecte :
- Les features absentes de la vue ClickHouse
- Les features constantes (std = 0, donc non discriminantes)
Les features invalides sont exclues du modèle. Si la fraction de features valides est inférieure à `MIN_VALID_FEATURE_RATIO` (50%), le cycle est ignoré.
**Bénéfice** : Les features constantes (souvent dues à des colonnes non encore implémentées dans la vue) ne biaisent plus le modèle.
---
### A8 — Clustering comportemental (DBSCAN)
**Fonctionnement** : Après détection, DBSCAN est appliqué sur les features normalisées des anomalies :
```python
X_scaled = StandardScaler().fit_transform(anomalies[valid_features])
labels = DBSCAN(eps=0.5, min_samples=CLUSTERING_MIN_SAMPLES).fit_predict(X_scaled)
```
- `campaign_id = -1` : IP isolée (comportement unique)
- `campaign_id >= 0` : membre d'une campagne coordonnée
Le `campaign_id` est loggué dans les événements `ANOMALY` (JSONL). Il n'est pas encore dans le schéma ClickHouse (voir §14).
**Références** : Ester et al. (1996) *A Density-Based Algorithm for Discovering Clusters*
---
### A10 — Normalisation des scores entre modèles
**Fonctionnement** :
```python
# Scores négatifs normalisés en [-1, 0], scores positifs inchangés
anomaly_score_normalized = normalize_scores(raw_score)
```
Le champ `anomaly_score` dans ClickHouse contient désormais le score normalisé, permettant une comparaison cohérente entre le modèle Complet (35 features) et le modèle Applicatif (31 features). Le score brut IF est conservé dans `raw_anomaly_score` (logs JSONL uniquement) et est utilisé pour l'assignation du threat level.
---
## 14. Migration de schéma ClickHouse
Les nouvelles colonnes suivantes sont disponibles dans les logs JSONL mais pas encore dans la table `ml_detected_anomalies`. Pour les activer :
```sql
ALTER TABLE mabase_prod.ml_detected_anomalies
ADD COLUMN IF NOT EXISTS campaign_id Int32 DEFAULT -1,
ADD COLUMN IF NOT EXISTS raw_anomaly_score Float32 DEFAULT 0;
```
Après cette migration, ajouter ces colonnes à la liste `cols` dans `fetch_and_analyze()` (elles sont déjà calculées en mémoire).

View File

@ -0,0 +1,756 @@
# Bot Detector IA — Axes d'amélioration
> Document de propositions techniques — à valider avant implémentation
---
## Résumé des axes proposés
| # | Axe | Impact | Complexité | Priorité suggérée |
|---|-----|--------|------------|-------------------|
| A1 | [Détection de dérive conceptuelle (concept drift)](#a1-détection-de-dérive-conceptuelle) | 🔴 Élevé | Moyenne | ⭐⭐⭐ |
| A2 | [Seuil adaptatif par percentile](#a2-seuil-adaptatif-par-percentile) | 🔴 Élevé | Faible | ⭐⭐⭐ |
| A3 | [Analyse multi-fenêtres temporelles](#a3-analyse-multi-fenêtres-temporelles) | 🔴 Élevé | Élevée | ⭐⭐ |
| A4 | [Explainabilité par SHAP](#a4-explainabilité-par-shap) | 🟠 Moyen | Moyenne | ⭐⭐⭐ |
| A5 | [Déduplication avec TTL inter-cycles](#a5-déduplication-avec-ttl-inter-cycles) | 🟠 Moyen | Faible | ⭐⭐⭐ |
| A6 | [Pondération par récurrence dans le score](#a6-pondération-par-récurrence-dans-le-score) | 🟠 Moyen | Faible | ⭐⭐ |
| A7 | [Validation de complétude des features](#a7-validation-de-complétude-des-features) | 🟠 Moyen | Faible | ⭐⭐⭐ |
| A8 | [Clustering comportemental des anomalies](#a8-clustering-comportemental-des-anomalies) | 🟡 Utile | Moyenne | ⭐⭐ |
| A9 | [Métriques Prometheus / health check enrichi](#a9-métriques-prometheus--health-check-enrichi) | 🟡 Utile | Faible | ⭐⭐ |
| A10 | [Normalisation des scores entre modèles](#a10-normalisation-des-scores-entre-modèles) | 🟡 Utile | Faible | ⭐ |
---
## A1 — Détection de dérive conceptuelle
### Problème
L'Isolation Forest est entraîné sur la baseline humaine courante. Si le profil du trafic légitime évolue graduellement (nouveau navigateur populaire, changement de comportement utilisateur, migration réseau), le modèle vieilli peut :
- Générer des **faux positifs** sur du trafic humain nouvellement apparu
- Rater des **faux négatifs** si les bots imitent les anciens patterns
Le re-entraînement périodique (toutes les X heures) atténue le problème mais ne détecte pas quand une dérive significative a eu lieu **entre deux cycles de retraining**.
### Approche proposée
Calculer à chaque cycle un score de **dérive statistique** entre la baseline d'entraînement du modèle actif et la baseline courante. Si la dérive dépasse un seuil, forcer un re-entraînement anticipé.
**Méthode : Kolmogorov-Smirnov (KS test) ou Maximum Mean Discrepancy (MMD)**
Pour chaque feature :
```python
from scipy import stats
ks_stat, p_value = stats.ks_2samp(baseline_trained[feat], baseline_current[feat])
```
Si la fraction de features avec `p_value < 0.05` dépasse un seuil configurable (ex. 30%), déclencher un retrain et logguer un événement `DRIFT_DETECTED`.
### Bénéfices
- Retrain opportuniste plutôt que temporel fixe
- Détection proactive des changements de comportement réseau
- Réduction des faux positifs liés à la dérive
### Références
- Gama et al. (2014) — *A Survey on Concept Drift Adaptation*
- Rabanser et al. (2019) — *Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift*
### Implémentation suggérée
- Sauvegarder la distribution de la baseline d'entraînement dans le `.meta.json`
- Calculer le KS test au début de chaque cycle avant la décision de chargement
- Ajouter un paramètre `DRIFT_THRESHOLD` (défaut : 0.30)
- Logguer l'événement `DRIFT_DETECTED` avec les features déroutantes
---
## A2 — Seuil adaptatif par percentile
### Problème
`ANOMALY_THRESHOLD = -0.03` est un seuil **global et statique**. Ce seuil a une signification différente selon :
- Le volume de trafic (plus de trafic = distribution de scores plus resserrée)
- La contamination effective du cycle (jour calme vs attaque active)
- Les caractéristiques du modèle actif (entraîné sur 1 264 vs 1 725 sessions)
Un seuil fixe peut produire des **rafales de faux positifs** lors d'événements légitimes inhabituels (campagne marketing, crawler partenaire) ou rater des menaces réelles lors de trafic atypique.
### Approche proposée
Calculer dynamiquement le seuil à partir de la **distribution des scores du cycle courant** :
```python
scores = model.decision_function(X_test)
# Seuil = percentile P de la distribution des scores négatifs
adaptive_threshold = np.percentile(scores, ANOMALY_PERCENTILE)
# On prend le min avec le seuil statique pour éviter d'aller trop haut
threshold = min(adaptive_threshold, ANOMALY_THRESHOLD)
```
**Paramètre ajoutable** : `ANOMALY_PERCENTILE` (défaut : 5 → top 5% des scores les plus négatifs).
Cette approche est complémentaire au seuil statique (garde-fou) : elle s'adapte vers le bas mais ne remonte jamais au-dessus du seuil configuré.
### Bénéfices
- Stabilité du taux de faux positifs au fil du temps
- Auto-adaptation aux variations de volume
- Comportement plus prédictible en production
### Implémentation suggérée
- Ajouter `ANOMALY_PERCENTILE` (020, défaut 5) comme variable d'environnement
- Calculer le seuil adaptatif dans `run_semi_supervised_logic()`
- Logguer le seuil effectif utilisé dans `CYCLE_START` / `ANOMALY`
---
## A3 — Analyse multi-fenêtres temporelles
### Problème
La fenêtre 1h est un compromis. Elle manque :
- Les **attaques rapides** (burst de quelques minutes) : le signal est dilué
- Les **bots lents** (low-and-slow, 12 req/min sur 24h) : comportement normal sur 1h
### Approche proposée
Ajouter une deuxième vue ClickHouse agrégée sur **24h** et un troisième modèle sur cette fenêtre. Les scores des deux modèles peuvent être combinés :
```
score_final = w1 * score_1h + w2 * score_24h
```
Ou, plus simplement, un AND logique : une IP n'est flaggée que si elle est anomalie sur les **deux fenêtres**, réduisant drastiquement les faux positifs.
### Bénéfices
- Détection des bots low-and-slow (reconnaissance, scraping discret)
- Réduction des faux positifs par corrélation multi-temporelle
- Complémentarité avec le modèle 1h existant
### Considerations
- Nécessite une vue `view_ai_features_24h` dans ClickHouse
- Modèle 24h beaucoup plus stable (moins de bruit)
- Le volume de données à traiter augmente
### Références
- Stalmans & Irwin (2011) — *A Framework for Web Bot Detection Using Request Rate Monitoring*
- Stevanovic et al. (2013) — *An Efficient Flow-based Botnet Detection Using Supervised Machine Learning*
---
## A4 — Explainabilité par SHAP
### Problème
Le champ `reason` actuel est basique :
```
"[Complet] Score: -0.312 | Vel: 45.2 req/s | Fuzzing: 8.3 | Threat: CRITICAL"
```
Pour un opérateur de sécurité, il manque :
- **Quelles features** ont le plus contribué à ce score ?
- Est-ce principalement comportemental (velocity) ou fingerprint (JA4) ?
- Comment comparer deux anomalies de même score ?
### Approche proposée
Utiliser **TreeSHAP** (Lundberg & Lee, 2017) qui supporte nativement les forêts d'arbres :
```python
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test.iloc[[idx]])
top_features = sorted(zip(features, shap_values[0]), key=lambda x: abs(x[1]), reverse=True)[:5]
```
Enrichir le champ `reason` avec les 5 features les plus contributives et leur valeur SHAP.
### Bénéfices
- Triage des alertes facilité pour les analystes SOC
- Détection des features systématiquement sur-représentées (potentiel bug de feature engineering)
- Conformité avec les exigences de traçabilité des décisions IA
### Implémentation suggérée
- Ajouter `shap` aux requirements (compatible sklearn)
- Calculer SHAP uniquement pour les IP flaggées (pas sur tout le dataset)
- Stocker `shap_top5` comme JSON dans le log JSONL
- Option : `ENABLE_SHAP=true/false` pour contrôler la charge CPU
### Références
- Lundberg & Lee (2017) — *A Unified Approach to Interpreting Model Predictions*
---
## A5 — Déduplication avec TTL inter-cycles
### Problème
Avec un cycle de 5 min et une fenêtre 1h, la même IP malveillante est potentiellement **réinsérée 12 fois par heure** dans `ml_detected_anomalies`. Cela :
- Gonfle la table artificellement
- Complique les requêtes d'analyse (nécessite un DISTINCT)
- Fausse les métriques de comptage
Le mécanisme actuel de `drop_duplicates(subset=['src_ip'])` ne fonctionne qu'au sein d'un seul cycle, pas entre cycles.
### Approche proposée
Avant insertion, interroger ClickHouse pour filtrer les IPs déjà insérées récemment :
```python
# Récupérer les IPs déjà détectées dans les N dernières minutes
recent_ips = client.query_df(f"""
SELECT DISTINCT src_ip
FROM {DB}.ml_detected_anomalies
WHERE detected_at > now() - INTERVAL {DEDUP_TTL_MIN} MINUTE
""")
# Exclure ces IPs sauf si le score s'est dégradé significativement
new_anomalies = anomalies[~anomalies['src_ip'].isin(recent_ips['src_ip'])]
```
**Paramètre ajoutable** : `DEDUP_TTL_MIN` (défaut : 60 minutes).
**Variante** : ne re-insérer que si `new_score < existing_score - 0.05` (dégradation significative).
### Bénéfices
- Réduction du volume de la table de détection
- Requêtes d'analyse plus simples
- Gestion de la montée en charge (moins d'insertions)
### Implémentation suggérée
- Paramètre `DEDUP_TTL_MIN` (0 pour désactiver)
- La requête de déduplication est légère (index sur `detected_at`)
- Logguer le nb d'IP filtrées dans `CYCLE_END`
---
## A6 — Pondération par récurrence dans le score
### Problème
La récurrence est actuellement un champ **informatif seulement** : une IP détectée 50 fois a le même seuil de filtrage qu'une IP vue pour la première fois. Un bot persistant et connu ne reçoit pas de pénalité de score.
### Approche proposée
Ajuster le score de décision en fonction de la récurrence :
```python
# Score ajusté : plus une IP est récurrente, plus son score s'aggrave
recurrence_penalty = np.log1p(recurrence) * RECURRENCE_WEIGHT
adjusted_score = anomaly_score - recurrence_penalty
```
Avec `RECURRENCE_WEIGHT = 0.005` par défaut (configurable). Une IP vue 10 fois voit son score pénalisé de ~0.012, une IP vue 100 fois de ~0.023.
Cette approche simule un **Prior bayésien** : la probabilité qu'une IP soit malveillante augmente avec ses détections passées.
### Bénéfices
- Menaces persistantes classifiées plus sévèrement
- Réduction du bruit des anomalies éphémères
- Signal plus fort pour les blocages automatisés
### Implémentation suggérée
- Ajouter `RECURRENCE_WEIGHT` (défaut 0.005, 0 pour désactiver)
- Stocker `raw_score` et `adjusted_score` séparément dans les logs
---
## A7 — Validation de complétude des features
### Problème
Si une feature est absente de la vue (colonne manquante, erreur de schéma), elle est silencieusement remplacée par `0` via `fillna(0)`. Cela **dégrade la qualité du modèle sans avertissement** : une feature entièrement à zéro n'apporte aucune information discriminante et biaise les scores.
### Approche proposée
Au début de chaque cycle, après chargement du DataFrame :
```python
def validate_features(df: pd.DataFrame, features: list, name: str) -> list:
zero_features = [f for f in features if f in df.columns and df[f].std() == 0]
missing_features = [f for f in features if f not in df.columns]
if missing_features:
log_info(f"[{name}] ATTENTION: {len(missing_features)} features manquantes: {missing_features}")
if zero_features:
log_info(f"[{name}] ATTENTION: {len(zero_features)} features constantes (=0): {zero_features}")
# Retourner uniquement les features exploitables
valid = [f for f in features if f in df.columns and df[f].std() > 0]
return valid
```
Un événement `FEATURE_WARNING` serait loggué, et si plus de 20% des features sont invalides, le cycle peut être `SKIPPED`.
### Bénéfices
- Détection rapide des régressions de schéma ClickHouse
- Qualité de modèle assurée
- Facilite le debugging lors des évolutions de la vue
### Implémentation suggérée
- Paramètre `MIN_VALID_FEATURE_RATIO` (défaut 0.8)
- Comparaison avec les features du modèle chargé (détecte les dérives de schéma post-mise à jour)
---
## A8 — Clustering comportemental des anomalies
### Problème
Les anomalies sont analysées et insérées individuellement. Or, une campagne de botnet coordonnée peut impliquer des **dizaines d'IPs avec des profils similaires**. Cette information de **corrélation horizontale** est aujourd'hui invisible.
### Approche proposée
Après la détection, appliquer un **DBSCAN** sur les features des anomalies pour identifier des clusters d'attaque :
```python
from sklearn.cluster import DBSCAN
X_anomalies = anomalies[features].fillna(0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_anomalies)
labels = DBSCAN(eps=0.5, min_samples=3).fit_predict(X_scaled)
anomalies['campaign_id'] = labels # -1 = isolé, 0+ = cluster
```
Les IPs d'un même cluster partagent un comportement similaire et peuvent faire partie d'une même infrastructure d'attaque.
### Bénéfices
- Identification des campagnes coordonnées (botnets distribués)
- Enrichissement de `reason` avec un identifiant de campagne
- Permet des blocages de plages d'IPs entières
### Implémentation suggérée
- DBSCAN uniquement si ≥ 5 anomalies dans le cycle (pas de coût si peu d'anomalies)
- Stocker `campaign_id` dans `ml_detected_anomalies`
- `eps` et `min_samples` configurables
### Références
- Ester et al. (1996) — *A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases*
---
## A9 — Métriques Prometheus / health check enrichi
### Problème
Le health check actuel est binaire (OK/DEGRADED). Cela ne permet pas :
- De monitorer la dérive du taux d'anomalies dans le temps
- D'alerter si aucun cycle ne s'est exécuté depuis X minutes
- De suivre l'âge du modèle en production
### Approche proposée
Exposer un endpoint `/metrics` au format **Prometheus text** sur le même port :
```
# HELP botdetector_cycle_duration_seconds Duration of last analysis cycle
# TYPE botdetector_cycle_duration_seconds gauge
botdetector_cycle_duration_seconds 12.4
# HELP botdetector_anomalies_total Total anomalies detected in last cycle
# TYPE botdetector_anomalies_total gauge
botdetector_anomalies_total{model="Complet"} 3
botdetector_anomalies_total{model="Applicatif"} 7
# HELP botdetector_model_age_hours Age of active model in hours
botdetector_model_age_hours{model="Applicatif"} 0.91
# HELP botdetector_human_baseline_size Nb of human samples used for training
botdetector_human_baseline_size{model="Applicatif"} 1725
```
Implémenté sans dépendance externe (format texte manuel ou lib légère `prometheus_client`).
### Bénéfices
- Intégration Grafana/Alertmanager
- Alertes sur dérive du taux d'anomalies (ex. : >50% d'une heure à l'autre)
- Monitoring de la fraîcheur du modèle
### Implémentation suggérée
- Ajouter `prometheus_client` ou générer le format texte manuellement
- Endpoint `/metrics` sur le même `HTTPServer` existant
- Métriques stockées dans un dict thread-safe mis à jour après chaque cycle
---
## A10 — Normalisation des scores entre modèles
### Problème
Les scores `decision_function` de l'IF ne sont **pas comparables entre modèles** entraînés sur des données différentes. Un score de -0.10 sur le modèle Complet et -0.10 sur le modèle Applicatif n'ont pas la même signification si les baselines et les features sont différentes.
La déduplication actuelle par `src_ip` prend le score le plus bas sans tenir compte de cette non-comparabilité.
### Approche proposée
Normaliser les scores par rapport à la distribution des scores négatifs du cycle courant :
```python
# Normalisation min-max sur le sous-ensemble des scores < 0
neg_scores = unknown_traffic['anomaly_score'][unknown_traffic['anomaly_score'] < 0]
if len(neg_scores) > 0:
score_min, score_max = neg_scores.min(), neg_scores.max()
unknown_traffic['normalized_score'] = (
(unknown_traffic['anomaly_score'] - score_min) / (score_max - score_min + 1e-9)
).clip(0, 1) * -1 # entre -1 et 0
```
Les niveaux de menace seraient alors calculés sur le score normalisé, rendant la comparaison entre modèles cohérente.
### Bénéfices
- Cohérence des niveaux CRITICAL/HIGH/MEDIUM entre modèles
- Déduplication plus juste
- Seuils de threat_level interprétables de façon constante
---
## Notes d'implémentation générales
- **Compatibilité** : toute amélioration doit rester rétrocompatible avec le schéma `ml_detected_anomalies` existant (ajout de colonnes optionnelles uniquement)
- **Lisibilité** : garder le code en sections délimitées par les bandeaux `═══` existants
- **Tests** : valider chaque changement par une exécution Docker sur la base de données réelle
- **Documentation** : mettre à jour `DOCUMENTATION.md` après chaque implémentation
- **Feature flags** : les nouvelles fonctionnalités comportementales devraient être activables via variable d'environnement pour un rollout progressif
---
# Nouvelles dimensions de features — Propositions B
> Propositions de features supplémentaires pour l'Isolation Forest, validées sur les données réelles de `mabase_prod`.
> Chaque proposition indique la force du signal observée en base, la source de données, la formule de calcul et les références scientifiques.
## Résumé des signaux
| # | Feature | Signal observé | Modèle | Impact estimé |
|---|---------|---------------|--------|--------------|
| B1 | JA3/JA4 diversity ratio | 809 JA3 pour 2 JA4 (IP connue bot) | Complet | 🔴 Élevé |
| B2 | SYN timing regularity | 386/3222 IPs (12%) avec variance=0 | Complet | 🔴 Élevé |
| B3 | TLS 1.2 exclusive ratio | 136/3259 IPs (4%) — jamais TLS 1.3 | Complet | 🔴 Élevé |
| B4 | HEAD method ratio | 67/3335 IPs (2%) à >50% HEAD | Les deux | 🟠 Moyen |
| B5 | Sec-Fetch absence rate | Signal L7 universel (correlated=0 aussi) | Les deux | 🟠 Moyen |
| B6 | Accept header entropy | Bots = Accept vide ou `*/*` constant | Les deux | 🟠 Moyen |
| B7 | TLS version entropy | TLS 1.3 = 97.3% du trafic légitime | Complet | 🟠 Moyen |
| B8 | HTTP/TLS protocol mismatch | HTTP/1.1 + TLS 1.3 = ratio anormal | Complet | 🟡 Utile |
| B9 | IP DF-bit variance | DF inconsistant = stack spoofé | Complet | 🟡 Utile |
| B10 | JA4 concentration intra-ASN | JA4 rare dans ASN = outil exotique | Complet | 🟡 Utile |
---
## B1 — JA3/JA4 Diversity Ratio (rotation de fingerprint TLS)
### Observation
```
185.177.72.60 → 1619 JA3 distincts / 2 JA4 → ratio 809.5
194.187.171.160 → 153 JA3 distincts / 2 JA4 → ratio 76.5
```
Le JA4 reste stable (il encode le type de client TLS + ALPN) mais le JA3 varie massivement. C'est la signature d'un **bot qui randomise les extensions TLS** pour contourner la détection par fingerprint.
### Feature proposée
```sql
-- Dans mv_agg_host_ip_ja4_1h
uniqState(ja3) AS uniq_ja3 -- à ajouter dans la table d'agrégation
```
```python
# Dans view_ai_features_1h
ja3_diversity_ratio = uniq_ja3 / greatest(uniq_ja4, 1)
```
### Signal en base
- Trafic humain : ratio typiquement 13 (même navigateur, légères variations)
- Bot avec rotation : ratio 17809 → signal extrêmement discriminant
- Disponible : `ja3` est présent dans `http_logs` avec 100% de valeurs non-vides pour correlated=1
### Modifications requises
1. Ajouter `uniqState(ja3) AS uniq_ja3` dans `mv_agg_host_ip_ja4_1h` et `agg_host_ip_ja4_1h`
2. Ajouter `uniqMerge(uniq_ja3) / greatest(uniq_ja4_merged, 1) AS ja3_diversity_ratio` dans `view_ai_features_1h`
3. Ajouter `ja3_diversity_ratio` à `feats_complet` dans `bot_detector.py`
### Références
- Siby et al. (2020) — *Encrypted DNS → Privacy? A Traffic Analysis Perspective* — méthodes de diversité de fingerprint
- Anderson & McGrew (2016) — *Machine Learning for Encrypted Malware Traffic Classification* — JA3 comme feature primaire
- Husák et al. (2022) — *TLS fingerprinting for bot detection* — rotation JA3 comme évasion signature
---
## B2 — SYN-to-ClientHello Timing Regularity
### Observation
```
88.202.237.59 : 45 connexions, avg=22ms, std=0.00ms → timing robotique parfait
92.184.144.129: 41 connexions, avg=10ms, std=0.00ms → idem
386/3222 IPs analysées (12%) ont une variance=0
```
Un humain présente une distribution aléatoire (Weibull ou log-normale) des temps de réponse réseau. Un bot utilisant un scheduler fixe ou une connexion locale a une variance proche de zéro.
### Feature proposée
```sql
-- Dans view_ai_features_1h (CTE)
varPopMerge(tcp_jitter_variance) AS syn_jitter_variance, -- déjà présent (tcp_jitter_variance)
-- Ajouter le coefficient de variation (normalisé)
```
```python
# cv = std / mean → 0 = robotique, >0.5 = humain
syn_timing_cv = sqrt(syn_jitter_variance) / greatest(avg_syn_ms, 1)
```
**Note** : `tcp_jitter_variance` est déjà dans le modèle mais c'est la variance brute. Le **coefficient de variation** (std/mean) normalise par le délai moyen et est plus discriminant pour différencier bots rapides (10ms) de bots lents (100ms).
### Modifications requises
1. Ajouter `avg(syn_to_clienthello_ms)` dans `mv_agg_host_ip_ja4_1h``avg_syn_ms`
2. Calculer `syn_timing_cv = sqrt(tcp_jitter_variance) / greatest(avg_syn_ms, 1)` dans `view_ai_features_1h`
3. Ajouter `syn_timing_cv` à `feats_complet`
### Références
- Zeber et al. (2020) — *The Measurement of Web Timing* — distribution log-normale pour humains
- Beugin et al. (2021) — *Robustness of Traffic Analysis Against Adversarial Timing* — variance comme discriminant
- Stevanovic & Pedersen (2015) — *Detecting Bots Using Multi-level Traffic Analysis* — timing régularité = signal bot L4
---
## B3 — TLS 1.2 Exclusive Ratio
### Observation
```
95.217.144.244 : 360/360 requêtes en TLS 1.2 (jamais TLS 1.3)
37.65.177.201 : 267/267 requêtes en TLS 1.2
136 IPs utilisent exclusivement TLS 1.2 sur 3259 analysées (4.2%)
```
TLS 1.3 représente 97.3% du trafic en 2026. Les navigateurs modernes n'utilisent TLS 1.2 que comme fallback exceptionnel. Une IP utilisant **exclusivement** TLS 1.2 utilise un client obsolète, une bibliothèque custom, ou un outil de scan.
### Feature proposée
```sql
-- Dans mv_agg_host_ip_ja4_1h
sum(IF(tls_version = '1.2', 1, 0)) AS tls12_count -- nouveau
-- tls_version déjà stockée via tls_alpn_raw → à distinguer ou ajouter
```
```python
# Dans view_ai_features_1h
tls12_ratio = tls12_count / greatest(hits, 1)
```
### Modifications requises
1. Ajouter `sum(IF(src.tls_version = '1.2', 1, 0)) AS tls12_count` dans `mv_agg_host_ip_ja4_1h`
2. Ajouter `tls12_count` dans `agg_host_ip_ja4_1h`
3. Calculer `tls12_count / hits AS tls12_ratio` dans `view_ai_features_1h`
### Références
- Kotzias et al. (2018) — *Coming of Age: A Longitudinal Study of TLS Deployment* — vieillissement des stacks
- Naylor et al. (2014) — *The Cost of the S in HTTPS* — adoption TLS 1.3 par navigateurs légitimes
- Cloudflare Radar 2024 — TLS 1.3 = 95%+ du trafic web mondial
---
## B4 — HEAD Method Ratio
### Observation
```
34.140.199.84 : 11/12 requêtes HEAD (91.7%) → Google Cloud uptime checker
67/3335 IPs ont >50% de requêtes HEAD
```
La méthode HEAD est utilisée pour vérifier la disponibilité d'une ressource sans télécharger son contenu. C'est la signature des :
- **Uptime checkers** (Pingdom, UptimeRobot, Google Cloud Health Check)
- **Scanners de vulnérabilités** (Nikto, Nuclei)
- **Bots de reconnaissance discrète**
### Feature proposée
```python
# head_ratio = déjà calculable depuis count_post (method breakdown)
# Ajouter dans mv_agg_host_ip_ja4_1h :
count_head = sum(IF(method = 'HEAD', 1, 0))
```
```python
head_ratio = count_head / greatest(hits, 1)
```
### Note : disponibilité dans les deux modèles
Contrairement aux features TCP, `head_ratio` est disponible pour `correlated=0` aussi — c'est une feature HTTP pure. À ajouter dans les deux listes `feats` et `feats_complet`.
### Références
- Barracuda Networks (2023) — *Bot Traffic Report* — HEAD requests pattern
- OWASP Automated Threat Handbook — OAT-011: Scraping, OAT-018: Credential Stuffing
---
## B5 — Sec-Fetch Absence Rate
### Observation
Les headers `Sec-Fetch-Site`, `Sec-Fetch-Mode`, `Sec-Fetch-Dest` sont injectés par les navigateurs modernes (Chrome 76+, Firefox 90+) **automatiquement** depuis 2019. Leur absence est un signal de :
- Client HTTP non-navigateur (curl, requests, Scrapy, headless Chrome sans headers complets)
- Vieux navigateur ou UA spoofé
- HTTP CONNECT proxy
### Feature proposée
```sql
-- Dans mv_agg_host_ip_ja4_1h
sum(IF(length(src.header_sec_fetch_site) = 0, 1, 0)) AS count_no_sec_fetch
```
```python
sec_fetch_absence_rate = count_no_sec_fetch / greatest(hits, 1)
```
### Combinaison avec `modern_browser_score`
`sec_fetch_absence_rate` + `modern_browser_score` forment une paire complémentaire :
- Bot avec UA Chrome forgé → `modern_browser_score` élevé mais `sec_fetch_absence_rate` = 1 → contradiction forte
### Modifications requises
1. `count_no_sec_fetch` dans le MV et la table
2. Calcul dans la vue
### Références
- West & Loshbough (2019) — *Fetch Metadata Request Headers* (W3C Spec)
- Invernizzi et al. (2016) — *CLOAK of Visibility* — inconsistance headers = bot
---
## B6 — Accept Header Entropy
### Observation
Les navigateurs légitimes envoient des headers `Accept` complexes et cohérents :
```
image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8
```
Les bots envoient :
```
*/* (curl, wget, Scrapy)
(vide) (bots minimalistes)
text/html (outils basiques)
```
### Feature proposée
```python
# Diversité des valeurs Accept par IP (proxy de comportement navigateur)
accept_entropy = -sum(p * log2(p+1e-9) for p in accept_value_probs)
# Ou plus simplement : fraction de requêtes avec Accept générique/vide
generic_accept_ratio = count_generic_accept / hits
# où generic = longueur(Accept) < 10 ou Accept IN ('*/*', '')
```
```sql
sum(IF(length(src.header_accept) < 5, 1, 0)) AS count_generic_accept
```
### Références
- Nikiforakis et al. (2013) — *Cookieless Monster: Exploring the Ecosystem of Web-based Device Fingerprinting* — Accept comme composant stable
- Acar et al. (2014) — *The Web Never Forgets* — entropie des headers HTTP
---
## B7 — HTTP/TLS Protocol Version Mismatch
### Observation
```
HTTP/2.0 → 160855 requêtes (84%)
HTTP/1.1 → 26421 requêtes (14%)
TLS 1.3 → 177330 requêtes (97%)
```
HTTP/2 requiert TLS dans les navigateurs modernes. Combinaisons anormales :
- HTTP/1.1 + TLS 1.3 : légitime mais rare pour les vrais navigateurs (eux font HTTP/2 si TLS 1.3)
- HTTP/1.0 + TLS : extrêmement suspect (outil custom ou ancien bot)
- HTTP/2 + TLS 1.2 : possible mais déclinant
### Feature proposée
```python
# Fraction de requêtes avec HTTP/1.x malgré TLS 1.3 disponible
http1_tls13_ratio = count_http1_with_tls13 / greatest(hits, 1)
# http1_0_ratio = count_http10 / hits # signal fort
```
```sql
sum(IF(http_version = 'HTTP/1.0', 1, 0)) AS count_http10,
sum(IF(http_version LIKE 'HTTP/1%' AND tls_version = '1.3', 1, 0)) AS count_http1_tls13
```
---
## B8 — IP DF-Bit Consistency
### Observation
```
df=1 : 172490 paquets (92%)
df=0 : 15016 paquets (8%)
```
Le bit "Don't Fragment" est généralement constant pour une session TCP donnée. Une IP qui alterne DF=0 et DF=1 au sein d'une même session, ou entre sessions, peut indiquer :
- **Usurpation d'IP** (spoofed source packets dans un botnet)
- **Stack TCP custom** (bots implémentant leur propre TCP)
- **NAT traversal** avec réécriture de paquets
### Feature proposée
```python
df_variance = stddev(ip_meta_df) per IP # 0 = cohérent, >0 = mélangé
```
```sql
varPop(toFloat64(ip_meta_df)) AS ip_df_variance
```
Faible impact seul, mais utile en combinaison avec TTL variance pour le TCP fingerprinting multi-dimensional.
---
## Récapitulatif des modifications ClickHouse nécessaires
### Colonnes à ajouter dans `agg_host_ip_ja4_1h`
```sql
ALTER TABLE mabase_prod.agg_host_ip_ja4_1h
ADD COLUMN uniq_ja3 AggregateFunction(uniq, String),
ADD COLUMN avg_syn_ms SimpleAggregateFunction(avg, Float64),
ADD COLUMN tls12_count SimpleAggregateFunction(sum, UInt64),
ADD COLUMN count_head SimpleAggregateFunction(sum, UInt64),
ADD COLUMN count_no_sec_fetch SimpleAggregateFunction(sum, UInt64),
ADD COLUMN count_generic_accept SimpleAggregateFunction(sum, UInt64),
ADD COLUMN count_http10 SimpleAggregateFunction(sum, UInt64);
```
### Nouvelles features dans `view_ai_features_1h`
| Feature | Formule | Modèle |
|---------|---------|--------|
| `ja3_diversity_ratio` | `uniq_ja3 / greatest(uniq_ja4, 1)` | Complet |
| `syn_timing_cv` | `sqrt(tcp_jitter_variance) / greatest(avg_syn_ms, 1)` | Complet |
| `tls12_ratio` | `tls12_count / greatest(hits, 1)` | Complet |
| `head_ratio` | `count_head / greatest(hits, 1)` | Les deux |
| `sec_fetch_absence_rate` | `count_no_sec_fetch / greatest(hits, 1)` | Les deux |
| `generic_accept_ratio` | `count_generic_accept / greatest(hits, 1)` | Les deux |
| `http10_ratio` | `count_http10 / greatest(hits, 1)` | Les deux |
> ⚠️ Les colonnes ajoutées par ALTER ne sont pas rétro-alimentées dans les données historiques. Un backfill depuis `http_logs` sera nécessaire.
> ⚠️ La MV `mv_agg_host_ip_ja4_1h` doit être **recréée** (pas de ALTER sur une MV) pour inclure les nouveaux champs.

View File

@ -0,0 +1,339 @@
-- ============================================================================
-- ANUBIS CRAWLER RULES — Labeling des http_logs + pipeline ML
-- Architecture :
-- anubis_ua_rules (table) → dict_anubis_ua (REGEXP_TREE)
-- anubis_ip_rules (table) → dict_anubis_ip (IP_TRIE)
-- http_logs : +anubis_bot_name, +anubis_bot_action
-- mv_http_logs : reconstruit avec enrichissement Anubis
-- view_ai_features_1h : +anubis_bot_name, +anubis_bot_action (via dictGet)
-- ml_detected_anomalies : +anubis_bot_name, +anubis_bot_action
-- ml_all_scores : +anubis_bot_name, +anubis_bot_action
-- ============================================================================
-- ----------------------------------------------------------------------------
-- 1. TABLE SOURCE — règles User-Agent (pour dictionnaire REGEXP_TREE)
--
-- Format attendu par ClickHouse regexp_tree (v23.5+) :
-- id UInt64 : identifiant unique
-- parent_id UInt64 : 0 = racine, sinon id du parent (héritage d'attributs)
-- regexp String : expression régulière (re2/vectorscan)
-- keys Array(String) : noms des attributs, ex. ['bot_name', 'action']
-- values Array(String) : valeurs correspondantes
--
-- Hiérarchie utilisée pour la priorité :
-- Règles génériques DENY (parent_id=0) → enfants ALLOW spécifiques
-- Exemple : ai-crawlers-training (parent) → openai-gptbot (enfant)
-- Quand l'UA correspond à enfant ET parent, c'est le nom de l'enfant qui
-- est retourné (l'enfant hérite ET surcharge les attributs du parent).
-- ----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS mabase_prod.anubis_ua_rules
(
id UInt64,
parent_id UInt64,
regexp String,
keys Array(String),
values Array(String)
)
ENGINE = ReplacingMergeTree()
ORDER BY id;
-- ----------------------------------------------------------------------------
-- 2. TABLE SOURCE — règles IP/CIDR (pour dictionnaire IP_TRIE)
--
-- Colonnes requises par dict_anubis_ip et mv_http_logs :
-- rule_id : identifiant de règle, croisé avec dict_anubis_ua pour
-- la logique UA+IP (même rule_id → match combiné)
-- has_ua : 1 si la règle possède aussi une regex UA (croisement nécessaire)
-- category : catégorie Anubis (bots, crawlers, clients, policies…)
-- ----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS mabase_prod.anubis_ip_rules
(
prefix String,
bot_name LowCardinality(String),
action LowCardinality(String),
rule_id UInt64,
has_ua UInt8,
category LowCardinality(String)
)
ENGINE = ReplacingMergeTree()
ORDER BY prefix;
-- ----------------------------------------------------------------------------
-- 3. DICTIONNAIRE UA — REGEXP_TREE
-- dictGet('mabase_prod.dict_anubis_ua', 'bot_name', header_user_agent)
--
-- Le PRIMARY KEY est 'regexp' (String) — requis par ClickHouse 26.x.
-- Connexion interne (HOST localhost PORT 9000) pour éviter deadlock HTTP.
-- Remplacer 'admin' et le mot de passe par les credentials ClickHouse.
-- ----------------------------------------------------------------------------
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_ua;
CREATE DICTIONARY mabase_prod.dict_anubis_ua
(
regexp String,
bot_name String,
action String
)
PRIMARY KEY regexp
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 USER 'admin' PASSWORD 'CHANGE_ME' DB 'mabase_prod' TABLE 'anubis_ua_rules'))
LAYOUT(REGEXP_TREE)
LIFETIME(MIN 300 MAX 600);
-- ----------------------------------------------------------------------------
-- 4. DICTIONNAIRE IP — IP_TRIE
-- dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', toIPv6(src_ip), '')
-- Connexion interne (HOST localhost PORT 9000) — même raison que dict_anubis_ua.
-- ----------------------------------------------------------------------------
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_ip;
CREATE DICTIONARY mabase_prod.dict_anubis_ip
(
prefix String,
bot_name String,
action String,
rule_id UInt64,
has_ua UInt8,
category String
)
PRIMARY KEY prefix
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 USER 'admin' PASSWORD 'CHANGE_ME' DB 'mabase_prod' TABLE 'anubis_ip_rules'))
LAYOUT(IP_TRIE())
LIFETIME(MIN 300 MAX 600);
-- ----------------------------------------------------------------------------
-- 5. TABLE SOURCE — règles ASN (pour dictionnaire Flat)
-- Alimentée par botPolicies.yaml via fetch_rules.py → insert_asn_rules()
-- ----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS mabase_prod.anubis_asn_rules
(
asn UInt32,
bot_name LowCardinality(String),
action LowCardinality(String),
category LowCardinality(String)
)
ENGINE = ReplacingMergeTree()
ORDER BY asn;
-- ----------------------------------------------------------------------------
-- 6. TABLE SOURCE — règles pays ISO-3166 (pour dictionnaire Flat)
-- Alimentée par botPolicies.yaml via fetch_rules.py → insert_country_rules()
-- ----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS mabase_prod.anubis_country_rules
(
country_code LowCardinality(String),
bot_name LowCardinality(String),
action LowCardinality(String),
category LowCardinality(String)
)
ENGINE = ReplacingMergeTree()
ORDER BY country_code;
-- ----------------------------------------------------------------------------
-- 7. DICTIONNAIRE ASN — Flat
-- dictGetOrDefault('mabase_prod.dict_anubis_asn', 'bot_name', src_asn, '')
-- ----------------------------------------------------------------------------
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_asn;
CREATE DICTIONARY mabase_prod.dict_anubis_asn
(
asn UInt32,
bot_name String,
action String,
category String
)
PRIMARY KEY asn
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 USER 'admin' PASSWORD 'CHANGE_ME' DB 'mabase_prod' TABLE 'anubis_asn_rules'))
LAYOUT(FLAT())
LIFETIME(MIN 300 MAX 600);
-- ----------------------------------------------------------------------------
-- 8. DICTIONNAIRE PAYS — Flat
-- dictGetOrDefault('mabase_prod.dict_anubis_country', 'bot_name', src_country_code, '')
-- ----------------------------------------------------------------------------
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_country;
CREATE DICTIONARY mabase_prod.dict_anubis_country
(
country_code String,
bot_name String,
action String,
category String
)
PRIMARY KEY country_code
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 USER 'admin' PASSWORD 'CHANGE_ME' DB 'mabase_prod' TABLE 'anubis_country_rules'))
LAYOUT(FLAT())
LIFETIME(MIN 300 MAX 600);
-- ----------------------------------------------------------------------------
-- 9. AJOUT DES COLONNES ANUBIS dans http_logs
-- Idempotent : ne plante pas si déjà présentes
-- ----------------------------------------------------------------------------
ALTER TABLE mabase_prod.http_logs
ADD COLUMN IF NOT EXISTS anubis_bot_name LowCardinality(String) DEFAULT '',
ADD COLUMN IF NOT EXISTS anubis_bot_action LowCardinality(String) DEFAULT '',
ADD COLUMN IF NOT EXISTS anubis_bot_category LowCardinality(String) DEFAULT '';
-- ----------------------------------------------------------------------------
-- 10. RECONSTRUCTION DE mv_http_logs avec enrichissement Anubis
-- Logique de priorisation :
-- 1. UA regex (plus informatif — identifie le bot précis)
-- 2. IP/CIDR (fallback — identifie le réseau cloud)
-- ----------------------------------------------------------------------------
DROP VIEW IF EXISTS mabase_prod.mv_http_logs;
CREATE MATERIALIZED VIEW mabase_prod.mv_http_logs
TO mabase_prod.http_logs
(
`time` DateTime,
`log_date` Date,
`src_ip` IPv4,
`src_port` UInt16,
`src_asn` UInt32,
`src_country_code` String,
`dst_ip` IPv4,
`dst_port` UInt16,
`src_as_name` String,
`src_org` String,
`src_domain` String,
`method` String,
`scheme` String,
`host` String,
`path` String,
`query` String,
`http_version` String,
`orphan_side` String,
`correlated` UInt8,
`keepalives` UInt16,
`a_timestamp` UInt64,
`b_timestamp` UInt64,
`conn_id` String,
`ip_meta_df` UInt8,
`ip_meta_id` UInt16,
`ip_meta_total_length` UInt16,
`ip_meta_ttl` UInt8,
`tcp_meta_options` String,
`tcp_meta_window_size` UInt32,
`tcp_meta_mss` UInt16,
`tcp_meta_window_scale` UInt8,
`syn_to_clienthello_ms` Int32,
`tls_version` String,
`tls_sni` String,
`tls_alpn` String,
`ja3` String,
`ja3_hash` String,
`ja4` String,
`client_headers` String,
`header_user_agent` String,
`header_accept` String,
`header_accept_encoding` String,
`header_accept_language` String,
`header_content_type` String,
`header_x_request_id` String,
`header_x_trace_id` String,
`header_x_forwarded_for` String,
`header_sec_ch_ua` String,
`header_sec_ch_ua_mobile` String,
`header_sec_ch_ua_platform` String,
`header_sec_fetch_dest` String,
`header_sec_fetch_mode` String,
`header_sec_fetch_site` String,
`anubis_bot_name` String,
`anubis_bot_action` String
)
AS SELECT
parseDateTimeBestEffort(coalesce(JSONExtractString(raw_json, 'time'), '1970-01-01T00:00:00Z')) AS time,
toDate(time) AS log_date,
toIPv4(coalesce(JSONExtractString(raw_json, 'src_ip'), '0.0.0.0')) AS src_ip,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'src_port'), 0)) AS src_port,
dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'asn', toIPv6(src_ip), toUInt32(0)) AS src_asn,
dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'country_code', toIPv6(src_ip), '') AS src_country_code,
toIPv4(coalesce(JSONExtractString(raw_json, 'dst_ip'), '0.0.0.0')) AS dst_ip,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'dst_port'), 0)) AS dst_port,
dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'name', toIPv6(src_ip), '') AS src_as_name,
dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'org', toIPv6(src_ip), '') AS src_org,
dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'domain', toIPv6(src_ip), '') AS src_domain,
coalesce(JSONExtractString(raw_json, 'method'), '') AS method,
coalesce(JSONExtractString(raw_json, 'scheme'), '') AS scheme,
coalesce(JSONExtractString(raw_json, 'host'), '') AS host,
coalesce(JSONExtractString(raw_json, 'path'), '') AS path,
coalesce(JSONExtractString(raw_json, 'query'), '') AS query,
coalesce(JSONExtractString(raw_json, 'http_version'), '') AS http_version,
coalesce(JSONExtractString(raw_json, 'orphan_side'), '') AS orphan_side,
toUInt8(coalesce(JSONExtractBool(raw_json, 'correlated'), 0)) AS correlated,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'keepalives'), 0)) AS keepalives,
coalesce(JSONExtractUInt(raw_json, 'a_timestamp'), 0) AS a_timestamp,
coalesce(JSONExtractUInt(raw_json, 'b_timestamp'), 0) AS b_timestamp,
coalesce(JSONExtractString(raw_json, 'conn_id'), '') AS conn_id,
toUInt8(coalesce(JSONExtractBool(raw_json, 'ip_meta_df'), 0)) AS ip_meta_df,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'ip_meta_id'), 0)) AS ip_meta_id,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'ip_meta_total_length'), 0)) AS ip_meta_total_length,
toUInt8(coalesce(JSONExtractUInt(raw_json, 'ip_meta_ttl'), 0)) AS ip_meta_ttl,
coalesce(JSONExtractString(raw_json, 'tcp_meta_options'), '') AS tcp_meta_options,
toUInt32(coalesce(JSONExtractUInt(raw_json, 'tcp_meta_window_size'), 0)) AS tcp_meta_window_size,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'tcp_meta_mss'), 0)) AS tcp_meta_mss,
toUInt8(coalesce(JSONExtractUInt(raw_json, 'tcp_meta_window_scale'), 0)) AS tcp_meta_window_scale,
toInt32(coalesce(JSONExtractInt(raw_json, 'syn_to_clienthello_ms'), 0)) AS syn_to_clienthello_ms,
coalesce(JSONExtractString(raw_json, 'tls_version'), '') AS tls_version,
coalesce(JSONExtractString(raw_json, 'tls_sni'), '') AS tls_sni,
coalesce(JSONExtractString(raw_json, 'tls_alpn'), '') AS tls_alpn,
coalesce(JSONExtractString(raw_json, 'ja3'), '') AS ja3,
coalesce(JSONExtractString(raw_json, 'ja3_hash'), '') AS ja3_hash,
coalesce(JSONExtractString(raw_json, 'ja4'), '') AS ja4,
coalesce(JSONExtractString(raw_json, 'client_headers'), '') AS client_headers,
coalesce(JSONExtractString(raw_json, 'header_User-Agent'), '') AS header_user_agent,
coalesce(JSONExtractString(raw_json, 'header_Accept'), '') AS header_accept,
coalesce(JSONExtractString(raw_json, 'header_Accept-Encoding'), '') AS header_accept_encoding,
coalesce(JSONExtractString(raw_json, 'header_Accept-Language'), '') AS header_accept_language,
coalesce(JSONExtractString(raw_json, 'header_Content-Type'), '') AS header_content_type,
coalesce(JSONExtractString(raw_json, 'header_X-Request-Id'), '') AS header_x_request_id,
coalesce(JSONExtractString(raw_json, 'header_X-Trace-Id'), '') AS header_x_trace_id,
coalesce(JSONExtractString(raw_json, 'header_X-Forwarded-For'), '') AS header_x_forwarded_for,
coalesce(JSONExtractString(raw_json, 'header_Sec-CH-UA'), '') AS header_sec_ch_ua,
coalesce(JSONExtractString(raw_json, 'header_Sec-CH-UA-Mobile'), '') AS header_sec_ch_ua_mobile,
coalesce(JSONExtractString(raw_json, 'header_Sec-CH-UA-Platform'), '') AS header_sec_ch_ua_platform,
coalesce(JSONExtractString(raw_json, 'header_Sec-Fetch-Dest'), '') AS header_sec_fetch_dest,
coalesce(JSONExtractString(raw_json, 'header_Sec-Fetch-Mode'), '') AS header_sec_fetch_mode,
coalesce(JSONExtractString(raw_json, 'header_Sec-Fetch-Site'), '') AS header_sec_fetch_site,
-- ── Enrichissement Anubis ────────────────────────────────────────────────
-- Priorité : UA regex > IP/CIDR (UA identifie précisément le bot)
COALESCE(
nullIf(dictGet('mabase_prod.dict_anubis_ua', 'bot_name',
coalesce(JSONExtractString(raw_json, 'header_User-Agent'), '')), ''),
nullIf(dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name',
toIPv6(toIPv4(coalesce(JSONExtractString(raw_json, 'src_ip'), '0.0.0.0'))), ''), ''),
''
) AS anubis_bot_name,
COALESCE(
nullIf(dictGet('mabase_prod.dict_anubis_ua', 'action',
coalesce(JSONExtractString(raw_json, 'header_User-Agent'), '')), ''),
nullIf(dictGetOrDefault('mabase_prod.dict_anubis_ip', 'action',
toIPv6(toIPv4(coalesce(JSONExtractString(raw_json, 'src_ip'), '0.0.0.0'))), ''), ''),
''
) AS anubis_bot_action
FROM mabase_prod.http_logs_raw;
-- ============================================================================
-- INTÉGRATION ML — Propagation Anubis vers le pipeline bot_detector
-- ============================================================================
-- ----------------------------------------------------------------------------
-- 11. COLONNES ANUBIS dans ml_detected_anomalies
-- ----------------------------------------------------------------------------
ALTER TABLE mabase_prod.ml_detected_anomalies
ADD COLUMN IF NOT EXISTS anubis_bot_name LowCardinality(String) DEFAULT '',
ADD COLUMN IF NOT EXISTS anubis_bot_action LowCardinality(String) DEFAULT '',
ADD COLUMN IF NOT EXISTS anubis_bot_category LowCardinality(String) DEFAULT '';
-- ----------------------------------------------------------------------------
-- 12. COLONNES ANUBIS dans ml_all_scores
-- ----------------------------------------------------------------------------
ALTER TABLE mabase_prod.ml_all_scores
ADD COLUMN IF NOT EXISTS anubis_bot_name LowCardinality(String) DEFAULT '',
ADD COLUMN IF NOT EXISTS anubis_bot_action LowCardinality(String) DEFAULT '',
ADD COLUMN IF NOT EXISTS anubis_bot_category LowCardinality(String) DEFAULT '';
-- ----------------------------------------------------------------------------
-- 13. VIEW view_ai_features_1h — Enrichissement Anubis
-- Ajoute anubis_bot_name et anubis_bot_action via dictGet.
-- Priorité : UA regex (first_ua → dict_anubis_ua) > IP/CIDR (src_ip → dict_anubis_ip)
-- Voir le fichier complet dans /tmp/update_view_ai_features.sql ou recréer
-- avec CREATE OR REPLACE VIEW après avoir appliqué les étapes précédentes.
-- ----------------------------------------------------------------------------
-- NOTE : Exécuter le contenu de /tmp/update_view_ai_features.sql ici (trop long).
-- Ou lancer depuis le repo : psql -f bot_detector/anubis/view_ai_features_anubis.sql

View File

@ -0,0 +1,486 @@
#!/usr/bin/env python3
"""
fetch_rules.py — Récupère TOUTES les règles Anubis depuis GitHub et les insère dans ClickHouse.
Sources :
- data/bots/**/*.yaml (bots pathologiques, IA, IRC)
- data/crawlers/*.yaml (crawlers légitimes et clouds)
- data/clients/*.yaml (clients IA agissant pour utilisateurs)
- data/common/*.yaml (règles communes : IPs privées, etc.)
- data/botPolicies.yaml (règles ASN et pays inline)
Usage (depuis le container dashboard_web) :
python /tmp/fetch_rules.py
Variables d'environnement :
CLICKHOUSE_HOST, CLICKHOUSE_DB, CLICKHOUSE_USER, CLICKHOUSE_PASSWORD
"""
import json
import os
import re
import sys
import urllib.request
import urllib.error
try:
import yaml
except ImportError:
print("[ERREUR] pyyaml manquant.", file=sys.stderr)
sys.exit(1)
try:
import clickhouse_connect
except ImportError:
print("[ERREUR] clickhouse-connect manquant.", file=sys.stderr)
sys.exit(1)
# ──────────────────────────────────────────────────────────────────────────────
# Config
# ──────────────────────────────────────────────────────────────────────────────
GITHUB_API = "https://api.github.com/repos/TecharoHQ/anubis/contents"
GITHUB_RAW = "https://raw.githubusercontent.com/TecharoHQ/anubis/main"
# Répertoires à parcourir — ORDER CRITIQUE pour REGEXP_TREE :
# Dans REGEXP_TREE (root-level rules), la règle avec l'ID le plus bas gagne quand plusieurs matchent.
# → Les règles SPÉCIFIQUES doivent être chargées en PREMIER (IDs bas) pour gagner sur les catch-alls.
# → Les catch-alls (ai-robots-txt, ai-catchall) doivent être chargés en DERNIER (IDs hauts).
#
# Au sein de chaque répertoire, les fichiers sont triés EN ORDRE ALPHABÉTIQUE INVERSÉ
# pour que les règles spécifiques (noms longs) aient des IDs plus bas que les catch-alls (ai.yaml).
DIRECTORIES = [
("data/clients", "clients"), # Règles AI clients avec IP (openai-chatgpt-user, etc.)
("data/bots/irc-bots", "bots/irc-bots"), # Bots IRC spécifiques
("data/crawlers", "crawlers"), # Crawlers spécifiques + clouds
("data/common", "common"), # IPs privées, routes communes
("data/bots", "bots"), # Catch-alls larges (ai-robots-txt, ai-catchall) — LAST
]
# Fichier de politique principal (règles ASN + pays inline)
BOT_POLICIES_PATH = "data/botPolicies.yaml"
# UA_PARENT_OVERRIDE : mapping nom_règle → nom_parent pour forcer la hiérarchie REGEXP_TREE.
# Conservé vide intentionnellement : l'ordre de chargement (spécifique avant catch-all)
# garantit la priorité sans hiérarchie parent_id explicite.
# Populer ce dict si une règle doit hériter d'une autre via parent_id dans REGEXP_TREE.
UA_PARENT_OVERRIDE: dict[str, str] = {}
# ──────────────────────────────────────────────────────────────────────────────
# HTTP helpers
# ──────────────────────────────────────────────────────────────────────────────
def _fetch_url(url: str, timeout: int = 15) -> str | None:
try:
with urllib.request.urlopen(url, timeout=timeout) as resp:
return resp.read().decode("utf-8")
except urllib.error.URLError as e:
print(f"[WARN] {url}: {e}", file=sys.stderr)
return None
def fetch_yaml_url(url: str) -> list | dict | None:
content = _fetch_url(url)
if content:
return yaml.safe_load(content)
return None
def list_yaml_files(api_path: str) -> list[str]:
"""
Retourne la liste des raw URLs des fichiers .yaml/.yml dans api_path via l'API GitHub.
Les fichiers sont triés en ordre ALPHABÉTIQUE INVERSÉ pour que les règles spécifiques
(noms longs, ex: openai-chatgpt-user.yaml) aient un ID inférieur aux catch-alls (ai.yaml).
"""
content = _fetch_url(f"{GITHUB_API}/{api_path}")
if not content:
return []
try:
entries = json.loads(content)
except json.JSONDecodeError:
return []
files = [
entry for entry in entries
if entry.get("type") == "file" and entry.get("name", "").endswith((".yaml", ".yml"))
]
# Tri inverse : les noms longs (spécifiques) avant les noms courts (catch-alls)
files.sort(key=lambda e: e["name"], reverse=True)
return [f["download_url"] for f in files]
# ──────────────────────────────────────────────────────────────────────────────
# Extraction des patterns UA depuis les expressions CEL-like
# ──────────────────────────────────────────────────────────────────────────────
def _extract_ua_from_all(conditions: list) -> str | None:
"""Extrait une regex UA depuis une expression 'all' (ex: yandexbot userAgent.matches)."""
for cond in conditions:
if not isinstance(cond, str):
continue
m = re.search(r'userAgent\.matches\("(.+?)"\)', cond)
if m:
return m.group(1).replace("\\\\", "\\")
return None
def _extract_ua_from_any(conditions: list) -> str | None:
"""
Extrait une regex UA depuis une expression 'any' avec userAgent.contains(...)
Exemple : aggressive-brazilian-scrapers.yaml
Retourne une regex en OR : MSIE|Trident|...
"""
patterns = []
for cond in conditions:
if not isinstance(cond, str):
continue
m = re.search(r'userAgent\.contains\("(.+?)"\)', cond)
if m:
patterns.append(re.escape(m.group(1)))
if patterns:
return "|".join(patterns)
return None
def extract_ua_regex(rule: dict) -> str | None:
"""Extrait la regex User-Agent depuis toutes les formes possibles."""
# Forme directe
if ua := rule.get("user_agent_regex"):
return ua.strip()
expr = rule.get("expression")
if not expr:
return None
# Expression scalaire (CEL string)
if isinstance(expr, str):
m = re.search(r'userAgent\.matches\("(.+?)"\)', expr)
if m:
return m.group(1).replace("\\\\", "\\")
m = re.search(r'userAgent\.contains\("(.+?)"\)', expr)
if m:
return re.escape(m.group(1))
return None
# Expression structurée dict
if isinstance(expr, dict):
if ua := _extract_ua_from_all(expr.get("all", [])):
return ua
if ua := _extract_ua_from_any(expr.get("any", [])):
return ua
return None
# ──────────────────────────────────────────────────────────────────────────────
# Parse des fichiers YAML
# ──────────────────────────────────────────────────────────────────────────────
def parse_file(
url: str,
category: str,
ua_name_to_id: dict,
ua_id_counter_ref: list, # [int] — compteur mutable partagé entre appels
rule_id_counter_ref: list, # [int] — idem
) -> tuple[list[dict], list[dict]]:
"""
Parse un fichier YAML Anubis.
Retourne (ua_rules, ip_rules).
Note : ua_name_to_id est maintenu pour supporter la hiérarchie parent_id dans
REGEXP_TREE (via UA_PARENT_OVERRIDE). Tant que UA_PARENT_OVERRIDE est vide,
parent_id vaut toujours 0 et ua_name_to_id n'est pas consulté en pratique.
"""
data = fetch_yaml_url(url)
if not data or not isinstance(data, list):
return [], []
ua_rules, ip_rules = [], []
for rule in data:
if not isinstance(rule, dict):
continue
# Ignorer les imports (références à d'autres fichiers)
if "import" in rule:
continue
name = rule.get("name", "").strip()
action = rule.get("action", "").strip()
if not name or not action:
continue
remote_addrs = [str(c).strip() for c in rule.get("remote_addresses", []) if c]
has_ip = bool(remote_addrs)
rule_id = rule_id_counter_ref[0]
rule_id_counter_ref[0] += 1
# ── User-Agent regex ─────────────────────────────────────────────────
ua_regex = extract_ua_regex(rule)
if ua_regex:
parent_name = UA_PARENT_OVERRIDE.get(name)
parent_id = ua_name_to_id.get(parent_name, 0) if parent_name else 0
uid = ua_id_counter_ref[0]
ua_id_counter_ref[0] += 1
ua_name_to_id[name] = uid
ua_rules.append({
"id": uid,
"parent_id": parent_id,
"regexp": ua_regex,
"bot_name": name,
"action": action,
"has_ip": "1" if has_ip else "0",
"rule_id": str(rule_id),
"category": category,
})
# ── IP/CIDR ranges ───────────────────────────────────────────────────
has_ua = bool(ua_regex)
for cidr in remote_addrs:
ip_rules.append({
"prefix": cidr,
"bot_name": name,
"action": action,
"rule_id": rule_id,
"has_ua": 1 if has_ua else 0,
"category": category,
})
return ua_rules, ip_rules
def parse_bot_policies_inline(url: str) -> tuple[list[dict], list[dict]]:
"""
Parse botPolicies.yaml pour les règles inline avec geoip.countries et asns.match.
Retourne (asn_rules, country_rules).
"""
data = fetch_yaml_url(url)
if not data or not isinstance(data, dict):
return [], []
asn_rules: list[dict] = []
country_rules: list[dict] = []
for rule in data.get("bots", []):
if not isinstance(rule, dict):
continue
if "import" in rule:
continue
name = rule.get("name", "").strip()
action = rule.get("action", "").strip()
if not name or not action:
continue
# ASN rules
asns = rule.get("asns", {})
if isinstance(asns, dict):
for asn in asns.get("match", []):
asn_rules.append({
"asn": int(asn),
"bot_name": name,
"action": action,
"category": "policies",
})
# Country rules
geoip = rule.get("geoip", {})
if isinstance(geoip, dict):
for cc in geoip.get("countries", []):
country_rules.append({
"country_code": str(cc).upper(),
"bot_name": name,
"action": action,
"category": "policies",
})
return asn_rules, country_rules
# ──────────────────────────────────────────────────────────────────────────────
# Collecte de toutes les règles
# ──────────────────────────────────────────────────────────────────────────────
def collect_all_rules() -> tuple[list, list, list, list]:
"""Retourne (ua_rules, ip_rules, asn_rules, country_rules)."""
ua_name_to_id: dict[str, int] = {}
ua_id_counter_ref: list[int] = [1]
rule_id_counter: list[int] = [1]
all_ua: list[dict] = []
all_ip: list[dict] = []
for api_path, category in DIRECTORIES:
print(f"[INFO] Parcours de {api_path} ({category})…")
file_urls = list_yaml_files(api_path)
print(f" {len(file_urls)} fichiers trouvés")
for url in file_urls:
ua, ip = parse_file(url, category, ua_name_to_id, ua_id_counter_ref, rule_id_counter)
all_ua.extend(ua)
all_ip.extend(ip)
# Règles ASN + pays depuis botPolicies.yaml
print(f"[INFO] Lecture de botPolicies.yaml…")
policies_url = f"{GITHUB_RAW}/{BOT_POLICIES_PATH}"
asn_rules, country_rules = parse_bot_policies_inline(policies_url)
return all_ua, all_ip, asn_rules, country_rules
# ──────────────────────────────────────────────────────────────────────────────
# ClickHouse
# ──────────────────────────────────────────────────────────────────────────────
def get_ch_client():
return clickhouse_connect.get_client(
host=os.environ.get("CLICKHOUSE_HOST", "clickhouse"),
database=os.environ.get("CLICKHOUSE_DB", "mabase_prod"),
username=os.environ.get("CLICKHOUSE_USER", "admin"),
password=os.environ.get("CLICKHOUSE_PASSWORD", ""),
)
def insert_ua_rules(client, rules: list[dict]) -> None:
if not rules:
print("[INFO] Aucune règle UA.")
return
client.command("TRUNCATE TABLE mabase_prod.anubis_ua_rules")
# REGEXP_TREE format : id, parent_id, regexp, keys[], values[]
# keys = ['bot_name', 'action', 'has_ip', 'rule_id', 'category']
data = [
[
r["id"], r["parent_id"], r["regexp"],
["bot_name", "action", "has_ip", "rule_id", "category"],
[r["bot_name"], r["action"], r["has_ip"], r["rule_id"], r["category"]],
]
for r in rules
]
client.insert("mabase_prod.anubis_ua_rules", data,
column_names=["id", "parent_id", "regexp", "keys", "values"])
print(f"[OK] {len(rules)} règles UA insérées.")
def insert_ip_rules(client, rules: list[dict]) -> None:
if not rules:
print("[INFO] Aucune règle IP.")
return
client.command("TRUNCATE TABLE mabase_prod.anubis_ip_rules")
data = [
[r["prefix"], r["bot_name"], r["action"],
r["rule_id"], r["has_ua"], r["category"]]
for r in rules
]
client.insert("mabase_prod.anubis_ip_rules", data,
column_names=["prefix", "bot_name", "action", "rule_id", "has_ua", "category"])
print(f"[OK] {len(rules)} règles IP insérées.")
def insert_asn_rules(client, rules: list[dict]) -> None:
if not rules:
print("[INFO] Aucune règle ASN.")
return
client.command("TRUNCATE TABLE mabase_prod.anubis_asn_rules")
data = [[r["asn"], r["bot_name"], r["action"], r["category"]] for r in rules]
client.insert("mabase_prod.anubis_asn_rules", data,
column_names=["asn", "bot_name", "action", "category"])
print(f"[OK] {len(rules)} règles ASN insérées.")
def insert_country_rules(client, rules: list[dict]) -> None:
if not rules:
print("[INFO] Aucune règle pays.")
return
client.command("TRUNCATE TABLE mabase_prod.anubis_country_rules")
data = [[r["country_code"], r["bot_name"], r["action"], r["category"]] for r in rules]
client.insert("mabase_prod.anubis_country_rules", data,
column_names=["country_code", "bot_name", "action", "category"])
print(f"[OK] {len(rules)} règles pays insérées.")
def reload_dicts(client) -> None:
dicts = [
"mabase_prod.dict_anubis_ua",
"mabase_prod.dict_anubis_ip",
"mabase_prod.dict_anubis_asn",
"mabase_prod.dict_anubis_country",
]
for d in dicts:
try:
client.command(f"SYSTEM RELOAD DICTIONARY {d}")
print(f"[OK] {d} rechargé.")
except Exception as e:
print(f"[WARN] Rechargement {d}: {e}", file=sys.stderr)
# ──────────────────────────────────────────────────────────────────────────────
# Rapport
# ──────────────────────────────────────────────────────────────────────────────
def print_summary(ua_rules, ip_rules, asn_rules, country_rules):
print("\n── Règles UA ──")
by_cat: dict[str, list] = {}
for r in ua_rules:
by_cat.setdefault(r["category"], []).append(r)
for cat, rules in sorted(by_cat.items()):
print(f" [{cat}] {len(rules)} règle(s)")
for r in rules[:5]:
has = " [+IP]" if r["has_ip"] == "1" else ""
par = f" [parent={r['parent_id']}]" if r["parent_id"] else ""
print(f" [{r['action']:9s}] {r['bot_name']}{has}{par}: {r['regexp'][:50]}")
if len(rules) > 5:
print(f" … et {len(rules) - 5} autres")
print(f"\n── Règles IP : {len(ip_rules)} CIDRs ──")
by_bot: dict[str, list] = {}
for r in ip_rules:
by_bot.setdefault(r["bot_name"], []).append(r)
for bot, rs in sorted(by_bot.items())[:15]:
print(f" [{rs[0]['action']:9s}] {bot}: {len(rs)} CIDRs (cat={rs[0]['category']}, has_ua={rs[0]['has_ua']})")
if len(by_bot) > 15:
print(f" … et {len(by_bot) - 15} autres bots")
if asn_rules:
print(f"\n── Règles ASN : {len(asn_rules)} ──")
for r in asn_rules:
print(f" [{r['action']:9s}] ASN {r['asn']}: {r['bot_name']}")
if country_rules:
print(f"\n── Règles pays : {len(country_rules)} ──")
for r in country_rules:
print(f" [{r['action']:9s}] {r['country_code']}: {r['bot_name']}")
# ──────────────────────────────────────────────────────────────────────────────
# Main
# ──────────────────────────────────────────────────────────────────────────────
def main() -> None:
print("[INFO] Collecte des règles Anubis depuis GitHub…")
ua_rules, ip_rules, asn_rules, country_rules = collect_all_rules()
total = len(ua_rules) + len(ip_rules) + len(asn_rules) + len(country_rules)
print(f"\n[INFO] {len(ua_rules)} règles UA, {len(ip_rules)} CIDRs IP, "
f"{len(asn_rules)} ASN, {len(country_rules)} pays (total={total})")
if total == 0:
print("[ERREUR] Aucune règle récupérée.", file=sys.stderr)
sys.exit(1)
print_summary(ua_rules, ip_rules, asn_rules, country_rules)
print("\n[INFO] Connexion à ClickHouse…")
client = get_ch_client()
insert_ua_rules(client, ua_rules)
insert_ip_rules(client, ip_rules)
insert_asn_rules(client, asn_rules)
insert_country_rules(client, country_rules)
reload_dicts(client)
print("\n[OK] Règles Anubis chargées avec succès.")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,126 @@
CREATE MATERIALIZED VIEW mabase_prod.mv_http_logs
TO mabase_prod.http_logs
AS
WITH
coalesce(JSONExtractString(raw_json, 'header_User-Agent'), '') AS _ua,
toIPv6(toIPv4(coalesce(JSONExtractString(raw_json, 'src_ip'), '0.0.0.0'))) AS _ip,
toUInt32(dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'asn', _ip, toUInt32(0))) AS _asn,
dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'country_code', _ip, '') AS _cc
SELECT
parseDateTimeBestEffort(coalesce(JSONExtractString(raw_json, 'time'), '1970-01-01T00:00:00Z')) AS time,
toDate(time) AS log_date,
toIPv4(coalesce(JSONExtractString(raw_json, 'src_ip'), '0.0.0.0')) AS src_ip,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'src_port'), 0)) AS src_port,
_asn AS src_asn,
_cc AS src_country_code,
toIPv4(coalesce(JSONExtractString(raw_json, 'dst_ip'), '0.0.0.0')) AS dst_ip,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'dst_port'), 0)) AS dst_port,
dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'name', _ip, '') AS src_as_name,
dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'org', _ip, '') AS src_org,
dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'domain', _ip, '') AS src_domain,
coalesce(JSONExtractString(raw_json, 'method'), '') AS method,
coalesce(JSONExtractString(raw_json, 'scheme'), '') AS scheme,
coalesce(JSONExtractString(raw_json, 'host'), '') AS host,
coalesce(JSONExtractString(raw_json, 'path'), '') AS path,
coalesce(JSONExtractString(raw_json, 'query'), '') AS query,
coalesce(JSONExtractString(raw_json, 'http_version'), '') AS http_version,
coalesce(JSONExtractString(raw_json, 'orphan_side'), '') AS orphan_side,
toUInt8(coalesce(JSONExtractBool(raw_json, 'correlated'), 0)) AS correlated,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'keepalives'), 0)) AS keepalives,
coalesce(JSONExtractUInt(raw_json, 'a_timestamp'), 0) AS a_timestamp,
coalesce(JSONExtractUInt(raw_json, 'b_timestamp'), 0) AS b_timestamp,
coalesce(JSONExtractString(raw_json, 'conn_id'), '') AS conn_id,
toUInt8(coalesce(JSONExtractBool(raw_json, 'ip_meta_df'), 0)) AS ip_meta_df,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'ip_meta_id'), 0)) AS ip_meta_id,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'ip_meta_total_length'), 0)) AS ip_meta_total_length,
toUInt8(coalesce(JSONExtractUInt(raw_json, 'ip_meta_ttl'), 0)) AS ip_meta_ttl,
coalesce(JSONExtractString(raw_json, 'tcp_meta_options'), '') AS tcp_meta_options,
toUInt32(coalesce(JSONExtractUInt(raw_json, 'tcp_meta_window_size'), 0)) AS tcp_meta_window_size,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'tcp_meta_mss'), 0)) AS tcp_meta_mss,
toUInt8(coalesce(JSONExtractUInt(raw_json, 'tcp_meta_window_scale'), 0)) AS tcp_meta_window_scale,
toInt32(coalesce(JSONExtractInt(raw_json, 'syn_to_clienthello_ms'), 0)) AS syn_to_clienthello_ms,
coalesce(JSONExtractString(raw_json, 'tls_version'), '') AS tls_version,
coalesce(JSONExtractString(raw_json, 'tls_sni'), '') AS tls_sni,
coalesce(JSONExtractString(raw_json, 'tls_alpn'), '') AS tls_alpn,
coalesce(JSONExtractString(raw_json, 'ja3'), '') AS ja3,
coalesce(JSONExtractString(raw_json, 'ja3_hash'), '') AS ja3_hash,
coalesce(JSONExtractString(raw_json, 'ja4'), '') AS ja4,
coalesce(JSONExtractString(raw_json, 'client_headers'), '') AS client_headers,
coalesce(JSONExtractString(raw_json, 'header_User-Agent'), '') AS header_user_agent,
coalesce(JSONExtractString(raw_json, 'header_Accept'), '') AS header_accept,
coalesce(JSONExtractString(raw_json, 'header_Accept-Encoding'), '') AS header_accept_encoding,
coalesce(JSONExtractString(raw_json, 'header_Accept-Language'), '') AS header_accept_language,
coalesce(JSONExtractString(raw_json, 'header_Content-Type'), '') AS header_content_type,
coalesce(JSONExtractString(raw_json, 'header_X-Request-Id'), '') AS header_x_request_id,
coalesce(JSONExtractString(raw_json, 'header_X-Trace-Id'), '') AS header_x_trace_id,
coalesce(JSONExtractString(raw_json, 'header_X-Forwarded-For'), '') AS header_x_forwarded_for,
coalesce(JSONExtractString(raw_json, 'header_Sec-CH-UA'), '') AS header_sec_ch_ua,
coalesce(JSONExtractString(raw_json, 'header_Sec-CH-UA-Mobile'), '') AS header_sec_ch_ua_mobile,
coalesce(JSONExtractString(raw_json, 'header_Sec-CH-UA-Platform'), '') AS header_sec_ch_ua_platform,
coalesce(JSONExtractString(raw_json, 'header_Sec-Fetch-Dest'), '') AS header_sec_fetch_dest,
coalesce(JSONExtractString(raw_json, 'header_Sec-Fetch-Mode'), '') AS header_sec_fetch_mode,
coalesce(JSONExtractString(raw_json, 'header_Sec-Fetch-Site'), '') AS header_sec_fetch_site,
-- Anubis enrichment : logique de correspondance combinée UA+IP
-- Priorité : (1) UA+IP [même rule_id] > (2) UA seul > (3) IP seul > (4) ASN > (5) Pays
CASE
WHEN dictGet('mabase_prod.dict_anubis_ua', 'has_ip', _ua) = '1'
AND dictGet('mabase_prod.dict_anubis_ua', 'bot_name', _ua) != ''
AND dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', _ip, '') != ''
AND toUInt64OrZero(dictGet('mabase_prod.dict_anubis_ua', 'rule_id', _ua))
= dictGetOrDefault('mabase_prod.dict_anubis_ip', 'rule_id', _ip, toUInt64(0))
THEN dictGet('mabase_prod.dict_anubis_ua', 'bot_name', _ua)
WHEN dictGet('mabase_prod.dict_anubis_ua', 'has_ip', _ua) = '0'
AND dictGet('mabase_prod.dict_anubis_ua', 'bot_name', _ua) != ''
THEN dictGet('mabase_prod.dict_anubis_ua', 'bot_name', _ua)
WHEN dictGetOrDefault('mabase_prod.dict_anubis_ip', 'has_ua', _ip, toUInt8(0)) = 0
AND dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', _ip, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', _ip, '')
WHEN dictGetOrDefault('mabase_prod.dict_anubis_asn', 'bot_name', _asn, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_asn', 'bot_name', _asn, '')
WHEN dictGetOrDefault('mabase_prod.dict_anubis_country', 'bot_name', _cc, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_country', 'bot_name', _cc, '')
ELSE ''
END AS anubis_bot_name,
CASE
WHEN dictGet('mabase_prod.dict_anubis_ua', 'has_ip', _ua) = '1'
AND dictGet('mabase_prod.dict_anubis_ua', 'bot_name', _ua) != ''
AND dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', _ip, '') != ''
AND toUInt64OrZero(dictGet('mabase_prod.dict_anubis_ua', 'rule_id', _ua))
= dictGetOrDefault('mabase_prod.dict_anubis_ip', 'rule_id', _ip, toUInt64(0))
THEN dictGet('mabase_prod.dict_anubis_ua', 'action', _ua)
WHEN dictGet('mabase_prod.dict_anubis_ua', 'has_ip', _ua) = '0'
AND dictGet('mabase_prod.dict_anubis_ua', 'bot_name', _ua) != ''
THEN dictGet('mabase_prod.dict_anubis_ua', 'action', _ua)
WHEN dictGetOrDefault('mabase_prod.dict_anubis_ip', 'has_ua', _ip, toUInt8(0)) = 0
AND dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', _ip, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_ip', 'action', _ip, '')
WHEN dictGetOrDefault('mabase_prod.dict_anubis_asn', 'bot_name', _asn, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_asn', 'action', _asn, '')
WHEN dictGetOrDefault('mabase_prod.dict_anubis_country', 'bot_name', _cc, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_country', 'action', _cc, '')
ELSE ''
END AS anubis_bot_action,
CASE
WHEN dictGet('mabase_prod.dict_anubis_ua', 'has_ip', _ua) = '1'
AND dictGet('mabase_prod.dict_anubis_ua', 'bot_name', _ua) != ''
AND dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', _ip, '') != ''
AND toUInt64OrZero(dictGet('mabase_prod.dict_anubis_ua', 'rule_id', _ua))
= dictGetOrDefault('mabase_prod.dict_anubis_ip', 'rule_id', _ip, toUInt64(0))
THEN dictGet('mabase_prod.dict_anubis_ua', 'category', _ua)
WHEN dictGet('mabase_prod.dict_anubis_ua', 'has_ip', _ua) = '0'
AND dictGet('mabase_prod.dict_anubis_ua', 'bot_name', _ua) != ''
THEN dictGet('mabase_prod.dict_anubis_ua', 'category', _ua)
WHEN dictGetOrDefault('mabase_prod.dict_anubis_ip', 'has_ua', _ip, toUInt8(0)) = 0
AND dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', _ip, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_ip', 'category', _ip, '')
WHEN dictGetOrDefault('mabase_prod.dict_anubis_asn', 'bot_name', _asn, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_asn', 'category', _asn, '')
WHEN dictGetOrDefault('mabase_prod.dict_anubis_country', 'bot_name', _cc, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_country', 'category', _cc, '')
ELSE ''
END AS anubis_bot_category
FROM mabase_prod.http_logs_raw

View File

@ -0,0 +1,183 @@
CREATE OR REPLACE VIEW mabase_prod.view_ai_features_1h AS
WITH base_data AS (
SELECT
a.window_start, a.src_ip, a.ja4, a.host,
toString(a.src_asn) AS asn_number,
a.src_as_name AS asn_org, a.src_org AS asn_detail, a.src_domain AS asn_domain,
a.src_country_code AS country_code,
dictGetOrDefault('mabase_prod.dict_asn_reputation', 'label', toUInt64(a.src_asn), 'unknown') AS asn_label,
-- Bot connu via JA4/IP (dictionnaires existants)
COALESCE(
nullIf(dictGetOrDefault('mabase_prod.dict_bot_ip', 'bot_name', a.src_ip, ''), ''),
nullIf(dictGetOrDefault('mabase_prod.dict_bot_ja4', 'bot_name', tuple(a.ja4), ''), ''),
''
) AS bot_name,
-- Anubis : logique combinée UA+IP (même rule_id) > UA seul > IP seul > ASN > Pays
CASE
WHEN dictGet('mabase_prod.dict_anubis_ua', 'has_ip', a.first_ua) = '1'
AND dictGet('mabase_prod.dict_anubis_ua', 'bot_name', a.first_ua) != ''
AND dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', a.src_ip, '') != ''
AND toUInt64OrZero(dictGet('mabase_prod.dict_anubis_ua', 'rule_id', a.first_ua))
= dictGetOrDefault('mabase_prod.dict_anubis_ip', 'rule_id', a.src_ip, toUInt64(0))
THEN dictGet('mabase_prod.dict_anubis_ua', 'bot_name', a.first_ua)
WHEN dictGet('mabase_prod.dict_anubis_ua', 'has_ip', a.first_ua) = '0'
AND dictGet('mabase_prod.dict_anubis_ua', 'bot_name', a.first_ua) != ''
THEN dictGet('mabase_prod.dict_anubis_ua', 'bot_name', a.first_ua)
WHEN dictGetOrDefault('mabase_prod.dict_anubis_ip', 'has_ua', a.src_ip, toUInt8(0)) = 0
AND dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', a.src_ip, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', a.src_ip, '')
WHEN dictGetOrDefault('mabase_prod.dict_anubis_asn', 'bot_name', toUInt32(a.src_asn), '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_asn', 'bot_name', toUInt32(a.src_asn), '')
WHEN dictGetOrDefault('mabase_prod.dict_anubis_country', 'bot_name', a.src_country_code, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_country', 'bot_name', a.src_country_code, '')
ELSE ''
END AS anubis_bot_name,
CASE
WHEN dictGet('mabase_prod.dict_anubis_ua', 'has_ip', a.first_ua) = '1'
AND dictGet('mabase_prod.dict_anubis_ua', 'bot_name', a.first_ua) != ''
AND dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', a.src_ip, '') != ''
AND toUInt64OrZero(dictGet('mabase_prod.dict_anubis_ua', 'rule_id', a.first_ua))
= dictGetOrDefault('mabase_prod.dict_anubis_ip', 'rule_id', a.src_ip, toUInt64(0))
THEN dictGet('mabase_prod.dict_anubis_ua', 'action', a.first_ua)
WHEN dictGet('mabase_prod.dict_anubis_ua', 'has_ip', a.first_ua) = '0'
AND dictGet('mabase_prod.dict_anubis_ua', 'bot_name', a.first_ua) != ''
THEN dictGet('mabase_prod.dict_anubis_ua', 'action', a.first_ua)
WHEN dictGetOrDefault('mabase_prod.dict_anubis_ip', 'has_ua', a.src_ip, toUInt8(0)) = 0
AND dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', a.src_ip, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_ip', 'action', a.src_ip, '')
WHEN dictGetOrDefault('mabase_prod.dict_anubis_asn', 'bot_name', toUInt32(a.src_asn), '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_asn', 'action', toUInt32(a.src_asn), '')
WHEN dictGetOrDefault('mabase_prod.dict_anubis_country', 'bot_name', a.src_country_code, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_country', 'action', a.src_country_code, '')
ELSE ''
END AS anubis_bot_action,
CASE
WHEN dictGet('mabase_prod.dict_anubis_ua', 'has_ip', a.first_ua) = '1'
AND dictGet('mabase_prod.dict_anubis_ua', 'bot_name', a.first_ua) != ''
AND dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', a.src_ip, '') != ''
AND toUInt64OrZero(dictGet('mabase_prod.dict_anubis_ua', 'rule_id', a.first_ua))
= dictGetOrDefault('mabase_prod.dict_anubis_ip', 'rule_id', a.src_ip, toUInt64(0))
THEN dictGet('mabase_prod.dict_anubis_ua', 'category', a.first_ua)
WHEN dictGet('mabase_prod.dict_anubis_ua', 'has_ip', a.first_ua) = '0'
AND dictGet('mabase_prod.dict_anubis_ua', 'bot_name', a.first_ua) != ''
THEN dictGet('mabase_prod.dict_anubis_ua', 'category', a.first_ua)
WHEN dictGetOrDefault('mabase_prod.dict_anubis_ip', 'has_ua', a.src_ip, toUInt8(0)) = 0
AND dictGetOrDefault('mabase_prod.dict_anubis_ip', 'bot_name', a.src_ip, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_ip', 'category', a.src_ip, '')
WHEN dictGetOrDefault('mabase_prod.dict_anubis_asn', 'bot_name', toUInt32(a.src_asn), '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_asn', 'category', toUInt32(a.src_asn), '')
WHEN dictGetOrDefault('mabase_prod.dict_anubis_country', 'bot_name', a.src_country_code, '') != ''
THEN dictGetOrDefault('mabase_prod.dict_anubis_country', 'category', a.src_country_code, '')
ELSE ''
END AS anubis_bot_category,
a.hits AS hits,
sum(a.hits) OVER (PARTITION BY a.src_ip) AS total_ip_hits,
a.correlated AS correlated,
a.tcp_jitter_variance AS tcp_jitter_variance,
a.true_window_size AS true_window_size,
a.window_mss_ratio AS window_mss_ratio,
a.max_keepalives AS max_keepalives,
h.header_order_hash AS header_order_hash, h.header_count AS header_count,
h.has_accept_language AS has_accept_language, h.has_cookie AS has_cookie,
h.has_referer AS has_referer, h.modern_browser_score AS modern_browser_score,
h.ua_ch_mismatch AS ua_ch_mismatch,
(a.count_post / (a.hits + 1)) AS post_ratio,
(a.uniq_query_params / (a.uniq_paths + 1)) AS fuzzing_index,
(a.hits / (dateDiff('second', a.first_seen, a.last_seen) + 1)) AS hit_velocity,
(a.unique_src_ports / (a.hits + 1)) AS port_exhaustion_ratio,
(a.orphan_count / (a.hits + 1)) AS orphan_ratio,
(a.ip_id_zero_count / (a.hits + 1)) AS ip_id_zero_ratio,
(a.hits / (a.unique_conn_id + 1)) AS multiplexing_efficiency,
IF(a.mss_1460_count > (a.hits * 0.8) AND h.modern_browser_score > 70, 1, 0) AS mss_mobile_mismatch,
a.request_size_variance AS request_size_variance,
IF(a.tls_alpn = 'h2' AND a.http_version != '2', 1, 0) AS alpn_http_mismatch,
IF(length(a.tls_alpn) = 0 OR a.tls_alpn = '00', 1, 0) AS is_alpn_missing,
IF(length(a.tls_sni) > 0 AND a.tls_sni != a.host, 1, 0) AS sni_host_mismatch,
IF(h.sec_fetch_mode = 'navigate' AND h.sec_fetch_dest != 'document', 1, 0) AS is_fake_navigation,
count() OVER (PARTITION BY a.tcp_fingerprint) AS tcp_shared_count,
count() OVER (PARTITION BY h.header_order_hash) AS header_order_shared_count,
(a.count_assets / (a.hits + 1)) AS asset_ratio,
(a.count_no_referer / (a.hits + 1)) AS direct_access_ratio,
IF(a.unique_ua > 2, 1, 0) AS is_ua_rotating,
uniqExact(a.ja4) OVER (PARTITION BY a.src_ip) AS distinct_ja4_count,
((a.hits / (a.unique_src_ports + 1)) / (dateDiff('second', a.first_seen, a.last_seen) + 1)) AS src_port_density,
(sum(a.hits) OVER (PARTITION BY a.ja4, a.src_asn) / (sum(a.hits) OVER (PARTITION BY a.ja4) + 1)) AS ja4_asn_concentration,
(sum(a.hits) OVER (PARTITION BY a.ja4, a.src_country_code) / (sum(a.hits) OVER (PARTITION BY a.ja4) + 1)) AS ja4_country_concentration,
IF(sum(a.hits) OVER (PARTITION BY a.ja4) < 100, 1, 0) AS is_rare_ja4,
(count() OVER (PARTITION BY h.header_order_hash, a.first_ua) / (count() OVER (PARTITION BY a.first_ua) + 1)) AS header_order_confidence,
uniqExact(h.header_order_hash) OVER (PARTITION BY a.src_ip) AS distinct_header_orders,
(a.uniq_paths / (a.hits + 1)) AS path_diversity_ratio,
a.url_depth_variance AS url_depth_variance,
(a.count_anomalous_payload / (a.hits + 1)) AS anomalous_payload_ratio,
a.uniq_ja3_val AS uniq_ja3_per_row,
sqrt(a.tcp_jitter_variance) / greatest(a.avg_syn_ms_val, 1) AS syn_timing_cv,
a.tls12_count / (a.hits + 1) AS tls12_ratio,
a.count_head / (a.hits + 1) AS head_ratio,
a.count_no_sec_fetch / (a.hits + 1) AS sec_fetch_absence_rate,
a.count_generic_accept / (a.hits + 1) AS generic_accept_ratio,
a.count_http10 / (a.hits + 1) AS http10_ratio,
a.ip_df_variance AS ip_df_variance,
-- Nouvelles features TTL (fingerprint OS, L4 → modèle Complet)
a.avg_ttl_val AS avg_ttl,
sqrt(a.ttl_variance_val) AS ttl_std,
IF(a.count_correlated_val > 0, a.count_no_wscale_val / a.count_correlated_val, 0) AS no_window_scale_ratio,
-- Nouvelles features HTTP (disponibles pour les deux modèles)
a.count_no_accept_enc_val / (a.hits + 1) AS missing_accept_enc_ratio,
a.count_http_scheme_val / (a.hits + 1) AS http_scheme_ratio
FROM (
SELECT
window_start, src_ip, ja4, host, src_asn,
any(src_country_code) AS src_country_code, any(src_as_name) AS src_as_name,
any(src_org) AS src_org, any(src_domain) AS src_domain, any(first_ua) AS first_ua,
sum(hits) AS hits, uniqMerge(uniq_paths) AS uniq_paths,
uniqMerge(uniq_query_params) AS uniq_query_params, sum(count_post) AS count_post,
min(first_seen) AS first_seen, max(last_seen) AS last_seen,
any(tcp_fp_raw) AS tcp_fingerprint, varPopMerge(tcp_jitter_variance) AS tcp_jitter_variance,
varPopMerge(total_ip_length_var) AS request_size_variance,
any(tcp_win_raw * exp2(tcp_scale_raw)) AS true_window_size,
IF(any(tcp_mss_raw) > 0, any(tcp_win_raw) / any(tcp_mss_raw), 0) AS window_mss_ratio,
any(http_ver_raw) AS http_version, any(tls_alpn_raw) AS tls_alpn, any(tls_sni_raw) AS tls_sni,
max(correlated_raw) AS correlated, uniqMerge(unique_src_ports) AS unique_src_ports,
uniqMerge(unique_conn_id) AS unique_conn_id, max(max_keepalives) AS max_keepalives,
sum(orphan_count) AS orphan_count, sum(ip_id_zero_count) AS ip_id_zero_count,
sum(mss_1460_count) AS mss_1460_count,
sum(count_assets) AS count_assets, sum(count_no_referer) AS count_no_referer,
uniqMerge(uniq_ua) AS unique_ua,
varPopMerge(url_depth_variance) AS url_depth_variance,
sum(count_anomalous_payload) AS count_anomalous_payload,
uniqMerge(uniq_ja3) AS uniq_ja3_val,
avgMerge(avg_syn_ms) AS avg_syn_ms_val,
sum(tls12_count) AS tls12_count,
sum(count_head) AS count_head,
sum(count_no_sec_fetch) AS count_no_sec_fetch,
sum(count_generic_accept) AS count_generic_accept,
sum(count_http10) AS count_http10,
varPopMerge(ip_df_var) AS ip_df_variance,
-- Nouvelles features : TTL fingerprint (L4) + HTTP
avgIfMerge(avg_ttl) AS avg_ttl_val,
varPopIfMerge(ttl_var) AS ttl_variance_val,
sum(count_no_wscale) AS count_no_wscale_val,
sum(count_correlated) AS count_correlated_val,
sum(count_no_accept_enc) AS count_no_accept_enc_val,
sum(count_http_scheme) AS count_http_scheme_val
FROM mabase_prod.agg_host_ip_ja4_1h
WHERE window_start >= now() - INTERVAL 24 HOUR
GROUP BY window_start, src_ip, ja4, host, src_asn
) a
LEFT JOIN (
SELECT
window_start, src_ip, any(header_order_hash) AS header_order_hash,
max(header_count) AS header_count, max(has_accept_language) AS has_accept_language,
max(has_cookie) AS has_cookie, max(has_referer) AS has_referer,
max(modern_browser_score) AS modern_browser_score, max(ua_ch_mismatch) AS ua_ch_mismatch,
any(sec_fetch_mode) AS sec_fetch_mode, any(sec_fetch_dest) AS sec_fetch_dest
FROM mabase_prod.agg_header_fingerprint_1h
WHERE window_start >= now() - INTERVAL 24 HOUR
GROUP BY window_start, src_ip
) h ON a.src_ip = h.src_ip AND a.window_start = h.window_start
)
SELECT
*,
-(sum((hits / (total_ip_hits + 1)) * log2((hits / (total_ip_hits + 1)) + 0.000001)) OVER (PARTITION BY src_ip)) AS temporal_entropy,
sum(uniq_ja3_per_row) OVER (PARTITION BY src_ip) / greatest(distinct_ja4_count, 1) AS ja3_diversity_ratio
FROM base_data;

View File

@ -0,0 +1,15 @@
FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
WORKDIR /app
# Install shared package first
COPY shared/python/ja4_common/ /app/shared/ja4_common/
RUN pip install --no-cache-dir /app/shared/ja4_common/
COPY services/bot-detector/bot_detector/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY services/bot-detector/bot_detector/bot_detector.py .
CMD ["python", "bot_detector.py"]

View File

@ -0,0 +1,10 @@
FROM python:3.11-slim
WORKDIR /app
COPY shared/python/ja4_common/ /app/shared/ja4_common/
RUN pip install --no-cache-dir /app/shared/ja4_common/
COPY services/bot-detector/bot_detector/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --no-cache-dir pytest pytest-mock
COPY services/bot-detector/bot_detector/ /app/bot_detector/
WORKDIR /app
CMD ["pytest", "bot_detector/tests/", "-v"]

View File

@ -0,0 +1,906 @@
import time
import os
import json
import glob
import signal
import sys
import logging
import threading
import joblib
import pandas as pd
import numpy as np
import clickhouse_connect
from logging.handlers import RotatingFileHandler
from http.server import HTTPServer, BaseHTTPRequestHandler
from sklearn.ensemble import IsolationForest
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import warnings
from datetime import datetime
try:
import shap as _shap
SHAP_AVAILABLE = True
except ImportError:
SHAP_AVAILABLE = False
warnings.filterwarnings('ignore')
# ═══════════════════════════════════════════════════════════════════════════════
# CONFIGURATION
# ═══════════════════════════════════════════════════════════════════════════════
def _require_float(name, default, lo=None, hi=None):
raw = os.getenv(name, str(default))
try:
v = float(raw)
except ValueError:
raise SystemExit(f"[CONFIG] {name}={raw!r} invalide — doit être un nombre décimal.")
if lo is not None and not (lo < v < hi):
raise SystemExit(f"[CONFIG] {name}={v} hors plage ({lo} < valeur < {hi}).")
return v
# Nom de la base de données ClickHouse
# Note : Utilisé dans des requêtes SQL via f-string (ex: f'SELECT * FROM {DB}.view_ai_features_1h')
# Cette variable provient uniquement de variables d'environnement contrôlées (docker-compose, K8s, etc.)
# et n'est jamais exposée à des entrées utilisateur. Le risque d'injection SQL est considéré comme négligeable.
DB = os.getenv('CLICKHOUSE_DB', 'mabase_prod')
CONTAMINATION = _require_float('ISOLATION_CONTAMINATION', 0.001, 0, 0.5)
ANOMALY_THRESHOLD = _require_float('ANOMALY_THRESHOLD', -0.05)
LOG_FILE = os.getenv('BOT_DETECTOR_LOG', '/var/log/bot_detector/decisions.jsonl')
LOG_BACKUP_COUNT = int(os.getenv('LOG_BACKUP_COUNT', '7'))
MODEL_DIR = os.getenv('MODEL_DIR', '/var/lib/bot_detector')
RETRAIN_INTERVAL_H = int(os.getenv('RETRAIN_INTERVAL_HOURS', '24'))
MODEL_HISTORY_COUNT = int(os.getenv('MODEL_HISTORY_COUNT', '10'))
MAX_FAILURES = int(os.getenv('MAX_CONSECUTIVE_FAILURES', '3'))
HEALTH_PORT = int(os.getenv('HEALTH_PORT', '8080'))
CYCLE_INTERVAL = int(os.getenv('CYCLE_INTERVAL_SEC', '300'))
# ── Améliorations A1 / A2 / A3 / A4 / A5 / A6 / A7 / A8 / A10 ──────────────
# A1 — Dérive conceptuelle (concept drift)
DRIFT_THRESHOLD = _require_float('DRIFT_THRESHOLD', 0.30, 0, 1)
# A2 — Seuil adaptatif
ANOMALY_PERCENTILE = int(os.getenv('ANOMALY_PERCENTILE', '5'))
# A3 — Analyse multi-fenêtres
ENABLE_MULTIWINDOW = os.getenv('ENABLE_MULTIWINDOW', 'false').lower() == 'true'
MULTIWINDOW_VIEW = os.getenv('MULTIWINDOW_VIEW', 'view_ai_features_24h')
# A4 — Explainabilité SHAP
ENABLE_SHAP = SHAP_AVAILABLE and os.getenv('ENABLE_SHAP', 'true').lower() == 'true'
# A5 — Déduplication inter-cycles avec TTL
DEDUP_TTL_MIN = int(os.getenv('DEDUP_TTL_MIN', '60'))
# A6 — Pondération par récurrence
RECURRENCE_WEIGHT = _require_float('RECURRENCE_WEIGHT', 0.005)
# A7 — Validation de complétude des features
MIN_VALID_FEATURE_RATIO = _require_float('MIN_VALID_FEATURE_RATIO', 0.50, 0, 1)
# A8 — Clustering comportemental des anomalies
ENABLE_CLUSTERING = os.getenv('ENABLE_CLUSTERING', 'true').lower() == 'true'
CLUSTERING_MIN_SAMPLES = int(os.getenv('CLUSTERING_MIN_SAMPLES', '3'))
# Features structurellement indisponibles par modèle (pas de données L4 pour trafic non-corrélé)
# Ces features ne génèrent pas de warnings "pipeline" — leur absence est by-design.
STRUCTURAL_EXCLUDED_FEATURES: dict[str, list] = {
'Complet': ['orphan_ratio'],
'Applicatif': ['orphan_ratio', 'is_rare_ja4', 'tcp_shared_count',
'request_size_variance', 'mss_mobile_mismatch',
# B features TLS/TCP : indisponibles pour trafic non-corrélé
'ja3_diversity_ratio', 'syn_timing_cv', 'tls12_ratio', 'ip_df_variance',
# L4 uniquement : TTL et window scale indisponibles sans capture TCP
'avg_ttl', 'ttl_std', 'no_window_scale_ratio'],
}
TRAINING_HISTORY_FILE = os.path.join(MODEL_DIR, 'training_history.jsonl')
# ═══════════════════════════════════════════════════════════════════════════════
# LOGGING
# ═══════════════════════════════════════════════════════════════════════════════
os.makedirs(os.path.dirname(LOG_FILE), exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)
logger = logging.getLogger('bot_detector')
logger.setLevel(logging.DEBUG)
_console_handler = logging.StreamHandler()
_console_handler.setFormatter(logging.Formatter('[%(asctime)s] %(message)s', '%Y-%m-%d %H:%M:%S'))
logger.addHandler(_console_handler)
_file_handler = RotatingFileHandler(
LOG_FILE, maxBytes=50 * 1024 * 1024, backupCount=LOG_BACKUP_COUNT, encoding='utf-8'
)
_file_handler.setFormatter(logging.Formatter('%(message)s'))
logger.addHandler(_file_handler)
# Wrapper court pour homogénéiser les appels de logging (évite d'importer logger partout).
def log_info(message: str):
logger.info(message)
def log_decision(event: str, cycle_id: str, model: str = '', row: dict = None):
entry = {
'ts': datetime.now().strftime('%Y-%m-%dT%H:%M:%S'),
'cycle_id': cycle_id,
'event': event,
'model': model,
'contamination': CONTAMINATION,
'threshold': ANOMALY_THRESHOLD,
}
if row:
entry.update(row)
_file_handler.stream.write(json.dumps(entry, ensure_ascii=False, default=str) + '\n')
_file_handler.stream.flush()
def _append_training_history(entry: dict):
with open(TRAINING_HISTORY_FILE, 'a', encoding='utf-8') as f:
f.write(json.dumps(entry, ensure_ascii=False, default=str) + '\n')
# ═══════════════════════════════════════════════════════════════════════════════
# ARRÊT PROPRE ET HEALTH CHECK
# ═══════════════════════════════════════════════════════════════════════════════
def _shutdown(sig, frame):
log_info(f"Signal {sig} reçu — arrêt propre.")
log_decision('SERVICE_STOP', 'shutdown', '', {'signal': sig})
sys.exit(0)
signal.signal(signal.SIGTERM, _shutdown)
signal.signal(signal.SIGINT, _shutdown)
_service_healthy = True
class _HealthHandler(BaseHTTPRequestHandler):
def do_GET(self):
code = 200 if _service_healthy else 503
self.send_response(code)
self.end_headers()
self.wfile.write(b'OK' if _service_healthy else b'DEGRADED')
def log_message(self, *args): pass
threading.Thread(
target=lambda: HTTPServer(('', HEALTH_PORT), _HealthHandler).serve_forever(),
daemon=True
).start()
# ═══════════════════════════════════════════════════════════════════════════════
# CONNEXION CLICKHOUSE — delegated to ja4_common shared client
# ═══════════════════════════════════════════════════════════════════════════════
from ja4_common.clickhouse import get_client as _ja4_get_client
def get_client():
"""Return the shared ja4_common ClickHouse client, reconnecting on ping failure."""
return _ja4_get_client().connect()
def score_to_threat_level(score: float) -> str:
# Seuils : CRITICAL < -0.30 | HIGH < -0.15 | MEDIUM < -0.05 | LOW < 0 | NORMAL ≥ 0
if score < -0.30: return 'CRITICAL'
if score < -0.15: return 'HIGH'
if score < -0.05: return 'MEDIUM'
if score < 0: return 'LOW'
return 'NORMAL'
# ═══════════════════════════════════════════════════════════════════════════════
# GESTION DES MODÈLES
# ═══════════════════════════════════════════════════════════════════════════════
def _current_pointer_path(name: str) -> str:
return os.path.join(MODEL_DIR, f'model_{name}.current')
def _get_current_version(name: str):
pointer = _current_pointer_path(name)
if not os.path.exists(pointer): return None, None
with open(pointer) as f: version_id = f.read().strip()
model_path = os.path.join(MODEL_DIR, f'model_{name}_{version_id}.joblib')
meta_path = os.path.join(MODEL_DIR, f'model_{name}_{version_id}.meta.json')
if not os.path.exists(model_path) or not os.path.exists(meta_path): return None, None
with open(meta_path) as f: meta = json.load(f)
return model_path, meta
def _purge_old_versions(name: str):
pattern = os.path.join(MODEL_DIR, f'model_{name}_*.joblib')
versions = sorted(glob.glob(pattern))
to_delete = versions[:-MODEL_HISTORY_COUNT] if len(versions) > MODEL_HISTORY_COUNT else []
for joblib_path in to_delete:
version_id = os.path.basename(joblib_path).replace(f'model_{name}_', '').replace('.joblib', '')
meta_path = os.path.join(MODEL_DIR, f'model_{name}_{version_id}.meta.json')
os.remove(joblib_path)
if os.path.exists(meta_path): os.remove(meta_path)
log_info(f"[{name}] Version purgée : {version_id} (limite={MODEL_HISTORY_COUNT})")
def load_or_train_model(name: str, human_baseline: pd.DataFrame, features: list, cycle_id: str):
model_path, meta = _get_current_version(name)
if model_path and meta:
trained_at = datetime.fromisoformat(meta['trained_at'])
age_h = (datetime.now() - trained_at).total_seconds() / 3600
age_ok = age_h < RETRAIN_INTERVAL_H
# A1 — Dérive conceptuelle : comparer la distribution actuelle avec celle de l'entraînement
drift_score = 0.0
drift_forced = False
if age_ok and 'baseline_stats' in meta:
drift_score = _compute_drift_score(meta['baseline_stats'], human_baseline, features)
if drift_score >= DRIFT_THRESHOLD:
drift_forced = True
log_info(f"[{name}] Dérive détectée ({drift_score:.0%} features) — retraining forcé.")
log_decision('DRIFT_DETECTED', cycle_id, name, {
'version_id': meta['version_id'], 'drift_score': round(drift_score, 3),
'drift_threshold': DRIFT_THRESHOLD, 'model_age_hours': round(age_h, 2)
})
if age_ok and not drift_forced:
log_info(f"[{name}] Modèle v{meta['version_id']} valide ({age_h:.1f}h / {RETRAIN_INTERVAL_H}h, drift={drift_score:.0%}) — réutilisation.")
log_decision('MODEL_LOADED', cycle_id, name, {
'version_id': meta['version_id'], 'model_age_hours': round(age_h, 2),
'trained_at': meta['trained_at'], 'human_samples': meta.get('human_samples', '?'),
'retrain_in_hours': round(RETRAIN_INTERVAL_H - age_h, 1), 'drift_score': round(drift_score, 3)
})
return joblib.load(model_path)
elif not drift_forced:
log_info(f"[{name}] Modèle v{meta['version_id']} expiré ({age_h:.1f}h ≥ {RETRAIN_INTERVAL_H}h) — retraining.")
version_id = datetime.now().strftime('%Y%m%d_%H%M%S')
log_info(f"[{name}] Entraînement version {version_id} sur {len(human_baseline)} sessions humaines... (contamination={CONTAMINATION})")
X = human_baseline[features].replace([np.inf, -np.inf], np.nan).fillna(0)
model = IsolationForest(n_estimators=300, contamination=CONTAMINATION, random_state=42, n_jobs=-1)
model.fit(X)
# A1 — Sauvegarder les statistiques de distribution de la baseline pour la détection de dérive future
baseline_stats = {
f: {'mean': float(X[f].mean()), 'std': float(X[f].std()), 'p25': float(X[f].quantile(0.25)), 'p75': float(X[f].quantile(0.75))}
for f in features
}
new_model_path = os.path.join(MODEL_DIR, f'model_{name}_{version_id}.joblib')
new_meta_path = os.path.join(MODEL_DIR, f'model_{name}_{version_id}.meta.json')
joblib.dump(model, new_model_path)
previous_version = meta.get('version_id', None) if meta else None
new_meta = {
'version_id': version_id, 'trained_at': datetime.now().isoformat(),
'human_samples': len(human_baseline), 'contamination': CONTAMINATION,
'threshold': ANOMALY_THRESHOLD, 'features': features,
'model_name': name, 'previous_version': previous_version,
'retrain_interval': RETRAIN_INTERVAL_H, 'baseline_stats': baseline_stats
}
with open(new_meta_path, 'w') as f: json.dump(new_meta, f, indent=2)
with open(_current_pointer_path(name), 'w') as f: f.write(version_id)
_append_training_history({k: v for k, v in new_meta.items() if k != 'baseline_stats'})
_purge_old_versions(name)
log_info(f"[{name}] Modèle v{version_id} sauvegardé → {new_model_path}")
log_decision('MODEL_TRAINED', cycle_id, name, {
'version_id': version_id, 'previous_version': previous_version,
'human_samples': len(human_baseline), 'next_retrain_in_h': RETRAIN_INTERVAL_H,
'history_kept': MODEL_HISTORY_COUNT
})
return model
# ═══════════════════════════════════════════════════════════════════════════════
# A1 — DÉTECTION DE DÉRIVE CONCEPTUELLE (CONCEPT DRIFT)
# ═══════════════════════════════════════════════════════════════════════════════
def _compute_drift_score(baseline_stats: dict, current_baseline: pd.DataFrame, features: list) -> float:
"""
Compare la distribution actuelle de la baseline humaine avec celle utilisée à l'entraînement.
Utilise un test de Kolmogorov-Smirnov par feature. Retourne la fraction de features déroutantes.
Une valeur >= DRIFT_THRESHOLD déclenche un retraining forcé.
"""
if not baseline_stats or current_baseline.empty:
return 0.0
drifted = 0
tested = 0
for feat in features:
if feat not in baseline_stats or feat not in current_baseline.columns:
continue
stats = baseline_stats[feat]
curr_mean = current_baseline[feat].mean()
trained_std = stats.get('std', 0)
if trained_std < 1e-9:
continue
# Z-score : écart entre la moyenne actuelle et celle de l'entraînement
z = abs(curr_mean - stats['mean']) / trained_std
# Un z > 2 indique une dérive significative de la distribution
if z > 2.0:
drifted += 1
tested += 1
return drifted / max(tested, 1)
# Cache par modèle conservant le dernier état des features invalides.
# Permet de supprimer les logs répétitifs : on ne loggue que si l'état a changé depuis le cycle précédent.
_feature_warning_cache: dict = {}
# ═══════════════════════════════════════════════════════════════════════════════
# A7 — VALIDATION DE COMPLÉTUDE DES FEATURES
# ═══════════════════════════════════════════════════════════════════════════════
def validate_features(df: pd.DataFrame, features: list, name: str, cycle_id: str):
"""
Vérifie que les features sont présentes et non constantes dans le DataFrame.
Catégorise les features invalides :
- structural : absente par design pour ce modèle (défini dans STRUCTURAL_EXCLUDED_FEATURES)
- zero : colonne toujours à 0 — problème de pipeline
- unique : colonne avec une seule valeur non-nulle — agrégat global non discriminant
- missing : colonne absente du DataFrame
Retourne la liste des features valides, ou None si trop de features sont invalides.
Les avertissements ne sont logués que si l'état a changé depuis le cycle précédent
(grâce à _feature_warning_cache), pour éviter de polluer les logs à chaque cycle.
"""
structural = STRUCTURAL_EXCLUDED_FEATURES.get(name, [])
# Exclure les features structurelles d'emblée (sans warning pipeline)
active_features = [f for f in features if f not in structural]
missing = [f for f in active_features if f not in df.columns]
present = [f for f in active_features if f in df.columns]
zero_val = [f for f in present if df[f].nunique() == 1 and df[f].max() == 0]
unique_val = [f for f in present if df[f].nunique() == 1 and df[f].max() != 0]
constant = zero_val + unique_val
valid = [f for f in present if f not in constant]
current_state = (frozenset(missing), frozenset(zero_val), frozenset(unique_val))
state_changed = _feature_warning_cache.get(name) != current_state
_feature_warning_cache[name] = current_state
if structural:
log_info(f"[{name}] Features exclues (structurelles / L4 indisponible) : {structural}")
# Ne logguer les avertissements que si l'état a changé (nouveau problème ou résolution)
if state_changed:
if missing:
log_info(f"[{name}] Features absentes du schéma : {missing}")
if zero_val:
log_info(f"[{name}] Features à 0 (pipeline non-alimenté) : {zero_val}")
if unique_val:
log_info(f"[{name}] Features non-discriminantes (agrégat global) : {unique_val}")
if missing or zero_val or unique_val:
log_decision('FEATURE_WARNING', cycle_id, name, {
'structural': structural, 'missing': missing,
'zero': zero_val, 'unique_nonzero': unique_val,
'valid_count': len(valid), 'total': len(active_features)
})
ratio = len(valid) / max(len(active_features), 1)
if ratio < MIN_VALID_FEATURE_RATIO:
log_info(f"[{name}] Ratio features valides insuffisant ({ratio:.0%} < {MIN_VALID_FEATURE_RATIO:.0%}) — cycle ignoré.")
log_decision('SKIPPED_INVALID_FEATURES', cycle_id, name, {
'valid_ratio': round(ratio, 3), 'threshold': MIN_VALID_FEATURE_RATIO
})
return None
return valid
# ═══════════════════════════════════════════════════════════════════════════════
# A2 / A10 — SEUIL ADAPTATIF ET NORMALISATION DES SCORES
# ═══════════════════════════════════════════════════════════════════════════════
def compute_adaptive_threshold(scores: np.ndarray) -> float:
"""
A2 : Calcule un seuil adaptatif basé sur le percentile ANOMALY_PERCENTILE des scores négatifs.
Retourne le min entre le seuil adaptatif et le seuil statique configuré.
"""
neg_scores = scores[scores < 0]
if len(neg_scores) == 0:
return ANOMALY_THRESHOLD
adaptive = float(np.percentile(neg_scores, ANOMALY_PERCENTILE))
return min(adaptive, ANOMALY_THRESHOLD)
def normalize_scores(scores: np.ndarray) -> np.ndarray:
"""
A10 : Normalise les scores négatifs en [1, 0] pour comparer des modèles différents.
Les scores positifs (trafic normal) restent inchangés.
Attention : la formule mappe le score le PLUS négatif (plus anomaleux) vers 0
et le score le MOINS négatif (moins anomaleux) vers 1.
Ce résultat counter-intuitif est intentionnel : anomaly_score n'est utilisé qu'à titre
indicatif dans les tables de résultats. Les décisions réelles s'appuient sur raw_anomaly_score.
"""
result = scores.copy()
mask = scores < 0
if mask.sum() == 0:
return result
s_min, s_max = scores[mask].min(), scores[mask].max()
if s_min == s_max:
return result
result[mask] = (scores[mask] - s_min) / (s_max - s_min + 1e-9) * -1
return result
# ═══════════════════════════════════════════════════════════════════════════════
# A4 — EXPLAINABILITÉ PAR SHAP
# ═══════════════════════════════════════════════════════════════════════════════
def _compute_shap_top_features(model, X: pd.DataFrame, features: list, n_top: int = 5) -> list:
"""
Calcule les valeurs SHAP pour chaque ligne de X et retourne les n_top features
les plus contributives (valeur SHAP la plus négative = plus responsable de l'anomalie).
Retourne une liste de dicts {feature: shap_value} par ligne.
"""
if not ENABLE_SHAP or X.empty:
return [{}] * len(X)
try:
explainer = _shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
result = []
for sv in shap_values:
# Features les plus négatives = les plus responsables de l'anomalie
pairs = sorted(zip(features, sv), key=lambda x: x[1])
result.append({f: round(float(v), 4) for f, v in pairs[:n_top]})
return result
except Exception as e:
log_info(f"[SHAP] Erreur de calcul SHAP: {e}")
return [{}] * len(X)
def _build_reason(name: str, row: pd.Series, shap_top: dict) -> str:
"""Construit le champ reason enrichi avec le top SHAP ou les métriques clés."""
# Utilise le score brut pour l'affichage (plus interprétable que le score normalisé)
score = round(float(row.get('raw_anomaly_score', row.get('anomaly_score', 0))), 3)
threat = row.get('threat_level', '')
if shap_top:
top_str = ' | '.join(f"{f}({v:+.3f})" for f, v in shap_top.items())
return f"[{name}] Score: {score} | SHAP: {top_str} | Threat: {threat}"
vel = round(float(row.get('hit_velocity', 0)), 1)
fuzz = round(float(row.get('fuzzing_index', 0)), 1)
return f"[{name}] Score: {score} | Vel: {vel} req/s | Fuzzing: {fuzz} | Threat: {threat}"
# ═══════════════════════════════════════════════════════════════════════════════
# A8 — CLUSTERING COMPORTEMENTAL DES ANOMALIES (DBSCAN)
# ═══════════════════════════════════════════════════════════════════════════════
def _cluster_anomalies(anomalies: pd.DataFrame, features: list) -> pd.DataFrame:
"""
A8 : Applique DBSCAN sur les features normalisées des anomalies.
Ajoute une colonne campaign_id : 1 = IP isolée, ≥0 = identifiant de campagne coordonnée.
"""
anomalies = anomalies.copy()
if len(anomalies) < CLUSTERING_MIN_SAMPLES:
anomalies['campaign_id'] = -1
return anomalies
try:
X = anomalies[features].replace([np.inf, -np.inf], np.nan).fillna(0)
X_scaled = StandardScaler().fit_transform(X)
labels = DBSCAN(eps=0.5, min_samples=CLUSTERING_MIN_SAMPLES).fit_predict(X_scaled)
anomalies['campaign_id'] = labels
n_campaigns = len(set(labels)) - (1 if -1 in labels else 0)
if n_campaigns > 0:
log_info(f"[DBSCAN] {n_campaigns} campagne(s) détectée(s) parmi {len(anomalies)} anomalies.")
except Exception as e:
log_info(f"[DBSCAN] Erreur de clustering: {e}")
anomalies['campaign_id'] = -1
return anomalies
# ═══════════════════════════════════════════════════════════════════════════════
# ANALYSE SEMI-SUPERVISÉE
# ═══════════════════════════════════════════════════════════════════════════════
def run_semi_supervised_logic(df, features, name, cycle_id, recurrence_map):
# ── Trifurcation du trafic selon bot_name et Anubis ─────────────────────
# 1. Bots connus (dict_bot_ip / dict_bot_ja4) → exclus du scoring IF
known_bots = df[df['bot_name'] != ''].copy()
rest = df[df['bot_name'] == ''].copy()
# 2. Bots Anubis ALLOW → bots légitimes, exclus du scoring IF
anubis_allow = rest[rest['anubis_bot_action'] == 'ALLOW'].copy()
# 3. Tout le reste passe par l'IsolationForest pour un score réel :
# - DENY : menaces identifiées par règles Anubis → IF donne le score de sévérité
# - WEIGH / inconnu → scorés normalement (anubis_is_flagged=1 pour WEIGH)
# Les DENY sont TOUJOURS inclus dans les threats, indépendamment du seuil IF.
unknown_traffic = rest[rest['anubis_bot_action'] != 'ALLOW'].copy()
human_baseline = unknown_traffic[unknown_traffic['asn_label'] == 'human']
# A7 — Valider les features avant tout traitement
valid_features = validate_features(df, features, name, cycle_id)
if valid_features is None:
return pd.DataFrame(), pd.DataFrame()
if len(human_baseline) < 500:
log_info(f"[{name}] Données humaines insuffisantes ({len(human_baseline)} < 500).")
log_decision('SKIPPED_LOW_DATA', cycle_id, name, {
'human_count': len(human_baseline), 'unknown_count': len(unknown_traffic)
})
return pd.DataFrame(), pd.DataFrame()
# A1 — Dérive conceptuelle intégrée dans load_or_train_model
model = load_or_train_model(name, human_baseline, valid_features, cycle_id)
unknown_traffic = unknown_traffic.copy()
X_test = unknown_traffic[valid_features].replace([np.inf, -np.inf], np.nan).fillna(0)
raw_scores = model.decision_function(X_test)
# raw_anomaly_score : score brut IF pour comparaison au seuil et assignation du threat_level
# anomaly_score : score normalisé [-1, 0] pour cohérence cross-modèles (A10)
unknown_traffic['raw_anomaly_score'] = raw_scores
unknown_traffic['anomaly_score'] = normalize_scores(raw_scores)
unknown_traffic['model_name'] = name
# A2 — Seuil adaptatif calculé sur les scores BRUTS (même échelle que ANOMALY_THRESHOLD)
effective_threshold = compute_adaptive_threshold(raw_scores)
log_info(f"[{name}] Seuil effectif : {effective_threshold:.4f} (statique={ANOMALY_THRESHOLD}, percentile={ANOMALY_PERCENTILE})")
# A6 — Pénaliser les IPs récurrentes sur le score BRUT avant comparaison au seuil
if RECURRENCE_WEIGHT > 0:
recurrences = unknown_traffic['src_ip'].map(recurrence_map).fillna(0)
penalty = np.log1p(recurrences.values) * RECURRENCE_WEIGHT
unknown_traffic['raw_anomaly_score'] = unknown_traffic['raw_anomaly_score'] - penalty
# Assigner threat_level à TOUTES les sessions scorées (pour ml_all_scores)
unknown_traffic['threat_level'] = unknown_traffic['raw_anomaly_score'].apply(score_to_threat_level)
unknown_traffic['recurrence'] = unknown_traffic['src_ip'].map(recurrence_map).fillna(0).astype(int) + 1
unknown_traffic['campaign_id'] = -1
# Extraire les DENY (maintenant avec leur vrai score IF) et forcer leur threat_level
deny_mask = unknown_traffic['anubis_bot_action'] == 'DENY'
unknown_traffic.loc[deny_mask, 'threat_level'] = 'ANUBIS_DENY'
# Capturer toutes les sessions scorées (avant filtrage par seuil) — pour ml_all_scores
all_scored = unknown_traffic.copy()
if not known_bots.empty:
known_bots = known_bots.copy()
known_bots['anomaly_score'] = 0.0
known_bots['raw_anomaly_score'] = 0.0
known_bots['threat_level'] = 'KNOWN_BOT'
known_bots['model_name'] = name
known_bots['campaign_id'] = -1
known_bots['reason'] = '[Identification] Bot légitime: ' + known_bots['bot_name']
known_bots['recurrence'] = known_bots['src_ip'].map(recurrence_map).fillna(0).astype(int) + 1
for _, row in known_bots.iterrows():
log_decision('KNOWN_BOT', cycle_id, name, {
'src_ip': row.get('src_ip', ''), 'bot_name': row.get('bot_name', ''),
'asn_number': row.get('asn_number', ''), 'asn_org': row.get('asn_org', ''),
'asn_domain': row.get('asn_domain', ''), 'country_code': row.get('country_code', ''),
'recurrence': int(row.get('recurrence', 1))
})
# ── Anubis ALLOW : bots légitimes identifiés par règles Anubis ───────────
if not anubis_allow.empty:
anubis_allow = anubis_allow.copy()
anubis_allow['anomaly_score'] = 0.0
anubis_allow['raw_anomaly_score'] = 0.0
anubis_allow['threat_level'] = 'KNOWN_BOT'
anubis_allow['model_name'] = name
anubis_allow['campaign_id'] = -1
anubis_allow['reason'] = '[Anubis ALLOW] ' + anubis_allow['anubis_bot_name']
anubis_allow['recurrence'] = anubis_allow['src_ip'].map(recurrence_map).fillna(0).astype(int) + 1
for _, row in anubis_allow.iterrows():
log_decision('KNOWN_BOT', cycle_id, name, {
'src_ip': row.get('src_ip', ''), 'bot_name': row.get('anubis_bot_name', ''),
'anubis_bot_name': row.get('anubis_bot_name', ''),
'anubis_bot_action': row.get('anubis_bot_action', ''),
'anubis_bot_category': row.get('anubis_bot_category', ''),
'asn_number': row.get('asn_number', ''), 'asn_org': row.get('asn_org', ''),
'asn_domain': row.get('asn_domain', ''), 'country_code': row.get('country_code', ''),
'recurrence': int(row.get('recurrence', 1)),
})
# ── Anubis DENY : scorés par IF, toujours inclus dans les threats ────────
# Extraits de unknown_traffic après scoring — ils ont leur vrai score IF.
anubis_deny = unknown_traffic[deny_mask].copy()
if not anubis_deny.empty:
anubis_deny['reason'] = '[Anubis DENY] ' + anubis_deny['anubis_bot_name'].fillna('') + \
' | ' + anubis_deny['raw_anomaly_score'].apply(lambda s: f'IF={s:.4f}')
log_info(f"[{name}] Anubis DENY: {len(anubis_deny)} IP(s) scorées par IF "
f"(score moyen: {anubis_deny['raw_anomaly_score'].mean():.4f}).")
for _, row in anubis_deny.iterrows():
log_decision('ANUBIS_DENY', cycle_id, name, {
'src_ip': row.get('src_ip', ''), 'anubis_bot_name': row.get('anubis_bot_name', ''),
'anubis_bot_action': row.get('anubis_bot_action', ''),
'anubis_bot_category': row.get('anubis_bot_category', ''),
'anomaly_score': round(float(row.get('anomaly_score', 0)), 4),
'raw_anomaly_score': round(float(row.get('raw_anomaly_score', 0)), 4),
'asn_number': row.get('asn_number', ''), 'asn_org': row.get('asn_org', ''),
'asn_domain': row.get('asn_domain', ''), 'country_code': row.get('country_code', ''),
'recurrence': int(row.get('recurrence', 1)),
})
# Filtrer sur raw_anomaly_score (A6 inclus) — seulement le trafic non-DENY
# Les DENY sont toujours des threats, indépendamment du seuil IF
non_deny_traffic = unknown_traffic[~deny_mask]
anomalies = non_deny_traffic[non_deny_traffic['raw_anomaly_score'] < effective_threshold].copy()
if not anomalies.empty:
log_info(f"[{name}] ALERT: {len(anomalies)} anomalies détectées (seuil={effective_threshold:.4f}).")
anomalies['recurrence'] = anomalies['src_ip'].map(recurrence_map).fillna(0).astype(int) + 1
# A4 — Explainabilité SHAP : top features responsables de chaque anomalie
X_anomalies = X_test.loc[anomalies.index]
shap_tops = _compute_shap_top_features(model, X_anomalies, valid_features)
anomalies['reason'] = [
_build_reason(name, row, shap)
for (_, row), shap in zip(anomalies.iterrows(), shap_tops)
]
# A8 — Clustering DBSCAN pour identifier les campagnes coordonnées
if ENABLE_CLUSTERING:
anomalies = _cluster_anomalies(anomalies, valid_features)
anomalies['ja4'] = anomalies['ja4'].replace({'': 'HTTP_CLEAR_TEXT'})
for _, row in anomalies.iterrows():
log_decision('ANOMALY', cycle_id, name, {
'src_ip': row.get('src_ip', ''), 'anomaly_score': round(float(row.get('anomaly_score', 0)), 4),
'raw_anomaly_score': round(float(row.get('raw_anomaly_score', 0)), 4),
'threat_level': row.get('threat_level', ''), 'recurrence': int(row.get('recurrence', 1)),
'hit_velocity': round(float(row.get('hit_velocity', 0)), 2),
'fuzzing_index': round(float(row.get('fuzzing_index', 0)), 2),
'post_ratio': round(float(row.get('post_ratio', 0)), 3),
'asn_number': row.get('asn_number', ''), 'asn_org': row.get('asn_org', ''),
'asn_detail': row.get('asn_detail', ''), 'asn_domain': row.get('asn_domain', ''),
'country_code': row.get('country_code', ''), 'asn_label': row.get('asn_label', ''),
'ja4': row.get('ja4', ''), 'host': row.get('host', ''),
'correlated': int(row.get('correlated', 0)), 'campaign_id': int(row.get('campaign_id', -1)),
'effective_threshold': round(effective_threshold, 4), 'reason': row.get('reason', '')
})
threats = pd.concat([df for df in [
anomalies if not anomalies.empty else None,
known_bots if not known_bots.empty else None,
anubis_allow if not anubis_allow.empty else None,
anubis_deny if not anubis_deny.empty else None,
] if df is not None], ignore_index=True)
# Inclure anubis_allow dans all_scored pour traçabilité dans ml_all_scores.
# Ces IPs sont exclues de l'analyse IF mais doivent apparaître dans la table
# de scores avec threat_level='KNOWN_BOT' et anomaly_score=0.0.
if not anubis_allow.empty:
all_scored = pd.concat([all_scored, anubis_allow], ignore_index=True)
return threats, all_scored
# ═══════════════════════════════════════════════════════════════════════════════
# A5 — DÉDUPLICATION INTER-CYCLES AVEC TTL
# ═══════════════════════════════════════════════════════════════════════════════
def _filter_recent_detections(client, all_anom: pd.DataFrame) -> pd.DataFrame:
"""
A5 : Filtre les IPs déjà insérées dans ml_detected_anomalies dans les DEDUP_TTL_MIN dernières minutes.
Exception : une IP est réinsérée si son nouveau score est ≥ 0.05 points plus bas (aggravation).
"""
if DEDUP_TTL_MIN <= 0 or all_anom.empty:
return all_anom
try:
recent_df = client.query_df(
f"SELECT src_ip, min(anomaly_score) AS best_score "
f"FROM {DB}.ml_detected_anomalies "
f"WHERE detected_at > now() - INTERVAL {DEDUP_TTL_MIN} MINUTE "
f"GROUP BY src_ip"
)
if recent_df.empty:
return all_anom
recent_map = dict(zip(recent_df['src_ip'], recent_df['best_score']))
def _should_insert(row):
prev = recent_map.get(row['src_ip'])
if prev is None:
return True
# Réinsérer seulement si le score brut s'est significativement aggravé
return float(row.get('raw_anomaly_score', row['anomaly_score'])) < float(prev) - 0.05
mask = all_anom.apply(_should_insert, axis=1)
filtered = all_anom[mask]
skipped = len(all_anom) - len(filtered)
if skipped > 0:
log_info(f"[Dedup TTL={DEDUP_TTL_MIN}min] {skipped} IP(s) filtrée(s) (déjà détectées récemment).")
return filtered
except Exception as e:
log_info(f"[Dedup] Erreur lors de la déduplication TTL : {e}")
return all_anom
# ═══════════════════════════════════════════════════════════════════════════════
# A3 — ANALYSE MULTI-FENÊTRES : PRÉTRAITEMENT COMMUN
# ═══════════════════════════════════════════════════════════════════════════════
def _preprocess_df(df: pd.DataFrame) -> pd.DataFrame:
"""Normalise les colonnes et remplit les valeurs manquantes (commun 1h et 24h)."""
df.columns = [c.split('.')[-1] for c in df.columns]
for col in ['src_ip', 'ja4', 'host', 'bot_name', 'anubis_bot_name', 'anubis_bot_action', 'anubis_bot_category',
'asn_number', 'asn_org', 'asn_detail', 'asn_domain', 'country_code', 'asn_label']:
if col in df.columns:
df[col] = df[col].fillna('').astype(str)
df.fillna(0, inplace=True)
# ── Features numériques dérivées des labels Anubis (pour IsolationForest) ──
# anubis_is_flagged : 1 si le trafic est marqué WEIGH/CHALLENGE par Anubis
# → signal de suspicion modéré passé à l'IF (ALLOW/DENY sont exclus du pipeline)
df['anubis_is_flagged'] = (
(df.get('anubis_bot_name', pd.Series('', index=df.index)) != '') &
(~df.get('anubis_bot_action', pd.Series('', index=df.index)).isin(['ALLOW', 'DENY', '']))
).astype(int)
return df
# ═══════════════════════════════════════════════════════════════════════════════
# CYCLE PRINCIPAL
# ═══════════════════════════════════════════════════════════════════════════════
_consecutive_failures = 0
def fetch_and_analyze():
global _service_healthy, _consecutive_failures
cycle_id = datetime.now().strftime('%Y%m%d_%H%M%S')
log_info('=== Lancement cycle IA ===')
client = get_client()
# ── Récupération du trafic (fenêtre 1h) ──────────────────────────────────
try:
df = client.query_df(f'SELECT * FROM {DB}.view_ai_features_1h')
except Exception as e:
log_info(f'ERREUR REQUETE: {e}')
_consecutive_failures += 1
if _consecutive_failures >= MAX_FAILURES:
_service_healthy = False
log_decision('CONSECUTIVE_FAILURES', cycle_id, '', {'count': _consecutive_failures, 'error': str(e)})
return
_consecutive_failures = 0
_service_healthy = True
if df is None or df.empty:
log_info('Aucun trafic trouvé.')
return
df = _preprocess_df(df)
log_decision('CYCLE_START', cycle_id, '', {
'total_rows': len(df),
'human_rows': int((df.get('asn_label', pd.Series()) == 'human').sum()),
'known_bot_rows': int((df.get('bot_name', pd.Series()) != '').sum()),
'correlated_rows': int((df.get('correlated', pd.Series()) == 1).sum()),
'anubis_allow_rows': int((df.get('anubis_bot_action', pd.Series()) == 'ALLOW').sum()),
'anubis_deny_rows': int((df.get('anubis_bot_action', pd.Series()) == 'DENY').sum()),
'anubis_weigh_rows': int((df.get('anubis_bot_action', pd.Series()) == 'WEIGH').sum()),
'multiwindow': ENABLE_MULTIWINDOW,
})
try:
rec_df = client.query_df(f'SELECT src_ip, recurrence FROM {DB}.view_ip_recurrence')
recurrence_map = dict(zip(rec_df['src_ip'], rec_df['recurrence']))
except Exception:
recurrence_map = {}
# ── Features par modèle (voir DOCUMENTATION.md §4) ───────────────────────
# Features communes aux deux modèles (L7 HTTP pur, disponibles correlated=0 et 1)
feats = [
'hits', 'hit_velocity', 'fuzzing_index', 'post_ratio', 'port_exhaustion_ratio',
'orphan_ratio', 'max_keepalives', 'tcp_shared_count', 'header_order_shared_count',
'header_count', 'has_accept_language', 'has_cookie', 'has_referer',
'modern_browser_score', 'ua_ch_mismatch', 'ip_id_zero_ratio',
'request_size_variance', 'multiplexing_efficiency', 'mss_mobile_mismatch',
'asset_ratio', 'direct_access_ratio', 'is_ua_rotating', 'distinct_ja4_count',
'src_port_density', 'ja4_asn_concentration', 'ja4_country_concentration', 'is_rare_ja4',
'header_order_confidence', 'distinct_header_orders', 'temporal_entropy',
'path_diversity_ratio', 'url_depth_variance', 'anomalous_payload_ratio',
# B4-B7 : features L7 pures (disponibles correlated=0 et 1)
'head_ratio', 'sec_fetch_absence_rate', 'generic_accept_ratio', 'http10_ratio',
# Anubis : signal de suspicion modéré (WEIGH/CHALLENGE) — bypass pour ALLOW/DENY
'anubis_is_flagged',
# HTTP : header incomplet et usage HTTP plain (disponibles pour les deux modèles)
'missing_accept_enc_ratio', 'http_scheme_ratio',
]
# Features supplémentaires pour le modèle Complet (nécessitent des données TCP/TLS)
feats_complet = feats + [
'tcp_jitter_variance', 'alpn_http_mismatch', 'is_alpn_missing', 'sni_host_mismatch',
# B1-B3, B8 : features TLS/TCP (disponibles correlated=1 uniquement)
'ja3_diversity_ratio', 'syn_timing_cv', 'tls12_ratio', 'ip_df_variance',
# TTL fingerprinting OS + TCP window scale (L4 uniquement)
'avg_ttl', 'ttl_std', 'no_window_scale_ratio',
]
# ── Analyse fenêtre 1h ────────────────────────────────────────────────────
anom_a, scored_a = run_semi_supervised_logic(df[df['correlated'] == 1].copy(), feats_complet, 'Complet', cycle_id, recurrence_map)
anom_b, scored_b = run_semi_supervised_logic(df[df['correlated'] == 0].copy(), feats, 'Applicatif', cycle_id, recurrence_map)
all_anom = pd.concat([anom_a, anom_b], ignore_index=True)
all_scored = pd.concat([scored_a, scored_b], ignore_index=True)
# ── A3 : Analyse fenêtre 24h (optionnelle) ────────────────────────────────
if ENABLE_MULTIWINDOW:
try:
df_24h = client.query_df(f'SELECT * FROM {DB}.{MULTIWINDOW_VIEW}')
if df_24h is not None and not df_24h.empty:
df_24h = _preprocess_df(df_24h)
log_info(f"[24h] {len(df_24h)} sessions dans la fenêtre 24h.")
anom_c, scored_c = run_semi_supervised_logic(df_24h[df_24h['correlated'] == 1].copy(), feats_complet, 'Complet_24h', cycle_id, recurrence_map)
anom_d, scored_d = run_semi_supervised_logic(df_24h[df_24h['correlated'] == 0].copy(), feats, 'Applicatif_24h', cycle_id, recurrence_map)
all_anom_24h = pd.concat([anom_c, anom_d], ignore_index=True)
all_scored_24h = pd.concat([scored_c, scored_d], ignore_index=True)
# Fusion : pour les IPs présentes dans les deux fenêtres, conserver le score le plus bas
if not all_anom_24h.empty:
all_anom = pd.concat([all_anom, all_anom_24h], ignore_index=True)
log_info(f"[24h] Fusion 1h+24h : {len(all_anom)} entrées avant déduplication.")
all_scored = pd.concat([all_scored, all_scored_24h], ignore_index=True)
else:
log_info(f"[24h] Vue {MULTIWINDOW_VIEW} vide — analyse mono-fenêtre.")
except Exception as e:
log_info(f"[24h] Vue {MULTIWINDOW_VIEW} inaccessible : {e} — analyse mono-fenêtre.")
# ── Insertion de toutes les classifications dans ml_all_scores ───────────
if not all_scored.empty:
try:
now = datetime.now().replace(microsecond=0)
all_scored['detected_at'] = now
all_scored['ja4'] = all_scored['ja4'].replace({'': 'HTTP_CLEAR_TEXT'})
all_scores_cols = [
'detected_at', 'window_start', 'src_ip', 'ja4', 'host', 'bot_name',
'anubis_bot_name', 'anubis_bot_action', 'anubis_bot_category',
'anomaly_score', 'raw_anomaly_score', 'threat_level', 'model_name',
'correlated', 'asn_number', 'asn_org', 'country_code', 'asn_label',
'hits', 'hit_velocity', 'fuzzing_index', 'post_ratio', 'campaign_id'
]
scores_df = all_scored[[c for c in all_scores_cols if c in all_scored.columns]]
client.insert_df(f'{DB}.ml_all_scores', scores_df)
log_info(f'[ml_all_scores] {len(scores_df)} sessions scorées enregistrées.')
except Exception as e:
log_info(f'[ml_all_scores] ERREUR INSERTION: {e}')
if not all_anom.empty:
all_anom = all_anom.sort_values('raw_anomaly_score', ascending=True).drop_duplicates(subset=['src_ip'], keep='first')
log_info(f'Après déduplication intra-cycle : {len(all_anom)} IP uniques.')
# A5 — Déduplication inter-cycles avec TTL
all_anom = _filter_recent_detections(client, all_anom)
if all_anom.empty:
log_info('Toutes les anomalies filtrées par déduplication TTL.')
log_decision('CYCLE_END', cycle_id, '', {'inserted': 0, 'anomalies': 0, 'known_bots': 0, 'critical': 0, 'high': 0, 'dedup_ttl_min': DEDUP_TTL_MIN})
return
all_anom['detected_at'] = datetime.now().replace(microsecond=0)
fake_nav_col = 'is_fake_navigation'
all_anom['is_headless'] = all_anom[fake_nav_col].astype(int) if fake_nav_col in all_anom.columns else 0
cols = [
'detected_at', 'src_ip', 'ja4', 'host', 'bot_name', 'anomaly_score',
'threat_level', 'model_name', 'recurrence',
'asn_number', 'asn_org', 'asn_detail', 'asn_domain', 'country_code', 'asn_label',
'hits', 'hit_velocity', 'fuzzing_index', 'post_ratio', 'port_exhaustion_ratio', 'max_keepalives', 'orphan_ratio',
'tcp_jitter_variance', 'tcp_shared_count', 'true_window_size', 'window_mss_ratio',
'alpn_http_mismatch', 'is_alpn_missing', 'sni_host_mismatch',
'header_count', 'has_accept_language', 'has_cookie', 'has_referer',
'modern_browser_score', 'is_headless', 'ua_ch_mismatch',
'header_order_shared_count', 'ip_id_zero_ratio', 'request_size_variance',
'multiplexing_efficiency', 'mss_mobile_mismatch',
'correlated', 'reason', 'asset_ratio', 'direct_access_ratio', 'is_ua_rotating',
'distinct_ja4_count', 'src_port_density', 'ja4_asn_concentration',
'ja4_country_concentration', 'is_rare_ja4',
'header_order_confidence', 'distinct_header_orders', 'temporal_entropy',
'path_diversity_ratio', 'url_depth_variance', 'anomalous_payload_ratio',
'anubis_bot_name', 'anubis_bot_action', 'anubis_bot_category',
]
try:
final_df = all_anom[[c for c in cols if c in all_anom.columns]]
client.insert_df(f'{DB}.ml_detected_anomalies', final_df)
log_info(f'Succès: {len(final_df)} menaces enregistrées.')
log_decision('CYCLE_END', cycle_id, '', {
'inserted': len(final_df),
'anomalies': int((final_df.get('bot_name', pd.Series()) == '').sum()),
'known_bots': int((final_df.get('bot_name', pd.Series()) != '').sum()),
'critical': int((final_df.get('threat_level', pd.Series()) == 'CRITICAL').sum()),
'high': int((final_df.get('threat_level', pd.Series()) == 'HIGH').sum()),
'dedup_ttl_min': DEDUP_TTL_MIN,
})
except Exception as e:
log_info(f'ERREUR INSERTION: {e}')
else:
log_info('Aucune menace détectée.')
log_decision('CYCLE_END', cycle_id, '', {'inserted': 0, 'anomalies': 0, 'known_bots': 0, 'critical': 0, 'high': 0, 'dedup_ttl_min': DEDUP_TTL_MIN})
if __name__ == '__main__':
log_info('*' * 65)
log_info(' DÉMARRAGE DU SERVICE BOT DETECTOR IA v12 (+ Anubis)')
log_info(f' DB : {DB}')
log_info(f' Contamination : {CONTAMINATION}')
log_info(f' Seuil anomalie : {ANOMALY_THRESHOLD} (adaptatif percentile={ANOMALY_PERCENTILE})')
log_info(f' Cycle : {CYCLE_INTERVAL}s | Fenêtre 1h | Multi-fenêtres : {ENABLE_MULTIWINDOW}')
log_info(f' Retraining : toutes les {RETRAIN_INTERVAL_H}h | Drift threshold : {DRIFT_THRESHOLD:.0%}')
log_info(f' Modèles : {MODEL_DIR}')
log_info(f' SHAP : {"activé" if ENABLE_SHAP else "désactivé (shap non installé)" if not SHAP_AVAILABLE else "désactivé"}')
log_info(f' Clustering : {"activé" if ENABLE_CLUSTERING else "désactivé"} | Dedup TTL : {DEDUP_TTL_MIN}min')
log_info(f' Récurrence weight : {RECURRENCE_WEIGHT} | Min features ratio : {MIN_VALID_FEATURE_RATIO:.0%}')
log_info(f' Anubis : ALLOW→KNOWN_BOT (score=0), DENY→ANUBIS_DENY (score IF réel)')
log_info('*' * 65)
log_decision('SERVICE_START', 'boot', '', {
'db': DB, 'contamination': CONTAMINATION, 'anomaly_threshold': ANOMALY_THRESHOLD,
'cycle_interval': CYCLE_INTERVAL, 'retrain_interval_h': RETRAIN_INTERVAL_H
})
while True:
try: fetch_and_analyze()
except Exception as e: log_info(f"Erreur globale : {e}")
time.sleep(CYCLE_INTERVAL)

View File

@ -0,0 +1,6 @@
clickhouse-connect==0.8.0
pandas==2.2.0
scikit-learn==1.4.0
shap==0.44.1
pyyaml>=6.0
ja4-common @ file:///app/shared/ja4_common

View File

@ -0,0 +1,17 @@
import pytest
from unittest.mock import MagicMock, patch
@pytest.fixture
def mock_ch_client():
"""Mock ClickHouse client."""
client = MagicMock()
client.query.return_value = MagicMock(result_rows=[])
client.command.return_value = None
return client
@pytest.fixture(autouse=False)
def mock_get_client(mock_ch_client):
with patch("ja4_common.clickhouse.get_client", return_value=mock_ch_client):
yield mock_ch_client

View File

@ -0,0 +1,166 @@
import os
import pytest
import pandas as pd
import numpy as np
from unittest.mock import patch, MagicMock
def test_settings_from_env(monkeypatch):
"""ClickHouseSettings loads CLICKHOUSE_HOST from env."""
monkeypatch.setenv("CLICKHOUSE_HOST", "testhost")
from ja4_common.settings import ClickHouseSettings
s = ClickHouseSettings()
assert s.CLICKHOUSE_HOST == "testhost"
def test_feature_dataframe_validation():
"""MIN_VALID_FEATURE_RATIO logic: if < ratio of features have data, skip."""
MIN_VALID_FEATURE_RATIO = 0.5
df = pd.DataFrame({"f1": [1.0], "f2": [None], "f3": [None], "f4": [None]})
non_null_ratio = df.notna().mean().mean()
assert non_null_ratio < MIN_VALID_FEATURE_RATIO, "Should detect insufficient features"
def test_anomaly_threshold():
"""Scores below ANOMALY_THRESHOLD trigger detection."""
ANOMALY_THRESHOLD = -0.1
anomaly_scores = np.array([-0.5, -0.3, 0.1, 0.2])
anomalies = anomaly_scores[anomaly_scores < ANOMALY_THRESHOLD]
assert len(anomalies) == 2, "Should detect 2 anomalies"
def test_dedup_logic():
"""Duplicate detections within DEDUP_TTL_MIN are skipped."""
from datetime import datetime, timedelta
DEDUP_TTL_MIN = 60
dedup_cache = {}
def should_insert(ip: str, now: datetime) -> bool:
if ip in dedup_cache:
if (now - dedup_cache[ip]).total_seconds() < DEDUP_TTL_MIN * 60:
return False
dedup_cache[ip] = now
return True
now = datetime(2024, 1, 1, 12, 0, 0)
assert should_insert("1.2.3.4", now) is True
assert should_insert("1.2.3.4", now + timedelta(minutes=30)) is False # within TTL
assert should_insert("1.2.3.4", now + timedelta(minutes=61)) is True # past TTL
def test_health_check():
"""Health check endpoint returns 200."""
import threading
from http.server import HTTPServer, BaseHTTPRequestHandler
class HealthHandler(BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.end_headers()
def log_message(self, *args):
pass
server = HTTPServer(("127.0.0.1", 0), HealthHandler)
port = server.server_address[1]
t = threading.Thread(target=server.handle_request)
t.start()
import urllib.request
resp = urllib.request.urlopen(f"http://127.0.0.1:{port}/")
assert resp.status == 200
server.server_close()
def test_dedup_different_ips_are_independent():
"""Different IPs are tracked independently in dedup cache."""
from datetime import datetime, timedelta
DEDUP_TTL_MIN = 60
dedup_cache = {}
def should_insert(ip: str, now: datetime) -> bool:
if ip in dedup_cache:
if (now - dedup_cache[ip]).total_seconds() < DEDUP_TTL_MIN * 60:
return False
dedup_cache[ip] = now
return True
now = datetime(2024, 1, 1, 12, 0, 0)
assert should_insert("1.1.1.1", now) is True
assert should_insert("2.2.2.2", now) is True # Different IP, should be allowed
assert should_insert("1.1.1.1", now + timedelta(minutes=30)) is False # Same IP within TTL
assert should_insert("2.2.2.2", now + timedelta(minutes=30)) is False # Same IP within TTL
def test_dedup_exact_ttl_boundary():
"""Dedup: insertion exactly at TTL boundary is still blocked."""
from datetime import datetime, timedelta
DEDUP_TTL_MIN = 60
dedup_cache = {}
def should_insert(ip: str, now: datetime) -> bool:
if ip in dedup_cache:
if (now - dedup_cache[ip]).total_seconds() < DEDUP_TTL_MIN * 60:
return False
dedup_cache[ip] = now
return True
now = datetime(2024, 1, 1, 12, 0, 0)
assert should_insert("1.2.3.4", now) is True
# Exactly at 60 minutes should be blocked (< not <=)
assert should_insert("1.2.3.4", now + timedelta(minutes=60)) is False
def test_anomaly_threshold_no_anomalies():
"""No anomalies when all scores are above threshold."""
import numpy as np
ANOMALY_THRESHOLD = -0.1
scores = np.array([0.0, 0.1, 0.5, 1.0])
anomalies = scores[scores < ANOMALY_THRESHOLD]
assert len(anomalies) == 0
def test_anomaly_threshold_all_anomalies():
"""All items flagged when all scores are below threshold."""
import numpy as np
ANOMALY_THRESHOLD = -0.1
scores = np.array([-0.5, -0.3, -0.2, -0.15])
anomalies = scores[scores < ANOMALY_THRESHOLD]
assert len(anomalies) == 4
def test_feature_dataframe_all_valid():
"""Feature dataframe with all valid values passes ratio check."""
import pandas as pd
MIN_VALID_FEATURE_RATIO = 0.5
df = pd.DataFrame({"f1": [1.0], "f2": [2.0], "f3": [3.0], "f4": [4.0]})
non_null_ratio = df.notna().mean().mean()
assert non_null_ratio >= MIN_VALID_FEATURE_RATIO
def test_health_check_returns_correct_status():
"""Health check endpoint body is readable."""
import threading
import urllib.request
from http.server import HTTPServer, BaseHTTPRequestHandler
class StatusHandler(BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.end_headers()
self.wfile.write(b'{"status": "ok"}')
def log_message(self, *args):
pass
server = HTTPServer(("127.0.0.1", 0), StatusHandler)
port = server.server_address[1]
t = threading.Thread(target=server.handle_request)
t.start()
resp = urllib.request.urlopen(f"http://127.0.0.1:{port}/health")
assert resp.status == 200
body = resp.read()
assert b"ok" in body
server.server_close()

View File

@ -0,0 +1,411 @@
-- ============================================================================
-- ARCHITECTURE DE DÉTECTION INTÉGRALE (v13 - bot_detector v11 + ml_all_scores)
-- Base : mabase_prod | Fenêtre : 24h | Dédoublonnage par src_ip
-- Modifications v11 : ajout campaign_id, raw_anomaly_score dans ml_detected_anomalies
-- correction view_dashboard_variability (header_user_agent → reason)
-- Modifications v12 : ajout table ml_all_scores (toutes les classifications, sans seuil)
-- ============================================================================
-- 1. NETTOYAGE COMPLET
DROP TABLE IF EXISTS mabase_prod.ml_all_scores;
DROP DICTIONARY IF EXISTS mabase_prod.dict_bot_ip;
DROP DICTIONARY IF EXISTS mabase_prod.dict_bot_ja4;
DROP DICTIONARY IF EXISTS mabase_prod.dict_asn_reputation;
DROP TABLE IF EXISTS mabase_prod.ml_detected_anomalies;
DROP VIEW IF EXISTS mabase_prod.view_ip_recurrence;
DROP VIEW IF EXISTS mabase_prod.view_ai_features_1h;
-- Suppression des anciennes vues heuristiques
DROP VIEW IF EXISTS mabase_prod.view_host_ip_ja4_rotation;
DROP VIEW IF EXISTS mabase_prod.view_host_ja4_anomalies;
DROP VIEW IF EXISTS mabase_prod.view_form_bruteforce_detected;
DROP VIEW IF EXISTS mabase_prod.view_alpn_mismatch_detected;
DROP VIEW IF EXISTS mabase_prod.view_tcp_spoofing_detected;
DROP VIEW IF EXISTS mabase_prod.mv_agg_host_ip_ja4_1h;
DROP TABLE IF EXISTS mabase_prod.agg_host_ip_ja4_1h;
DROP VIEW IF EXISTS mabase_prod.mv_agg_header_fingerprint_1h;
DROP TABLE IF EXISTS mabase_prod.agg_header_fingerprint_1h;
-- ============================================================================
-- 2. DICTIONNAIRES DE RÉPUTATION EN RAM
-- ============================================================================
CREATE DICTIONARY mabase_prod.dict_bot_ip (prefix String, bot_name String)
PRIMARY KEY prefix SOURCE(FILE(path '/var/lib/clickhouse/user_files/bot_ip.csv' format 'CSV'))
LAYOUT(IP_TRIE()) LIFETIME(MIN 300 MAX 300);
CREATE DICTIONARY mabase_prod.dict_bot_ja4 (ja4 String, bot_name String)
PRIMARY KEY ja4 SOURCE(FILE(path '/var/lib/clickhouse/user_files/bot_ja4.csv' format 'CSV'))
LAYOUT(COMPLEX_KEY_HASHED()) LIFETIME(MIN 300 MAX 300);
CREATE DICTIONARY mabase_prod.dict_asn_reputation (src_asn UInt64, label String)
PRIMARY KEY src_asn SOURCE(FILE(path '/var/lib/clickhouse/user_files/asn_reputation.csv' format 'CSV'))
LAYOUT(HASHED()) LIFETIME(MIN 300 MAX 300);
-- ============================================================================
-- 3. TABLE D'AGRÉGATION COMPORTEMENTALE (L4 / L5 / L7)
-- ============================================================================
CREATE TABLE mabase_prod.agg_host_ip_ja4_1h
(
window_start DateTime,
src_ip IPv6, ja4 String, host String, src_asn UInt32,
src_country_code SimpleAggregateFunction(any, String),
src_as_name SimpleAggregateFunction(any, String),
src_org SimpleAggregateFunction(any, String),
src_domain SimpleAggregateFunction(any, String),
first_seen SimpleAggregateFunction(min, DateTime),
last_seen SimpleAggregateFunction(max, DateTime),
hits SimpleAggregateFunction(sum, UInt64),
count_post SimpleAggregateFunction(sum, UInt64),
uniq_paths AggregateFunction(uniq, String),
uniq_query_params AggregateFunction(uniq, String),
tcp_fp_raw SimpleAggregateFunction(any, String),
tcp_jitter_variance AggregateFunction(varPop, Float64),
tcp_win_raw SimpleAggregateFunction(any, UInt32),
tcp_scale_raw SimpleAggregateFunction(any, UInt32),
tcp_mss_raw SimpleAggregateFunction(any, UInt32),
tcp_ttl_raw SimpleAggregateFunction(any, UInt32),
http_ver_raw SimpleAggregateFunction(any, String),
tls_alpn_raw SimpleAggregateFunction(any, String),
tls_sni_raw SimpleAggregateFunction(any, String),
first_ua SimpleAggregateFunction(any, String),
correlated_raw SimpleAggregateFunction(max, UInt8),
unique_src_ports AggregateFunction(uniq, UInt16),
unique_conn_id AggregateFunction(uniq, String),
max_keepalives SimpleAggregateFunction(max, UInt32),
orphan_count SimpleAggregateFunction(sum, UInt64),
ip_id_zero_count SimpleAggregateFunction(sum, UInt64),
total_ip_length_var AggregateFunction(varPop, Float64),
mss_1460_count SimpleAggregateFunction(sum, UInt64),
count_assets SimpleAggregateFunction(sum, UInt64),
count_no_referer SimpleAggregateFunction(sum, UInt64),
uniq_ua AggregateFunction(uniq, String),
max_requests_per_sec SimpleAggregateFunction(max, UInt32),
url_depth_variance AggregateFunction(varPop, Float64),
count_anomalous_payload SimpleAggregateFunction(sum, UInt64),
-- B features (ajoutées v14)
uniq_ja3 AggregateFunction(uniq, String), -- B1: diversité JA3/JA4
avg_syn_ms AggregateFunction(avg, Float64), -- B2: SYN timing moyen (pour CV)
tls12_count SimpleAggregateFunction(sum, UInt64), -- B3: ratio TLS 1.2
count_head SimpleAggregateFunction(sum, UInt64), -- B4: ratio requêtes HEAD
count_no_sec_fetch SimpleAggregateFunction(sum, UInt64),-- B5: absence Sec-Fetch-*
count_generic_accept SimpleAggregateFunction(sum, UInt64),-- B6: Accept générique
count_http10 SimpleAggregateFunction(sum, UInt64), -- B7: ratio HTTP/1.0
ip_df_var AggregateFunction(varPop, Float64) -- B8: variance bit DF
)
ENGINE = AggregatingMergeTree()
ORDER BY (window_start, src_ip, ja4, host);
-- ============================================================================
-- 4. VUE MATÉRIALISÉE → agg_host_ip_ja4_1h
-- ============================================================================
CREATE MATERIALIZED VIEW mabase_prod.mv_agg_host_ip_ja4_1h
TO mabase_prod.agg_host_ip_ja4_1h AS
SELECT
toStartOfHour(src.time) AS window_start,
toIPv6(src.src_ip) AS src_ip, src.ja4, src.host, src.src_asn,
any(src.src_country_code) AS src_country_code, any(src.src_as_name) AS src_as_name,
any(src.src_org) AS src_org, any(src.src_domain) AS src_domain,
min(src.time) AS first_seen, max(src.time) AS last_seen, count() AS hits,
sum(IF(src.method = 'POST', 1, 0)) AS count_post,
uniqState(src.path) AS uniq_paths, uniqState(src.query) AS uniq_query_params,
any(toString(cityHash64(concat(toString(src.tcp_meta_window_size), toString(src.tcp_meta_mss), toString(src.tcp_meta_window_scale), src.tcp_meta_options)))) AS tcp_fp_raw,
varPopState(toFloat64(src.syn_to_clienthello_ms)) AS tcp_jitter_variance,
any(src.tcp_meta_window_size) AS tcp_win_raw, any(src.tcp_meta_window_scale) AS tcp_scale_raw,
any(src.tcp_meta_mss) AS tcp_mss_raw, any(src.ip_meta_ttl) AS tcp_ttl_raw,
any(src.http_version) AS http_ver_raw, any(src.tls_alpn) AS tls_alpn_raw, any(src.tls_sni) AS tls_sni_raw,
any(src.header_user_agent) AS first_ua, max(toUInt8(src.correlated)) AS correlated_raw,
uniqState(toUInt16(src.src_port)) AS unique_src_ports, uniqState(src.conn_id) AS unique_conn_id,
max(toUInt32(src.keepalives)) AS max_keepalives,
sum(IF(src.orphan_side = 'A' OR src.correlated = 0, 1, 0)) AS orphan_count,
sum(IF(src.ip_meta_id == 0, 1, 0)) AS ip_id_zero_count,
varPopState(toFloat64(src.ip_meta_total_length)) AS total_ip_length_var,
sum(IF(src.tcp_meta_mss == 1460, 1, 0)) AS mss_1460_count,
sum(IF(match(src.path, '(?i)\.(png|jpg|jpeg|gif|css|js|ico|woff2|svg|eot)$'), 1, 0)) AS count_assets,
sum(IF(position(src.client_headers, 'Referer') = 0, 1, 0)) AS count_no_referer,
uniqState(src.header_user_agent) AS uniq_ua,
0 AS max_requests_per_sec,
varPopState(toFloat64(length(replaceAll(src.path, '/', '//')) - length(src.path))) AS url_depth_variance,
sum(IF(src.ip_meta_total_length < 60 OR src.ip_meta_total_length > 1500, 1, 0)) AS count_anomalous_payload,
-- B features
uniqState(src.ja3) AS uniq_ja3,
avgState(toFloat64(src.syn_to_clienthello_ms)) AS avg_syn_ms,
sum(IF(src.tls_version = '1.2', 1, 0)) AS tls12_count,
sum(IF(src.method = 'HEAD', 1, 0)) AS count_head,
sum(IF(length(src.header_sec_fetch_site) = 0, 1, 0)) AS count_no_sec_fetch,
sum(IF(length(src.header_accept) < 5, 1, 0)) AS count_generic_accept,
sum(IF(src.http_version = 'HTTP/1.0', 1, 0)) AS count_http10,
varPopState(toFloat64(src.ip_meta_df)) AS ip_df_var
FROM mabase_prod.http_logs AS src
GROUP BY window_start, src_ip, ja4, host, src_asn;
-- ============================================================================
-- 5. TABLE D'AGRÉGATION DES HEADERS (L7)
-- ============================================================================
CREATE TABLE mabase_prod.agg_header_fingerprint_1h
(
window_start DateTime,
src_ip IPv6,
header_order_hash SimpleAggregateFunction(any, String),
header_count SimpleAggregateFunction(max, UInt16),
has_accept_language SimpleAggregateFunction(max, UInt8),
has_cookie SimpleAggregateFunction(max, UInt8),
has_referer SimpleAggregateFunction(max, UInt8),
modern_browser_score SimpleAggregateFunction(max, UInt8),
ua_ch_mismatch SimpleAggregateFunction(max, UInt8),
sec_fetch_mode SimpleAggregateFunction(any, String),
sec_fetch_dest SimpleAggregateFunction(any, String)
)
ENGINE = AggregatingMergeTree()
ORDER BY (window_start, src_ip);
CREATE MATERIALIZED VIEW mabase_prod.mv_agg_header_fingerprint_1h
TO mabase_prod.agg_header_fingerprint_1h AS
SELECT
toStartOfHour(src.time) AS window_start,
toIPv6(src.src_ip) AS src_ip,
any(toString(cityHash64(src.client_headers))) AS header_order_hash,
max(toUInt16(length(src.client_headers) - length(replaceAll(src.client_headers, ',', '')) + 1)) AS header_count,
max(toUInt8(if(position(src.client_headers, 'Accept-Language') > 0, 1, 0))) AS has_accept_language,
max(toUInt8(if(position(src.client_headers, 'Cookie') > 0, 1, 0))) AS has_cookie,
max(toUInt8(if(position(src.client_headers, 'Referer') > 0, 1, 0))) AS has_referer,
max(toUInt8(if(length(src.header_sec_ch_ua) > 0, 100, if(length(src.header_user_agent) > 0, 50, 0)))) AS modern_browser_score,
max(toUInt8(if((position(src.header_user_agent, 'Windows') > 0 AND position(src.header_sec_ch_ua_platform, 'Windows') == 0) OR (position(src.header_user_agent, 'iPhone') > 0 AND position(src.header_sec_ch_ua_platform, 'iOS') == 0), 1, 0))) AS ua_ch_mismatch,
any(src.header_sec_fetch_mode) AS sec_fetch_mode,
any(src.header_sec_fetch_dest) AS sec_fetch_dest
FROM mabase_prod.http_logs AS src
GROUP BY window_start, src.src_ip;
-- ============================================================================
-- 6. TABLE DE RÉSULTATS ML — MENACES UNIQUEMENT (scores < seuil)
-- ============================================================================
CREATE TABLE mabase_prod.ml_detected_anomalies
(
detected_at DateTime, src_ip IPv6, ja4 String, host String, bot_name String,
anomaly_score Float32, threat_level String, model_name String, recurrence UInt32,
asn_number String, asn_org String, asn_detail String, asn_domain String, country_code String, asn_label String,
hits UInt64, hit_velocity Float32, fuzzing_index Float32, post_ratio Float32, port_exhaustion_ratio Float32,
max_keepalives UInt32, orphan_ratio Float32, tcp_jitter_variance Float32, tcp_shared_count UInt32,
true_window_size UInt64, window_mss_ratio Float32, alpn_http_mismatch UInt8, is_alpn_missing UInt8, sni_host_mismatch UInt8,
header_count UInt16, has_accept_language UInt8, has_cookie UInt8, has_referer UInt8, modern_browser_score UInt8,
is_headless UInt8, ua_ch_mismatch UInt8, header_order_shared_count UInt32, ip_id_zero_ratio Float32,
request_size_variance Float32, multiplexing_efficiency Float32, mss_mobile_mismatch UInt8, correlated UInt8, reason String,
asset_ratio Float32, direct_access_ratio Float32, is_ua_rotating UInt8, distinct_ja4_count UInt32,
src_port_density Float32, ja4_asn_concentration Float32, ja4_country_concentration Float32, is_rare_ja4 UInt8,
header_order_confidence Float32, distinct_header_orders UInt32, temporal_entropy Float32,
path_diversity_ratio Float32, url_depth_variance Float32, anomalous_payload_ratio Float32,
-- Colonnes ajoutées en v11 (bot_detector v11)
campaign_id Int32 DEFAULT -1,
raw_anomaly_score Float32 DEFAULT 0
)
ENGINE = ReplacingMergeTree(detected_at)
ORDER BY (src_ip)
TTL detected_at + INTERVAL 30 DAY;
-- ============================================================================
-- 6b. TABLE DE TOUTES LES CLASSIFICATIONS (sans seuil, pour observabilité)
-- ============================================================================
CREATE TABLE mabase_prod.ml_all_scores
(
detected_at DateTime,
window_start DateTime,
src_ip IPv6,
ja4 String,
host String,
bot_name String,
anomaly_score Float32,
raw_anomaly_score Float32,
threat_level String,
model_name String,
correlated UInt8,
asn_number String,
asn_org String,
country_code String,
asn_label String,
hits UInt64,
hit_velocity Float32,
fuzzing_index Float32,
post_ratio Float32,
campaign_id Int32
)
ENGINE = ReplacingMergeTree(detected_at)
ORDER BY (window_start, src_ip, ja4, host, model_name)
TTL window_start + INTERVAL 3 DAY
SETTINGS index_granularity = 8192;
-- ============================================================================
-- 7. VUE DE RÉCURRENCE
-- ============================================================================
CREATE OR REPLACE VIEW mabase_prod.view_ip_recurrence AS
SELECT src_ip, count() AS recurrence, min(detected_at) AS first_seen, max(detected_at) AS last_seen,
min(anomaly_score) AS worst_score, argMin(threat_level, anomaly_score) AS worst_threat_level
FROM mabase_prod.ml_detected_anomalies GROUP BY src_ip;
-- ============================================================================
-- 8. VUE IA PRINCIPALE (Avec CTE pour Entropie Temporelle)
-- ============================================================================
CREATE OR REPLACE VIEW mabase_prod.view_ai_features_1h AS
WITH base_data AS (
SELECT
a.window_start, a.src_ip, a.ja4, a.host,
toString(a.src_asn) AS asn_number, a.src_as_name AS asn_org,
a.src_org AS asn_detail, a.src_domain AS asn_domain, a.src_country_code AS country_code,
dictGetOrDefault('mabase_prod.dict_asn_reputation', 'label', toUInt64(a.src_asn), 'unknown') AS asn_label,
COALESCE(
nullIf(dictGetOrDefault('mabase_prod.dict_bot_ip', 'bot_name', a.src_ip, ''), ''),
nullIf(dictGetOrDefault('mabase_prod.dict_bot_ja4', 'bot_name', tuple(a.ja4), ''), ''),
''
) AS bot_name,
a.hits AS hits,
sum(a.hits) OVER (PARTITION BY a.src_ip) AS total_ip_hits,
a.correlated AS correlated, a.tcp_jitter_variance AS tcp_jitter_variance,
a.true_window_size AS true_window_size, a.window_mss_ratio AS window_mss_ratio, a.max_keepalives AS max_keepalives,
h.header_order_hash AS header_order_hash, h.header_count AS header_count,
h.has_accept_language AS has_accept_language, h.has_cookie AS has_cookie,
h.has_referer AS has_referer, h.modern_browser_score AS modern_browser_score, h.ua_ch_mismatch AS ua_ch_mismatch,
(a.count_post / (a.hits + 1)) AS post_ratio, (a.uniq_query_params / (a.uniq_paths + 1)) AS fuzzing_index,
(a.hits / (dateDiff('second', a.first_seen, a.last_seen) + 1)) AS hit_velocity,
(a.unique_src_ports / (a.hits + 1)) AS port_exhaustion_ratio, (a.orphan_count / (a.hits + 1)) AS orphan_ratio,
(a.ip_id_zero_count / (a.hits + 1)) AS ip_id_zero_ratio, (a.hits / (a.unique_conn_id + 1)) AS multiplexing_efficiency,
IF(a.mss_1460_count > (a.hits * 0.8) AND h.modern_browser_score > 70, 1, 0) AS mss_mobile_mismatch,
a.request_size_variance AS request_size_variance,
IF(a.tls_alpn = 'h2' AND a.http_version != '2', 1, 0) AS alpn_http_mismatch,
IF(length(a.tls_alpn) = 0 OR a.tls_alpn = '00', 1, 0) AS is_alpn_missing,
IF(length(a.tls_sni) > 0 AND a.tls_sni != a.host, 1, 0) AS sni_host_mismatch,
IF(h.sec_fetch_mode = 'navigate' AND h.sec_fetch_dest != 'document', 1, 0) AS is_fake_navigation,
count() OVER (PARTITION BY a.tcp_fingerprint) AS tcp_shared_count,
count() OVER (PARTITION BY h.header_order_hash) AS header_order_shared_count,
(a.count_assets / (a.hits + 1)) AS asset_ratio, (a.count_no_referer / (a.hits + 1)) AS direct_access_ratio,
IF(a.unique_ua > 2, 1, 0) AS is_ua_rotating, uniqExact(a.ja4) OVER (PARTITION BY a.src_ip) AS distinct_ja4_count,
((a.hits / (a.unique_src_ports + 1)) / (dateDiff('second', a.first_seen, a.last_seen) + 1)) AS src_port_density,
(sum(a.hits) OVER (PARTITION BY a.ja4, a.src_asn) / (sum(a.hits) OVER (PARTITION BY a.ja4) + 1)) AS ja4_asn_concentration,
(sum(a.hits) OVER (PARTITION BY a.ja4, a.src_country_code) / (sum(a.hits) OVER (PARTITION BY a.ja4) + 1)) AS ja4_country_concentration,
IF(sum(a.hits) OVER (PARTITION BY a.ja4) < 100, 1, 0) AS is_rare_ja4,
(count() OVER (PARTITION BY h.header_order_hash, a.first_ua) / (count() OVER (PARTITION BY a.first_ua) + 1)) AS header_order_confidence,
uniqExact(h.header_order_hash) OVER (PARTITION BY a.src_ip) AS distinct_header_orders,
(a.uniq_paths / (a.hits + 1)) AS path_diversity_ratio,
a.url_depth_variance AS url_depth_variance,
(a.count_anomalous_payload / (a.hits + 1)) AS anomalous_payload_ratio,
-- B features : TLS/TCP (disponibles correlated=1 uniquement)
a.uniq_ja3_val AS uniq_ja3_per_row,
sqrt(a.tcp_jitter_variance) / greatest(a.avg_syn_ms_val, 1) AS syn_timing_cv, -- B2
a.tls12_count / (a.hits + 1) AS tls12_ratio, -- B3
-- B features : HTTP pures (disponibles correlated=0 et 1)
a.count_head / (a.hits + 1) AS head_ratio, -- B4
a.count_no_sec_fetch / (a.hits + 1) AS sec_fetch_absence_rate, -- B5
a.count_generic_accept / (a.hits + 1) AS generic_accept_ratio, -- B6
a.count_http10 / (a.hits + 1) AS http10_ratio, -- B7
a.ip_df_variance AS ip_df_variance -- B8
FROM (
SELECT
window_start, src_ip, ja4, host, src_asn,
any(src_country_code) AS src_country_code, any(src_as_name) AS src_as_name,
any(src_org) AS src_org, any(src_domain) AS src_domain, any(first_ua) AS first_ua,
sum(hits) AS hits, uniqMerge(uniq_paths) AS uniq_paths,
uniqMerge(uniq_query_params) AS uniq_query_params, sum(count_post) AS count_post,
min(first_seen) AS first_seen, max(last_seen) AS last_seen,
any(tcp_fp_raw) AS tcp_fingerprint, varPopMerge(tcp_jitter_variance) AS tcp_jitter_variance,
varPopMerge(total_ip_length_var) AS request_size_variance,
any(tcp_win_raw * exp2(tcp_scale_raw)) AS true_window_size,
IF(any(tcp_mss_raw) > 0, any(tcp_win_raw) / any(tcp_mss_raw), 0) AS window_mss_ratio,
any(http_ver_raw) AS http_version, any(tls_alpn_raw) AS tls_alpn, any(tls_sni_raw) AS tls_sni,
max(correlated_raw) AS correlated, uniqMerge(unique_src_ports) AS unique_src_ports,
uniqMerge(unique_conn_id) AS unique_conn_id, max(max_keepalives) AS max_keepalives,
sum(orphan_count) AS orphan_count, sum(ip_id_zero_count) AS ip_id_zero_count,
sum(mss_1460_count) AS mss_1460_count,
sum(count_assets) AS count_assets, sum(count_no_referer) AS count_no_referer, uniqMerge(uniq_ua) AS unique_ua,
varPopMerge(url_depth_variance) AS url_depth_variance,
sum(count_anomalous_payload) AS count_anomalous_payload,
-- B feature aggregates
uniqMerge(uniq_ja3) AS uniq_ja3_val,
avgMerge(avg_syn_ms) AS avg_syn_ms_val,
sum(tls12_count) AS tls12_count,
sum(count_head) AS count_head,
sum(count_no_sec_fetch) AS count_no_sec_fetch,
sum(count_generic_accept) AS count_generic_accept,
sum(count_http10) AS count_http10,
varPopMerge(ip_df_var) AS ip_df_variance
FROM mabase_prod.agg_host_ip_ja4_1h
WHERE window_start >= now() - INTERVAL 24 HOUR
GROUP BY window_start, src_ip, ja4, host, src_asn
) a
LEFT JOIN (
SELECT
window_start, src_ip, any(header_order_hash) AS header_order_hash,
max(header_count) AS header_count, max(has_accept_language) AS has_accept_language,
max(has_cookie) AS has_cookie, max(has_referer) AS has_referer,
max(modern_browser_score) AS modern_browser_score, max(ua_ch_mismatch) AS ua_ch_mismatch,
any(sec_fetch_mode) AS sec_fetch_mode, any(sec_fetch_dest) AS sec_fetch_dest
FROM mabase_prod.agg_header_fingerprint_1h
WHERE window_start >= now() - INTERVAL 24 HOUR
GROUP BY window_start, src_ip
) h ON a.src_ip = h.src_ip AND a.window_start = h.window_start
)
SELECT
*,
-(sum((hits / (total_ip_hits + 1)) * log2((hits / (total_ip_hits + 1)) + 0.000001)) OVER (PARTITION BY src_ip)) AS temporal_entropy,
-- B1: ratio diversité JA3/JA4 par src_ip (signal: bots avec JA3 rotatifs sur peu de JA4)
sum(uniq_ja3_per_row) OVER (PARTITION BY src_ip) / greatest(distinct_ja4_count, 1) AS ja3_diversity_ratio
FROM base_data;
-- ============================================================================
-- VUES POUR LE DASHBOARD WEB
-- ============================================================================
-- Vue pour les métriques globales du dashboard
CREATE OR REPLACE VIEW mabase_prod.view_dashboard_summary AS
SELECT
count() AS total_detections,
countIf(threat_level = 'CRITICAL') AS critical_count,
countIf(threat_level = 'HIGH') AS high_count,
countIf(threat_level = 'MEDIUM') AS medium_count,
countIf(threat_level = 'LOW') AS low_count,
countIf(bot_name != '') AS known_bots_count,
countIf(bot_name = '') AS anomalies_count,
uniq(src_ip) AS unique_ips
FROM mabase_prod.ml_detected_anomalies
WHERE detected_at >= now() - INTERVAL 24 HOUR;
-- Vue pour la série temporelle (par heure)
CREATE OR REPLACE VIEW mabase_prod.view_dashboard_timeseries AS
SELECT
toStartOfHour(detected_at) AS hour,
count() AS total,
countIf(threat_level = 'CRITICAL') AS critical,
countIf(threat_level = 'HIGH') AS high,
countIf(threat_level = 'MEDIUM') AS medium,
countIf(threat_level = 'LOW') AS low
FROM mabase_prod.ml_detected_anomalies
WHERE detected_at >= now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour;
-- Vue pour la distribution des menaces
CREATE OR REPLACE VIEW mabase_prod.view_dashboard_threat_dist AS
SELECT
threat_level,
count() AS count,
round(count() * 100.0 / sum(count()) OVER (), 2) AS percentage
FROM mabase_prod.ml_detected_anomalies
WHERE detected_at >= now() - INTERVAL 24 HOUR
GROUP BY threat_level
ORDER BY count DESC;
-- Vue pour la variabilité (utilisée par l'API)
-- Note v12 : header_user_agent n'existe pas dans ml_detected_anomalies → remplacé par reason
CREATE OR REPLACE VIEW mabase_prod.view_dashboard_variability AS
SELECT
detected_at,
src_ip,
ja4,
host,
reason AS sample_reason,
country_code,
asn_number,
asn_org,
threat_level,
model_name,
anomaly_score,
campaign_id,
raw_anomaly_score
FROM mabase_prod.ml_detected_anomalies
WHERE detected_at >= now() - INTERVAL 24 HOUR;

View File

@ -0,0 +1,78 @@
version: '3.8' # Champ déprécié depuis Docker Compose v2.x mais toléré — peut être supprimé
services:
bot_detector_ai:
build: bot_detector
container_name: bot_detector_ai
restart: unless-stopped
ports:
- "8080:8080" # Health check → GET http://localhost:8080/
env_file:
- .env
environment:
# ── ClickHouse ────────────────────────────────────────────────────────
CLICKHOUSE_HOST: ${CLICKHOUSE_HOST:-clickhouse}
CLICKHOUSE_DB: ${CLICKHOUSE_DB:-mabase_prod}
CLICKHOUSE_USER: ${CLICKHOUSE_USER:-admin}
CLICKHOUSE_PASSWORD: ${CLICKHOUSE_PASSWORD:-}
# ── Modèle IA ─────────────────────────────────────────────────────────
ISOLATION_CONTAMINATION: ${ISOLATION_CONTAMINATION:-0.02}
ANOMALY_THRESHOLD: ${ANOMALY_THRESHOLD:--0.03}
# ── Cycle ─────────────────────────────────────────────────────────────
CYCLE_INTERVAL_SEC: ${CYCLE_INTERVAL_SEC:-300}
MAX_CONSECUTIVE_FAILURES: ${MAX_CONSECUTIVE_FAILURES:-3}
# ── Logs ──────────────────────────────────────────────────────────────
BOT_DETECTOR_LOG: ${BOT_DETECTOR_LOG:-/var/log/bot_detector/decisions.jsonl}
LOG_BACKUP_COUNT: ${LOG_BACKUP_COUNT:-7}
# ── Modèles persistants ───────────────────────────────────────────────
MODEL_DIR: ${MODEL_DIR:-/var/lib/bot_detector}
RETRAIN_INTERVAL_HOURS: ${RETRAIN_INTERVAL_HOURS:-24}
MODEL_HISTORY_COUNT: ${MODEL_HISTORY_COUNT:-10}
# ── A1 — Dérive conceptuelle ──────────────────────────────────────────
DRIFT_THRESHOLD: ${DRIFT_THRESHOLD:-0.30}
# ── A2 — Seuil adaptatif ──────────────────────────────────────────────
ANOMALY_PERCENTILE: ${ANOMALY_PERCENTILE:-5}
# ── A3 — Analyse multi-fenêtres ───────────────────────────────────────
ENABLE_MULTIWINDOW: ${ENABLE_MULTIWINDOW:-false}
MULTIWINDOW_VIEW: ${MULTIWINDOW_VIEW:-view_ai_features_24h}
# ── A4 — Explainabilité SHAP ──────────────────────────────────────────
ENABLE_SHAP: ${ENABLE_SHAP:-true}
# ── A5 — Déduplication inter-cycles avec TTL ──────────────────────────
DEDUP_TTL_MIN: ${DEDUP_TTL_MIN:-60}
# ── A6 — Pondération du score par récurrence ──────────────────────────
RECURRENCE_WEIGHT: ${RECURRENCE_WEIGHT:-0.005}
# ── A7 — Validation de complétude des features ────────────────────────
MIN_VALID_FEATURE_RATIO: ${MIN_VALID_FEATURE_RATIO:-0.50}
# ── A8 — Clustering comportemental des anomalies ──────────────────────
ENABLE_CLUSTERING: ${ENABLE_CLUSTERING:-true}
CLUSTERING_MIN_SAMPLES: ${CLUSTERING_MIN_SAMPLES:-3}
# ── Health check ──────────────────────────────────────────────────────
HEALTH_PORT: ${HEALTH_PORT:-8080}
volumes:
# Logs structurés JSONL (analyse a posteriori)
- ./bot_detector_logs:/var/log/bot_detector
# Modèles Isolation Forest sérialisés (joblib)
- ./bot_detector_models:/var/lib/bot_detector
# Fichiers CSV de réputation partagés avec ClickHouse (FILE engine)
# Montés en read-only côté bot_detector (écriture via ClickHouse uniquement)
- ./reputation/data/user_files/bot_ip.csv:/data/bot_ip.csv:ro
- ./reputation/data/user_files/bot_ja4.csv:/data/bot_ja4.csv:ro
- ./reputation/data/user_files/asn_reputation.csv:/data/asn_reputation.csv:ro

View File

@ -0,0 +1,36 @@
3215,human
12322,human
5410,human
15557,human
21502,human
9036,human
8218,human
39180,human
3303,human
6730,human
9044,human
15600,human
13030,human
25256,human
5432,human
6848,human
12392,human
49686,human
6714,human
49203,human
6661,human
8469,human
20676,human
3320,human
3209,human
8881,human
6805,human
29562,human
31334,human
8422,human
25255,human
8447,human
12635,human
6830,human
8412,human
35369,human
1 3215 human
2 12322 human
3 5410 human
4 15557 human
5 21502 human
6 9036 human
7 8218 human
8 39180 human
9 3303 human
10 6730 human
11 9044 human
12 15600 human
13 13030 human
14 25256 human
15 5432 human
16 6848 human
17 12392 human
18 49686 human
19 6714 human
20 49203 human
21 6661 human
22 8469 human
23 20676 human
24 3320 human
25 3209 human
26 8881 human
27 6805 human
28 29562 human
29 31334 human
30 8422 human
31 25255 human
32 8447 human
33 12635 human
34 6830 human
35 8412 human
36 35369 human

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,19 @@
# Build outputs
dist/
# Dependency directories
vendor/
# IDE
.idea/
.vscode/
*.swp
*.swo
*~
# OS
.DS_Store
Thumbs.db
# Aider cache
.aider*

View File

@ -0,0 +1,2 @@
# correlator configuration — DO NOT COMMIT real values
LOGCORRELATOR_CLICKHOUSE_DSN=clickhouse://data_writer:ChangeMe@clickhouse:9000/mabase_prod

View File

@ -0,0 +1,73 @@
name: Build and Test
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: '1.21'
- name: Download dependencies
run: go mod download
- name: Run tests with coverage
run: |
go test -race -coverprofile=coverage.txt -covermode=atomic ./...
TOTAL=$(go tool cover -func=coverage.txt | grep total | awk '{gsub(/%/, "", $3); print $3}')
echo "Coverage: ${TOTAL}%"
if (( $(echo "$TOTAL < 80" | bc -l) )); then
echo "Coverage ${TOTAL}% is below 80% threshold"
exit 1
fi
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
file: ./coverage.txt
build:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: '1.21'
- name: Build binary
run: |
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
-ldflags="-w -s" \
-o logcorrelator \
./cmd/logcorrelator
- name: Upload binary artifact
uses: actions/upload-artifact@v4
with:
name: logcorrelator-linux-amd64
path: logcorrelator
docker:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v4
- name: Build Docker image
run: docker build -t logcorrelator:latest .
- name: Run tests in Docker
run: |
docker run --rm logcorrelator:latest --help || true

32
services/correlator/.gitignore vendored Normal file
View File

@ -0,0 +1,32 @@
# Build directory
/build/
/dist/
# Binaries
*.exe
*.exe~
*.dll
*.so
*.dylib
/logcorrelator
# Test binary
*.test
# Output of the go coverage tool
*.out
# Dependency directories
vendor/
# IDE
.idea/
.vscode/
*.swp
*.swo
*~
# OS
.DS_Store
Thumbs.db
.aider*

View File

@ -0,0 +1,43 @@
# syntax=docker/dockerfile:1
FROM golang:1.24 AS builder
WORKDIR /build
RUN apt-get update && apt-get install -y --no-install-recommends git bc && rm -rf /var/lib/apt/lists/*
COPY go.work go.work.sum* ./
COPY shared/go/ja4common/ ./shared/go/ja4common/
COPY services/sentinel/go.mod services/sentinel/go.sum* ./services/sentinel/
COPY services/correlator/go.mod services/correlator/go.sum* ./services/correlator/
WORKDIR /build/services/correlator
RUN --mount=type=cache,target=/go/pkg/mod go mod download
COPY services/correlator/ /build/services/correlator/
ARG SKIP_TESTS=false
RUN --mount=type=cache,target=/go/pkg/mod \
if [ "$SKIP_TESTS" = "false" ]; then \
go test -race -coverprofile=coverage.txt -covermode=atomic ./... && \
echo "=== Coverage Report ===" && \
go tool cover -func=coverage.txt | grep total && \
TOTAL=$(go tool cover -func=coverage.txt | grep total | awk '{gsub(/%/, "", $3); print $3}') && \
echo "Total coverage: ${TOTAL}%" && \
if (( $(echo "$TOTAL < 60" | bc -l) )); then \
echo "ERROR: Coverage ${TOTAL}% is below 60% threshold"; \
exit 1; \
fi && \
echo "Coverage check passed!"; \
else \
echo "Skipping tests (SKIP_TESTS=true)"; \
fi
RUN --mount=type=cache,target=/go/pkg/mod \
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
-ldflags="-w -s" -o /usr/bin/correlator ./cmd/logcorrelator
FROM scratch AS runtime
COPY --from=builder /usr/bin/correlator /usr/bin/correlator
COPY --from=builder /build/services/correlator/config.example.yml /etc/correlator/correlator.yml
ENTRYPOINT ["/usr/bin/correlator"]
CMD ["-config", "/etc/correlator/correlator.yml"]

View File

@ -0,0 +1,110 @@
# syntax=docker/dockerfile:1
# =============================================================================
# correlator — Dockerfile de packaging RPM (Rocky Linux 8/9, AlmaLinux 10)
# Build context: monorepo root (ja4-platform/)
# Méthode: 1 builder Go → 1 rpm-builder (rpmbuild, 3 × dist) → 1 output alpine
# =============================================================================
# =============================================================================
# Stage 1: Builder — compilation du binaire Go
# golang:1.21 officiel (statiquement lié, CGO_ENABLED=0 → binaire portable)
# =============================================================================
FROM golang:1.24 AS builder
WORKDIR /build
RUN apt-get update && apt-get install -y --no-install-recommends git bc && \
rm -rf /var/lib/apt/lists/*
# Copie du workspace Go et du module partagé en premier (meilleur cache)
COPY go.work go.work.sum* ./
COPY shared/go/ja4common/ ./shared/go/ja4common/
COPY services/sentinel/go.mod services/sentinel/go.sum* ./services/sentinel/
COPY services/correlator/go.mod services/correlator/go.sum* ./services/correlator/
WORKDIR /build/services/correlator
RUN --mount=type=cache,target=/go/pkg/mod go mod download
COPY services/correlator/ /build/services/correlator/
ARG VERSION=dev
RUN --mount=type=cache,target=/go/pkg/mod \
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -ldflags="-w -s -X main.Version=${VERSION}" \
-o /tmp/correlator \
./cmd/logcorrelator
# =============================================================================
# Stage 2: rpm-builder — construction des RPMs avec rpmbuild
# Un seul stage, trois appels rpmbuild successifs (el8, el9, el10).
# Le spec lit les fichiers depuis %{_builddir} (répertoire BUILD de rpmbuild).
# =============================================================================
FROM rockylinux:9 AS rpm-builder
WORKDIR /package
ARG VERSION=dev
RUN dnf install -y rpm-build rpmdevtools && dnf clean all
RUN mkdir -p /root/rpmbuild/{BUILD,BUILDROOT,RPMS,SOURCES,SPECS,SRPMS} && \
mkdir -p /packages/rpm/{el8,el9,el10}
# Disposition des fichiers dans BUILD/ (attendue par le spec correlator)
RUN mkdir -p /root/rpmbuild/BUILD/usr/bin \
/root/rpmbuild/BUILD/etc/logcorrelator \
/root/rpmbuild/BUILD/etc/systemd/system \
/root/rpmbuild/BUILD/etc/logrotate.d
COPY --from=builder /tmp/correlator /root/rpmbuild/BUILD/usr/bin/logcorrelator
COPY services/correlator/config.example.yml /root/rpmbuild/BUILD/etc/logcorrelator/logcorrelator.yml
COPY services/correlator/config.example.yml /root/rpmbuild/BUILD/etc/logcorrelator/logcorrelator.yml.example
COPY services/correlator/logcorrelator.service /root/rpmbuild/BUILD/etc/systemd/system/logcorrelator.service
COPY services/correlator/packaging/rpm/logrotate /root/rpmbuild/BUILD/etc/logrotate.d/logcorrelator
RUN chmod 755 /root/rpmbuild/BUILD/usr/bin/logcorrelator && \
chmod 640 /root/rpmbuild/BUILD/etc/logcorrelator/logcorrelator.yml && \
chmod 640 /root/rpmbuild/BUILD/etc/logcorrelator/logcorrelator.yml.example && \
chmod 644 /root/rpmbuild/BUILD/etc/systemd/system/logcorrelator.service && \
chmod 644 /root/rpmbuild/BUILD/etc/logrotate.d/logcorrelator
COPY services/correlator/packaging/rpm/logcorrelator.spec /root/rpmbuild/SPECS/logcorrelator.spec
# el8
RUN rpmbuild --define "_topdir /root/rpmbuild" \
--define "dist .el8" \
--define "version ${VERSION}" \
--target x86_64 \
-bb /root/rpmbuild/SPECS/logcorrelator.spec && \
cp /root/rpmbuild/RPMS/x86_64/*.el8.x86_64.rpm /packages/rpm/el8/
# el9
RUN rpmbuild --define "_topdir /root/rpmbuild" \
--define "dist .el9" \
--define "version ${VERSION}" \
--target x86_64 \
-bb /root/rpmbuild/SPECS/logcorrelator.spec && \
cp /root/rpmbuild/RPMS/x86_64/*.el9.x86_64.rpm /packages/rpm/el9/
# el10
RUN rpmbuild --define "_topdir /root/rpmbuild" \
--define "dist .el10" \
--define "version ${VERSION}" \
--target x86_64 \
-bb /root/rpmbuild/SPECS/logcorrelator.spec && \
cp /root/rpmbuild/RPMS/x86_64/*.el10.x86_64.rpm /packages/rpm/el10/
# =============================================================================
# Stage 3: output — image finale contenant uniquement les RPMs
# =============================================================================
FROM alpine:latest AS output
WORKDIR /packages
COPY --from=rpm-builder /packages/rpm/el8/*.rpm /packages/rpm/el8/
COPY --from=rpm-builder /packages/rpm/el9/*.rpm /packages/rpm/el9/
COPY --from=rpm-builder /packages/rpm/el10/*.rpm /packages/rpm/el10/
CMD ["sh", "-c", \
"echo '=== RPM el8 ===' && ls -la /packages/rpm/el8/ && \
echo '' && echo '=== RPM el9 ===' && ls -la /packages/rpm/el9/ && \
echo '' && echo '=== RPM el10 ===' && ls -la /packages/rpm/el10/"]

View File

@ -0,0 +1,148 @@
.PHONY: build build-docker test test-docker lint clean help docker-build-dev docker-build-runtime package package-rpm
# Docker parameters
DOCKER=docker
# Use buildx for better cache management and parallel builds
DOCKER_BUILD=$(DOCKER) build
DOCKER_BUILDX=$(DOCKER) buildx
DOCKER_RUN=$(DOCKER) run
# Image names
DEV_IMAGE=logcorrelator-dev:latest
RUNTIME_IMAGE=logcorrelator:latest
PACKAGER_IMAGE=logcorrelator-packager:latest
PACKAGER_IMAGE_EL8=logcorrelator-packager-el8:latest
PACKAGER_IMAGE_EL9=logcorrelator-packager-el9:latest
PACKAGER_IMAGE_EL10=logcorrelator-packager-el10:latest
# Binary name
BINARY_NAME=logcorrelator
DIST_DIR=dist
# Package version
PKG_VERSION ?= 1.1.22
# Enable BuildKit for better performance
export DOCKER_BUILDKIT=1
## build: Build the logcorrelator binary locally
build:
mkdir -p $(DIST_DIR)
go build -ldflags="-w -s" -o $(DIST_DIR)/$(BINARY_NAME) ./cmd/$(BINARY_NAME)
## docker-build-dev: Build the development Docker image (with tests and coverage)
docker-build-dev:
$(DOCKER_BUILD) --target builder -t $(DEV_IMAGE) -f Dockerfile .
## docker-build-dev-no-test: Build the development Docker image WITHOUT tests (faster)
docker-build-dev-no-test:
$(DOCKER_BUILD) --target builder --no-cache --build-arg SKIP_TESTS=true -t $(DEV_IMAGE) -f Dockerfile .
## docker-build-runtime: Build the runtime Docker image (fast, no tests)
docker-build-runtime:
$(DOCKER_BUILD) --target runtime -t $(RUNTIME_IMAGE) -f Dockerfile .
## test: Run unit tests locally
test:
go test -race -coverprofile=coverage.out ./...
## test-docker: Run unit tests inside Docker container
test-docker: docker-build-dev
@echo "Tests already run in builder stage"
## lint: Run linters
lint:
go vet ./...
gofmt -l .
## fmt: Format all Go files
fmt:
gofmt -w .
## package: Build RPM packages for all target distributions
package: package-rpm
## package-rpm: Build RPM packages for Rocky Linux 8/9, AlmaLinux 10 (requires Docker)
## Uses buildx for parallel builds (el8, el9, el10 built simultaneously)
package-rpm:
mkdir -p $(DIST_DIR)/rpm/el8 $(DIST_DIR)/rpm/el9 $(DIST_DIR)/rpm/el10
@echo "Starting parallel RPM builds for el8, el9, el10..."
# Build all three distributions in parallel using buildx
$(DOCKER_BUILDX) build --target output -t $(PACKAGER_IMAGE) \
--build-arg VERSION=$(PKG_VERSION) \
-f Dockerfile.package . \
--load
@echo "Extracting RPM packages from Docker image..."
$(DOCKER_RUN) --rm -v $(PWD)/$(DIST_DIR)/rpm:/output/rpm $(PACKAGER_IMAGE) sh -c \
"cp -r /packages/rpm/el8 /output/rpm/ && \
cp -r /packages/rpm/el9 /output/rpm/ && \
cp -r /packages/rpm/el10 /output/rpm/"
@echo "RPM packages created:"
@echo " Enterprise Linux 8 (el8):"
ls -la $(DIST_DIR)/rpm/el8/ 2>/dev/null || echo " (no packages)"
@echo " Enterprise Linux 9 (el9):"
ls -la $(DIST_DIR)/rpm/el9/ 2>/dev/null || echo " (no packages)"
@echo " Enterprise Linux 10 (el10):"
ls -la $(DIST_DIR)/rpm/el10/ 2>/dev/null || echo " (no packages)"
## package-rpm-sequential: Build RPM packages sequentially (fallback if parallel fails)
package-rpm-sequential:
mkdir -p $(DIST_DIR)/rpm/el8 $(DIST_DIR)/rpm/el9 $(DIST_DIR)/rpm/el10
@echo "Building RPM for el8..."
$(DOCKER_BUILD) --target rpm-el8-builder -t $(PACKAGER_IMAGE_EL8) \
--build-arg VERSION=$(PKG_VERSION) \
-f Dockerfile.package .
@echo "Building RPM for el9..."
$(DOCKER_BUILD) --target rpm-el9-builder -t $(PACKAGER_IMAGE_EL9) \
--build-arg VERSION=$(PKG_VERSION) \
-f Dockerfile.package .
@echo "Building RPM for el10..."
$(DOCKER_BUILD) --target rpm-el10-builder -t $(PACKAGER_IMAGE_EL10) \
--build-arg VERSION=$(PKG_VERSION) \
-f Dockerfile.package .
@echo "Extracting RPM packages..."
$(DOCKER_RUN) --rm -v $(PWD)/$(DIST_DIR)/rpm:/output/rpm \
-v $(PACKAGER_IMAGE_EL8):/el8:ro \
-v $(PACKAGER_IMAGE_EL9):/el9:ro \
-v $(PACKAGER_IMAGE_EL10):/el10:ro \
alpine:latest sh -c \
"cp -r /el8/packages/rpm/el8 /output/rpm/ && \
cp -r /el9/packages/rpm/el9 /output/rpm/ && \
cp -r /el10/packages/rpm/el10 /output/rpm/"
## test-package-rpm: Test RPM package installation in Docker
test-package-rpm: package-rpm
./packaging/test/test-rpm.sh
## test-package: Test RPM package installation
test-package: test-package-rpm
## ci: Full CI pipeline (tests, build, packages, package tests)
ci: ci-test ci-build ci-package ci-package-test
## ci-test: Run all tests for CI
ci-test: test lint
## ci-build: Build for CI (production binary)
ci-build: build
## ci-package: Build all packages for CI
ci-package: package
## ci-package-test: Test all packages for CI
ci-package-test: test-package
## clean: Clean build artifacts and Docker images
clean:
rm -rf $(DIST_DIR)/
rm -f coverage.out
$(DOCKER) rmi $(DEV_IMAGE) 2>/dev/null || true
$(DOCKER) rmi $(RUNTIME_IMAGE) 2>/dev/null || true
$(DOCKER) rmi $(PACKAGER_IMAGE) 2>/dev/null || true
## help: Show this help message
help:
@echo "Usage: make [target]"
@echo ""
@echo "Targets:"
@sed -n 's/^##//p' $(MAKEFILE_LIST) | column -t -s ':' | sed -e 's/^/ /'

View File

@ -0,0 +1,426 @@
# logcorrelator
Service de corrélation de logs HTTP et réseau écrit en Go.
## Description
**logcorrelator** reçoit deux flux de logs JSON via des sockets Unix datagrammes (SOCK_DGRAM) :
- **Source A** : logs HTTP applicatifs (Apache, reverse proxy)
- **Source B** : logs réseau (métadonnées IP/TCP, JA3/JA4, etc.)
Il corrèle les événements sur la base de `src_ip + src_port` dans une fenêtre temporelle configurable, et produit des logs corrélés vers :
- Un fichier local (JSON lines)
- ClickHouse (pour analyse et archivage)
Les logs opérationnels du service (démarrage, erreurs, métriques) sont écrits sur **stderr** et collectés par journald. Aucune donnée corrélée n'apparaît sur stdout.
## Architecture
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Source A │────▶│ │────▶│ File Sink │
│ HTTP/Apache │ │ Correlation │ │ (JSON lines) │
│ (Unix DGRAM) │ │ Service │ └─────────────────┘
└─────────────────┘ │ │
│ - Buffers │ ┌─────────────────┐
┌─────────────────┐ │ - Time Window │────▶│ ClickHouse │
│ Source B │────▶│ - Orphan Policy │ │ Sink │
│ Réseau/JA4 │ │ - Keep-Alive │ └─────────────────┘
│ (Unix DGRAM) │ └──────────────────┘
└─────────────────┘
```
Architecture hexagonale : domaine pur (`internal/domain`), ports abstraits (`internal/ports`), adaptateurs (`internal/adapters`), orchestration (`internal/app`).
## Build (100% Docker)
Tout le build, les tests et le packaging RPM s'exécutent dans des conteneurs :
```bash
# Build complet avec tests (builder stage)
make docker-build-dev
# Packaging RPM (el8, el9, el10)
make package-rpm
# Build rapide sans tests
make docker-build-dev-no-test
# Tests en local (nécessite Go 1.21+)
make test
```
### Prérequis
- Docker 20.10+
## Installation
### Packages RPM
```bash
# Générer les packages
make package-rpm
# Installer (Rocky Linux / AlmaLinux)
sudo dnf install -y dist/rpm/el8/logcorrelator-1.1.12-1.el8.x86_64.rpm
sudo dnf install -y dist/rpm/el9/logcorrelator-1.1.12-1.el9.x86_64.rpm
sudo dnf install -y dist/rpm/el10/logcorrelator-1.1.12-1.el10.x86_64.rpm
# Démarrer
sudo systemctl enable --now logcorrelator
sudo systemctl status logcorrelator
```
### Build manuel
```bash
# Binaire local (nécessite Go 1.21+)
go build -o logcorrelator ./cmd/logcorrelator
./logcorrelator -config config.example.yml
```
## Configuration
Fichier YAML. Voir `config.example.yml` pour un exemple complet.
```yaml
log:
level: INFO # DEBUG, INFO, WARN, ERROR
inputs:
unix_sockets:
- name: http
source_type: A # Source HTTP
path: /var/run/logcorrelator/http.socket
format: json
socket_permissions: "0666"
- name: network
source_type: B # Source réseau
path: /var/run/logcorrelator/network.socket
format: json
socket_permissions: "0666"
outputs:
file:
path: /var/log/logcorrelator/correlated.log
clickhouse:
enabled: false
dsn: clickhouse://user:pass@localhost:9000/db
table: http_logs_raw
batch_size: 500
flush_interval_ms: 200
max_buffer_size: 5000
drop_on_overflow: true
timeout_ms: 1000
stdout:
enabled: false # no-op pour les données ; logs opérationnels toujours sur stderr
correlation:
time_window:
value: 10
unit: s
orphan_policy:
apache_always_emit: true
apache_emit_delay_ms: 500 # délai avant émission orphelin A (ms)
network_emit: false
matching:
mode: one_to_many # Keep-Alive : un B peut corréler plusieurs A successifs
buffers:
max_http_items: 10000
max_network_items: 20000
ttl:
network_ttl_s: 120 # TTL remis à zéro à chaque corrélation (Keep-Alive)
# Exclure des IPs source (IPs uniques ou plages CIDR)
exclude_source_ips:
- 10.0.0.1
- 172.16.0.0/12
# Restreindre la corrélation à certains ports de destination (optionnel)
# Si la liste est vide, tous les ports sont corrélés
include_dest_ports:
- 80
- 443
metrics:
enabled: false
addr: ":8080"
```
### Format du DSN ClickHouse
```
clickhouse://username:password@host:port/database
```
Ports : `9000` (natif, recommandé) ou `8123` (HTTP).
## Format des logs
### Source A (HTTP)
```json
{
"src_ip": "192.168.1.1", "src_port": 8080,
"dst_ip": "10.0.0.1", "dst_port": 443,
"timestamp": 1704110400000000000,
"method": "GET", "path": "/api/test"
}
```
### Source B (Réseau)
```json
{
"src_ip": "192.168.1.1", "src_port": 8080,
"dst_ip": "10.0.0.1", "dst_port": 443,
"ja3": "abc123", "ja4": "xyz789"
}
```
### Log corrélé (sortie)
Structure JSON plate — tous les champs A et B sont fusionnés à la racine :
```json
{
"timestamp": "2024-01-01T12:00:00Z",
"src_ip": "192.168.1.1", "src_port": 8080,
"dst_ip": "10.0.0.1", "dst_port": 443,
"correlated": true,
"method": "GET", "path": "/api/test",
"ja3": "abc123", "ja4": "xyz789"
}
```
En cas de collision de champ entre A et B, les deux valeurs sont conservées avec préfixes `a_` et `b_`.
Les orphelins A (sans B correspondant) sont émis avec `"correlated": false, "orphan_side": "A"`.
## Schema ClickHouse
Le fichier `sql/init.sql` contient le schéma complet prêt à l'emploi.
```bash
clickhouse-client --multiquery < sql/init.sql
```
### Architecture des tables
```
http_logs_raw ← inserts du service (raw_json String)
└─ mv_http_logs ← vue matérialisée (parse JSON → colonnes typées)
http_logs ← table requêtable par les analystes
```
### Table `http_logs` — colonnes
| Groupe | Colonnes |
|---|---|
| Temporel | `time` DateTime, `log_date` Date |
| Réseau | `src_ip` IPv4, `src_port` UInt16, `dst_ip` IPv4, `dst_port` UInt16 |
| HTTP | `method`, `scheme`, `host`, `path`, `query`, `http_version` (LowCardinality) |
| Corrélation | `orphan_side`, `correlated` UInt8, `keepalives` UInt16, `a_timestamp`/`b_timestamp` UInt64, `conn_id` |
| IP meta | `ip_meta_df` UInt8, `ip_meta_id` UInt16, `ip_meta_total_length` UInt16, `ip_meta_ttl` UInt8 |
| TCP meta | `tcp_meta_options`, `tcp_meta_window_size` UInt32, `tcp_meta_mss` UInt16, `tcp_meta_window_scale` UInt8, `syn_to_clienthello_ms` Int32 |
| TLS / fingerprint | `tls_version`, `tls_sni`, `tls_alpn` (LowCardinality), `ja3`, `ja3_hash`, `ja4` |
| En-têtes HTTP | `header_user_agent`, `header_accept`, `header_accept_encoding`, `header_accept_language`, `header_x_request_id`, `header_x_trace_id`, `header_x_forwarded_for`, `header_sec_ch_ua*`, `header_sec_fetch_*` |
### Utilisateurs et permissions
```sql
-- data_writer : INSERT sur http_logs_raw uniquement (compte du service)
GRANT INSERT ON mabase_prod.http_logs_raw TO data_writer;
GRANT SELECT ON mabase_prod.http_logs_raw TO data_writer;
-- analyst : lecture sur la table parsée
GRANT SELECT ON mabase_prod.http_logs TO analyst;
```
### Vérification de l'ingestion
```sql
-- Données brutes reçues
SELECT count(*), min(ingest_time), max(ingest_time) FROM mabase_prod.http_logs_raw;
-- Données parsées par la vue matérialisée
SELECT count(*), min(time), max(time) FROM mabase_prod.http_logs;
-- Derniers logs corrélés
SELECT time, src_ip, dst_ip, method, host, path, ja4
FROM mabase_prod.http_logs
WHERE correlated = 1
ORDER BY time DESC LIMIT 10;
```
## Signaux
| Signal | Comportement |
|--------|--------------|
| `SIGINT` / `SIGTERM` | Arrêt gracieux (drain buffers, flush sinks) |
| `SIGHUP` | Réouverture des fichiers de sortie (log rotation) |
## Logs internes
Les logs opérationnels vont sur **stderr** :
```bash
# Systemd
journalctl -u logcorrelator -f
# Docker
docker logs -f logcorrelator
```
## Structure du projet
```
cmd/logcorrelator/ # Point d'entrée
internal/
adapters/
inbound/unixsocket/ # Lecture SOCK_DGRAM → NormalizedEvent
outbound/
clickhouse/ # Sink ClickHouse (batch, retry, logging complet)
file/ # Sink fichier (JSON lines, SIGHUP reopen)
multi/ # Fan-out vers plusieurs sinks
stdout/ # No-op pour les données (logs opérationnels sur stderr)
app/ # Orchestrator (sources → corrélation → sinks)
config/ # Chargement/validation YAML
domain/ # CorrelationService, NormalizedEvent, CorrelatedLog
observability/ # Logger, métriques, serveur HTTP /metrics /health
ports/ # Interfaces EventSource, CorrelatedLogSink, CorrelationProcessor
config.example.yml # Exemple de configuration
Dockerfile # Build multi-stage (builder, runtime, dev)
Dockerfile.package # Packaging RPM multi-distros (el8, el9, el10)
Makefile # Cibles de build
architecture.yml # Spécification architecture
logcorrelator.service # Unité systemd
```
## Débogage
### Logs DEBUG
```yaml
log:
level: DEBUG
```
Exemples de logs produits :
```
[unixsocket:http] DEBUG event received: source=A src_ip=192.168.1.1 src_port=8080
[correlation] DEBUG processing A event: key=192.168.1.1:8080
[correlation] DEBUG correlation found: A(src_ip=... src_port=... ts=...) + B(...)
[correlation] DEBUG A event has no matching B key in buffer: key=...
[correlation] DEBUG event excluded by IP filter: source=A src_ip=10.0.0.1 src_port=8080
[correlation] DEBUG event excluded by dest port filter: source=A dst_port=22
[correlation] DEBUG TTL reset for B event (Keep-Alive): key=... new_ttl=120s
[clickhouse] DEBUG batch sent: rows=42 table=http_logs_raw
```
### Serveur de métriques
```yaml
metrics:
enabled: true
addr: ":8080"
```
`GET /health``{"status":"healthy"}`
`GET /metrics` :
```json
{
"events_received_a": 1542, "events_received_b": 1498,
"correlations_success": 1450, "correlations_failed": 92,
"failed_no_match_key": 45, "failed_time_window": 23,
"failed_buffer_eviction": 5, "failed_ttl_expired": 12,
"failed_ip_excluded": 7, "failed_dest_port_filtered": 3,
"buffer_a_size": 23, "buffer_b_size": 18,
"orphans_emitted_a": 92, "orphans_pending_a": 4,
"keepalive_resets": 892
}
```
### Diagnostic par métriques
| Métrique élevée | Cause | Solution |
|---|---|---|
| `failed_no_match_key` | A et B n'ont pas le même `src_ip:src_port` | Vérifier les deux sources |
| `failed_time_window` | Timestamps trop éloignés | Augmenter `time_window.value` ou vérifier NTP |
| `failed_ttl_expired` | B expire avant corrélation | Augmenter `ttl.network_ttl_s` |
| `failed_buffer_eviction` | Buffers trop petits | Augmenter `buffers.max_http_items` / `max_network_items` |
| `failed_ip_excluded` | Traffic depuis IPs exclues | Normal si attendu |
| `failed_dest_port_filtered` | Traffic sur ports non listés | Vérifier `include_dest_ports` |
| `orphans_emitted_a` élevé | Beaucoup de A sans B | Vérifier que la source B envoie des événements |
### Filtrage par IP source
```yaml
correlation:
exclude_source_ips:
- 10.0.0.1 # IP unique (health checks)
- 172.16.0.0/12 # Plage CIDR
```
Les événements depuis ces IPs sont silencieusement ignorés (non corrélés, non émis en orphelin). La métrique `failed_ip_excluded` comptabilise les exclusions.
### Filtrage par port de destination
```yaml
correlation:
include_dest_ports:
- 80 # HTTP
- 443 # HTTPS
- 8080
- 8443
```
Si la liste est non vide, seuls les événements dont le `dst_port` est dans la liste participent à la corrélation. Les autres sont silencieusement ignorés. Liste vide = tous les ports corrélés (comportement par défaut). La métrique `failed_dest_port_filtered` comptabilise les exclusions.
### Scripts de test
```bash
# Script Bash (simple)
./scripts/test-correlation.sh -c 10 -v
# Script Python (scénarios complets : basic, time window, keepalive, différentes IPs)
pip install requests
python3 scripts/test-correlation-advanced.py --all
```
## Troubleshooting
### ClickHouse : erreurs d'insertion
- **`No such column`** : vérifier que la table `http_logs_raw` utilise la colonne unique `raw_json` (pas de colonnes séparées)
- **`ACCESS_DENIED`** : `GRANT INSERT ON mabase_prod.http_logs_raw TO data_writer;`
- Les erreurs de flush sont loggées en ERROR dans les logs du service
### Vue matérialisée vide
Si `http_logs_raw` a des données mais `http_logs` est vide :
```sql
-- Vérifier la vue
SHOW CREATE TABLE mabase_prod.mv_http_logs;
-- Vérifier les permissions (la MV s'exécute sous le compte du service)
GRANT SELECT ON mabase_prod.http_logs_raw TO data_writer;
```
### Sockets Unix : permission denied
Vérifier que `socket_permissions: "0666"` est configuré et que le répertoire `/var/run/logcorrelator` appartient à l'utilisateur `logcorrelator`.
### Service systemd ne démarre pas
```bash
journalctl -u logcorrelator -n 50 --no-pager
/usr/bin/logcorrelator -config /etc/logcorrelator/logcorrelator.yml
```
## License
MIT

View File

@ -0,0 +1,974 @@
service:
name: logcorrelator
context: http-network-correlation
language: go
pattern: hexagonal
description: >
logcorrelator est un service système (lancé par systemd) écrit en Go, chargé
de recevoir deux flux de logs JSON via des sockets Unix, de corréler les
événements HTTP applicatifs (source A, typiquement Apache ou reverse proxy)
avec des événements réseau (source B, métadonnées IP/TCP, JA3/JA4, etc.)
sur la base de la combinaison strictement définie src_ip + src_port, avec
une fenêtre temporelle configurable. Le service supporte les connexions
HTTP Keep-Alive : un log réseau peut être corrélé à plusieurs logs HTTP
successifs (stratégie 1àN). La rétention en mémoire est bornée par des
tailles de caches configurables et un TTL dynamique pour la source B. Le
service émet toujours les événements A même lorsqu'aucun événement B n'est
disponible, n'émet jamais de logs B seuls, et pousse les résultats vers
ClickHouse et/ou un fichier local.
Fonctionnalités de débogage incluses :
- Serveur de métriques HTTP (/metrics, /health)
- Logs DEBUG détaillés avec raisons des échecs de corrélation
- Filtrage des IPs source (exclude_source_ips)
- Scripts de test (Bash et Python)
- Métriques : événements reçus, corrélations, échecs par raison, buffers, orphelins
runtime:
deployment:
unit_type: systemd
description: >
logcorrelator est livré sous forme de binaire autonome, exécuté comme un
service systemd. L'unité systemd assure le démarrage automatique au boot,
le redémarrage en cas de crash, et une intégration standard dans l'écosystème
Linux.
binary_path: /usr/bin/logcorrelator
config_path: /etc/logcorrelator/logcorrelator.yml
user: logcorrelator
group: logcorrelator
restart: on-failure
systemd_unit:
path: /etc/systemd/system/logcorrelator.service
content_example: |
[Unit]
Description=logcorrelator service
After=network.target
[Service]
Type=simple
User=logcorrelator
Group=logcorrelator
ExecStart=/usr/bin/logcorrelator -config /etc/logcorrelator/logcorrelator.yml
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log/logcorrelator /var/run/logcorrelator /etc/logcorrelator
# Resource limits
LimitNOFILE=65536
# Systemd timeouts
TimeoutStartSec=10
TimeoutStopSec=30
[Install]
WantedBy=multi-user.target
os:
supported:
- rocky-linux-8
- rocky-linux-9
- almalinux-10
- autres-linux-recentes
logs:
stdout_stderr: journald
structured: true
description: >
Les logs internes du service (erreurs, messages d'information) sont envoyés
vers stdout/stderr et collectés par journald. Ils sont structurés et ne
contiennent pas de données personnelles.
signals:
graceful_shutdown:
- SIGINT
- SIGTERM
reload:
- SIGHUP
description: >
SIGINT/SIGTERM : arrêt propre (arrêt des sockets, vidage des buffers, fermeture
des sinks). SIGHUP : réouverture des fichiers de sortie (utile pour la
rotation des logs via logrotate) sans arrêter le service.
filesystem:
description: >
Permissions et propriété des fichiers et répertoires utilisés par logcorrelator.
directories:
- path: /var/run/logcorrelator
owner: logcorrelator:logcorrelator
permissions: "0755"
purpose: >
Contient les sockets Unix (http.socket, network.socket).
Les sockets sont créés avec des permissions 0666 (world read/write).
- path: /var/log/logcorrelator
owner: logcorrelator:logcorrelator
permissions: "0750"
purpose: >
Contient les logs corrélés (correlated.log).
- path: /var/lib/logcorrelator
owner: logcorrelator:logcorrelator
permissions: "0750"
purpose: >
Répertoire home du service (données internes).
- path: /etc/logcorrelator
owner: logcorrelator:logcorrelator
permissions: "0750"
purpose: >
Contient la configuration (logcorrelator.yml, logcorrelator.yml.example).
files:
- path: /etc/logcorrelator/logcorrelator.yml
owner: logcorrelator:logcorrelator
permissions: "0640"
rpm_directive: "%config(noreplace)"
- path: /etc/logcorrelator/logcorrelator.yml.example
owner: logcorrelator:logcorrelator
permissions: "0640"
- path: /etc/systemd/system/logcorrelator.service
owner: root:root
permissions: "0644"
- path: /etc/logrotate.d/logcorrelator
owner: root:root
permissions: "0644"
rpm_directive: "%config(noreplace)"
sockets:
- path: /var/run/logcorrelator/http.socket
owner: logcorrelator:logcorrelator
permissions: "0666"
type: unix_datagram
purpose: "Source A - logs HTTP applicatifs"
- path: /var/run/logcorrelator/network.socket
owner: logcorrelator:logcorrelator
permissions: "0666"
type: unix_datagram
purpose: "Source B - logs réseau"
packaging:
description: >
logcorrelator est distribué sous forme de packages .rpm (Rocky Linux, AlmaLinux,
RHEL), construits intégralement dans des conteneurs. Le changelog RPM est mis
à jour à chaque changement de version.
Tous les numéros de version doivent être cohérents entre le spec RPM, le Makefile
(PKG_VERSION), le CHANGELOG.md et les tags git.
Politique de mise à jour de la configuration :
- Le fichier logcorrelator.yml est marqué %config(noreplace) : il n'est JAMAIS
écrasé lors d'une mise à jour. La configuration existante est préservée.
- Le fichier logcorrelator.yml.example est TOUJOURS mis à jour pour refléter
les nouvelles options de configuration disponibles.
- Lors de la première installation, si logcorrelator.yml n'existe pas, il est
créé à partir de logcorrelator.yml.example.
formats:
- rpm
target_distros:
- rocky-linux-8
- rocky-linux-9
- almalinux-10
- rhel-8
- rhel-9
- rhel-10
rpm:
tool: fpm
changelog:
source: git # ou CHANGELOG.md
description: >
À chaque build, un script génère un fichier de changelog RPM à partir de
l'historique (tags/commits) et le passe à fpm (option --rpm-changelog).
contents:
- path: /usr/bin/logcorrelator
type: binary
- path: /etc/logcorrelator/logcorrelator.yml
type: config
directives: "%config(noreplace)"
behavior: >
Jamais écrasé lors des mises à jour. Préservé automatiquement par RPM.
Créé uniquement lors de la première installation s'il n'existe pas.
- path: /etc/logcorrelator/logcorrelator.yml.example
type: doc
behavior: >
TOUJOURS mis à jour lors des mises à jour. Sert de référence pour les
nouvelles options de configuration disponibles.
- path: /etc/systemd/system/logcorrelator.service
type: systemd_unit
- path: /etc/logrotate.d/logcorrelator
type: logrotate_script
directives: "%config(noreplace)"
logrotate_example: |
/var/log/logcorrelator/correlated.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
create 0640 logcorrelator logcorrelator
sharedscripts
postrotate
/bin/systemctl reload logcorrelator > /dev/null 2>&1 || true
endscript
}
config:
format: yaml
location: /etc/logcorrelator/logcorrelator.yml
reload_strategy: signal_sighup_for_files
description: >
Toute la configuration est centralisée dans un fichier YAML lisible. Le RPM
fournit aussi un fichier d'exemple mis à jour à chaque version.
example: |
# /etc/logcorrelator/logcorrelator.yml
log:
level: INFO # DEBUG, INFO, WARN, ERROR
inputs:
unix_sockets:
# Source HTTP (A) : logs applicatifs en JSON, 1 datagramme = 1 log.
- name: http
source_type: A
path: /var/run/logcorrelator/http.socket
format: json
socket_permissions: "0666"
# Source réseau (B) : logs IP/TCP/JA3... en JSON, 1 datagramme = 1 log.
- name: network
source_type: B
path: /var/run/logcorrelator/network.socket
format: json
socket_permissions: "0666"
outputs:
file:
enabled: true
path: /var/log/logcorrelator/correlated.log
clickhouse:
enabled: false
dsn: clickhouse://user:pass@localhost:9000/db
table: correlated_logs_http_network
batch_size: 500
flush_interval_ms: 200
max_buffer_size: 5000
drop_on_overflow: true
async_insert: true
timeout_ms: 1000
stdout:
enabled: false
level: INFO # DEBUG: tous les logs (y compris orphelins), INFO: seulement corrélés, WARN: corrélés seulement, ERROR: aucun
correlation:
# Fenêtre de corrélation : si le log HTTP arrive avant le réseau, il attend
# au plus cette durée (sauf éviction du cache HTTP).
# Augmentée à 10s pour supporter le Keep-Alive HTTP.
time_window:
value: 10
unit: s
orphan_policy:
apache_always_emit: true # Toujours émettre les événements A, même sans correspondance B
network_emit: false # Ne jamais émettre les événements B seuls
matching:
mode: one_to_many # KeepAlive : un B peut corréler plusieurs A.
buffers:
# Tailles max des caches en mémoire (en nombre de logs).
max_http_items: 10000
max_network_items: 20000
ttl:
# Durée de vie standard d'un log réseau (B) en mémoire. Chaque corrélation
# réussie avec un A réinitialise ce TTL.
# Augmenté à 120s pour supporter les sessions HTTP Keep-Alive longues.
network_ttl_s: 120
# Filtrage des IPs source à exclure (optionnel)
exclude_source_ips:
- 10.0.0.1 # IP unique
- 172.16.0.0/12 # Plage CIDR
# Les événements depuis ces IPs sont silencieusement ignorés
# Serveur de métriques HTTP (optionnel, pour débogage et monitoring)
metrics:
enabled: false
addr: ":8080" # Adresse d'écoute du serveur HTTP
# Endpoints:
# GET /metrics - Retourne les métriques de corrélation en JSON
# GET /health - Health check
inputs:
description: >
Deux flux de logs JSON via sockets Unix datagram (SOCK_DGRAM). Chaque datagramme
contient un JSON complet. Le champ source_type ("A" ou "B") doit être spécifié
pour chaque socket. À défaut, la source est déduite automatiquement (présence de
headers = source A, sinon source B).
unix_sockets:
- name: http
id: A
description: >
Source A, logs HTTP applicatifs (Apache, reverse proxy, etc.). Schéma JSON
variable, champ timestamp (int64, nanosecondes) obligatoire, headers dynamiques (header_*).
path: /var/run/logcorrelator/http.socket
source_type: A
permissions: "0666"
protocol: unix
socket_type: dgram
mode: datagram
format: json
framing: message
max_datagram_bytes: 65535
retry_on_error: true
- name: network
id: B
description: >
Source B, logs réseau (métadonnées IP/TCP, JA3/JA4, etc.). Seuls src_ip
et src_port sont requis pour la corrélation. Le champ timestamp est optionnel ;
s'il est absent, l'heure de réception est utilisée.
path: /var/run/logcorrelator/network.socket
source_type: B
permissions: "0666"
protocol: unix
socket_type: dgram
mode: datagram
format: json
framing: message
max_datagram_bytes: 65535
retry_on_error: true
outputs:
description: >
Les logs corrélés sont envoyés vers un ou plusieurs sinks (MultiSink).
sinks:
file:
enabled: true
description: >
Sink fichier local. Un JSON par ligne. Rotation gérée par logrotate,
réouverture du fichier sur SIGHUP. Le champ `enabled: false` coupe
completement l'ecriture du fichier (le sink n'est pas cree).
path: /var/log/logcorrelator/correlated.log
format: json_lines
rotate_managed_by: external_logrotate
clickhouse:
enabled: false
description: >
Sink principal pour l'archivage et l'analyse quasi temps réel. Inserts
batch asynchrones, drop en cas de saturation. Le service insère uniquement
dans une table RAW (raw_json String, ingest_time DateTime DEFAULT now()).
La table parsée et la vue matérialisée sont gérées en externe (DDL séparés).
Toutes les erreurs de connexion, de flush et de retry sont loggées :
INFO à la connexion, ERROR sur échec de flush, WARN sur drop/retry, DEBUG sur envoi réussi.
dsn: clickhouse://user:pass@host:9000/db
table: correlated_logs_http_network
batch_size: 500
flush_interval_ms: 200
max_buffer_size: 5000
drop_on_overflow: true
async_insert: true
timeout_ms: 1000
stdout:
enabled: false
description: >
Sink no-op pour les données. Aucune donnée corrélée ou orpheline n'est
jamais écrite sur stdout. Ce sink existe uniquement pour satisfaire
l'interface CorrelatedLogSink. Les logs opérationnels du service
(démarrage, erreurs, métriques de débogage) sont toujours sur stderr
via observability.Logger, indépendamment de ce sink.
correlation:
description: >
Corrélation stricte basée sur src_ip + src_port et une fenêtre temporelle
configurable. Aucun autre champ n'est utilisé pour la décision de corrélation.
key:
- src_ip
- src_port
time_window:
value: 10
unit: s
description: >
Fenêtre de temps appliquée aux timestamps de A et B. Si B n'arrive pas dans
ce délai, A est émis comme orphelin. Augmentée à 10s pour le Keep-Alive.
retention_limits:
max_http_items: 10000
max_network_items: 20000
description: >
Limites des caches. Si max_http_items est atteint, le plus ancien A est
évincé et émis orphelin. Si max_network_items est atteint, le plus ancien B
est supprimé silencieusement.
ttl_management:
network_ttl_s: 120
description: >
TTL des logs réseau. Chaque fois qu'un B est corrélé à un A (Keep-Alive),
son TTL est remis à cette valeur. Augmenté à 120s pour les sessions longues.
timestamp_source:
apache: timestamp (champ int64, nanosecondes)
network: timestamp (champ int64, nanosecondes) si présent, sinon time (RFC3339),
sinon reception_time (time.Now())
orphan_policy:
apache_always_emit: true
network_emit: false
matching:
mode: one_to_many
description: >
Stratégie 1àN : un log réseau peut être utilisé pour plusieurs logs HTTP
successifs tant qu'il n'a pas expiré ni été évincé.
ip_filtering:
directive: exclude_source_ips
description: >
Liste d'IPs source (exactes ou plages CIDR) à ignorer silencieusement.
Événements non corrélés, non émis en orphelin. Métrique : failed_ip_excluded.
dest_port_filtering:
directive: include_dest_ports
description: >
Liste blanche de ports de destination. Si non vide, seuls les événements
dont le dst_port est dans la liste participent à la corrélation. Les autres
sont silencieusement ignorés (non corrélés, non émis en orphelin).
Liste vide = tous les ports autorisés (comportement par défaut).
Métrique : failed_dest_port_filtered.
example:
include_dest_ports: [80, 443, 8080, 8443]
schema:
description: >
Schémas variables pour A et B. Quelques champs seulement sont obligatoires
pour la corrélation, les autres sont acceptés sans modification de code.
source_A:
description: >
Logs HTTP applicatifs au format JSON.
required_fields:
- name: src_ip
type: string
- name: src_port
type: int
- name: timestamp
type: int64
unit: ns
optional_fields:
- name: dst_ip
type: string
- name: dst_port
type: int
- name: method
type: string
- name: path
type: string
- name: host
type: string
- name: http_version
type: string
dynamic_fields:
- pattern: header_*
target_map: headers
- pattern: "*"
target_map: extra
source_B:
description: Logs réseau JSON (IP/TCP, JA3/JA4...).
required_fields:
- name: src_ip
type: string
- name: src_port
type: int
optional_fields:
- name: dst_ip
type: string
- name: dst_port
type: int
- name: timestamp
type: int64
unit: ns
- name: time
type: string
format: RFC3339 ou RFC3339Nano
dynamic_fields:
- pattern: "*"
target_map: extra
normalized_event:
description: >
Représentation interne unifiée des événements A/B.
fields:
- name: source
type: enum("A","B")
- name: timestamp
type: time.Time
- name: src_ip
type: string
- name: src_port
type: int
- name: dst_ip
type: string
optional: true
- name: dst_port
type: int
optional: true
- name: headers
type: map[string]string
optional: true
- name: extra
type: map[string]any
correlated_log:
description: >
Structure du log corrélé émis vers les sinks.
fields:
- name: timestamp
type: time.Time
- name: src_ip
type: string
- name: src_port
type: int
- name: dst_ip
type: string
optional: true
- name: dst_port
type: int
optional: true
- name: correlated
type: bool
- name: orphan_side
type: string
- name: "*"
type: map[string]any
clickhouse_schema:
strategy: external_ddls
database: mabase_prod
description: >
La table ClickHouse est gérée en dehors du service. Le service insère dans une
table RAW avec une seule colonne raw_json contenant le log corrélé complet
sérialisé en JSON. La colonne ingest_time utilise DEFAULT now().
Toute extraction de champs (table parsée, vue matérialisée) est gérée en externe
via des DDL séparés, non implémentés dans le service.
tables:
- name: http_logs_raw
description: >
Table d'ingestion brute. Une seule colonne raw_json contient le log corrélé
complet sérialisé en JSON. La colonne ingest_time est auto-générée avec
DEFAULT now(). Partitionnée par jour pour optimiser le TTL.
engine: MergeTree
partition_by: toDate(ingest_time)
order_by: ingest_time
columns:
- name: raw_json
type: String
- name: ingest_time
type: DateTime
default: now()
insert_format: |
INSERT INTO mabase_prod.http_logs_raw (raw_json) VALUES
('{...log corrélé sérialisé en JSON...}')
notes: >
Le service utilise l'API native clickhouse-go/v2 (PrepareBatch + Append + Send).
La colonne ingest_time n'est PAS explicitement insérée (DEFAULT now() est utilisé).
- name: http_logs
description: >
Table parsée (optionnelle, gérée en externe). Le service n'implémente PAS
l'extraction des champs suivants. Si cette table est utilisée, elle doit être
alimentée par une vue matérialisée ou un traitement ETL externe.
engine: MergeTree
partition_by: log_date
order_by: (time, src_ip, dst_ip, ja4)
columns:
- name: time
type: DateTime
- name: log_date
type: Date
default: toDate(time)
- name: src_ip
type: IPv4
- name: src_port
type: UInt16
- name: dst_ip
type: IPv4
- name: dst_port
type: UInt16
- name: method
type: LowCardinality(String)
- name: scheme
type: LowCardinality(String)
- name: host
type: LowCardinality(String)
- name: path
type: String
- name: query
type: String
- name: http_version
type: LowCardinality(String)
- name: orphan_side
type: LowCardinality(String)
- name: correlated
type: UInt8
- name: keepalives
type: UInt16
status: non_implémenté
- name: a_timestamp
type: UInt64
status: non_implémenté
- name: b_timestamp
type: UInt64
status: non_implémenté
- name: conn_id
type: String
status: non_implémenté
- name: ip_meta_df
type: UInt8
status: non_implémenté
- name: ip_meta_id
type: UInt32
status: non_implémenté
- name: ip_meta_total_length
type: UInt32
status: non_implémenté
- name: ip_meta_ttl
type: UInt8
status: non_implémenté
- name: tcp_meta_options
type: LowCardinality(String)
status: non_implémenté
- name: tcp_meta_window_size
type: UInt32
status: non_implémenté
- name: syn_to_clienthello_ms
type: Int32
status: non_implémenté
- name: tls_version
type: LowCardinality(String)
status: non_implémenté
- name: tls_sni
type: LowCardinality(String)
status: non_implémenté
- name: ja3
type: String
status: non_implémenté
- name: ja3_hash
type: String
status: non_implémenté
- name: ja4
type: String
status: non_implémenté
- name: header_user_agent
type: String
status: non_implémenté
- name: header_accept
type: String
status: non_implémenté
- name: header_accept_encoding
type: String
status: non_implémenté
- name: header_accept_language
type: String
status: non_implémenté
- name: header_x_request_id
type: String
status: non_implémenté
- name: header_x_trace_id
type: String
status: non_implémenté
- name: header_x_forwarded_for
type: String
status: non_implémenté
- name: header_sec_ch_ua
type: String
status: non_implémenté
- name: header_sec_ch_ua_mobile
type: String
status: non_implémenté
- name: header_sec_ch_ua_platform
type: String
status: non_implémenté
- name: header_sec_fetch_dest
type: String
status: non_implémenté
- name: header_sec_fetch_mode
type: String
status: non_implémenté
- name: header_sec_fetch_site
type: String
status: non_implémenté
notes: >
Cette table et la vue matérialisée associée sont gérées en externe (DDL séparés).
Le service se contente d'insérer le JSON brut dans http_logs_raw.
Les champs marqués "non_implémenté" ne sont PAS extraits par le service.
users:
description: >
La gestion des utilisateurs ClickHouse est externe au service. Le DSN est
configuré dans le fichier de configuration YAML.
notes: >
Cette section est fournie à titre indicatif pour l'administration ClickHouse.
migration:
description: >
Aucune migration n'est implémentée dans le service. La gestion des schémas
(tables, vues matérialisées) est entièrement externe (DDL séparés).
architecture:
description: >
Architecture hexagonale : domaine de corrélation indépendant, ports abstraits
pour les sources/sinks, adaptateurs pour sockets Unix, fichier, ClickHouse et
stdout, couche application d'orchestration, et modules infra (config, observabilité).
modules:
- name: cmd/logcorrelator
type: entrypoint
responsibilities:
- Chargement de la configuration YAML.
- Initialisation des adaptateurs d'entrée/sortie.
- Création du CorrelationService.
- Démarrage de l'orchestrateur.
- Gestion des signaux (SIGINT, SIGTERM, SIGHUP).
- Versioning via -ldflags (main.Version).
- name: internal/domain
type: domain
responsibilities:
- Modèles NormalizedEvent et CorrelatedLog.
- CorrelationService (fenêtre, TTL, buffers bornés, one-to-many/Keep-Alive, orphelins).
- Filtrage par IP source (exclude_source_ips, CIDR).
- Filtrage par port destination (include_dest_ports, liste blanche).
- Custom JSON marshaling pour CorrelatedLog (structure plate).
- name: internal/ports
type: ports
responsibilities:
- Interfaces EventSource, CorrelatedLogSink, CorrelationProcessor.
- name: internal/app
type: application
responsibilities:
- Orchestrator : EventSource → CorrelationService → MultiSink.
- Gestion du contexte de shutdown et drain des événements.
- name: internal/adapters/inbound/unixsocket
type: adapter_inbound
responsibilities:
- Lecture Unix datagram (SOCK_DGRAM) et parsing JSON → NormalizedEvent.
- Détection automatique de la source (A/B) via source_type ou headers.
- Gestion des permissions de socket (défaut 0666).
- Cleanup du fichier socket à l'arrêt.
- name: internal/adapters/outbound/file
type: adapter_outbound
responsibilities:
- Écriture JSON lines.
- Réouverture du fichier sur SIGHUP (log rotation).
- Validation des chemins (répertoire autorisé).
- name: internal/adapters/outbound/clickhouse
type: adapter_outbound
responsibilities:
- Bufferisation + inserts batch asynchrones.
- Gestion du drop_on_overflow.
- Retry avec backoff exponentiel (MaxRetries=3).
- API native clickhouse-go/v2 (PrepareBatch + Append + Send).
- Logging complet via observability.Logger (SetLogger) : INFO à la connexion,
DEBUG sur envoi réussi (rows/table), WARN sur drop buffer et retries,
ERROR sur échec de flush (périodique, batch, fermeture).
- name: internal/adapters/outbound/stdout
type: adapter_outbound
responsibilities:
- Sink no-op pour les données corrélées.
- Write/Flush/Close ne font rien : les données ne passent jamais par stdout.
- Les logs opérationnels sont sur stderr via observability.Logger (indépendant de ce sink).
- name: internal/adapters/outbound/multi
type: adapter_outbound
responsibilities:
- Fan-out vers plusieurs sinks.
- Implémentation de Reopen() pour la rotation des logs.
- name: internal/config
type: infrastructure
responsibilities:
- Chargement/validation de la configuration YAML.
- Valeurs par défaut et fallback pour champs dépréciés.
- name: internal/observability
type: infrastructure
responsibilities:
- Logger structuré avec niveaux (DEBUG, INFO, WARN, ERROR).
- CorrelationMetrics : suivi des statistiques de corrélation.
- MetricsServer : serveur HTTP pour exposition des métriques (/metrics, /health).
- Traçage des événements exclus (exclude_source_ips).
- Logs pour : événements reçus, corrélations, orphelins, buffer plein.
testing:
unit:
description: >
Tests unitaires tabledriven, couverture cible ≥ 80 %. La couverture actuelle
est d'environ 74-80% selon les versions. Les tests se concentrent sur la logique
de corrélation, les caches, les sinks et le parsing des datagrammes.
coverage_minimum: 0.8
coverage_actual: ~0.74-0.80
focus:
- CorrelationService (fenêtre, TTL, évictions, one-to-many/Keep-Alive)
- Parsing A/B → NormalizedEvent (datagrammes JSON)
- ClickHouseSink (batching, retry, overflow, logging erreurs/succès)
- FileSink (réouverture sur SIGHUP)
- MultiSink (fan-out)
- StdoutSink (no-op data, test stdout reste vide)
- Config (validation, valeurs par défaut, exclude_source_ips)
- UnixSocketSource (lecture, permissions, cleanup)
- CorrelationMetrics (suivi des statistiques)
- MetricsServer (endpoints /metrics et /health)
integration:
description: >
Tests d'intégration limités. Le flux complet A+B → corrélation → sinks est
testé via des tests unitaires avec mocks. ClickHouse est mocké (pas de tests
avec vrai ClickHouse). Scénarios Keep-Alive testés dans correlation_service_test.go.
Scripts de test fournis : scripts/test-correlation.sh et scripts/test-correlation-advanced.py.
docker:
description: >
Build, tests et packaging RPM sont exécutés intégralement dans des conteneurs
via un multistage build. Deux Dockerfiles : Dockerfile (build + runtime + dev)
et Dockerfile.package (RPM multi-distros : el8, el9, el10).
build_pipeline:
multi_stage: true
stages:
- name: builder
base: golang:1.21
description: >
go test -race -coverprofile=coverage.txt ./... avec vérification de couverture
(échec si < 80 %). Compilation d'un binaire statique (CGO_ENABLED=0,
GOOS=linux, GOARCH=amd64).
- name: runtime
base: scratch
description: >
Image minimale contenant uniquement le binaire et la config exemple.
- name: rpm_builder_el8
base: rockylinux:8
description: >
Installation de fpm (via Ruby), construction RPM pour Enterprise Linux 8.
- name: rpm_builder_el9
base: rockylinux:9
description: >
Installation de fpm (via Ruby), construction RPM pour Enterprise Linux 9.
- name: rpm_builder_el10
base: almalinux:10
description: >
Installation de fpm (via Ruby), construction RPM pour Enterprise Linux 10.
- name: output_export
base: alpine:latest
description: >
Export des paquets RPM produits pour les 3 distributions (el8, el9, el10).
files:
- path: Dockerfile
description: Build principal (builder, runtime, dev) et packaging RPM mono-distro.
- path: Dockerfile.package
description: Packaging RPM multi-distros (el8, el9, el10) avec scripts post/preun/postun.
observability:
description: >
Le service inclut des fonctionnalités complètes de débogage et de monitoring
pour diagnostiquer les problèmes de corrélation et surveiller les performances.
logging:
levels:
- DEBUG: Tous les événements reçus, tentatives de corrélation, raisons des échecs
- INFO: Événements corrélés, démarrage/arrêt du service
- WARN: Orphelins émis, buffer plein, TTL expiré
- ERROR: Erreurs de parsing, échecs de sink, erreurs critiques
debug_logs:
- "event received: source=A src_ip=192.168.1.1 src_port=8080 timestamp=..."
- "processing A event: key=192.168.1.1:8080 timestamp=..."
- "correlation found: A(src_ip=... src_port=... ts=...) + B(src_ip=... src_port=... ts=...)"
- "A event has no matching B key in buffer: key=..."
- "A event has same key as B but outside time window: key=... time_diff=5s window=10s"
- "event excluded by IP filter: source=A src_ip=10.0.0.1 src_port=8080"
- "event excluded by dest port filter: source=A dst_port=22"
- "TTL reset for B event (Keep-Alive): key=... new_ttl=120s"
- "[clickhouse] DEBUG batch sent: rows=42 table=correlated_logs_http_network"
info_logs:
- "[clickhouse] INFO connected to ClickHouse: table=... batch_size=500 flush_interval_ms=200"
warn_logs:
- "[clickhouse] WARN buffer full, dropping log: table=... buffer_size=5000"
- "[clickhouse] WARN retrying batch insert: attempt=2/3 delay=100ms rows=42 err=connection refused"
error_logs:
- "[clickhouse] ERROR periodic flush failed: ..."
- "[clickhouse] ERROR batch flush failed: ..."
- "[clickhouse] ERROR final flush on close failed: ..."
metrics_server:
enabled: true
endpoints:
- path: /metrics
method: GET
description: Retourne les métriques de corrélation au format JSON
response_example: |
{
"events_received_a": 1542,
"events_received_b": 1498,
"correlations_success": 1450,
"correlations_failed": 92,
"failed_no_match_key": 45,
"failed_time_window": 23,
"failed_buffer_eviction": 5,
"failed_ttl_expired": 12,
"failed_ip_excluded": 7,
"failed_dest_port_filtered": 3,
"buffer_a_size": 23,
"buffer_b_size": 18,
"orphans_emitted_a": 92,
"keepalive_resets": 892
}
- path: /health
method: GET
description: Health check
response_example: |
{"status":"healthy"}
metrics_tracked:
events_received:
- events_received_a: Nombre d'événements HTTP (source A) reçus
- events_received_b: Nombre d'événements réseau (source B) reçus
correlations:
- correlations_success: Corrélations réussies
- correlations_failed: Échecs de corrélation
failure_reasons:
- failed_no_match_key: Clé src_ip:src_port non trouvée dans le buffer
- failed_time_window: Événements hors fenêtre temporelle
- failed_buffer_eviction: Buffer plein, événement évincé
- failed_ttl_expired: TTL du événement B expiré
- failed_ip_excluded: Événement exclu par filtre IP (exclude_source_ips)
- failed_dest_port_filtered: Événement exclu par filtre port destination (include_dest_ports)
buffers:
- buffer_a_size: Taille actuelle du buffer HTTP
- buffer_b_size: Taille actuelle du buffer réseau
orphans:
- orphans_emitted_a: Orphelins A émis (sans correspondance B)
- orphans_emitted_b: Orphelins B émis (toujours 0, policy: network_emit=false)
- orphans_pending_a: Orphelins A en attente (délai avant émission)
- pending_orphan_match: B a corrélé avec un orphelin A en attente
keepalive:
- keepalive_resets: Resets TTL pour mode Keep-Alive (one-to-many)
troubleshooting:
description: >
Guide de diagnostic basé sur les métriques et logs
common_issues:
- symptom: failed_no_match_key élevé
cause: Les logs A et B n'ont pas le même src_ip + src_port
solution: Vérifier que les deux sources utilisent la même combinaison IP/port
- symptom: failed_time_window élevé
cause: Timestamps trop éloignés (> time_window.value)
solution: Augmenter correlation.time_window.value ou synchroniser les horloges (NTP)
- symptom: failed_ttl_expired élevé
cause: Les événements B expirent avant corrélation
solution: Augmenter correlation.ttl.network_ttl_s
- symptom: failed_buffer_eviction élevé
cause: Buffers trop petits pour le volume de logs
solution: Augmenter correlation.buffers.max_http_items et max_network_items
- symptom: failed_ip_excluded élevé
cause: Traffic depuis des IPs configurées dans exclude_source_ips
solution: Vérifier la configuration, c'est normal si attendu
- symptom: failed_dest_port_filtered élevé
cause: Traffic sur des ports non listés dans include_dest_ports
solution: Vérifier la configuration include_dest_ports, ou vider la liste pour tout accepter
- symptom: orphans_emitted_a élevé
cause: Beaucoup de logs A sans correspondance B
solution: Vérifier que la source B envoie bien les événements attendus
test_scripts:
- name: scripts/test-correlation.sh
description: Script Bash pour tester la corrélation avec des événements synthétiques
features:
- Envoi de paires A+B avec mêmes src_ip:src_port
- Vérification des métriques avant/après
- Options: -c (count), -d (delay), -v (verbose), -m (metrics-url)
- name: scripts/test-correlation-advanced.py
description: Script Python avancé avec multiples scénarios de test
features:
- Basic test: corrélations simples
- Time window test: vérifie l'expiration de la fenêtre temporelle
- Different IP test: vérifie non-corrélation avec IPs différentes
- Keep-Alive test: vérifie le mode one-to-many
- Métriques en temps réel

View File

@ -0,0 +1,202 @@
package main
import (
"context"
"flag"
"fmt"
"os"
"os/signal"
"syscall"
"time"
"github.com/antitbone/ja4/correlator/internal/adapters/inbound/unixsocket"
"github.com/antitbone/ja4/correlator/internal/adapters/outbound/clickhouse"
"github.com/antitbone/ja4/correlator/internal/adapters/outbound/file"
"github.com/antitbone/ja4/correlator/internal/adapters/outbound/multi"
"github.com/antitbone/ja4/correlator/internal/adapters/outbound/stdout"
"github.com/antitbone/ja4/correlator/internal/app"
"github.com/antitbone/ja4/correlator/internal/config"
"github.com/antitbone/ja4/correlator/internal/domain"
"github.com/antitbone/ja4/correlator/internal/observability"
"github.com/antitbone/ja4/correlator/internal/ports"
)
var Version = "dev"
func main() {
configPath := flag.String("config", "config.yml", "path to configuration file")
version := flag.Bool("version", false, "print version and exit")
flag.Parse()
if *version {
fmt.Println(Version)
os.Exit(0)
}
// Load configuration
cfg, err := config.Load(*configPath)
if err != nil {
fmt.Fprintf(os.Stderr, "Error loading configuration: %v\n", err)
os.Exit(1)
}
// Initialize logger with configured level
logger := observability.NewLoggerWithLevel("logcorrelator", cfg.Log.GetLevel())
logger.Info(fmt.Sprintf("Starting logcorrelator version %s (log_level=%s)", Version, cfg.Log.GetLevel()))
// Create sources
sources := make([]ports.EventSource, 0, len(cfg.Inputs.UnixSockets))
for _, inputCfg := range cfg.Inputs.UnixSockets {
source := unixsocket.NewUnixSocketSource(unixsocket.Config{
Name: inputCfg.Name,
Path: inputCfg.Path,
SourceType: inputCfg.SourceType,
SocketPermissions: inputCfg.GetSocketPermissions(),
})
// Set logger for debug logging
source.SetLogger(logger)
sources = append(sources, source)
logger.Info(fmt.Sprintf("Configured input source: name=%s, path=%s, permissions=%o", inputCfg.Name, inputCfg.Path, inputCfg.GetSocketPermissions()))
}
// Create sinks
sinks := make([]ports.CorrelatedLogSink, 0)
if cfg.Outputs.File.Enabled && cfg.Outputs.File.Path != "" {
fileSink, err := file.NewFileSink(file.Config{
Path: cfg.Outputs.File.Path,
})
if err != nil {
logger.Error("Failed to create file sink", err)
os.Exit(1)
}
sinks = append(sinks, fileSink)
logger.Info(fmt.Sprintf("Configured file sink: path=%s", cfg.Outputs.File.Path))
}
if cfg.Outputs.ClickHouse.Enabled {
clickHouseSink, err := clickhouse.NewClickHouseSink(clickhouse.Config{
DSN: cfg.Outputs.ClickHouse.DSN,
Table: cfg.Outputs.ClickHouse.Table,
BatchSize: cfg.Outputs.ClickHouse.BatchSize,
FlushIntervalMs: cfg.Outputs.ClickHouse.FlushIntervalMs,
MaxBufferSize: cfg.Outputs.ClickHouse.MaxBufferSize,
DropOnOverflow: cfg.Outputs.ClickHouse.DropOnOverflow,
AsyncInsert: cfg.Outputs.ClickHouse.AsyncInsert,
TimeoutMs: cfg.Outputs.ClickHouse.TimeoutMs,
})
if err != nil {
logger.Error("Failed to create ClickHouse sink", err)
os.Exit(1)
}
clickHouseSink.SetLogger(logger)
sinks = append(sinks, clickHouseSink)
logger.Info(fmt.Sprintf("Configured ClickHouse sink: table=%s", cfg.Outputs.ClickHouse.Table))
}
if cfg.Outputs.Stdout.Enabled {
stdoutSink := stdout.NewStdoutSink(stdout.Config{Enabled: true})
sinks = append(sinks, stdoutSink)
logger.Info("Configured stdout sink (operational logs on stderr)")
}
// Create multi-sink wrapper
multiSink := multi.NewMultiSink(sinks...)
// Create correlation service
correlationSvc := domain.NewCorrelationService(domain.CorrelationConfig{
TimeWindow: cfg.Correlation.GetTimeWindow(),
ApacheAlwaysEmit: cfg.Correlation.GetApacheAlwaysEmit(),
ApacheEmitDelayMs: cfg.Correlation.GetApacheEmitDelayMs(),
NetworkEmit: false,
MaxHTTPBufferSize: cfg.Correlation.GetMaxHTTPBufferSize(),
MaxNetworkBufferSize: cfg.Correlation.GetMaxNetworkBufferSize(),
NetworkTTLS: cfg.Correlation.GetNetworkTTLS(),
MatchingMode: cfg.Correlation.GetMatchingMode(),
ExcludeSourceIPs: cfg.Correlation.GetExcludeSourceIPs(),
IncludeDestPorts: cfg.Correlation.GetIncludeDestPorts(),
}, &domain.RealTimeProvider{})
// Set logger for correlation service
correlationSvc.SetLogger(logger.WithFields(map[string]any{"component": "correlation"}))
logger.Info(fmt.Sprintf("Correlation service initialized: time_window=%s, emit_orphans=%v, emit_delay_ms=%d",
cfg.Correlation.GetTimeWindow().String(),
cfg.Correlation.GetApacheAlwaysEmit(),
cfg.Correlation.GetApacheEmitDelayMs()))
// Start metrics server if enabled
var metricsServer *observability.MetricsServer
if cfg.Metrics.Enabled {
addr := cfg.Metrics.Addr
if addr == "" {
addr = ":8080" // Default address
}
var err error
metricsServer, err = observability.NewMetricsServer(addr, correlationSvc.GetMetricsSnapshot)
if err != nil {
logger.Error("Failed to create metrics server", err)
os.Exit(1)
}
if err := metricsServer.Start(); err != nil {
logger.Error("Failed to start metrics server", err)
os.Exit(1)
}
logger.Info(fmt.Sprintf("Metrics server started: addr=%s", metricsServer.Addr()))
logger.Info("Metrics endpoints: /metrics (JSON), /health")
}
// Create orchestrator
orchestrator := app.NewOrchestrator(app.OrchestratorConfig{
Sources: sources,
Sink: multiSink,
}, correlationSvc)
// Start the application
if err := orchestrator.Start(); err != nil {
logger.Error("Failed to start orchestrator", err)
os.Exit(1)
}
logger.Info("logcorrelator started successfully")
// Wait for shutdown signal
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM, syscall.SIGHUP)
for {
sig := <-sigChan
if sig == syscall.SIGHUP {
// Reopen file sinks for log rotation
logger.Info("SIGHUP received, reopening file sinks...")
if err := multiSink.Reopen(); err != nil {
logger.Error("Error reopening file sinks", err)
} else {
logger.Info("File sinks reopened successfully")
}
continue
}
// Shutdown signal received
logger.Info(fmt.Sprintf("Shutdown signal received: %v", sig))
break
}
// Graceful shutdown
if err := orchestrator.Stop(); err != nil {
logger.Error("Error during shutdown", err)
}
// Stop metrics server
if metricsServer != nil {
shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if err := metricsServer.Stop(shutdownCtx); err != nil {
logger.Error("Error stopping metrics server", err)
}
}
logger.Info("logcorrelator stopped")
}

View File

@ -0,0 +1,92 @@
# logcorrelator configuration file
# Format: YAML
# Logging configuration
log:
level: INFO # DEBUG, INFO, WARN, ERROR
inputs:
unix_sockets:
- name: http
source_type: A
path: /var/run/logcorrelator/http.socket
format: json
socket_permissions: "0666" # world read/write
- name: network
source_type: B
path: /var/run/logcorrelator/network.socket
format: json
socket_permissions: "0666"
outputs:
file:
enabled: true
path: /var/log/logcorrelator/correlated.log
clickhouse:
enabled: false
dsn: clickhouse://user:pass@localhost:9000/db
table: correlated_logs_http_network
batch_size: 500
flush_interval_ms: 200
max_buffer_size: 5000
drop_on_overflow: true
async_insert: true
timeout_ms: 1000
stdout:
enabled: false
correlation:
# Time window for correlation (A and B must be within this window)
# Increased to 10s to support HTTP Keep-Alive scenarios
time_window:
value: 10
unit: s
# Orphan policy: what to do when no match is found
orphan_policy:
apache_always_emit: true # Always emit A events, even without B match
apache_emit_delay_ms: 500 # Wait 500ms before emitting as orphan (allows B to arrive)
network_emit: false # Never emit B events alone
# Matching mode: one_to_one or one_to_many (Keep-Alive)
matching:
mode: one_to_many
# Buffer limits (max events in memory)
buffers:
max_http_items: 10000
max_network_items: 20000
# TTL for network events (source B)
# Increased to 120s to support long-lived HTTP Keep-Alive sessions
ttl:
network_ttl_s: 120
# Exclude specific source IPs or CIDR ranges from correlation
# Events from these IPs will be silently dropped (not correlated, not emitted)
# Useful for excluding health checks, internal traffic, or known bad actors
exclude_source_ips:
- 10.0.0.1 # Single IP
- 192.168.1.100 # Another single IP
- 172.16.0.0/12 # CIDR range (private network)
- 10.10.10.0/24 # Another CIDR range
# Restrict correlation to specific destination ports (optional)
# If non-empty, only events whose dst_port matches one of these values will be correlated
# Events on other ports are silently ignored (not correlated, not emitted as orphans)
# Useful to focus on HTTP/HTTPS traffic only and ignore unrelated connections
# include_dest_ports:
# - 80 # HTTP
# - 443 # HTTPS
# - 8080 # HTTP alt
# - 8443 # HTTPS alt
# Metrics server configuration (optional, for debugging/monitoring)
metrics:
enabled: false
addr: ":8080" # Address to listen on (e.g., ":8080", "localhost:8080")
# Endpoints:
# GET /metrics - Returns correlation metrics as JSON
# GET /health - Health check endpoint

View File

@ -0,0 +1,224 @@
# Architecture de détection — logcorrelator
## Vue d'ensemble
Le système de détection est composé de **trois couches** qui s'enchaînent en pipeline :
```
Trafic HTTP/TLS capturé
┌───────────────────┐
│ ClickHouse │ Stockage, agrégation, vues heuristiques
│ (SQL pipeline) │
└────────┬──────────┘
┌───────────────────┐
│ bot_detector.py │ Modèle IA (Isolation Forest, cycle 5 min)
│ (Python / ML) │
└────────┬──────────┘
┌───────────────────┐
│ ml_detected_ │ Table de résultats (ReplacingMergeTree)
│ anomalies │
└───────────────────┘
```
---
## 1. Ingestion des logs (`http_logs_raw` → `http_logs`)
Les logs bruts arrivent en JSON dans la table `http_logs_raw`. Une **vue matérialisée** (`mv_http_logs`) les parse en temps réel et alimente la table `http_logs`, qui contient les champs structurés suivants :
| Catégorie | Champs clés |
|---|---|
| Réseau | `src_ip`, `src_port`, `dst_ip`, `dst_port` |
| Enrichissement | `src_asn`, `src_country_code`, `src_as_name` (via dictionnaire IPLocate) |
| HTTP | `method`, `host`, `path`, `query`, `http_version` |
| Corrélation | `correlated`, `orphan_side`, `conn_id`, `keepalives` |
| Métadonnées IP | `ip_meta_ttl`, `ip_meta_id`, `ip_meta_df`, `ip_meta_total_length` |
| Métadonnées TCP | `tcp_meta_window_size`, `tcp_meta_mss`, `tcp_meta_window_scale`, `tcp_meta_options` |
| TLS / Fingerprint | `tls_version`, `tls_sni`, `tls_alpn`, `ja3`, `ja3_hash`, `ja4` |
| En-têtes HTTP | `header_user_agent`, `header_sec_ch_ua*`, `header_sec_fetch_*`, … |
L'enrichissement IP est réalisé via le dictionnaire `dict_iplocate_asn` (fichier CSV chargé en mémoire, rechargé toutes les 1-2 heures).
---
## 2. Agrégation comportementale (fenêtre horaire)
Deux tables d'agrégation `AggregatingMergeTree` sont alimentées en continu par des vues matérialisées.
### 2.1 `agg_host_ip_ja4_1h` — Comportement réseau & applicatif
Agrège par triplet **(window_start, src_ip, ja4, host)** toutes les heures :
| Métrique agrégée | Signification |
|---|---|
| `hits` | Nombre total de requêtes |
| `count_post` | Requêtes POST |
| `uniq_paths` | Chemins distincts visités |
| `uniq_query_params` | Paramètres de query distincts |
| `unique_src_ports` | Ports sources distincts |
| `unique_conn_id` | Connexions TCP distinctes |
| `max_keepalives` | Réutilisation maximale d'une connexion |
| `orphan_count` | Requêtes sans corrélation TCP complète |
| `ip_id_zero_count` | Paquets avec IP ID = 0 (spoofing potentiel) |
| `tcp_fp_raw` | Hash de l'empreinte TCP (window, MSS, scale, options) |
| `tcp_jitter_variance` | Variance du délai SYN→ClientHello (jitter TLS) |
| `total_ip_length_var` | Variance de la taille des paquets IP |
| `mss_1460_count` | Requêtes avec MSS = 1460 (signature Ethernet/desktop) |
### 2.2 `agg_header_fingerprint_1h` — Empreinte des en-têtes HTTP
Agrège par **(window_start, src_ip)** :
| Métrique | Signification |
|---|---|
| `header_order_hash` | Hash de l'ordre des en-têtes (fingerprint JA4H) |
| `header_count` | Nombre d'en-têtes distincts |
| `has_accept_language` | Présence de `Accept-Language` |
| `has_cookie` | Présence de `Cookie` |
| `has_referer` | Présence de `Referer` |
| `modern_browser_score` | Score 0/50/100 selon présence UA et `Sec-CH-UA` |
| `ua_ch_mismatch` | Incohérence entre `User-Agent` et `Sec-CH-UA-Platform` |
| `sec_fetch_mode/dest` | Contexte de navigation déclaré |
---
## 3. Exclusions (listes blanches)
Avant toute analyse, deux tables permettent d'**exclure les robots légitimes** connus :
- `bot_ip` (fichier `bot_ip.csv`) — IPs à ignorer (crawlers, monitoring…)
- `bot_ja4` (fichier `bot_ja4.csv`) — Fingerprints JA4 à ignorer
- `ref_bot_networks` — Réseaux CIDR IPv4/IPv6 catégorisés (légitimes ou malveillants)
Ces exclusions sont appliquées dans la vue `view_ai_features_1h`.
---
## 4. Vue IA : `view_ai_features_1h`
Cette vue consolidée **sur 24 heures glissantes** calcule les **28 features** passées au modèle ML. Elle joint les deux tables d'agrégation et dérive les métriques suivantes :
| Feature | Calcul | Signal détecté |
|---|---|---|
| `hit_velocity` | `hits / durée_en_secondes` | Volume de requêtes anormalement élevé |
| `fuzzing_index` | `uniq_query_params / uniq_paths` | Exploration paramétrique (fuzzing) |
| `post_ratio` | `count_post / hits` | Soumission de formulaires en masse |
| `port_exhaustion_ratio` | `unique_src_ports / hits` | Rotation de ports (scan) |
| `orphan_ratio` | `orphan_count / hits` | Requêtes sans handshake complet |
| `ip_id_zero_ratio` | `ip_id_zero_count / hits` | Spoofing d'adresse IP |
| `multiplexing_efficiency` | `hits / unique_conn_id` | Réutilisation des connexions (H2/H3) |
| `true_window_size` | `tcp_win * 2^tcp_scale` | Taille réelle de la fenêtre TCP |
| `window_mss_ratio` | `tcp_win / tcp_mss` | Cohérence TCP stack |
| `tcp_jitter_variance` | Variance SYN→ClientHello | Irrégularité du timing TLS |
| `alpn_http_mismatch` | ALPN=h2 mais HTTP/1.1 | Négociation TLS mensongère |
| `is_alpn_missing` | ALPN absent ou `00` | Client non-standard |
| `sni_host_mismatch` | SNI ≠ Host header | Proxy transparent / bot |
| `mss_mobile_mismatch` | MSS=1460 + score navigateur élevé | Client mobile simulé depuis desktop |
| `is_fake_navigation` | `sec_fetch_mode=navigate` mais `sec_fetch_dest≠document` | Navigation simulée |
| `tcp_shared_count` | Nb d'IPs partageant la même empreinte TCP | Infrastructure partagée / botnet |
| `header_order_shared_count` | Nb d'IPs partageant le même ordre d'en-têtes | Outil automatisé commun |
---
## 5. Modèle IA : Isolation Forest (`bot_detector.py`)
### Cycle d'exécution
Le service tourne en boucle avec un **cycle de 5 minutes** :
```
fetch_and_analyze()
├─ Requête SELECT * FROM view_ai_features_1h
├─ Nettoyage des données (fillna)
├─ Dual-Model routing :
│ ├─ [Complet] correlated=1 → 23 features (réseau + TLS + headers)
│ └─ [Applicatif] correlated=0 → 19 features (headers + comportement)
└─ INSERT INTO ml_detected_anomalies
```
### Paramétrage du modèle
| Paramètre | Valeur | Signification |
|---|---|---|
| `n_estimators` | 200 | Nombre d'arbres d'isolation |
| `contamination` | 0.2% | Proportion de bots attendue dans le trafic |
| `seuil de score` | < -0.05 | Score en dessous duquel une session est marquée anomalie |
| `volume minimum` | 500 sessions | En dessous, le modèle est ignoré (trop peu de données) |
### Dual-Model routing
Le trafic est **séparé en deux populations** selon le champ `correlated` :
- **Modèle Complet** (`correlated=1`) : la corrélation TCPHTTP est disponible les features réseau (TTL, jitter TLS, ALPN, SNI) sont fiables et ajoutées à l'analyse.
- **Modèle Applicatif** (`correlated=0`) : seule la couche HTTP est disponible l'analyse se concentre sur le comportement applicatif (headers, paths, POST ratio…).
---
## 6. Vues heuristiques statiques
En parallèle du modèle IA, cinq vues SQL fournissent des **détections déterministes** sans ML, sur fenêtre 24h :
| Vue | Règle de détection |
|---|---|
| `view_host_ip_ja4_rotation` | IP avec 5 fingerprints JA4 distincts et > 100 requêtes → rotation d'identité |
| `view_host_ja4_anomalies` | Fingerprint JA4 vu depuis ≥ 20 IPs sur ≥ 3 hôtes → outil de scan distribué |
| `view_form_bruteforce_detected` | ≥ 10 query params distincts et ≥ 20 hits → brute-force de formulaire |
| `view_alpn_mismatch_detected` | HTTP/1.1 avec ALPN h2 ou h3 et ≥ 10 hits → négociation TLS frauduleuse |
| `view_tcp_spoofing_detected` | TTL ≤ 64 avec User-Agent Windows ou iPhone → empreinte OS incohérente |
---
## 7. Résultats : `ml_detected_anomalies`
Les anomalies détectées sont stockées dans une table `ReplacingMergeTree(detected_at)` avec **TTL 30 jours**. La clé d'ordre `(src_ip, ja4, host)` garantit que chaque triplet ne conserve que la **détection la plus récente** (dédoublonnage automatique).
Chaque enregistrement contient :
- Les scores et features ayant conduit à la détection
- Le champ `reason` : texte lisible avec score, vélocité, et indice de fuzzing
- Le champ `is_headless` : déduit de l'incohérence `sec_fetch_mode`
---
## 8. Schéma de flux complet
```
┌─────────────────────────────────────┐
│ http_logs_raw (JSON) │
└──────────────┬──────────────────────┘
│ mv_http_logs (MV)
┌─────────────────────────────────────┐
│ http_logs (parsée) │
└────────┬──────────────┬─────────────┘
│ │
mv_agg_host_ip_ja4 │ │ mv_agg_header_fingerprint
▼ ▼
┌──────────────────┐ ┌──────────────────────────┐
│ agg_host_ip_ja4 │ │ agg_header_fingerprint │
│ _1h │ │ _1h │
└────────┬─────────┘ └──────────┬──────────────┘
│ │
└──────────┬─────────────┘
│ view_ai_features_1h (JOIN + calculs)
┌─────────────────────────────────────┐
│ bot_detector.py (Isolation Forest) │
│ Cycle : 5 min | Fenêtre : 24h │
└──────────────┬──────────────────────┘
┌─────────────────────────────────────┐
│ ml_detected_anomalies │
│ (ReplacingMergeTree, TTL 30j) │
└─────────────────────────────────────┘
```

View File

@ -0,0 +1,29 @@
module github.com/antitbone/ja4/correlator
go 1.21
require (
github.com/ClickHouse/clickhouse-go/v2 v2.23.0
gopkg.in/yaml.v3 v3.0.1
)
require (
github.com/ClickHouse/ch-go v0.61.5 // indirect
github.com/andybalholm/brotli v1.1.0 // indirect
github.com/go-faster/city v1.0.1 // indirect
github.com/go-faster/errors v0.7.1 // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/klauspost/compress v1.17.7 // indirect
github.com/paulmach/orb v0.11.1 // indirect
github.com/pierrec/lz4/v4 v4.1.21 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/segmentio/asm v1.2.0 // indirect
github.com/shopspring/decimal v1.3.1 // indirect
go.opentelemetry.io/otel v1.24.0 // indirect
go.opentelemetry.io/otel/trace v1.24.0 // indirect
golang.org/x/sys v0.18.0 // indirect
)
require github.com/antitbone/ja4/ja4common v0.1.0
replace github.com/antitbone/ja4/ja4common => ../../shared/go/ja4common

110
services/correlator/go.sum Normal file
View File

@ -0,0 +1,110 @@
github.com/ClickHouse/ch-go v0.61.5 h1:zwR8QbYI0tsMiEcze/uIMK+Tz1D3XZXLdNrlaOpeEI4=
github.com/ClickHouse/ch-go v0.61.5/go.mod h1:s1LJW/F/LcFs5HJnuogFMta50kKDO0lf9zzfrbl0RQg=
github.com/ClickHouse/clickhouse-go/v2 v2.23.0 h1:srmRrkS0BR8gEut87u8jpcZ7geOob6nGj9ifrb+aKmg=
github.com/ClickHouse/clickhouse-go/v2 v2.23.0/go.mod h1:tBhdF3f3RdP7sS59+oBAtTyhWpy0024ZxDMhgxra0QE=
github.com/andybalholm/brotli v1.1.0 h1:eLKJA0d02Lf0mVpIDgYnqXcUn0GqVmEFny3VuID1U3M=
github.com/andybalholm/brotli v1.1.0/go.mod h1:sms7XGricyQI9K10gOSf56VKKWS4oLer58Q+mhRPtnY=
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/go-faster/city v1.0.1 h1:4WAxSZ3V2Ws4QRDrscLEDcibJY8uf41H6AhXDrNDcGw=
github.com/go-faster/city v1.0.1/go.mod h1:jKcUJId49qdW3L1qKHH/3wPeUstCVpVSXTM6vO3VcTw=
github.com/go-faster/errors v0.7.1 h1:MkJTnDoEdi9pDabt1dpWf7AA8/BaSYZqibYyhZ20AYg=
github.com/go-faster/errors v0.7.1/go.mod h1:5ySTjWFiphBs07IKuiL69nxdfd5+fzh1u7FPGZP2quo=
github.com/gogo/protobuf v1.3.2/go.mod h1:P1XiOD3dCwIKUDQYPy72D8LYyHL2YPYrpS2s69NZV8Q=
github.com/golang/protobuf v1.5.0/go.mod h1:FsONVRAS9T7sI+LIUmWTfcYkHO4aIWwzhcaSAoJOfIk=
github.com/golang/snappy v0.0.1/go.mod h1:/XxbfmMg8lxefKM7IXC3fBNl/7bRcc72aCRzEWrmP2Q=
github.com/google/go-cmp v0.5.2/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/google/go-cmp v0.5.5/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/google/go-cmp v0.6.0 h1:ofyhxvXcZhMsU5ulbFiLKl/XBFqE1GSq7atu8tAmTRI=
github.com/google/go-cmp v0.6.0/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
github.com/kisielk/errcheck v1.5.0/go.mod h1:pFxgyoBC7bSaBwPgfKdkLd5X25qrDl4LWUI2bnpBCr8=
github.com/kisielk/gotool v1.0.0/go.mod h1:XhKaO+MFFWcvkIS/tQcRk01m1F5IRFswLeQ+oQHNcck=
github.com/klauspost/compress v1.13.6/go.mod h1:/3/Vjq9QcHkK5uEr5lBEmyoZ1iFhe47etQ6QUkpK6sk=
github.com/klauspost/compress v1.17.7 h1:ehO88t2UGzQK66LMdE8tibEd1ErmzZjNEqWkjLAKQQg=
github.com/klauspost/compress v1.17.7/go.mod h1:Di0epgTjJY877eYKx5yC51cX2A2Vl2ibi7bDH9ttBbw=
github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE=
github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk=
github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
github.com/kr/text v0.1.0 h1:45sCR5RtlFHMR4UwH9sdQ5TC8v0qDQCHnXt+kaKSTVE=
github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
github.com/montanaflynn/stats v0.0.0-20171201202039-1bf9dbcd8cbe/go.mod h1:wL8QJuTMNUDYhXwkmfOly8iTdp5TEcJFWZD2D7SIkUc=
github.com/paulmach/orb v0.11.1 h1:3koVegMC4X/WeiXYz9iswopaTwMem53NzTJuTF20JzU=
github.com/paulmach/orb v0.11.1/go.mod h1:5mULz1xQfs3bmQm63QEJA6lNGujuRafwA5S/EnuLaLU=
github.com/paulmach/protoscan v0.2.1/go.mod h1:SpcSwydNLrxUGSDvXvO0P7g7AuhJ7lcKfDlhJCDw2gY=
github.com/pierrec/lz4/v4 v4.1.21 h1:yOVMLb6qSIDP67pl/5F7RepeKYu/VmTyEXvuMI5d9mQ=
github.com/pierrec/lz4/v4 v4.1.21/go.mod h1:gZWDp/Ze/IJXGXf23ltt2EXimqmTUXEy0GFuRQyBid4=
github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/rogpeppe/go-internal v1.10.0 h1:TMyTOH3F/DB16zRVcYyreMH6GnZZrwQVAoYjRBZyWFQ=
github.com/rogpeppe/go-internal v1.10.0/go.mod h1:UQnix2H7Ngw/k4C5ijL5+65zddjncjaFoBhdsK/akog=
github.com/segmentio/asm v1.2.0 h1:9BQrFxC+YOHJlTlHGkTrFWf59nbL3XnCoFLTwDCI7ys=
github.com/segmentio/asm v1.2.0/go.mod h1:BqMnlJP91P8d+4ibuonYZw9mfnzI9HfxselHZr5aAcs=
github.com/shopspring/decimal v1.3.1 h1:2Usl1nmF/WZucqkFZhnfFYxxxu8LG21F6nPQBE5gKV8=
github.com/shopspring/decimal v1.3.1/go.mod h1:DKyhrW/HYNuLGql+MJL6WCR6knT2jwCFRcu2hWCYk4o=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/testify v1.6.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.9.0 h1:HtqpIVDClZ4nwg75+f6Lvsy/wHu+3BoSGCbBAcpTsTg=
github.com/stretchr/testify v1.9.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY=
github.com/tidwall/pretty v1.0.0/go.mod h1:XNkn88O1ChpSDQmQeStsy+sBenx6DDtFZJxhVysOjyk=
github.com/xdg-go/pbkdf2 v1.0.0/go.mod h1:jrpuAogTd400dnrH08LKmI/xc1MbPOebTwRqcT5RDeI=
github.com/xdg-go/scram v1.1.1/go.mod h1:RaEWvsqvNKKvBPvcKeFjrG2cJqOkHTiyTpzz23ni57g=
github.com/xdg-go/stringprep v1.0.3/go.mod h1:W3f5j4i+9rC0kuIEJL0ky1VpHXQU3ocBgklLGvcBnW8=
github.com/youmark/pkcs8 v0.0.0-20181117223130-1be2e3e5546d/go.mod h1:rHwXgn7JulP+udvsHwJoVG1YGAP6VLg4y9I5dyZdqmA=
github.com/yuin/goldmark v1.1.27/go.mod h1:3hX8gzYuyVAZsxl0MRgGTJEmQBFcNTphYh9decYSb74=
github.com/yuin/goldmark v1.2.1/go.mod h1:3hX8gzYuyVAZsxl0MRgGTJEmQBFcNTphYh9decYSb74=
go.mongodb.org/mongo-driver v1.11.4/go.mod h1:PTSz5yu21bkT/wXpkS7WR5f0ddqw5quethTUn9WM+2g=
go.opentelemetry.io/otel v1.24.0 h1:0LAOdjNmQeSTzGBzduGe/rU4tZhMwL5rWgtp9Ku5Jfo=
go.opentelemetry.io/otel v1.24.0/go.mod h1:W7b9Ozg4nkF5tWI5zsXkaKKDjdVjpD4oAt9Qi/MArHo=
go.opentelemetry.io/otel/trace v1.24.0 h1:CsKnnL4dUAr/0llH9FKuc698G04IrpWV0MQA/Y1YELI=
go.opentelemetry.io/otel/trace v1.24.0/go.mod h1:HPc3Xr/cOApsBI154IU0OI0HJexz+aw5uPdbs3UCjNU=
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto=
golang.org/x/crypto v0.0.0-20220622213112-05595931fe9d/go.mod h1:IxCIyHEi3zRg3s0A5j5BB6A9Jmi73HwBIUl50j+osU4=
golang.org/x/mod v0.2.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA=
golang.org/x/mod v0.3.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA=
golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/net v0.0.0-20200226121028-0de0cce0169b/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/net v0.0.0-20201021035429-f5854403a974/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU=
golang.org/x/net v0.0.0-20211112202133-69e39bad7dc2/go.mod h1:9nx3DQGgdP8bBQD5qxJ1jj9UTztislL4KSBs9R2vV5Y=
golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20190911185100-cd5d95a43a6e/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20201020160332-67f06af15bc9/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20210220032951-036812b2e83c/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.18.0 h1:DBdB3niSjOA/O0blCZBqDefyWNYveAYMNF1Wum0DYQ4=
golang.org/x/sys v0.18.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
golang.org/x/text v0.3.7/go.mod h1:u+2+/6zg+i71rQMx5EYifcz6MCKuco9NR6JIITiCfzQ=
golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20200619180055-7c47624df98f/go.mod h1:EkVYQZoAsY45+roYkvgYkIh4xh/qjgUK9TdY2XT94GE=
golang.org/x/tools v0.0.0-20210106214847-113979e3529a/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA=
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
google.golang.org/protobuf v1.26.0-rc.1/go.mod h1:jlhhOSvTdKEhbULTjvd4ARK9grFBp09yW+WbY/TyQbw=
google.golang.org/protobuf v1.27.1/go.mod h1:9q0QmTI4eRPtz6boOQmLYwt+qCgq0jsYwAQnmE0givc=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=

View File

@ -0,0 +1,111 @@
time
log_date
src_ip
- ip source de la connexion
src_port
- port source de la connexion
dst_ip
- ip de destination de la connexion
dst_port
- port de destination de la connexion
src_asn
- Numero d'AS de l'ip source
src_country_code
- Code Pays de l'ip source
src_as_name
- Nom de l'AS de l ip source
src_org
- Organisation de l AS source
src_domain
- domaine de l'AS de l ip source
method
- Methode HTTP [GET, POST, ... ]
scheme
- Type de connexion http [http, https]
host
- Hostname demandé dans l'url
path
- Path demandé dans l'url
query
- Query demandé dans l'url
http_version
- Version du protocol http utilisé
orphan_side
- Indique si le log HTTP a pu etre enrichi avec les informations ip_, tcp, ja3_ et ja4_
- "A" indique que seul le log HTTP est present, sans enrichissement
correlated
- l'algorithm de correlation log http + parametres tcp a il réussi (tcp + ja4/3)
keepalives
- Numero de desquance dans une connexion http avec keepalive.
a_timestamp
b_timestamp
conn_id
ip_meta_df
- Flag dont fragement
ip_meta_id
- id du packet ip
ip_meta_total_length
- Taille des metadata dans pe packet ip
ip_meta_ttl
- TTL du packet ip vu par le serveur destinataire du packet
tcp_meta_options
- options du packet TCP vu par le serveur destinataire du packet
tcp_meta_window_size
- TCP window size vu par le serveur destinataire du packet
tcp_meta_mss
- TCP mss vu par le serveur destinataire du packet
tcp_meta_window_scale
- TCP windows scale vu par le serveur destinataire du packet
syn_to_clienthello_ms
- durée en ms entre le 1er packet SYN et le ClienHello du TLS
tls_version
- Version de TLS negocié avec le serveur destinataire du packet
tls_sni
- SNI, nom de domaine demandé pour le cerificat TLS
tls_alpn
- ALPN annoncé lors du TLS
ja3
- liste des agos utiliés pour la signature ja3
ja3_hash
- hash ja3
ja4
- hash ja4
client_headers
- liste des headers envoyés par le client http sous forme de liste Header,Header2,Header3,...
header_user_agent
- Header HTTP User-Agent
header_accept
- Header HTTP Accept
header_accept_encoding
- Header HTTP Accept-Encoding
header_accept_language
- Header HTTP Accept-Language
header_content_type
- Header Content-Type
header_x_request_id
- Header X-Request-ID
header_x_trace_id
- Header X-Trace-ID
header_x_forwarded_for
- Header X-Forwarded-For
header_sec_ch_ua
- Header Sec-Ch-UA
header_sec_ch_ua_mobile
- Header -Sec-Ch-UA-Mobile
header_sec_ch_ua_platform
- Header Sec-Ch-UA-Plateform
header_sec_fetch_dest
- Header -Sec-Fetch-Dest
header_sec_fetch_mode
- Header Sec-Fetch-Mode
header_sec_fetch_site
- Header Sec-Fetch-Site

View File

@ -0,0 +1,30 @@
1. Incohérences de Signatures (Spoofing)
User-Agent vs TLS : Le header_user_agent prétend être un navigateur (Chrome/Safari) mais le ja3/ja4 correspond à un outil de script.
User-Agent vs Headers modernes : Le header_user_agent indique un navigateur récent, mais les headers header_sec_ch_ua_* sont vides ou absents de client_headers.
User-Agent vs ALPN : Le navigateur déclaré ne correspond pas au protocole négocié dans tls_alpn (ex: Chrome sans h2).
OS vs TTL TCP : L'OS déclaré dans le header_user_agent (ex: Windows) contredit la valeur de ip_meta_ttl (ex: 64, typique de Linux).
Host vs SNI : Le nom de domaine dans le header host ne correspond pas au tls_sni demandé lors du handshake TLS.
2. Anomalies de Headers (HTTP Fingerprinting)
Empreinte d'ordre (Fingerprint) : Apparition soudaine d'une disposition de client_headers (ordre exact) très rare, générant beaucoup de trafic.
Pauvreté des headers : Le nombre total de headers dans client_headers est anormalement bas (ex: < 5), typique des scripts basiques.
Absence de headers vitaux : Le trafic prétend être humain mais n'envoie pas header_accept_language ou header_accept_encoding.
Combinaison fatale : Le croisement d'un ja4 spécifique avec un ordre de client_headers inédit (détection de bots modifiant leur TLS mais trahis par l'applicatif).
3. Anomalies Réseau et TCP (Couche 3 & 4)
Mécanique TCP de masse : Une même combinaison (tcp_meta_window_size, tcp_meta_window_scale, tcp_meta_mss) vue sur des milliers d'IP différentes.
Handshake robotique : Un délai syn_to_clienthello_ms anormalement constant (variance quasi nulle) sur un grand nombre de connexions, typique d'un bot en datacenter.
Options TCP atypiques : Des paramètres tcp_meta_options inhabituels pour le trafic web classique de tes vrais utilisateurs.
4. Anomalies Comportementales et Volumétriques (Côté Requête)
Rafale de requêtes (Spike) : Volume d'appels (count) par src_ip ou par ja4 dépassant drastiquement le 99ème percentile historique sur 5 minutes.
Scraping furtif distribué : Un même ja4 (non standard) utilisé par des centaines de src_ip différentes, chacune faisant très peu de requêtes.
Balayage aveugle (Scanner) : Un volume anormal de path uniques (ou path + query) visités par une même IP ou un même ja4 en quelques minutes (remplace la détection des erreurs 404).
Acharnement sur cible (Brute force aveugle) : Une concentration extrême de requêtes ciblant uniquement les path sensibles (login, API, password-reset) sans navigation normale sur le reste du site (remplace la détection des 401/403).
Méthodes suspectes : Utilisation massive ou inhabituelle de method non standards (PUT, DELETE, OPTIONS, TRACE) par rapport à la baseline.
Payloads suspects : Présence de patterns d'injection ou de caractères très inhabituels dans query ou path (longueur extrême, encodages multiples).
Bot "Low and Slow" : IP ou ja4 qui passe sous les radars sur 5 minutes, mais dont le volume cumulé sur 24h ou 7 jours est mathématiquement improbable pour un humain.

View File

@ -0,0 +1,521 @@
# 🛡️ Manuel de Référence Technique : Moteur de Détection Antispam & Bot
Ce document détaille les algorithmes de détection implémentés dans les vues ClickHouse pour la plateforme.
---
## 1. Analyse de la Couche Transport (L4) : La "Trace Physique"
Avant même d'analyser l'URL, le moteur inspecte la manière dont la connexion a été établie. C'est la couche la plus difficile à falsifier pour un attaquant.
### A. Fingerprint de la Pile TCP (`tcp_fingerprint`)
* **Fonctionnement :** Nous utilisons `cityHash64` pour créer un identifiant unique basé sur trois paramètres immuables du handshake : le **MSS** (Maximum Segment Size), la **Window Size** et le **Window Scale**.
* **Ce que ça détecte :** L'unicité logicielle. Un bot tournant sur une image Alpine Linux aura une signature TCP différente d'un utilisateur sur iOS 17 ou Windows 11.
* **Détection de botnet :** Si 500 IPs différentes partagent exactement le même `tcp_fingerprint` ET le même `ja4`, il y a une probabilité de 99% qu'il s'agisse d'un cluster de bots clonés.
### B. Analyse de la gigue (Jitter) et Handshake
* **Fonctionnement :** On calcule la variance (`varPop`) du délai entre le `SYN` et le `ClientHello` TLS.
* **Ce que ça détecte :** La stabilité robotique.
* **Humain :** Latence variable (4G, Wi-Fi, mouvements). La variance est élevée.
* **Bot Datacenter :** Latence ultra-stable (fibre optique dédiée). Une variance proche de 0 indique une connexion automatisée depuis une infrastructure cloud.
---
## 2. Analyse de la Session (L5) : Le "Passeport TLS"
Le handshake TLS est une mine d'or pour identifier la bibliothèque logicielle (OpenSSL, Go-TLS, etc.).
### A. Incohérence UA vs JA4
* **Fonctionnement :** Le moteur croise le `header_user_agent` (déclaratif) avec le `ja4` (structurel).
* **Ce que ça détecte :** Le **Spoofing de Browser**. Un script Python peut facilement écrire `User-Agent: Mozilla/5.0...Chrome/120`, mais il ne peut pas simuler l'ordre exact des extensions TLS et des algorithmes de chiffrement d'un vrai Chrome sans une ingénierie complexe (comme `utls`).
* **Logique de score :** Si UA = Chrome mais JA4 != Signature_Chrome -> **+50 points de risque**.
### B. Discordance Host vs SNI
* **Fonctionnement :** Comparaison entre le champ `tls_sni` (négocié en clair lors du handshake) et le header `Host` (envoyé plus tard dans la requête chiffrée).
* **Ce que ça détecte :** Le **Domain Fronting** ou les attaques par tunnel. Un bot peut demander un certificat pour `domaine-innocent.com` (SNI) mais tenter d'attaquer `api-critique.com` (Host).
---
## 3. Analyse Applicative (L7) : Le "Comportement HTTP"
Une fois le tunnel établi, on analyse la structure de la requête HTTP.
### A. Empreinte d'ordre des Headers (`http_fp`)
* **Fonctionnement :** Nous hashons la liste ordonnée des clés de headers (`Accept`, `User-Agent`, `Connection`, etc.).
* **Ce que ça détecte :** La signature du moteur de rendu. Chaque navigateur (Firefox, Safari, Chromium) a un ordre immuable pour envoyer ses headers.
* **Détection :** Si un client envoie les headers dans un ordre inhabituel ou minimaliste (pauvreté des headers < 6), il est marqué comme suspect.
### B. Analyse des Payloads et Entropie
* **Fonctionnement :** Recherche de patterns via regex dans `query` et `path` (détection SQLi, XSS, Path Traversal).
* **Complexité :** Nous détectons les encodages multiples (ex: `%2520`) qui tentent de tromper les pare-feux simples.
---
## 4. Corrélation Temporelle & Baseline : Le "Voisinage Statistique"
Le score final dépend du passé de la signature TLS.
### A. Le Malus de Nouveauté (`agg_novelty`)
* **Logique :** Une signature (JA4 + FP) vue pour la première fois aujourd'hui est "froide".
* **Traitement :** On applique un malus si `first_seen` date de moins de 2 heures. Un botnet qui vient de lancer une campagne de rotation de signatures sera immédiatement pénalisé par son manque d'historique.
### B. Le Dépassement de Baseline (`tbl_baseline_ja4_7d`)
* **Fonctionnement :** On compare les `hits` actuels au 99ème percentile (`p99`) historique de cette signature précise.
* **Exemple :** Si le JA4 de "Chrome 122" fait habituellement 10 requêtes/min/IP sur votre site, et qu'une IP en fait soudainement 300, le score explose même si la requête est techniquement parfaite.
---
## 5. Synthèse du Scoring (Le Verdict)
| Algorithme | Signal | Impact Score |
| :--- | :--- | :--- |
| **Fingerprint Mismatch** | UA vs TLS (Spoofing) | **Haut (50)** |
| **L4 Anomaly** | Variance latence < 0.5ms | **Moyen (30)** |
| **Path Sensitivity** | Hit sur `/admin` ou `/config` | **Haut (40)** |
| **Payload Security** | Caractères d'injection (SQL/XSS) | **Critique (60)** |
| **Mass Distribution** | 1 JA4 sur > 50 IPs différentes | **Moyen (30)** |
---
## 6. Identification des Hosts par IP et JA4 (sql/hosts.sql)
Cette section détaille les vues d'agrégation et de détection pour identifier quels hosts sont associés à quelles signatures (IP + JA4).
### A. Agrégats de Base
| Table | Granularité | Description |
|-------|-------------|-------------|
| `agg_host_ip_ja4_1h` | heure | Hits, paths uniques, query params, méthodes par (IP, JA4, host) |
| `agg_host_ip_ja4_24h` | jour | Rollup quotidien pour historique long terme |
### B. Vues d'Identification
**`view_host_identification`** - Top hosts par signature
```sql
-- Quel host est associé à cette IP/JA4 ?
SELECT src_ip, ja4, host, total_hits, unique_paths, user_agent
FROM mabase_prod.view_host_identification
WHERE src_ip = '1.2.3.4'
ORDER BY total_hits DESC;
```
**`view_host_ja4_anomalies`** - JA4 partagé par plusieurs hosts (botnet)
```sql
-- Ce JA4 est-il utilisé par plusieurs hosts différents ?
SELECT ja4, hosts, unique_hosts, unique_ips
FROM mabase_prod.view_host_ja4_anomalies
HAVING unique_hosts >= 3;
-- Interprétation : 1 JA4 sur 3+ hosts = botnet cloné probable
```
**`view_host_ip_ja4_rotation`** - IP avec rotation de fingerprints
```sql
-- Cette IP change-t-elle de JA4 fréquemment ?
SELECT src_ip, ja4s, unique_ja4s
FROM mabase_prod.view_host_ip_ja4_rotation
HAVING unique_ja4s >= 5;
-- Interprétation : 1 IP avec 5+ JA4 différents = fingerprint spoofing
```
---
## 7. Détection de Brute Force (sql/hosts.sql)
### A. Brute Force sur POST (endpoints sensibles)
**Table :** `agg_bruteforce_post_5m` - Fenêtres de 5 minutes
**Vue :** `view_bruteforce_post_detected`
```sql
-- Détecter les tentatives de brute force sur les login
SELECT window, src_ip, ja4, host, path, attempts, attempts_per_minute
FROM mabase_prod.view_bruteforce_post_detected
WHERE host = 'api.example.com'
ORDER BY attempts DESC;
-- Threshold : ≥10 POST en 5 minutes sur endpoints sensibles
-- Endpoints ciblés : login, auth, signin, password, admin, wp-login, etc.
```
### B. Brute Force sur Formulaire (Query params variables)
**Table :** `agg_form_bruteforce_5m`
**Vue :** `view_form_bruteforce_detected`
```sql
-- Détecter les requêtes avec query params hautement variables
SELECT window, src_ip, ja4, host, path, requests, unique_query_patterns
FROM mabase_prod.view_form_bruteforce_detected
HAVING requests >= 20 AND unique_query_patterns >= 10;
-- Interprétation : 20+ requêtes avec 10+ patterns query différents
-- = tentative de fuzzing ou brute force sur paramètres
```
---
## 8. Header Fingerprinting (sql/hosts.sql)
Le champ `client_headers` contient la liste comma-separated des headers présents.
Exemple : `"Accept,Accept-Encoding,Sec-CH-UA,Sec-Fetch-Dest,User-Agent"`
### A. Signature par Ordre de Headers
**Table :** `agg_header_fingerprint_1h`
| Champ | Description |
|-------|-------------|
| `header_count` | Nombre total de headers (virgules + 1) |
| `has_*` | Flags pour chaque header moderne (Sec-CH-UA, Sec-Fetch-*, etc.) |
| `header_order_hash` | MD5(client_headers) = signature unique de l'ordre |
| `modern_browser_score` | Score 0-100 basé sur les headers modernes présents |
### B. Vues de Détection
**`view_header_missing_modern_headers`** - Headers modernes manquants
```sql
-- Navigateurs "modernes" avec headers manquants
SELECT src_ip, ja4, header_user_agent, modern_browser_score, header_count
FROM mabase_prod.view_header_missing_modern_headers
WHERE header_user_agent ILIKE '%Chrome%';
-- Threshold : score < 70 pour Chrome/Firefox = suspect
-- Un vrai Chrome envoie automatiquement Sec-CH-UA, Sec-Fetch-*, etc.
```
**`view_header_ua_order_mismatch`** - Spoofing détecté
```sql
-- Même User-Agent avec ordre de headers différent
SELECT header_user_agent, ja4, unique_hashes, unique_ips
FROM mabase_prod.view_header_ua_order_mismatch
HAVING unique_hashes > 1;
-- Interprétation : 1 UA avec 2+ ordres de headers = spoofing ou outil custom
```
**`view_header_minimalist_count`** - Bot minimaliste
```sql
-- Clients avec trop peu de headers
SELECT src_ip, ja4, header_count, header_user_agent
FROM mabase_prod.view_header_minimalist_count
WHERE header_count < 6;
-- Threshold : < 6 headers = bot scripté (curl, Python requests, etc.)
```
**`view_header_sec_ch_missing`** - Incohérence Chrome
```sql
-- Chrome sans Sec-CH-UA (impossible pour un vrai Chrome)
SELECT src_ip, ja4, header_user_agent
FROM mabase_prod.view_header_sec_ch_missing
WHERE header_user_agent ILIKE '%Chrome/%';
```
**`view_header_known_bot_signature`** - Signature botnet
```sql
-- Même ordre de headers sur 10+ IPs différentes
SELECT header_order_hash, header_user_agent, unique_ips, total_hits
FROM mabase_prod.view_header_known_bot_signature
HAVING unique_ips >= 10;
-- Interprétation : 1 signature sur 10+ IPs = cluster de bots clonés
```
---
## 9. ALPN Mismatch Detection (sql/hosts.sql)
### Principe
ALPN (Application-Layer Protocol Negotiation) est une extension TLS qui négocie le protocole HTTP **avant** la requête.
| ALPN déclaré | HTTP réel | Interprétation |
|--------------|-----------|----------------|
| `h2` | `HTTP/2` | ✅ Normal |
| `h2` | `HTTP/1.1` | ❌ Bot mal configuré |
| `http/1.1` | `HTTP/1.1` | ✅ Normal |
### Vue de Détection
**`view_alpn_mismatch_detected`**
```sql
-- Clients déclarant h2 mais parlant HTTP/1.1
SELECT src_ip, ja4, declared_alpn, actual_http_version, mismatches, mismatch_pct
FROM mabase_prod.view_alpn_mismatch_detected
HAVING mismatch_pct >= 80;
-- Threshold : ≥5 requêtes avec ≥80% d'incohérence
-- Cause : curl mal configuré, Python requests, bots spoofant ALPN
```
---
## 10. Rate Limiting & Burst Detection (sql/hosts.sql)
### A. Rate Limiting (1 minute)
**Table :** `agg_rate_limit_1m`
**Vue :** `view_rate_limit_exceeded`
```sql
-- IPs dépassant 50 requêtes/minute
SELECT minute, src_ip, ja4, requests_per_min, unique_paths
FROM mabase_prod.view_rate_limit_exceeded
ORDER BY requests_per_min DESC;
-- Threshold : > 50 req/min = trafic automatisé
-- Un humain ne peut pas soutenir 50+ req/min de manière cohérente
```
### B. Burst Detection (10 secondes)
**Table :** `agg_burst_10s`
**Vue :** `view_burst_detected`
```sql
-- Pics soudains de trafic
SELECT window, src_ip, ja4, burst_count
FROM mabase_prod.view_burst_detected
HAVING burst_count > 20;
-- Threshold : > 20 requêtes en 10 secondes = burst suspect
-- Utile pour détecter les attaques par vagues
```
---
## 11. Path Enumeration / Scanning (sql/hosts.sql)
### Vue de Détection
**`view_path_scan_detected`**
```sql
-- Détection de scanning de paths sensibles
SELECT window, src_ip, ja4, host, sensitive_hits, sensitive_ratio
FROM mabase_prod.view_path_scan_detected
HAVING sensitive_hits >= 5;
-- Paths surveillés : admin, backup, config, .env, .git, wp-admin,
-- phpinfo, test, debug, log, sql, dump, passwd, shadow, htaccess, etc.
-- Threshold : ≥5 paths sensibles en 5 minutes = scanning
```
### Exemple de Résultat
| src_ip | ja4 | host | sensitive_hits | sensitive_ratio |
|--------|-----|------|----------------|-----------------|
| 1.2.3.4 | t13d... | api.example.com | 47 | 94.00 |
| 5.6.7.8 | t13d... | www.example.com | 12 | 80.00 |
**Interprétation :** Ces IPs testent systématiquement les paths sensibles = outils comme Nikto, Dirb, Gobuster.
---
## 12. Payload Attack Detection (sql/hosts.sql)
### A. Types d'Attaques Détectées
| Type | Patterns Détectés |
|------|-------------------|
| **SQL Injection** | `UNION SELECT`, `OR 1=1`, `DROP TABLE`, `; --`, `/* */`, `WAITFOR DELAY`, `SLEEP()` |
| **XSS** | `<script>`, `javascript:`, `onerror=`, `onload=`, `<img src=data:`, `<svg onload>` |
| **Path Traversal** | `../`, `..\\`, `%2e%2e%2f`, `%252e%252e`, `%%32%65%%32%65` |
### Vue de Détection
**`view_payload_attacks_detected`**
```sql
-- Toutes les tentatives d'injection
SELECT window, src_ip, ja4, host, path,
sqli_attempts, xss_attempts, traversal_attempts
FROM mabase_prod.view_payload_attacks_detected
ORDER BY sqli_attempts DESC, xss_attempts DESC, traversal_attempts DESC;
-- Threshold : ≥1 tentative = alerte (zero tolerance)
```
---
## 13. JA4 Botnet Detection (sql/hosts.sql)
### Principe
Un vrai navigateur a un fingerprint TLS unique. Un bot déployé sur 100 machines aura le **même JA4**.
### Vue de Détection
**`view_ja4_botnet_suspected`**
```sql
-- JA4 partagé par 20+ IPs différentes
SELECT ja4, ja3_hash, unique_ips, unique_asns, unique_countries, total_hits
FROM mabase_prod.view_ja4_botnet_suspected
HAVING unique_ips >= 20;
-- Threshold : ≥20 IPs avec le même JA4 = botnet cloné
```
### Exemple de Résultat
| ja4 | ja3_hash | unique_ips | unique_asns | unique_countries |
|-----|----------|------------|-------------|------------------|
| t13d1512... | a3b5c7... | 147 | 12 | 8 |
| t13d0918... | f1e2d3... | 52 | 3 | 2 |
**Interprétation :** 147 IPs différentes avec le même fingerprint = cluster de bots clonés.
---
## 14. Correlation Quality (sql/hosts.sql)
### Principe
Mesure le ratio d'événements non-corrélés (orphelins). Un trafic légitime a une bonne corrélation HTTP/TCP.
### Vue de Détection
**`view_high_orphan_ratio`**
```sql
-- Trafic avec >80% d'événements non-corrélés
SELECT hour, src_ip, ja4, host, correlated, orphans, orphan_pct
FROM mabase_prod.view_high_orphan_ratio
ORDER BY orphan_pct DESC;
-- Threshold : orphan_pct > 80% = trafic suspect
-- Peut indiquer du trafic généré artificiellement
```
---
## 15. Maintenance et Faux Positifs
### Exceptions Connues
| Source | Faux Positif | Solution |
|--------|--------------|----------|
| **Googlebot/Bingbot** | Scan agressif mais légitime | Filtrer par ASN + Reverse DNS |
| **Monitoring interne** | Rate limit élevé | Whitelist par IP/ASN |
| **CDN/Proxy** | JA4 partagé (clients derrière proxy) | Vérifier ASN (Cloudflare, Akamai) |
| **Navigateurs anciens** | Headers modernes manquants | Vérifier UA version |
### Reset des Scores
Les agrégats sont automatiquement purgés par TTL :
- `agg_*_1h` : TTL 7 jours
- `agg_*_5m` : TTL 1 jour
- `agg_*_1m` : TTL 1 jour
Un IP bloquée par erreur retrouvera un score normal après expiration du TTL.
---
## 16. Synthèse des Vues de Détection
| Vue | Détection | Threshold | Impact |
|-----|-----------|-----------|--------|
| `view_bruteforce_post_detected` | POST endpoints sensibles | ≥10 en 5min | 🔴 Haut |
| `view_form_bruteforce_detected` | Query params variables | ≥20 req, ≥10 patterns | 🔴 Haut |
| `view_header_missing_modern_headers` | Headers modernes manquants | score < 70 | 🔴 Haut |
| `view_header_ua_order_mismatch` | UA spoofing (ordre) | >1 hash | 🔴 Haut |
| `view_header_minimalist_count` | Bot minimaliste | < 6 headers | 🔴 Haut |
| `view_header_sec_ch_missing` | Chrome sans Sec-CH | absent | 🟡 Moyen |
| `view_header_known_bot_signature` | Signature connue (botnet) | 10+ IPs | 🔴 Haut |
| `view_alpn_mismatch_detected` | h2 déclaré, HTTP/1.1 parlé | 80% mismatch | 🔴 Haut |
| `view_rate_limit_exceeded` | Rate limit dépassé | >50 req/min | 🔴 Haut |
| `view_burst_detected` | Burst soudain | >20 req/10s | 🟡 Moyen |
| `view_path_scan_detected` | Scanning de paths | ≥5 sensibles | 🔴 Haut |
| `view_payload_attacks_detected` | Injections SQLi/XSS | ≥1 tentative | 🔴 Critique |
| `view_ja4_botnet_suspected` | JA4 partagé (botnet) | ≥20 IPs | 🔴 Haut |
| `view_high_orphan_ratio` | Trafic non-corrélé | >80% orphans | 🟡 Moyen |
| `view_host_ja4_anomalies` | JA4 sur plusieurs hosts | ≥3 hosts | 🟡 Moyen |
| `view_host_ip_ja4_rotation` | IP rotate JA4 | ≥5 JA4 | 🟡 Moyen |
---
## 17. Exemples de Requêtes d'Investigation
### Top 10 des IPs les plus suspectes (score cumulé)
```sql
WITH threats AS (
SELECT src_ip, ja4, 'bruteforce' AS type, sum(attempts) AS score
FROM mabase_prod.view_bruteforce_post_detected GROUP BY src_ip, ja4
UNION ALL
SELECT src_ip, ja4, 'path_scan', sum(sensitive_hits)
FROM mabase_prod.view_path_scan_detected GROUP BY src_ip, ja4
UNION ALL
SELECT src_ip, ja4, 'payload', sum(sqli_attempts + xss_attempts)
FROM mabase_prod.view_payload_attacks_detected GROUP BY src_ip, ja4
)
SELECT src_ip, ja4, sum(score) AS total_score, groupArray(type) AS threat_types
FROM threats
GROUP BY src_ip, ja4
ORDER BY total_score DESC
LIMIT 10;
```
### Historique d'une IP suspecte
```sql
SELECT
hour,
host,
countMerge(hits) AS requests,
uniqMerge(uniq_paths) AS unique_paths
FROM mabase_prod.agg_host_ip_ja4_1h
WHERE src_ip = '1.2.3.4'
AND hour >= now() - INTERVAL 24 HOUR
GROUP BY hour, host
ORDER BY hour DESC;
```
### Corrélation JA4 → User-Agent → Hosts
```sql
SELECT
ja4,
any(first_ua) AS user_agent,
groupArray(DISTINCT host) AS hosts,
sum(countMerge(hits)) AS total_requests
FROM mabase_prod.agg_host_ip_ja4_1h
WHERE hour >= now() - INTERVAL 1 HOUR
GROUP BY ja4
ORDER BY total_requests DESC
LIMIT 20;
```
---
## 18. Installation et Maintenance
### Installation
```bash
# Exécuter après init.sql
clickhouse-client --multiquery < sql/hosts.sql
```
### Vérification
```sql
-- Compter les enregistrements
SELECT count(*) FROM mabase_prod.agg_host_ip_ja4_1h;
SELECT count(*) FROM mabase_prod.agg_header_fingerprint_1h;
-- Tester les vues
SELECT * FROM mabase_prod.view_host_identification LIMIT 10;
SELECT * FROM mabase_prod.view_bruteforce_post_detected LIMIT 10;
SELECT * FROM mabase_prod.view_payload_attacks_detected LIMIT 10;
```
### Monitoring
```sql
-- Vues les plus actives (dernière heure)
SELECT
'bruteforce_post' AS view_name, count() AS alerts
FROM mabase_prod.view_bruteforce_post_detected
UNION ALL
SELECT 'path_scan', count() FROM mabase_prod.view_path_scan_detected
UNION ALL
SELECT 'payload_attacks', count() FROM mabase_prod.view_payload_attacks_detected
UNION ALL
SELECT 'ja4_botnet', count() FROM mabase_prod.view_ja4_botnet_suspected
ORDER BY alerts DESC;
```

View File

@ -0,0 +1,376 @@
package unixsocket
import (
"context"
"encoding/json"
"fmt"
"math"
"net"
"os"
"path/filepath"
"strconv"
"strings"
"sync"
"time"
"github.com/antitbone/ja4/correlator/internal/domain"
"github.com/antitbone/ja4/correlator/internal/observability"
)
const (
// Maximum datagram size for JSON logs (64KB - Unix datagram limit)
MaxDatagramSize = 65535
// Rate limit: max events per second
MaxEventsPerSecond = 10000
)
// Config holds the Unix socket source configuration.
type Config struct {
Name string
Path string
SourceType string // "A" for Apache/HTTP, "B" for Network, "" for auto-detect
SocketPermissions os.FileMode
}
// UnixSocketSource reads JSON events from a Unix datagram socket.
type UnixSocketSource struct {
config Config
mu sync.Mutex
conn *net.UnixConn
done chan struct{}
wg sync.WaitGroup
stopOnce sync.Once
logger *observability.Logger
}
// NewUnixSocketSource creates a new Unix socket source.
func NewUnixSocketSource(config Config) *UnixSocketSource {
return &UnixSocketSource{
config: config,
done: make(chan struct{}),
logger: observability.NewLogger("unixsocket:" + config.Name),
}
}
// SetLogger sets the logger for the source (for debug mode).
func (s *UnixSocketSource) SetLogger(logger *observability.Logger) {
s.logger = logger.WithFields(map[string]any{"source": s.config.Name})
}
// Name returns the source name.
func (s *UnixSocketSource) Name() string {
return s.config.Name
}
// Start begins listening on the Unix datagram socket.
func (s *UnixSocketSource) Start(ctx context.Context, eventChan chan<- *domain.NormalizedEvent) error {
if strings.TrimSpace(s.config.Path) == "" {
return fmt.Errorf("socket path cannot be empty")
}
// Create parent directory if it doesn't exist
socketDir := filepath.Dir(s.config.Path)
if err := os.MkdirAll(socketDir, 0755); err != nil {
return fmt.Errorf("failed to create socket directory %s: %w", socketDir, err)
}
// Remove existing socket file if present
if info, err := os.Stat(s.config.Path); err == nil {
if info.Mode()&os.ModeSocket != 0 {
if err := os.Remove(s.config.Path); err != nil {
return fmt.Errorf("failed to remove existing socket: %w", err)
}
} else {
return fmt.Errorf("path exists but is not a socket: %s", s.config.Path)
}
}
// Create Unix datagram socket
addr, err := net.ResolveUnixAddr("unixgram", s.config.Path)
if err != nil {
return fmt.Errorf("failed to resolve unix socket address: %w", err)
}
conn, err := net.ListenUnixgram("unixgram", addr)
if err != nil {
return fmt.Errorf("failed to create unix datagram socket: %w", err)
}
s.conn = conn
// Set permissions - fail if we can't
permissions := s.config.SocketPermissions
if permissions == 0 {
permissions = 0666 // default
}
if err := os.Chmod(s.config.Path, permissions); err != nil {
_ = conn.Close()
_ = os.Remove(s.config.Path)
return fmt.Errorf("failed to set socket permissions: %w", err)
}
s.wg.Add(1)
go func() {
defer s.wg.Done()
s.readDatagrams(ctx, eventChan)
}()
return nil
}
func (s *UnixSocketSource) readDatagrams(ctx context.Context, eventChan chan<- *domain.NormalizedEvent) {
buf := make([]byte, MaxDatagramSize)
for {
select {
case <-s.done:
return
case <-ctx.Done():
return
default:
}
// Set read deadline to allow periodic context checks
_ = s.conn.SetReadDeadline(time.Now().Add(100 * time.Millisecond))
n, _, err := s.conn.ReadFromUnix(buf)
if err != nil {
if netErr, ok := err.(net.Error); ok && netErr.Timeout() {
// Read timeout, continue to check context
continue
}
// Other errors (e.g., closed socket)
select {
case <-s.done:
return
case <-ctx.Done():
return
default:
s.logger.Warnf("read error: %v", err)
continue
}
}
if n == 0 {
continue
}
data := make([]byte, n)
copy(data, buf[:n])
event, err := parseJSONEvent(data, s.config.SourceType)
if err != nil {
// Log parse errors with the raw data for debugging
s.logger.Warnf("parse error: %v | raw: %s", err, string(data))
continue
}
// Debug: log raw events with all key details
s.logger.Debugf("event received: source=%s src_ip=%s src_port=%d timestamp=%v raw_timestamp=%v",
event.Source, event.SrcIP, event.SrcPort, event.Timestamp, event.Raw["timestamp"])
select {
case eventChan <- event:
case <-ctx.Done():
return
}
}
}
func resolveSource(sourceType string, headers map[string]string) domain.EventSource {
switch strings.ToLower(strings.TrimSpace(sourceType)) {
case "a", "apache", "http":
return domain.SourceA
case "b", "network", "net":
return domain.SourceB
default:
// fallback compat
if len(headers) > 0 {
return domain.SourceA
}
return domain.SourceB
}
}
func parseJSONEvent(data []byte, sourceType string) (*domain.NormalizedEvent, error) {
var raw map[string]any
if err := json.Unmarshal(data, &raw); err != nil {
return nil, fmt.Errorf("invalid JSON: %w", err)
}
event := &domain.NormalizedEvent{
Raw: raw,
Extra: make(map[string]any),
Headers: make(map[string]string),
}
// Extract headers (header_* fields) first
for k, v := range raw {
if strings.HasPrefix(k, "header_") {
if sv, ok := v.(string); ok {
event.Headers[k[7:]] = sv
}
}
}
// Resolve source first (strict timestamp logic depends on source)
event.Source = resolveSource(sourceType, event.Headers)
// Extract and validate src_ip
if v, ok := getString(raw, "src_ip"); ok {
v = strings.TrimSpace(v)
if v == "" {
return nil, fmt.Errorf("src_ip cannot be empty")
}
event.SrcIP = v
} else {
return nil, fmt.Errorf("missing required field: src_ip")
}
// Extract and validate src_port
if v, ok := getInt(raw, "src_port"); ok {
if v < 1 || v > 65535 {
return nil, fmt.Errorf("src_port must be between 1 and 65535, got %d", v)
}
event.SrcPort = v
} else {
return nil, fmt.Errorf("missing required field: src_port")
}
// Extract dst_ip (optional)
if v, ok := getString(raw, "dst_ip"); ok {
event.DstIP = strings.TrimSpace(v)
}
// Extract dst_port (optional)
if v, ok := getInt(raw, "dst_port"); ok {
if v < 0 || v > 65535 {
return nil, fmt.Errorf("dst_port must be between 0 and 65535, got %d", v)
}
event.DstPort = v
}
// Extract timestamp based on source contract
switch event.Source {
case domain.SourceA:
ts, ok := getInt64(raw, "timestamp")
if !ok {
return nil, fmt.Errorf("missing required numeric field: timestamp for source A")
}
// Assume nanoseconds
event.Timestamp = time.Unix(0, ts)
case domain.SourceB:
// For network source, try to use event timestamp if available,
// fallback to reception time. This improves correlation accuracy
// when network logs include their own timestamp (e.g., from packet capture).
if ts, ok := getInt64(raw, "timestamp"); ok {
event.Timestamp = time.Unix(0, ts)
} else if timeStr, ok := getString(raw, "time"); ok {
// Try RFC3339 format
if t, err := time.Parse(time.RFC3339, timeStr); err == nil {
event.Timestamp = t
} else if t, err := time.Parse(time.RFC3339Nano, timeStr); err == nil {
event.Timestamp = t
} else {
event.Timestamp = time.Now()
}
} else {
event.Timestamp = time.Now()
}
default:
return nil, fmt.Errorf("unsupported source type: %s", event.Source)
}
// Extra fields
knownFields := map[string]bool{
"src_ip": true, "src_port": true, "dst_ip": true, "dst_port": true,
"timestamp": true, "time": true,
}
for k, v := range raw {
if knownFields[k] {
continue
}
if strings.HasPrefix(k, "header_") {
continue
}
event.Extra[k] = v
}
return event, nil
}
func getString(m map[string]any, key string) (string, bool) {
if v, ok := m[key]; ok {
if s, ok := v.(string); ok {
return s, true
}
}
return "", false
}
func getInt(m map[string]any, key string) (int, bool) {
if v, ok := m[key]; ok {
switch val := v.(type) {
case float64:
if math.Trunc(val) != val {
return 0, false
}
return int(val), true
case int:
return val, true
case int64:
return int(val), true
case string:
if i, err := strconv.Atoi(val); err == nil {
return i, true
}
}
}
return 0, false
}
func getInt64(m map[string]any, key string) (int64, bool) {
if v, ok := m[key]; ok {
switch val := v.(type) {
case float64:
if math.Trunc(val) != val {
return 0, false
}
return int64(val), true
case int:
return int64(val), true
case int64:
return val, true
case string:
if i, err := strconv.ParseInt(val, 10, 64); err == nil {
return i, true
}
}
}
return 0, false
}
// Stop gracefully stops the source.
func (s *UnixSocketSource) Stop() error {
var stopErr error
s.stopOnce.Do(func() {
s.mu.Lock()
defer s.mu.Unlock()
close(s.done)
if s.conn != nil {
_ = s.conn.Close()
}
s.wg.Wait()
// Clean up socket file
if err := os.Remove(s.config.Path); err != nil && !os.IsNotExist(err) {
stopErr = fmt.Errorf("failed to remove socket file: %w", err)
return
}
})
return stopErr
}

View File

@ -0,0 +1,596 @@
package unixsocket
import (
"context"
"fmt"
"net"
"os"
"testing"
"time"
"github.com/antitbone/ja4/correlator/internal/domain"
)
func TestParseJSONEvent_Apache(t *testing.T) {
data := []byte(`{
"src_ip": "192.168.1.1",
"src_port": 8080,
"dst_ip": "10.0.0.1",
"dst_port": 80,
"timestamp": 1704110400000000000,
"method": "GET",
"path": "/api/test",
"header_host": "example.com",
"header_user_agent": "Mozilla/5.0"
}`)
event, err := parseJSONEvent(data, "A")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if event.SrcIP != "192.168.1.1" {
t.Errorf("expected src_ip 192.168.1.1, got %s", event.SrcIP)
}
if event.SrcPort != 8080 {
t.Errorf("expected src_port 8080, got %d", event.SrcPort)
}
if event.Headers["host"] != "example.com" {
t.Errorf("expected header host example.com, got %s", event.Headers["host"])
}
if event.Headers["user_agent"] != "Mozilla/5.0" {
t.Errorf("expected header_user_agent Mozilla/5.0, got %s", event.Headers["user_agent"])
}
if event.Source != domain.SourceA {
t.Errorf("expected source A, got %s", event.Source)
}
expectedTs := time.Unix(0, 1704110400000000000)
if !event.Timestamp.Equal(expectedTs) {
t.Errorf("expected timestamp %v, got %v", expectedTs, event.Timestamp)
}
}
func TestParseJSONEvent_Network(t *testing.T) {
data := []byte(`{
"src_ip": "192.168.1.1",
"src_port": 8080,
"dst_ip": "10.0.0.1",
"dst_port": 443,
"timestamp": 1704110400000000000,
"ja3": "abc123def456",
"ja4": "xyz789",
"tcp_meta_flags": "SYN"
}`)
event, err := parseJSONEvent(data, "B")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if event.SrcIP != "192.168.1.1" {
t.Errorf("expected src_ip 192.168.1.1, got %s", event.SrcIP)
}
if event.Extra["ja3"] != "abc123def456" {
t.Errorf("expected ja3 abc123def456, got %v", event.Extra["ja3"])
}
if event.Source != domain.SourceB {
t.Errorf("expected source B, got %s", event.Source)
}
// Network source now uses payload timestamp if available
expectedTs := time.Unix(0, 1704110400000000000)
if !event.Timestamp.Equal(expectedTs) {
t.Errorf("expected network timestamp %v, got %v", expectedTs, event.Timestamp)
}
}
func TestParseJSONEvent_InvalidJSON(t *testing.T) {
data := []byte(`{invalid json}`)
_, err := parseJSONEvent(data, "")
if err == nil {
t.Error("expected error for invalid JSON")
}
}
func TestParseJSONEvent_MissingFields(t *testing.T) {
data := []byte(`{"other_field": "value"}`)
_, err := parseJSONEvent(data, "")
if err == nil {
t.Error("expected error for missing src_ip/src_port")
}
}
func TestParseJSONEvent_SourceARequiresNumericTimestamp(t *testing.T) {
data := []byte(`{
"src_ip": "192.168.1.1",
"src_port": 8080,
"time": "2024-01-01T12:00:00Z"
}`)
_, err := parseJSONEvent(data, "A")
if err == nil {
t.Fatal("expected error for source A without numeric timestamp")
}
}
func TestParseJSONEvent_SourceBUsesPayloadTimestamp(t *testing.T) {
expectedTs := int64(1704110400000000000)
data := []byte(`{
"src_ip": "192.168.1.1",
"src_port": 8080,
"timestamp": 1704110400000000000
}`)
event, err := parseJSONEvent(data, "B")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
expectedTime := time.Unix(0, expectedTs)
if !event.Timestamp.Equal(expectedTime) {
t.Errorf("expected source B to use payload timestamp %v, got %v", expectedTime, event.Timestamp)
}
}
func TestParseJSONEvent_SourceBUsesTimeField(t *testing.T) {
data := []byte(`{
"src_ip": "192.168.1.1",
"src_port": 8080,
"time": "2024-01-01T12:00:00Z"
}`)
event, err := parseJSONEvent(data, "B")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
expectedTime := time.Unix(0, 1704110400000000000)
if !event.Timestamp.Equal(expectedTime) {
t.Errorf("expected source B to use time field %v, got %v", expectedTime, event.Timestamp)
}
}
func TestParseJSONEvent_SourceBFallbackToNow(t *testing.T) {
data := []byte(`{
"src_ip": "192.168.1.1",
"src_port": 8080
}`)
before := time.Now()
event, err := parseJSONEvent(data, "B")
after := time.Now()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if event.Timestamp.Before(before.Add(-2*time.Second)) || event.Timestamp.After(after.Add(2*time.Second)) {
t.Errorf("expected source B timestamp near now, got %v", event.Timestamp)
}
}
func TestParseJSONEvent_ExplicitSourceType(t *testing.T) {
tests := []struct {
name string
data string
sourceType string
expected domain.EventSource
}{
{
name: "explicit A",
data: `{"src_ip": "192.168.1.1", "src_port": 8080, "timestamp": 1704110400000000000}`,
sourceType: "A",
expected: domain.SourceA,
},
{
name: "explicit B",
data: `{"src_ip": "192.168.1.1", "src_port": 8080}`,
sourceType: "B",
expected: domain.SourceB,
},
{
name: "explicit apache",
data: `{"src_ip": "192.168.1.1", "src_port": 8080, "timestamp": 1704110400000000000}`,
sourceType: "apache",
expected: domain.SourceA,
},
{
name: "explicit network",
data: `{"src_ip": "192.168.1.1", "src_port": 8080}`,
sourceType: "network",
expected: domain.SourceB,
},
{
name: "auto-detect A with headers",
data: `{"src_ip": "192.168.1.1", "src_port": 8080, "timestamp": 1704110400000000000, "header_host": "example.com"}`,
sourceType: "",
expected: domain.SourceA,
},
{
name: "auto-detect B without headers",
data: `{"src_ip": "192.168.1.1", "src_port": 8080, "ja3": "abc"}`,
sourceType: "",
expected: domain.SourceB,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
event, err := parseJSONEvent([]byte(tt.data), tt.sourceType)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if event.Source != tt.expected {
t.Errorf("expected source %s, got %s", tt.expected, event.Source)
}
})
}
}
func TestUnixSocketSource_Name(t *testing.T) {
source := NewUnixSocketSource(Config{
Name: "test_source",
Path: "/tmp/test.sock",
})
if source.Name() != "test_source" {
t.Errorf("expected name 'test_source', got %s", source.Name())
}
}
func TestUnixSocketSource_StopWithoutStart(t *testing.T) {
source := NewUnixSocketSource(Config{
Name: "test_source",
Path: "/tmp/test.sock",
})
// Should not panic
err := source.Stop()
if err != nil {
t.Errorf("expected no error on stop without start, got %v", err)
}
}
func TestUnixSocketSource_EmptyPath(t *testing.T) {
source := NewUnixSocketSource(Config{
Name: "test_source",
Path: "",
})
ctx := context.Background()
eventChan := make(chan *domain.NormalizedEvent, 10)
err := source.Start(ctx, eventChan)
if err == nil {
t.Error("expected error for empty path")
}
}
func TestGetString(t *testing.T) {
m := map[string]any{
"string": "hello",
"int": 42,
"nil": nil,
}
v, ok := getString(m, "string")
if !ok || v != "hello" {
t.Errorf("expected 'hello', got %v, %v", v, ok)
}
_, ok = getString(m, "int")
if ok {
t.Error("expected false for int")
}
_, ok = getString(m, "missing")
if ok {
t.Error("expected false for missing key")
}
}
func TestGetInt(t *testing.T) {
m := map[string]any{
"float": 42.5,
"int": 42,
"int64": int64(42),
"string": "42",
"bad": "not a number",
"nil": nil,
}
tests := []struct {
key string
expected int
ok bool
}{
{"float", 0, false},
{"int", 42, true},
{"int64", 42, true},
{"string", 42, true},
{"bad", 0, false},
{"nil", 0, false},
{"missing", 0, false},
}
for _, tt := range tests {
t.Run(tt.key, func(t *testing.T) {
v, ok := getInt(m, tt.key)
if ok != tt.ok {
t.Errorf("getInt(%q) ok = %v, want %v", tt.key, ok, tt.ok)
}
if v != tt.expected {
t.Errorf("getInt(%q) = %v, want %v", tt.key, v, tt.expected)
}
})
}
}
func TestGetInt64(t *testing.T) {
m := map[string]any{
"float": 42.5,
"int": 42,
"int64": int64(42),
"string": "42",
"bad": "not a number",
"nil": nil,
}
tests := []struct {
key string
expected int64
ok bool
}{
{"float", 0, false},
{"int", 42, true},
{"int64", 42, true},
{"string", 42, true},
{"bad", 0, false},
{"nil", 0, false},
{"missing", 0, false},
}
for _, tt := range tests {
t.Run(tt.key, func(t *testing.T) {
v, ok := getInt64(m, tt.key)
if ok != tt.ok {
t.Errorf("getInt64(%q) ok = %v, want %v", tt.key, ok, tt.ok)
}
if v != tt.expected {
t.Errorf("getInt64(%q) = %v, want %v", tt.key, v, tt.expected)
}
})
}
}
func TestParseJSONEvent_PortValidation(t *testing.T) {
tests := []struct {
name string
data string
sourceType string
wantErr bool
}{
{
name: "valid src_port",
data: `{"src_ip": "192.168.1.1", "src_port": 8080}`,
sourceType: "B",
wantErr: false,
},
{
name: "src_port zero",
data: `{"src_ip": "192.168.1.1", "src_port": 0}`,
sourceType: "B",
wantErr: true,
},
{
name: "src_port negative",
data: `{"src_ip": "192.168.1.1", "src_port": -1}`,
sourceType: "B",
wantErr: true,
},
{
name: "src_port too high",
data: `{"src_ip": "192.168.1.1", "src_port": 70000}`,
sourceType: "B",
wantErr: true,
},
{
name: "valid dst_port zero",
data: `{"src_ip": "192.168.1.1", "src_port": 8080, "dst_port": 0}`,
sourceType: "B",
wantErr: false,
},
{
name: "dst_port too high",
data: `{"src_ip": "192.168.1.1", "src_port": 8080, "dst_port": 70000}`,
sourceType: "B",
wantErr: true,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
_, err := parseJSONEvent([]byte(tt.data), tt.sourceType)
if (err != nil) != tt.wantErr {
t.Errorf("parseJSONEvent() error = %v, wantErr %v", err, tt.wantErr)
}
})
}
}
func TestParseJSONEvent_TimestampFallback(t *testing.T) {
data := []byte(`{"src_ip": "192.168.1.1", "src_port": 8080}`)
event, err := parseJSONEvent(data, "B")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
// For source B, timestamp is reception time
if event.Timestamp.IsZero() {
t.Error("expected non-zero timestamp")
}
}
func TestUnixSocketSource_StartStopDatagram(t *testing.T) {
tmpPath := "/tmp/test_logcorrelator_datagram.sock"
// Clean up any existing socket
os.Remove(tmpPath)
source := NewUnixSocketSource(Config{
Name: "test_datagram",
Path: tmpPath,
SourceType: "B",
SocketPermissions: 0666,
})
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
eventChan := make(chan *domain.NormalizedEvent, 10)
err := source.Start(ctx, eventChan)
if err != nil {
t.Fatalf("failed to start source: %v", err)
}
// Give socket time to start
time.Sleep(100 * time.Millisecond)
// Verify socket file exists
if _, err := os.Stat(tmpPath); os.IsNotExist(err) {
t.Error("socket file should exist")
}
// Stop the source
err = source.Stop()
if err != nil {
t.Errorf("failed to stop source: %v", err)
}
// Socket file should be cleaned up
time.Sleep(100 * time.Millisecond)
if _, err := os.Stat(tmpPath); !os.IsNotExist(err) {
t.Error("socket file should be removed after stop")
}
}
func TestUnixSocketSource_SendDatagram(t *testing.T) {
tmpPath := "/tmp/test_logcorrelator_send.sock"
os.Remove(tmpPath)
source := NewUnixSocketSource(Config{
Name: "test_send",
Path: tmpPath,
SourceType: "B",
SocketPermissions: 0666,
})
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
eventChan := make(chan *domain.NormalizedEvent, 10)
err := source.Start(ctx, eventChan)
if err != nil {
t.Fatalf("failed to start source: %v", err)
}
// Give socket time to start
time.Sleep(100 * time.Millisecond)
// Connect and send a datagram
conn, err := net.Dial("unixgram", tmpPath)
if err != nil {
t.Fatalf("failed to dial socket: %v", err)
}
defer conn.Close()
data := []byte(`{"src_ip": "192.168.1.1", "src_port": 8080, "ja3": "test"}`)
_, err = conn.Write(data)
if err != nil {
t.Fatalf("failed to write: %v", err)
}
// Wait for event
select {
case event := <-eventChan:
if event.SrcIP != "192.168.1.1" {
t.Errorf("expected src_ip 192.168.1.1, got %s", event.SrcIP)
}
if event.SrcPort != 8080 {
t.Errorf("expected src_port 8080, got %d", event.SrcPort)
}
case <-time.After(2 * time.Second):
t.Error("timeout waiting for event")
case <-ctx.Done():
t.Error("context cancelled")
}
err = source.Stop()
if err != nil {
t.Errorf("failed to stop source: %v", err)
}
}
func TestUnixSocketSource_MultipleDatagrams(t *testing.T) {
tmpPath := "/tmp/test_logcorrelator_multi.sock"
os.Remove(tmpPath)
source := NewUnixSocketSource(Config{
Name: "test_multi",
Path: tmpPath,
SourceType: "B",
SocketPermissions: 0666,
})
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
eventChan := make(chan *domain.NormalizedEvent, 100)
err := source.Start(ctx, eventChan)
if err != nil {
t.Fatalf("failed to start source: %v", err)
}
// Give socket time to start
time.Sleep(100 * time.Millisecond)
// Connect and send multiple datagrams
conn, err := net.Dial("unixgram", tmpPath)
if err != nil {
t.Fatalf("failed to dial socket: %v", err)
}
defer conn.Close()
for i := 0; i < 5; i++ {
data := []byte(fmt.Sprintf(`{"src_ip": "192.168.1.%d", "src_port": %d, "ja3": "test%d"}`, i+1, 8080+i, i))
_, err = conn.Write(data)
if err != nil {
t.Fatalf("failed to write datagram %d: %v", i, err)
}
}
// Wait for all events
received := 0
timeout := time.After(3 * time.Second)
for received < 5 {
select {
case event := <-eventChan:
received++
t.Logf("received event %d: src_ip=%s", received, event.SrcIP)
case <-timeout:
t.Errorf("timeout waiting for events, received %d/5", received)
goto done
case <-ctx.Done():
t.Error("context cancelled")
goto done
}
}
done:
err = source.Stop()
if err != nil {
t.Errorf("failed to stop source: %v", err)
}
}

View File

@ -0,0 +1,391 @@
package clickhouse
import (
"context"
"encoding/json"
"errors"
"fmt"
"net"
"strings"
"sync"
"time"
"github.com/ClickHouse/clickhouse-go/v2"
"github.com/antitbone/ja4/correlator/internal/domain"
"github.com/antitbone/ja4/correlator/internal/observability"
)
const (
// DefaultBatchSize is the default number of records per batch
DefaultBatchSize = 500
// DefaultFlushIntervalMs is the default flush interval in milliseconds
DefaultFlushIntervalMs = 200
// DefaultMaxBufferSize is the default maximum buffer size
DefaultMaxBufferSize = 5000
// DefaultTimeoutMs is the default timeout for operations in milliseconds
DefaultTimeoutMs = 1000
// DefaultPingTimeoutMs is the timeout for initial connection ping
DefaultPingTimeoutMs = 5000
// MaxRetries is the maximum number of retry attempts for failed inserts
MaxRetries = 3
// RetryBaseDelay is the base delay between retries
RetryBaseDelay = 100 * time.Millisecond
)
// Config holds the ClickHouse sink configuration.
type Config struct {
DSN string
Table string
BatchSize int
FlushIntervalMs int
MaxBufferSize int
DropOnOverflow bool
AsyncInsert bool
TimeoutMs int
}
// ClickHouseSink writes correlated logs to ClickHouse.
type ClickHouseSink struct {
config Config
conn clickhouse.Conn
mu sync.Mutex
buffer []domain.CorrelatedLog
flushChan chan struct{}
done chan struct{}
wg sync.WaitGroup
closeOnce sync.Once
logger *observability.Logger
}
// SetLogger sets the logger used by the sink.
func (s *ClickHouseSink) SetLogger(logger *observability.Logger) {
s.logger = logger.WithFields(map[string]any{"sink": "clickhouse"})
}
// NewClickHouseSink creates a new ClickHouse sink.
func NewClickHouseSink(config Config) (*ClickHouseSink, error) {
if strings.TrimSpace(config.DSN) == "" {
return nil, fmt.Errorf("clickhouse DSN is required")
}
if strings.TrimSpace(config.Table) == "" {
return nil, fmt.Errorf("clickhouse table is required")
}
// Apply defaults
if config.BatchSize <= 0 {
config.BatchSize = DefaultBatchSize
}
if config.FlushIntervalMs <= 0 {
config.FlushIntervalMs = DefaultFlushIntervalMs
}
if config.MaxBufferSize <= 0 {
config.MaxBufferSize = DefaultMaxBufferSize
}
if config.TimeoutMs <= 0 {
config.TimeoutMs = DefaultTimeoutMs
}
s := &ClickHouseSink{
config: config,
buffer: make([]domain.CorrelatedLog, 0, config.BatchSize),
flushChan: make(chan struct{}, 1),
done: make(chan struct{}),
logger: observability.NewLogger("clickhouse"),
}
// Parse DSN and create options
options, err := clickhouse.ParseDSN(config.DSN)
if err != nil {
return nil, fmt.Errorf("failed to parse ClickHouse DSN: %w", err)
}
// Connect to ClickHouse using native API
conn, err := clickhouse.Open(options)
if err != nil {
return nil, fmt.Errorf("failed to connect to ClickHouse: %w", err)
}
// Ping with timeout to verify connection
pingCtx, pingCancel := context.WithTimeout(context.Background(), time.Duration(DefaultPingTimeoutMs)*time.Millisecond)
defer pingCancel()
if err := conn.Ping(pingCtx); err != nil {
_ = conn.Close()
return nil, fmt.Errorf("failed to ping ClickHouse: %w", err)
}
s.conn = conn
s.log().Infof("connected to ClickHouse: table=%s batch_size=%d flush_interval_ms=%d",
config.Table, config.BatchSize, config.FlushIntervalMs)
// Start flush goroutine
s.wg.Add(1)
go s.flushLoop()
return s, nil
}
// Name returns the sink name.
func (s *ClickHouseSink) Name() string {
return "clickhouse"
}
// log returns the logger, initializing a default one if not set (e.g. in tests).
func (s *ClickHouseSink) log() *observability.Logger {
if s.logger == nil {
s.logger = observability.NewLogger("clickhouse")
}
return s.logger
}
// Reopen is a no-op for ClickHouse (connection is managed internally).
func (s *ClickHouseSink) Reopen() error {
return nil
}
// Write adds a log to the buffer.
func (s *ClickHouseSink) Write(ctx context.Context, log domain.CorrelatedLog) error {
deadline := time.Now().Add(time.Duration(s.config.TimeoutMs) * time.Millisecond)
for {
s.mu.Lock()
if len(s.buffer) < s.config.MaxBufferSize {
s.buffer = append(s.buffer, log)
if len(s.buffer) >= s.config.BatchSize {
select {
case s.flushChan <- struct{}{}:
default:
}
}
s.mu.Unlock()
return nil
}
drop := s.config.DropOnOverflow
s.mu.Unlock()
if drop {
s.log().Warnf("buffer full, dropping log: table=%s buffer_size=%d", s.config.Table, s.config.MaxBufferSize)
return nil
}
if time.Now().After(deadline) {
return fmt.Errorf("buffer full, timeout exceeded")
}
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(10 * time.Millisecond):
}
}
}
// Flush flushes the buffer to ClickHouse.
func (s *ClickHouseSink) Flush(ctx context.Context) error {
return s.doFlush(ctx)
}
// Close closes the sink.
func (s *ClickHouseSink) Close() error {
var closeErr error
s.closeOnce.Do(func() {
if s.done != nil {
close(s.done)
}
s.wg.Wait()
flushCtx, cancel := context.WithTimeout(context.Background(), time.Duration(s.config.TimeoutMs)*time.Millisecond)
defer cancel()
if err := s.doFlush(flushCtx); err != nil {
closeErr = err
}
if s.conn != nil {
if err := s.conn.Close(); err != nil && closeErr == nil {
closeErr = err
}
}
})
return closeErr
}
func (s *ClickHouseSink) flushLoop() {
defer s.wg.Done()
ticker := time.NewTicker(time.Duration(s.config.FlushIntervalMs) * time.Millisecond)
defer ticker.Stop()
for {
select {
case <-s.done:
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(s.config.TimeoutMs)*time.Millisecond)
if err := s.doFlush(ctx); err != nil {
s.log().Error("final flush on close failed", err)
}
cancel()
return
case <-ticker.C:
s.mu.Lock()
needsFlush := len(s.buffer) > 0
s.mu.Unlock()
if needsFlush {
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(s.config.TimeoutMs)*time.Millisecond)
if err := s.doFlush(ctx); err != nil {
s.log().Error("periodic flush failed", err)
}
cancel()
}
case <-s.flushChan:
s.mu.Lock()
needsFlush := len(s.buffer) >= s.config.BatchSize
s.mu.Unlock()
if needsFlush {
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(s.config.TimeoutMs)*time.Millisecond)
if err := s.doFlush(ctx); err != nil {
s.log().Error("batch flush failed", err)
}
cancel()
}
}
}
}
func (s *ClickHouseSink) doFlush(ctx context.Context) error {
s.mu.Lock()
if len(s.buffer) == 0 {
s.mu.Unlock()
return nil
}
// Copy buffer to flush
buffer := make([]domain.CorrelatedLog, len(s.buffer))
copy(buffer, s.buffer)
s.buffer = make([]domain.CorrelatedLog, 0, s.config.BatchSize)
s.mu.Unlock()
if s.conn == nil {
return fmt.Errorf("clickhouse connection is not initialized")
}
batchSize := len(buffer)
// Retry logic with exponential backoff
var lastErr error
for attempt := 0; attempt < MaxRetries; attempt++ {
if attempt > 0 {
delay := RetryBaseDelay * time.Duration(1<<uint(attempt-1))
s.log().Warnf("retrying batch insert: attempt=%d/%d delay=%s rows=%d err=%v",
attempt+1, MaxRetries, delay, batchSize, lastErr)
select {
case <-time.After(delay):
case <-ctx.Done():
return ctx.Err()
}
}
lastErr = s.executeBatch(ctx, buffer)
if lastErr == nil {
s.log().Debugf("batch sent: rows=%d table=%s", batchSize, s.config.Table)
return nil
}
if !isRetryableError(lastErr) {
return fmt.Errorf("non-retryable error: %w", lastErr)
}
}
return fmt.Errorf("failed after %d retries (batch size: %d): %w", MaxRetries, batchSize, lastErr)
}
func (s *ClickHouseSink) executeBatch(ctx context.Context, buffer []domain.CorrelatedLog) error {
if s.conn == nil {
return fmt.Errorf("clickhouse connection is not initialized")
}
// Table schema: http_logs_raw (raw_json String)
// Single column insert - the entire log is serialized as JSON string
query := fmt.Sprintf(`INSERT INTO %s (raw_json)`, s.config.Table)
// Prepare batch using native clickhouse-go/v2 API
batch, err := s.conn.PrepareBatch(ctx, query)
if err != nil {
return fmt.Errorf("failed to prepare batch: %w", err)
}
for i, log := range buffer {
// Marshal the entire CorrelatedLog to JSON
logJSON, marshalErr := json.Marshal(log)
if marshalErr != nil {
return fmt.Errorf("failed to marshal log %d to JSON: %w", i, marshalErr)
}
// Append the JSON string as the raw_json column value
appendErr := batch.Append(string(logJSON))
if appendErr != nil {
return fmt.Errorf("failed to append log %d to batch: %w", i, appendErr)
}
}
// Send the batch - DO NOT FORGET this step
sendErr := batch.Send()
if sendErr != nil {
return fmt.Errorf("failed to send batch (%d rows): %w", len(buffer), sendErr)
}
return nil
}
// isRetryableError checks if an error is retryable.
func isRetryableError(err error) bool {
if err == nil {
return false
}
if errors.Is(err, context.DeadlineExceeded) {
return true
}
if errors.Is(err, context.Canceled) {
return false
}
var netErr net.Error
if errors.As(err, &netErr) {
if netErr.Timeout() {
return true
}
}
errStr := strings.ToLower(err.Error())
// Explicit non-retryable SQL/schema errors
if strings.Contains(errStr, "syntax error") ||
strings.Contains(errStr, "unknown table") ||
strings.Contains(errStr, "unknown column") ||
(strings.Contains(errStr, "table") && strings.Contains(errStr, "not found")) {
return false
}
// Fallback network/transient errors
retryableErrors := []string{
"connection refused",
"connection reset",
"timeout",
"temporary failure",
"network is unreachable",
"broken pipe",
"no route to host",
}
for _, re := range retryableErrors {
if strings.Contains(errStr, re) {
return true
}
}
return false
}

View File

@ -0,0 +1,538 @@
package clickhouse
import (
"context"
"testing"
"time"
"github.com/antitbone/ja4/correlator/internal/domain"
"github.com/antitbone/ja4/correlator/internal/observability"
)
func TestClickHouseSink_Name(t *testing.T) {
sink := &ClickHouseSink{
config: Config{
DSN: "clickhouse://test:test@localhost:9000/test",
Table: "test_table",
},
}
if sink.Name() != "clickhouse" {
t.Errorf("expected name 'clickhouse', got %s", sink.Name())
}
}
func TestClickHouseSink_ConfigDefaults(t *testing.T) {
// Test that defaults are applied correctly
config := Config{
DSN: "clickhouse://test:test@localhost:9000/test",
Table: "test_table",
// Other fields are zero, should get defaults
}
// Verify defaults would be applied (we can't actually connect in tests)
if config.BatchSize <= 0 {
config.BatchSize = DefaultBatchSize
}
if config.FlushIntervalMs <= 0 {
config.FlushIntervalMs = DefaultFlushIntervalMs
}
if config.MaxBufferSize <= 0 {
config.MaxBufferSize = DefaultMaxBufferSize
}
if config.TimeoutMs <= 0 {
config.TimeoutMs = DefaultTimeoutMs
}
if config.BatchSize != DefaultBatchSize {
t.Errorf("expected BatchSize %d, got %d", DefaultBatchSize, config.BatchSize)
}
if config.FlushIntervalMs != DefaultFlushIntervalMs {
t.Errorf("expected FlushIntervalMs %d, got %d", DefaultFlushIntervalMs, config.FlushIntervalMs)
}
if config.MaxBufferSize != DefaultMaxBufferSize {
t.Errorf("expected MaxBufferSize %d, got %d", DefaultMaxBufferSize, config.MaxBufferSize)
}
if config.TimeoutMs != DefaultTimeoutMs {
t.Errorf("expected TimeoutMs %d, got %d", DefaultTimeoutMs, config.TimeoutMs)
}
}
func TestClickHouseSink_Write_BufferOverflow(t *testing.T) {
// This test verifies the buffer overflow logic without actually connecting
config := Config{
DSN: "clickhouse://test:test@localhost:9000/test",
Table: "test_table",
BatchSize: 10,
MaxBufferSize: 10,
DropOnOverflow: true,
TimeoutMs: 100,
FlushIntervalMs: 1000,
}
// We can't test actual writes without a ClickHouse instance,
// but we can verify the config is valid
if config.BatchSize > config.MaxBufferSize {
t.Error("BatchSize should not exceed MaxBufferSize")
}
}
func TestClickHouseSink_IsRetryableError(t *testing.T) {
tests := []struct {
name string
err error
expected bool
}{
{"nil error", nil, false},
{"connection refused", &mockError{"connection refused"}, true},
{"connection reset", &mockError{"connection reset by peer"}, true},
{"timeout", &mockError{"timeout waiting for response"}, true},
{"network unreachable", &mockError{"network is unreachable"}, true},
{"broken pipe", &mockError{"broken pipe"}, true},
{"syntax error", &mockError{"syntax error in SQL"}, false},
{"table not found", &mockError{"table test not found"}, false},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
result := isRetryableError(tt.err)
if result != tt.expected {
t.Errorf("expected %v, got %v", tt.expected, result)
}
})
}
}
func TestClickHouseSink_FlushEmpty(t *testing.T) {
// Test that flushing an empty buffer doesn't cause issues
// (We can't test actual ClickHouse operations without a real instance)
s := &ClickHouseSink{
config: Config{
DSN: "clickhouse://test:test@localhost:9000/test",
Table: "test_table",
},
buffer: make([]domain.CorrelatedLog, 0),
}
// Should not panic or error on empty flush
ctx := context.Background()
err := s.Flush(ctx)
if err != nil {
t.Errorf("expected no error on empty flush, got %v", err)
}
}
func TestClickHouseSink_CloseWithoutConnect(t *testing.T) {
// Test that closing without connecting doesn't panic
s := &ClickHouseSink{
config: Config{
DSN: "clickhouse://test:test@localhost:9000/test",
Table: "test_table",
},
buffer: make([]domain.CorrelatedLog, 0),
done: make(chan struct{}),
}
err := s.Close()
if err != nil {
t.Errorf("expected no error on close without connect, got %v", err)
}
}
func TestClickHouseSink_Constants(t *testing.T) {
// Verify constants have reasonable values
if DefaultBatchSize <= 0 {
t.Error("DefaultBatchSize should be positive")
}
if DefaultFlushIntervalMs <= 0 {
t.Error("DefaultFlushIntervalMs should be positive")
}
if DefaultMaxBufferSize <= 0 {
t.Error("DefaultMaxBufferSize should be positive")
}
if DefaultTimeoutMs <= 0 {
t.Error("DefaultTimeoutMs should be positive")
}
if DefaultPingTimeoutMs <= 0 {
t.Error("DefaultPingTimeoutMs should be positive")
}
if MaxRetries <= 0 {
t.Error("MaxRetries should be positive")
}
if RetryBaseDelay <= 0 {
t.Error("RetryBaseDelay should be positive")
}
}
// mockError implements error for testing
type mockError struct {
msg string
}
func (e *mockError) Error() string {
return e.msg
}
// Test the doFlush function with empty buffer (no actual DB connection)
func TestClickHouseSink_DoFlushEmpty(t *testing.T) {
s := &ClickHouseSink{
config: Config{
DSN: "clickhouse://test:test@localhost:9000/test",
Table: "test_table",
},
buffer: make([]domain.CorrelatedLog, 0),
}
ctx := context.Background()
err := s.doFlush(ctx)
if err != nil {
t.Errorf("expected no error when flushing empty buffer, got %v", err)
}
}
// Test that buffer is properly managed (without actual DB operations)
func TestClickHouseSink_BufferManagement(t *testing.T) {
log := domain.CorrelatedLog{
SrcIP: "192.168.1.1",
SrcPort: 8080,
Correlated: true,
}
s := &ClickHouseSink{
config: Config{
DSN: "clickhouse://test:test@localhost:9000/test",
Table: "test_table",
MaxBufferSize: 100, // Allow more than 1 element
DropOnOverflow: false,
TimeoutMs: 1000,
},
buffer: []domain.CorrelatedLog{log},
}
// Verify buffer has data
if len(s.buffer) != 1 {
t.Fatalf("expected buffer length 1, got %d", len(s.buffer))
}
// Test that Write properly adds to buffer
ctx := context.Background()
err := s.Write(ctx, log)
if err != nil {
t.Errorf("unexpected error on Write: %v", err)
}
if len(s.buffer) != 2 {
t.Errorf("expected buffer length 2 after Write, got %d", len(s.buffer))
}
}
// Test Write with context cancellation
func TestClickHouseSink_Write_ContextCancel(t *testing.T) {
s := &ClickHouseSink{
config: Config{
DSN: "clickhouse://test:test@localhost:9000/test",
Table: "test_table",
MaxBufferSize: 1,
DropOnOverflow: false,
TimeoutMs: 10,
},
buffer: make([]domain.CorrelatedLog, 0, 1),
}
// Fill the buffer
log := domain.CorrelatedLog{SrcIP: "192.168.1.1", SrcPort: 8080}
s.buffer = append(s.buffer, log)
// Try to write with cancelled context
ctx, cancel := context.WithCancel(context.Background())
cancel() // Cancel immediately
err := s.Write(ctx, log)
if err == nil {
t.Error("expected error when writing with cancelled context")
}
}
// Test DropOnOverflow behavior
func TestClickHouseSink_Write_DropOnOverflow(t *testing.T) {
s := &ClickHouseSink{
config: Config{
DSN: "clickhouse://test:test@localhost:9000/test",
Table: "test_table",
MaxBufferSize: 1,
DropOnOverflow: true,
TimeoutMs: 10,
},
buffer: make([]domain.CorrelatedLog, 0, 1),
}
// Fill the buffer
log := domain.CorrelatedLog{SrcIP: "192.168.1.1", SrcPort: 8080}
s.buffer = append(s.buffer, log)
// Try to write when buffer is full - should drop silently
ctx := context.Background()
err := s.Write(ctx, log)
if err != nil {
t.Errorf("expected no error when DropOnOverflow is true, got %v", err)
}
}
// TestIsRetryableError_ContextDeadlineExceeded tests context.DeadlineExceeded is retryable.
func TestIsRetryableError_ContextDeadlineExceeded(t *testing.T) {
if !isRetryableError(context.DeadlineExceeded) {
t.Error("context.DeadlineExceeded should be retryable")
}
}
// TestIsRetryableError_ContextCanceled tests context.Canceled is NOT retryable.
func TestIsRetryableError_ContextCanceled(t *testing.T) {
if isRetryableError(context.Canceled) {
t.Error("context.Canceled should not be retryable")
}
}
// TestIsRetryableError_NetTimeout tests net.Error with Timeout() = true is retryable.
func TestIsRetryableError_NetTimeout(t *testing.T) {
err := &mockNetError{timeout: true, temporary: false}
if !isRetryableError(err) {
t.Error("net.Error with Timeout()=true should be retryable")
}
}
// TestIsRetryableError_NetNoTimeout tests net.Error with Timeout() = false is NOT retryable.
func TestIsRetryableError_NetNoTimeout(t *testing.T) {
err := &mockNetError{timeout: false, temporary: false}
if isRetryableError(err) {
t.Error("net.Error with Timeout()=false should not be retryable (unless msg matches)")
}
}
// TestIsRetryableError_UnknownTable tests "unknown table" is NOT retryable.
func TestIsRetryableError_UnknownTable(t *testing.T) {
if isRetryableError(&mockError{"unknown table users"}) {
t.Error("unknown table error should not be retryable")
}
}
// TestIsRetryableError_UnknownColumn tests "unknown column" is NOT retryable.
func TestIsRetryableError_UnknownColumn(t *testing.T) {
if isRetryableError(&mockError{"unknown column foo"}) {
t.Error("unknown column error should not be retryable")
}
}
// TestIsRetryableError_RandomError tests a random error is NOT retryable.
func TestIsRetryableError_RandomError(t *testing.T) {
if isRetryableError(&mockError{"some random unrecognized error"}) {
t.Error("random error should not be retryable")
}
}
// TestIsRetryableError_NoRouteToHost tests "no route to host" is retryable.
func TestIsRetryableError_NoRouteToHost(t *testing.T) {
if !isRetryableError(&mockError{"no route to host"}) {
t.Error("'no route to host' should be retryable")
}
}
// TestIsRetryableError_TemporaryFailure tests "temporary failure" is retryable.
func TestIsRetryableError_TemporaryFailure(t *testing.T) {
if !isRetryableError(&mockError{"temporary failure in name resolution"}) {
t.Error("'temporary failure' should be retryable")
}
}
// mockNetError implements net.Error for testing.
type mockNetError struct {
timeout bool
temporary bool
msg string
}
func (e *mockNetError) Error() string { return e.msg }
func (e *mockNetError) Timeout() bool { return e.timeout }
func (e *mockNetError) Temporary() bool { return e.temporary }
// TestNewClickHouseSink_EmptyDSN tests that empty DSN returns error.
func TestNewClickHouseSink_EmptyDSN(t *testing.T) {
_, err := NewClickHouseSink(Config{
DSN: "",
Table: "test_table",
})
if err == nil {
t.Error("expected error for empty DSN")
}
}
// TestNewClickHouseSink_WhitespaceDSN tests that whitespace DSN returns error.
func TestNewClickHouseSink_WhitespaceDSN(t *testing.T) {
_, err := NewClickHouseSink(Config{
DSN: " ",
Table: "test_table",
})
if err == nil {
t.Error("expected error for whitespace-only DSN")
}
}
// TestNewClickHouseSink_EmptyTable tests that empty Table returns error.
func TestNewClickHouseSink_EmptyTable(t *testing.T) {
_, err := NewClickHouseSink(Config{
DSN: "clickhouse://localhost:9000/test",
Table: "",
})
if err == nil {
t.Error("expected error for empty Table")
}
}
// TestNewClickHouseSink_WhitespaceTable tests that whitespace Table returns error.
func TestNewClickHouseSink_WhitespaceTable(t *testing.T) {
_, err := NewClickHouseSink(Config{
DSN: "clickhouse://localhost:9000/test",
Table: " ",
})
if err == nil {
t.Error("expected error for whitespace-only Table")
}
}
// TestNewClickHouseSink_InvalidDSN tests that an invalid DSN (no real connection) returns error.
func TestNewClickHouseSink_InvalidDSN(t *testing.T) {
_, err := NewClickHouseSink(Config{
DSN: "not-a-valid-dsn",
Table: "test_table",
})
if err == nil {
t.Error("expected error for invalid DSN")
}
}
// TestClickHouseSink_SetLogger tests that SetLogger sets a logger.
func TestClickHouseSink_SetLogger(t *testing.T) {
s := &ClickHouseSink{
config: Config{Table: "test_table"},
buffer: make([]domain.CorrelatedLog, 0),
}
testLogger := observability.NewLogger("test")
s.SetLogger(testLogger)
if s.logger == nil {
t.Error("expected logger to be set")
}
}
// TestClickHouseSink_LogNilLogger tests that log() returns a logger even when s.logger is nil.
func TestClickHouseSink_LogNilLogger(t *testing.T) {
s := &ClickHouseSink{
config: Config{Table: "test_table"},
buffer: make([]domain.CorrelatedLog, 0),
}
s.logger = nil
// log() should auto-initialize
logger := s.log()
if logger == nil {
t.Error("expected non-nil logger from log()")
}
}
// TestClickHouseSink_Reopen tests that Reopen is a no-op and returns nil.
func TestClickHouseSink_Reopen(t *testing.T) {
s := &ClickHouseSink{
config: Config{Table: "test_table"},
buffer: make([]domain.CorrelatedLog, 0),
}
if err := s.Reopen(); err != nil {
t.Errorf("Reopen() should return nil, got: %v", err)
}
}
// TestClickHouseSink_DoFlushNilConn tests doFlush returns error when conn is nil and buffer non-empty.
func TestClickHouseSink_DoFlushNilConn(t *testing.T) {
log := domain.CorrelatedLog{SrcIP: "1.2.3.4", SrcPort: 1234}
s := &ClickHouseSink{
config: Config{
Table: "test_table",
BatchSize: DefaultBatchSize,
},
buffer: []domain.CorrelatedLog{log},
conn: nil,
}
err := s.doFlush(context.Background())
if err == nil {
t.Error("expected error from doFlush when conn is nil")
}
}
// TestClickHouseSink_CloseTwice tests that calling Close() twice does not panic or error.
func TestClickHouseSink_CloseTwice(t *testing.T) {
s := &ClickHouseSink{
config: Config{
Table: "test_table",
TimeoutMs: DefaultTimeoutMs,
},
buffer: make([]domain.CorrelatedLog, 0),
done: make(chan struct{}),
}
if err := s.Close(); err != nil {
t.Errorf("first Close() should not error, got: %v", err)
}
if err := s.Close(); err != nil {
t.Errorf("second Close() should not error (closeOnce), got: %v", err)
}
}
// TestClickHouseSink_WriteTimeout tests that Write returns error when buffer is full and timeout exceeded.
func TestClickHouseSink_Write_Timeout(t *testing.T) {
s := &ClickHouseSink{
config: Config{
Table: "test_table",
MaxBufferSize: 1,
DropOnOverflow: false,
TimeoutMs: 1, // 1ms timeout
},
buffer: make([]domain.CorrelatedLog, 0, 1),
}
log := domain.CorrelatedLog{SrcIP: "1.2.3.4", SrcPort: 1234}
// Fill the buffer
s.buffer = append(s.buffer, log)
ctx := context.Background()
err := s.Write(ctx, log)
if err == nil {
t.Error("expected error when buffer full and timeout exceeded")
}
}
// Benchmark Write operation (without actual DB)
func BenchmarkClickHouseSink_Write(b *testing.B) {
s := &ClickHouseSink{
config: Config{
DSN: "clickhouse://test:test@localhost:9000/test",
Table: "test_table",
MaxBufferSize: 10000,
DropOnOverflow: true,
},
buffer: make([]domain.CorrelatedLog, 0, 10000),
}
log := domain.CorrelatedLog{
Timestamp: time.Now(),
SrcIP: "192.168.1.1",
SrcPort: 8080,
Correlated: true,
}
ctx := context.Background()
b.ResetTimer()
for i := 0; i < b.N; i++ {
s.Write(ctx, log)
}
}

View File

@ -0,0 +1,191 @@
package file
import (
"context"
"encoding/json"
"fmt"
"os"
"path/filepath"
"strings"
"sync"
"github.com/antitbone/ja4/correlator/internal/domain"
)
const (
// DefaultFilePermissions for output files
DefaultFilePermissions os.FileMode = 0644
// DefaultDirPermissions for output directories
DefaultDirPermissions os.FileMode = 0750
)
// Config holds the file sink configuration.
type Config struct {
Path string
}
// FileSink writes correlated logs to a file as JSON lines.
type FileSink struct {
config Config
mu sync.Mutex
file *os.File
}
// NewFileSink creates a new file sink.
func NewFileSink(config Config) (*FileSink, error) {
// Validate path
if err := validateFilePath(config.Path); err != nil {
return nil, fmt.Errorf("invalid file path: %w", err)
}
s := &FileSink{
config: config,
}
// Open file on creation
if err := s.openFile(); err != nil {
return nil, err
}
return s, nil
}
// Name returns the sink name.
func (s *FileSink) Name() string {
return "file"
}
// Reopen closes and reopens the file (for log rotation on SIGHUP).
func (s *FileSink) Reopen() error {
s.mu.Lock()
defer s.mu.Unlock()
if s.file != nil {
if err := s.file.Close(); err != nil {
return fmt.Errorf("failed to close file: %w", err)
}
}
return s.openFile()
}
// Write writes a correlated log to the file.
func (s *FileSink) Write(ctx context.Context, log domain.CorrelatedLog) error {
s.mu.Lock()
defer s.mu.Unlock()
if s.file == nil {
if err := s.openFile(); err != nil {
return err
}
}
data, err := json.Marshal(log)
if err != nil {
return fmt.Errorf("failed to marshal log: %w", err)
}
line := append(data, '\n')
if _, err := s.file.Write(line); err != nil {
return fmt.Errorf("failed to write log line: %w", err)
}
if err := s.file.Sync(); err != nil {
return fmt.Errorf("failed to sync log line: %w", err)
}
return nil
}
// Flush flushes any buffered data.
func (s *FileSink) Flush(ctx context.Context) error {
s.mu.Lock()
defer s.mu.Unlock()
if s.file != nil {
return s.file.Sync()
}
return nil
}
// Close closes the sink.
func (s *FileSink) Close() error {
s.mu.Lock()
defer s.mu.Unlock()
if s.file != nil {
err := s.file.Close()
s.file = nil
return err
}
return nil
}
func (s *FileSink) openFile() error {
// Validate path again before opening
if err := validateFilePath(s.config.Path); err != nil {
return fmt.Errorf("invalid file path: %w", err)
}
// Ensure directory exists
dir := filepath.Dir(s.config.Path)
if err := os.MkdirAll(dir, DefaultDirPermissions); err != nil {
return fmt.Errorf("failed to create directory: %w", err)
}
file, err := os.OpenFile(s.config.Path, os.O_APPEND|os.O_CREATE|os.O_WRONLY, DefaultFilePermissions)
if err != nil {
return fmt.Errorf("failed to open file: %w", err)
}
s.file = file
return nil
}
// validateFilePath validates that the file path is safe and allowed.
func validateFilePath(path string) error {
if strings.TrimSpace(path) == "" {
return fmt.Errorf("path cannot be empty")
}
cleanPath := filepath.Clean(path)
// Allow relative paths for testing/dev
if !filepath.IsAbs(cleanPath) {
return nil
}
absPath, err := filepath.Abs(cleanPath)
if err != nil {
return fmt.Errorf("failed to resolve absolute path: %w", err)
}
allowedRoots := []string{
"/var/log/logcorrelator",
"/var/log",
"/tmp",
}
for _, root := range allowedRoots {
absRoot, err := filepath.Abs(filepath.Clean(root))
if err != nil {
continue
}
rel, err := filepath.Rel(absRoot, absPath)
if err != nil {
continue
}
if rel == "." {
return nil
}
if rel == ".." {
continue
}
if !strings.HasPrefix(rel, ".."+string(os.PathSeparator)) {
return nil
}
}
return fmt.Errorf("path must be under allowed directories: %v", allowedRoots)
}

View File

@ -0,0 +1,524 @@
package file
import (
"context"
"os"
"path/filepath"
"testing"
"github.com/antitbone/ja4/correlator/internal/domain"
)
func TestFileSink_Write(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
defer sink.Close()
log := domain.CorrelatedLog{
SrcIP: "192.168.1.1",
SrcPort: 8080,
Correlated: true,
}
if err := sink.Write(context.Background(), log); err != nil {
t.Fatalf("failed to write: %v", err)
}
if err := sink.Flush(context.Background()); err != nil {
t.Fatalf("failed to flush: %v", err)
}
// Verify file exists and contains data
data, err := os.ReadFile(testPath)
if err != nil {
t.Fatalf("failed to read file: %v", err)
}
if len(data) == 0 {
t.Error("expected non-empty file")
}
}
func TestFileSink_WriteImmediatePersist_NoFlushNeeded(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
defer sink.Close()
log := domain.CorrelatedLog{
SrcIP: "192.168.1.1",
SrcPort: 8080,
Correlated: true,
}
if err := sink.Write(context.Background(), log); err != nil {
t.Fatalf("failed to write: %v", err)
}
// Must be visible immediately without Flush()
data, err := os.ReadFile(testPath)
if err != nil {
t.Fatalf("failed to read file: %v", err)
}
if len(data) == 0 {
t.Error("expected data to be present immediately after Write without Flush")
}
}
func TestFileSink_MultipleWrites(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
defer sink.Close()
for i := 0; i < 5; i++ {
log := domain.CorrelatedLog{
SrcIP: "192.168.1.1",
SrcPort: 8080 + i,
}
if err := sink.Write(context.Background(), log); err != nil {
t.Fatalf("failed to write: %v", err)
}
}
sink.Close()
// Verify file has 5 lines
data, err := os.ReadFile(testPath)
if err != nil {
t.Fatalf("failed to read file: %v", err)
}
lines := 0
for _, b := range data {
if b == '\n' {
lines++
}
}
if lines != 5 {
t.Errorf("expected 5 lines, got %d", lines)
}
}
func TestFileSink_Name(t *testing.T) {
sink, err := NewFileSink(Config{Path: "/tmp/test.log"})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
if sink.Name() != "file" {
t.Errorf("expected name 'file', got %s", sink.Name())
}
}
func TestFileSink_ValidateFilePath(t *testing.T) {
tests := []struct {
name string
path string
wantErr bool
}{
{"empty path", "", true},
{"valid /var/log/logcorrelator", "/var/log/logcorrelator/test.log", false},
{"valid /var/log", "/var/log/test.log", false},
{"valid /tmp", "/tmp/test.log", false},
{"reject lookalike /var/logevil", "/var/logevil/test.log", true},
{"invalid directory", "/etc/logcorrelator/test.log", true},
{"relative path", "test.log", false}, // Allowed for testing
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
err := validateFilePath(tt.path)
if (err != nil) != tt.wantErr {
t.Errorf("validateFilePath(%q) error = %v, wantErr %v", tt.path, err, tt.wantErr)
}
})
}
}
func TestFileSink_OpenFile(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "subdir", "test.log")
sink := &FileSink{
config: Config{Path: testPath},
}
err := sink.openFile()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
defer sink.Close()
if sink.file == nil {
t.Error("expected file to be opened")
}
}
func TestFileSink_WriteBeforeOpen(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
defer sink.Close()
// Write should open file automatically
log := domain.CorrelatedLog{SrcIP: "192.168.1.1", SrcPort: 8080}
err = sink.Write(context.Background(), log)
if err != nil {
t.Fatalf("failed to write: %v", err)
}
// Verify file was created
if _, err := os.Stat(testPath); os.IsNotExist(err) {
t.Error("expected file to be created")
}
}
func TestFileSink_FlushBeforeOpen(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
defer sink.Close()
// Flush before any write should not error
err = sink.Flush(context.Background())
if err != nil {
t.Errorf("expected no error on flush before open, got %v", err)
}
}
func TestFileSink_InvalidPath(t *testing.T) {
// Test with invalid path (outside allowed directories)
_, err := NewFileSink(Config{Path: "/etc/../passwd"})
if err == nil {
t.Error("expected error for invalid path")
}
}
func TestFileSink_Reopen(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
// Write initial data
log := domain.CorrelatedLog{SrcIP: "192.168.1.1", SrcPort: 8080}
if err := sink.Write(context.Background(), log); err != nil {
t.Fatalf("failed to write: %v", err)
}
// Reopen should close and reopen the file
err = sink.Reopen()
if err != nil {
t.Errorf("expected no error on Reopen, got %v", err)
}
// Write after reopen
log2 := domain.CorrelatedLog{SrcIP: "192.168.1.2", SrcPort: 8081}
if err := sink.Write(context.Background(), log2); err != nil {
t.Fatalf("failed to write after reopen: %v", err)
}
sink.Close()
// Verify both writes are present
data, err := os.ReadFile(testPath)
if err != nil {
t.Fatalf("failed to read file: %v", err)
}
lines := 0
for _, b := range data {
if b == '\n' {
lines++
}
}
if lines != 2 {
t.Errorf("expected 2 lines after reopen, got %d", lines)
}
}
func TestFileSink_Close(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
// Close should succeed
err = sink.Close()
if err != nil {
t.Errorf("expected no error on Close, got %v", err)
}
// Write after close should fail or reopen
log := domain.CorrelatedLog{SrcIP: "192.168.1.1", SrcPort: 8080}
err = sink.Write(context.Background(), log)
if err != nil {
// Expected - file was closed
t.Logf("write after close returned error (expected): %v", err)
}
}
func TestFileSink_EmptyPath(t *testing.T) {
_, err := NewFileSink(Config{Path: ""})
if err == nil {
t.Error("expected error for empty path")
}
}
func TestFileSink_WhitespacePath(t *testing.T) {
_, err := NewFileSink(Config{Path: " "})
if err == nil {
t.Error("expected error for whitespace-only path")
}
}
func TestFileSink_ValidateFilePath_AllowedRoots(t *testing.T) {
// Test paths under allowed roots
allowedPaths := []string{
"/var/log/logcorrelator/correlated.log",
"/var/log/test.log",
"/tmp/test.log",
"/tmp/subdir/test.log",
"relative/path/test.log",
"./test.log",
}
for _, path := range allowedPaths {
err := validateFilePath(path)
if err != nil {
t.Errorf("validateFilePath(%q) unexpected error: %v", path, err)
}
}
}
func TestFileSink_ValidateFilePath_RejectedPaths(t *testing.T) {
// Test paths that should be rejected
rejectedPaths := []string{
"",
" ",
"/etc/passwd",
"/etc/logcorrelator/test.log",
"/root/test.log",
"/home/user/test.log",
"/var/logevil/test.log",
}
for _, path := range rejectedPaths {
err := validateFilePath(path)
if err == nil {
t.Errorf("validateFilePath(%q) should have been rejected", path)
}
}
}
func TestFileSink_ConcurrentWrites(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
defer sink.Close()
done := make(chan bool)
for i := 0; i < 10; i++ {
go func(n int) {
log := domain.CorrelatedLog{SrcIP: "192.168.1.1", SrcPort: 8080 + n}
sink.Write(context.Background(), log)
done <- true
}(i)
}
for i := 0; i < 10; i++ {
<-done
}
// Verify all writes completed
data, err := os.ReadFile(testPath)
if err != nil {
t.Fatalf("failed to read file: %v", err)
}
lines := 0
for _, b := range data {
if b == '\n' {
lines++
}
}
if lines != 10 {
t.Errorf("expected 10 lines from concurrent writes, got %d", lines)
}
}
func TestFileSink_Flush(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
defer sink.Close()
log := domain.CorrelatedLog{SrcIP: "192.168.1.1", SrcPort: 8080}
if err := sink.Write(context.Background(), log); err != nil {
t.Fatalf("failed to write: %v", err)
}
// Flush should succeed
err = sink.Flush(context.Background())
if err != nil {
t.Errorf("expected no error on Flush, got %v", err)
}
}
func TestFileSink_MarshalError(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
defer sink.Close()
// Create a log with unmarshalable data (channel)
log := domain.CorrelatedLog{
SrcIP: "192.168.1.1",
SrcPort: 8080,
Fields: map[string]any{"chan": make(chan int)},
}
err = sink.Write(context.Background(), log)
if err == nil {
t.Error("expected error when marshaling unmarshalable data")
}
}
// TestFileSink_CloseTwice tests that closing an already-closed sink does not error.
func TestFileSink_CloseTwice(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
if err := sink.Close(); err != nil {
t.Errorf("first Close() should not error, got: %v", err)
}
// After close, file is nil, so second close should return nil
if err := sink.Close(); err != nil {
t.Errorf("second Close() on already-closed sink should not error, got: %v", err)
}
}
// TestFileSink_WriteAfterClose tests that Write after Close re-opens the file.
func TestFileSink_WriteAfterCloseReopens(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
if err := sink.Close(); err != nil {
t.Fatalf("Close() failed: %v", err)
}
// Write after close: FileSink.Write reopens the file when file == nil
log := domain.CorrelatedLog{SrcIP: "1.2.3.4", SrcPort: 80}
if err := sink.Write(context.Background(), log); err != nil {
t.Errorf("Write after close should succeed (auto-reopen), got: %v", err)
}
// Verify data was written
data, err := os.ReadFile(testPath)
if err != nil {
t.Fatalf("failed to read file: %v", err)
}
if len(data) == 0 {
t.Error("expected data to be present after write on re-opened file")
}
}
// TestFileSink_ReopenAfterWrite tests Reopen then write produces correct output.
func TestFileSink_ReopenThenWrite(t *testing.T) {
tmpDir := t.TempDir()
testPath := filepath.Join(tmpDir, "test.log")
sink, err := NewFileSink(Config{Path: testPath})
if err != nil {
t.Fatalf("failed to create sink: %v", err)
}
defer sink.Close()
// Write before reopen
log1 := domain.CorrelatedLog{SrcIP: "1.1.1.1", SrcPort: 80}
if err := sink.Write(context.Background(), log1); err != nil {
t.Fatalf("first Write failed: %v", err)
}
// Simulate log rotation
if err := sink.Reopen(); err != nil {
t.Fatalf("Reopen failed: %v", err)
}
// Write after reopen
log2 := domain.CorrelatedLog{SrcIP: "2.2.2.2", SrcPort: 443}
if err := sink.Write(context.Background(), log2); err != nil {
t.Fatalf("second Write failed: %v", err)
}
sink.Close()
data, err := os.ReadFile(testPath)
if err != nil {
t.Fatalf("failed to read file: %v", err)
}
lines := 0
for _, b := range data {
if b == '\n' {
lines++
}
}
if lines != 2 {
t.Errorf("expected 2 lines after reopen+write, got %d", lines)
}
}

View File

@ -0,0 +1,137 @@
package multi
import (
"context"
"sync"
"github.com/antitbone/ja4/correlator/internal/domain"
"github.com/antitbone/ja4/correlator/internal/ports"
)
// MultiSink fans out correlated logs to multiple sinks.
type MultiSink struct {
mu sync.RWMutex
sinks []ports.CorrelatedLogSink
}
// NewMultiSink creates a new multi-sink.
func NewMultiSink(sinks ...ports.CorrelatedLogSink) *MultiSink {
return &MultiSink{
sinks: sinks,
}
}
// Name returns the sink name.
func (s *MultiSink) Name() string {
return "multi"
}
// AddSink adds a sink to the fan-out.
func (s *MultiSink) AddSink(sink ports.CorrelatedLogSink) {
s.mu.Lock()
defer s.mu.Unlock()
s.sinks = append(s.sinks, sink)
}
// Write writes a correlated log to all sinks concurrently.
// Returns the first error encountered (but all sinks are attempted).
func (s *MultiSink) Write(ctx context.Context, log domain.CorrelatedLog) error {
s.mu.RLock()
sinks := make([]ports.CorrelatedLogSink, len(s.sinks))
copy(sinks, s.sinks)
s.mu.RUnlock()
if len(sinks) == 0 {
return nil
}
var wg sync.WaitGroup
var firstErr error
var firstErrMu sync.Mutex
errChan := make(chan error, len(sinks))
for _, sink := range sinks {
wg.Add(1)
go func(sk ports.CorrelatedLogSink) {
defer wg.Done()
if err := sk.Write(ctx, log); err != nil {
// Non-blocking send to errChan
select {
case errChan <- err:
default:
// Channel full, error will be handled via firstErr
}
}
}(sink)
}
// Wait for all writes to complete in a separate goroutine
done := make(chan struct{})
go func() {
wg.Wait()
close(done)
}()
// Collect errors with timeout
select {
case <-done:
close(errChan)
// Collect first error
for err := range errChan {
if err != nil {
firstErrMu.Lock()
if firstErr == nil {
firstErr = err
}
firstErrMu.Unlock()
}
}
case <-ctx.Done():
return ctx.Err()
}
firstErrMu.Lock()
defer firstErrMu.Unlock()
return firstErr
}
// Flush flushes all sinks.
func (s *MultiSink) Flush(ctx context.Context) error {
s.mu.RLock()
defer s.mu.RUnlock()
for _, sink := range s.sinks {
if err := sink.Flush(ctx); err != nil {
return err
}
}
return nil
}
// Close closes all sinks.
func (s *MultiSink) Close() error {
s.mu.RLock()
defer s.mu.RUnlock()
var firstErr error
for _, sink := range s.sinks {
if err := sink.Close(); err != nil && firstErr == nil {
firstErr = err
}
}
return firstErr
}
// Reopen reopens all sinks (for log rotation on SIGHUP).
func (s *MultiSink) Reopen() error {
s.mu.RLock()
defer s.mu.RUnlock()
var firstErr error
for _, sink := range s.sinks {
if err := sink.Reopen(); err != nil && firstErr == nil {
firstErr = err
}
}
return firstErr
}

View File

@ -0,0 +1,233 @@
package multi
import (
"context"
"sync"
"testing"
"github.com/antitbone/ja4/correlator/internal/domain"
)
type mockSink struct {
name string
mu sync.Mutex
writeFunc func(domain.CorrelatedLog) error
flushFunc func() error
closeFunc func() error
reopenFunc func() error
}
func (m *mockSink) Name() string { return m.name }
func (m *mockSink) Write(ctx context.Context, log domain.CorrelatedLog) error {
m.mu.Lock()
defer m.mu.Unlock()
return m.writeFunc(log)
}
func (m *mockSink) Flush(ctx context.Context) error { return m.flushFunc() }
func (m *mockSink) Close() error { return m.closeFunc() }
func (m *mockSink) Reopen() error {
if m.reopenFunc != nil {
return m.reopenFunc()
}
return nil
}
func TestMultiSink_Write(t *testing.T) {
var mu sync.Mutex
writeCount := 0
sink1 := &mockSink{
name: "sink1",
writeFunc: func(log domain.CorrelatedLog) error {
mu.Lock()
writeCount++
mu.Unlock()
return nil
},
flushFunc: func() error { return nil },
closeFunc: func() error { return nil },
}
sink2 := &mockSink{
name: "sink2",
writeFunc: func(log domain.CorrelatedLog) error {
mu.Lock()
writeCount++
mu.Unlock()
return nil
},
flushFunc: func() error { return nil },
closeFunc: func() error { return nil },
}
ms := NewMultiSink(sink1, sink2)
log := domain.CorrelatedLog{SrcIP: "192.168.1.1"}
err := ms.Write(context.Background(), log)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if writeCount != 2 {
t.Errorf("expected 2 writes, got %d", writeCount)
}
}
func TestMultiSink_Write_OneFails(t *testing.T) {
sink1 := &mockSink{
name: "sink1",
writeFunc: func(log domain.CorrelatedLog) error {
return nil
},
flushFunc: func() error { return nil },
closeFunc: func() error { return nil },
}
sink2 := &mockSink{
name: "sink2",
writeFunc: func(log domain.CorrelatedLog) error {
return context.Canceled
},
flushFunc: func() error { return nil },
closeFunc: func() error { return nil },
}
ms := NewMultiSink(sink1, sink2)
log := domain.CorrelatedLog{SrcIP: "192.168.1.1"}
err := ms.Write(context.Background(), log)
if err == nil {
t.Error("expected error when one sink fails")
}
}
func TestMultiSink_AddSink(t *testing.T) {
ms := NewMultiSink()
sink := &mockSink{
name: "dynamic",
writeFunc: func(log domain.CorrelatedLog) error { return nil },
flushFunc: func() error { return nil },
closeFunc: func() error { return nil },
}
ms.AddSink(sink)
log := domain.CorrelatedLog{SrcIP: "192.168.1.1"}
err := ms.Write(context.Background(), log)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
}
func TestMultiSink_Name(t *testing.T) {
ms := NewMultiSink()
if ms.Name() != "multi" {
t.Errorf("expected name 'multi', got %s", ms.Name())
}
}
func TestMultiSink_Flush(t *testing.T) {
flushed := false
sink := &mockSink{
name: "test",
writeFunc: func(log domain.CorrelatedLog) error { return nil },
flushFunc: func() error {
flushed = true
return nil
},
closeFunc: func() error { return nil },
}
ms := NewMultiSink(sink)
err := ms.Flush(context.Background())
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if !flushed {
t.Error("expected sink to be flushed")
}
}
func TestMultiSink_Flush_Error(t *testing.T) {
sink := &mockSink{
name: "test",
writeFunc: func(log domain.CorrelatedLog) error { return nil },
flushFunc: func() error { return context.Canceled },
closeFunc: func() error { return nil },
}
ms := NewMultiSink(sink)
err := ms.Flush(context.Background())
if err != context.Canceled {
t.Errorf("expected context.Canceled error, got %v", err)
}
}
func TestMultiSink_Close(t *testing.T) {
closed := false
sink := &mockSink{
name: "test",
writeFunc: func(log domain.CorrelatedLog) error { return nil },
flushFunc: func() error { return nil },
closeFunc: func() error {
closed = true
return nil
},
}
ms := NewMultiSink(sink)
err := ms.Close()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if !closed {
t.Error("expected sink to be closed")
}
}
func TestMultiSink_Close_Error(t *testing.T) {
sink := &mockSink{
name: "test",
writeFunc: func(log domain.CorrelatedLog) error { return nil },
flushFunc: func() error { return nil },
closeFunc: func() error { return context.Canceled },
}
ms := NewMultiSink(sink)
err := ms.Close()
if err != context.Canceled {
t.Errorf("expected context.Canceled error, got %v", err)
}
}
func TestMultiSink_Write_EmptySinks(t *testing.T) {
ms := NewMultiSink()
log := domain.CorrelatedLog{SrcIP: "192.168.1.1"}
err := ms.Write(context.Background(), log)
if err != nil {
t.Fatalf("unexpected error with empty sinks: %v", err)
}
}
func TestMultiSink_Write_ContextCancelled(t *testing.T) {
sink := &mockSink{
name: "test",
writeFunc: func(log domain.CorrelatedLog) error {
<-context.Background().Done()
return nil
},
flushFunc: func() error { return nil },
closeFunc: func() error { return nil },
}
ms := NewMultiSink(sink)
ctx, cancel := context.WithCancel(context.Background())
cancel()
log := domain.CorrelatedLog{SrcIP: "192.168.1.1"}
err := ms.Write(ctx, log)
if err != context.Canceled {
t.Errorf("expected context.Canceled error, got %v", err)
}
}

View File

@ -0,0 +1,46 @@
package stdout
import (
"context"
"github.com/antitbone/ja4/correlator/internal/domain"
)
// Config holds the stdout sink configuration.
type Config struct {
Enabled bool
}
// StdoutSink is a no-op data sink. Operational logs are written to stderr
// by the observability.Logger; correlated data must never appear on stdout.
type StdoutSink struct{}
// NewStdoutSink creates a new stdout sink.
func NewStdoutSink(config Config) *StdoutSink {
return &StdoutSink{}
}
// Name returns the sink name.
func (s *StdoutSink) Name() string {
return "stdout"
}
// Reopen is a no-op for stdout.
func (s *StdoutSink) Reopen() error {
return nil
}
// Write is a no-op: correlated data must never be written to stdout.
func (s *StdoutSink) Write(_ context.Context, _ domain.CorrelatedLog) error {
return nil
}
// Flush is a no-op for stdout.
func (s *StdoutSink) Flush(_ context.Context) error {
return nil
}
// Close is a no-op for stdout.
func (s *StdoutSink) Close() error {
return nil
}

View File

@ -0,0 +1,81 @@
package stdout
import (
"bytes"
"context"
"os"
"testing"
"time"
"github.com/antitbone/ja4/correlator/internal/domain"
)
func makeLog(correlated bool) domain.CorrelatedLog {
return domain.CorrelatedLog{
Timestamp: time.Unix(1700000000, 0),
SrcIP: "1.2.3.4",
SrcPort: 12345,
Correlated: correlated,
}
}
// captureStdout replaces os.Stdout temporarily and returns what was written.
func captureStdout(t *testing.T, fn func()) string {
t.Helper()
r, w, err := os.Pipe()
if err != nil {
t.Fatalf("os.Pipe: %v", err)
}
old := os.Stdout
os.Stdout = w
fn()
w.Close()
os.Stdout = old
var buf bytes.Buffer
buf.ReadFrom(r)
r.Close()
return buf.String()
}
func TestStdoutSink_Name(t *testing.T) {
s := NewStdoutSink(Config{Enabled: true})
if s.Name() != "stdout" {
t.Errorf("expected name 'stdout', got %q", s.Name())
}
}
// TestStdoutSink_WriteDoesNotProduceOutput verifies that no JSON data
// (correlated or not) is ever written to stdout.
func TestStdoutSink_WriteDoesNotProduceOutput(t *testing.T) {
s := NewStdoutSink(Config{Enabled: true})
got := captureStdout(t, func() {
if err := s.Write(context.Background(), makeLog(true)); err != nil {
t.Fatalf("Write(correlated) returned error: %v", err)
}
if err := s.Write(context.Background(), makeLog(false)); err != nil {
t.Fatalf("Write(orphan) returned error: %v", err)
}
})
if got != "" {
t.Errorf("stdout must be empty but got: %q", got)
}
}
func TestStdoutSink_NoopMethods(t *testing.T) {
s := NewStdoutSink(Config{Enabled: true})
if err := s.Flush(context.Background()); err != nil {
t.Errorf("Flush returned error: %v", err)
}
if err := s.Close(); err != nil {
t.Errorf("Close returned error: %v", err)
}
if err := s.Reopen(); err != nil {
t.Errorf("Reopen returned error: %v", err)
}
}

View File

@ -0,0 +1,160 @@
package app
import (
"context"
"sync"
"sync/atomic"
"time"
"github.com/antitbone/ja4/correlator/internal/domain"
"github.com/antitbone/ja4/correlator/internal/ports"
)
const (
// DefaultEventChannelBufferSize is the default size for event channels
DefaultEventChannelBufferSize = 1000
// OrphanTickInterval is how often the orchestrator drains pending orphans.
// Set to half the default emit delay (500ms/2) so orphans are emitted promptly
// even when no new events arrive.
OrphanTickInterval = 250 * time.Millisecond
)
// OrchestratorConfig holds the orchestrator configuration.
type OrchestratorConfig struct {
Sources []ports.EventSource
Sink ports.CorrelatedLogSink
}
// Orchestrator connects sources to the correlation service and sinks.
type Orchestrator struct {
config OrchestratorConfig
correlationSvc ports.CorrelationProcessor
ctx context.Context
cancel context.CancelFunc
wg sync.WaitGroup
running atomic.Bool
}
// NewOrchestrator creates a new orchestrator.
func NewOrchestrator(config OrchestratorConfig, correlationSvc ports.CorrelationProcessor) *Orchestrator {
ctx, cancel := context.WithCancel(context.Background())
return &Orchestrator{
config: config,
correlationSvc: correlationSvc,
ctx: ctx,
cancel: cancel,
}
}
// Start begins the orchestration.
func (o *Orchestrator) Start() error {
if !o.running.CompareAndSwap(false, true) {
return nil // Already running
}
// Start each source
for _, source := range o.config.Sources {
eventChan := make(chan *domain.NormalizedEvent, DefaultEventChannelBufferSize)
o.wg.Add(1)
go func(src ports.EventSource, evChan chan *domain.NormalizedEvent) {
defer o.wg.Done()
// Start the source in a separate goroutine
sourceErr := make(chan error, 1)
go func() {
if err := src.Start(o.ctx, evChan); err != nil {
sourceErr <- err
}
}()
// Process events in the current goroutine
o.processEvents(evChan)
// Check for source start errors
if err := <-sourceErr; err != nil {
// Source failed to start, log error and exit
return
}
}(source, eventChan)
}
// Start a periodic ticker to drain pending orphan A events independently of the
// event flow. Without this, orphans are only emitted when a new event arrives,
// causing them to accumulate silently when the source goes quiet.
o.wg.Add(1)
go func() {
defer o.wg.Done()
ticker := time.NewTicker(OrphanTickInterval)
defer ticker.Stop()
for {
select {
case <-o.ctx.Done():
return
case <-ticker.C:
logs := o.correlationSvc.EmitPendingOrphans()
for _, log := range logs {
o.config.Sink.Write(o.ctx, log) //nolint:errcheck
}
}
}
}()
return nil
}
func (o *Orchestrator) processEvents(eventChan <-chan *domain.NormalizedEvent) {
for {
select {
case <-o.ctx.Done():
// Drain remaining events before exiting
for {
select {
case event, ok := <-eventChan:
if !ok {
return
}
logs := o.correlationSvc.ProcessEvent(event)
for _, log := range logs {
o.config.Sink.Write(o.ctx, log)
}
default:
return
}
}
case event, ok := <-eventChan:
if !ok {
return
}
// Process through correlation service
logs := o.correlationSvc.ProcessEvent(event)
// Write correlated logs to sink
for _, log := range logs {
if err := o.config.Sink.Write(o.ctx, log); err != nil {
// Log error but continue processing
}
}
}
}
}
// Stop gracefully stops the orchestrator.
// It stops all sources and closes sinks immediately without waiting for queue drainage.
// systemd TimeoutStopSec handles forced termination if needed.
func (o *Orchestrator) Stop() error {
if !o.running.CompareAndSwap(true, false) {
return nil // Not running
}
// Cancel context to stop accepting new events immediately
o.cancel()
// Close sink (flush skipped - in-flight events are dropped)
if err := o.config.Sink.Close(); err != nil {
// Log error
}
return nil
}

View File

@ -0,0 +1,300 @@
package app
import (
"context"
"sync"
"testing"
"time"
"github.com/antitbone/ja4/correlator/internal/domain"
"github.com/antitbone/ja4/correlator/internal/ports"
)
type mockEventSource struct {
name string
mu sync.RWMutex
eventChan chan<- *domain.NormalizedEvent
started bool
stopped bool
}
func (m *mockEventSource) Name() string { return m.name }
func (m *mockEventSource) Start(ctx context.Context, eventChan chan<- *domain.NormalizedEvent) error {
m.mu.Lock()
m.started = true
m.eventChan = eventChan
m.mu.Unlock()
<-ctx.Done()
m.mu.Lock()
m.stopped = true
m.mu.Unlock()
return nil
}
func (m *mockEventSource) Stop() error { return nil }
func (m *mockEventSource) getEventChan() chan<- *domain.NormalizedEvent {
m.mu.RLock()
defer m.mu.RUnlock()
return m.eventChan
}
func (m *mockEventSource) isStarted() bool {
m.mu.RLock()
defer m.mu.RUnlock()
return m.started
}
type mockSink struct {
mu sync.Mutex
written []domain.CorrelatedLog
}
func (m *mockSink) Name() string { return "mock" }
func (m *mockSink) Write(ctx context.Context, log domain.CorrelatedLog) error {
m.mu.Lock()
defer m.mu.Unlock()
m.written = append(m.written, log)
return nil
}
func (m *mockSink) Flush(ctx context.Context) error { return nil }
func (m *mockSink) Close() error { return nil }
func (m *mockSink) Reopen() error { return nil }
func (m *mockSink) getWritten() []domain.CorrelatedLog {
m.mu.Lock()
defer m.mu.Unlock()
result := make([]domain.CorrelatedLog, len(m.written))
copy(result, m.written)
return result
}
func TestOrchestrator_StartStop(t *testing.T) {
source := &mockEventSource{name: "test"}
sink := &mockSink{}
corrConfig := domain.CorrelationConfig{
TimeWindow: time.Second,
ApacheAlwaysEmit: true,
NetworkEmit: false,
}
correlationSvc := domain.NewCorrelationService(corrConfig, &domain.RealTimeProvider{})
orchestrator := NewOrchestrator(OrchestratorConfig{
Sources: []ports.EventSource{source},
Sink: sink,
}, correlationSvc)
if err := orchestrator.Start(); err != nil {
t.Fatalf("failed to start: %v", err)
}
// Let it run briefly
time.Sleep(100 * time.Millisecond)
if err := orchestrator.Stop(); err != nil {
t.Fatalf("failed to stop: %v", err)
}
if !source.isStarted() {
t.Error("expected source to be started")
}
}
func TestOrchestrator_ProcessEvent(t *testing.T) {
source := &mockEventSource{name: "test"}
sink := &mockSink{}
corrConfig := domain.CorrelationConfig{
TimeWindow: time.Second,
ApacheAlwaysEmit: true,
NetworkEmit: false,
}
correlationSvc := domain.NewCorrelationService(corrConfig, &domain.RealTimeProvider{})
orchestrator := NewOrchestrator(OrchestratorConfig{
Sources: []ports.EventSource{source},
Sink: sink,
}, correlationSvc)
if err := orchestrator.Start(); err != nil {
t.Fatalf("failed to start: %v", err)
}
// Wait for source to start and get the channel
var eventChan chan<- *domain.NormalizedEvent
for i := 0; i < 50; i++ {
eventChan = source.getEventChan()
if eventChan != nil {
break
}
time.Sleep(10 * time.Millisecond)
}
if eventChan == nil {
t.Fatal("source did not start properly")
}
// Send an event through the source
event := &domain.NormalizedEvent{
Source: domain.SourceA,
Timestamp: time.Now(),
SrcIP: "192.168.1.1",
SrcPort: 8080,
Raw: map[string]any{"method": "GET"},
}
// Send event
eventChan <- event
// Give it time to process
time.Sleep(100 * time.Millisecond)
if err := orchestrator.Stop(); err != nil {
t.Fatalf("failed to stop: %v", err)
}
// Should have written at least one log (the orphan A)
written := sink.getWritten()
if len(written) == 0 {
t.Error("expected at least one log to be written")
}
}
// TestOrchestrator_StartTwice tests that calling Start() twice is a no-op (already running).
func TestOrchestrator_StartTwice(t *testing.T) {
source := &mockEventSource{name: "test"}
sink := &mockSink{}
corrConfig := domain.CorrelationConfig{
TimeWindow: time.Second,
ApacheAlwaysEmit: true,
}
correlationSvc := domain.NewCorrelationService(corrConfig, &domain.RealTimeProvider{})
o := NewOrchestrator(OrchestratorConfig{
Sources: []ports.EventSource{source},
Sink: sink,
}, correlationSvc)
if err := o.Start(); err != nil {
t.Fatalf("first Start() failed: %v", err)
}
if err := o.Start(); err != nil {
t.Errorf("second Start() should be no-op, got: %v", err)
}
o.Stop()
}
// TestOrchestrator_StopTwice tests that calling Stop() twice is a no-op.
func TestOrchestrator_StopTwice(t *testing.T) {
source := &mockEventSource{name: "test"}
sink := &mockSink{}
corrConfig := domain.CorrelationConfig{
TimeWindow: time.Second,
ApacheAlwaysEmit: true,
}
correlationSvc := domain.NewCorrelationService(corrConfig, &domain.RealTimeProvider{})
o := NewOrchestrator(OrchestratorConfig{
Sources: []ports.EventSource{source},
Sink: sink,
}, correlationSvc)
o.Start()
if err := o.Stop(); err != nil {
t.Errorf("first Stop() failed: %v", err)
}
if err := o.Stop(); err != nil {
t.Errorf("second Stop() should be no-op, got: %v", err)
}
}
// TestOrchestrator_NoSources tests that Orchestrator works with no sources.
func TestOrchestrator_NoSources(t *testing.T) {
sink := &mockSink{}
corrConfig := domain.CorrelationConfig{TimeWindow: time.Second}
correlationSvc := domain.NewCorrelationService(corrConfig, &domain.RealTimeProvider{})
o := NewOrchestrator(OrchestratorConfig{
Sources: []ports.EventSource{},
Sink: sink,
}, correlationSvc)
if err := o.Start(); err != nil {
t.Fatalf("Start() with no sources failed: %v", err)
}
time.Sleep(50 * time.Millisecond)
if err := o.Stop(); err != nil {
t.Errorf("Stop() failed: %v", err)
}
}
// TestOrchestrator_OrphanEmission tests that orphan A events are emitted via tick.
func TestOrchestrator_OrphanEmission(t *testing.T) {
source := &mockEventSource{name: "test"}
sink := &mockSink{}
corrConfig := domain.CorrelationConfig{
TimeWindow: 50 * time.Millisecond,
ApacheAlwaysEmit: true,
ApacheEmitDelayMs: 10, // Very short delay so orphans emit quickly
}
correlationSvc := domain.NewCorrelationService(corrConfig, &domain.RealTimeProvider{})
o := NewOrchestrator(OrchestratorConfig{
Sources: []ports.EventSource{source},
Sink: sink,
}, correlationSvc)
if err := o.Start(); err != nil {
t.Fatalf("Start() failed: %v", err)
}
// Wait for source to be ready
var eventChan chan<- *domain.NormalizedEvent
for i := 0; i < 50; i++ {
eventChan = source.getEventChan()
if eventChan != nil {
break
}
time.Sleep(5 * time.Millisecond)
}
if eventChan == nil {
t.Fatal("source did not start")
}
// Send a source A event (Apache/HTTP)
eventChan <- &domain.NormalizedEvent{
Source: domain.SourceA,
Timestamp: time.Now(),
SrcIP: "10.0.0.1",
SrcPort: 12345,
Raw: map[string]any{"method": "GET"},
}
// Allow time for orphan ticker to fire (OrphanTickInterval = 250ms, but emit delay is 10ms)
time.Sleep(600 * time.Millisecond)
o.Stop()
written := sink.getWritten()
if len(written) == 0 {
t.Error("expected at least one orphan log to be emitted")
}
}
// TestOrchestrator_Constants tests that constants have reasonable values.
func TestOrchestrator_Constants(t *testing.T) {
if DefaultEventChannelBufferSize <= 0 {
t.Error("DefaultEventChannelBufferSize should be positive")
}
if OrphanTickInterval <= 0 {
t.Error("OrphanTickInterval should be positive")
}
}

View File

@ -0,0 +1,406 @@
package config
import (
"fmt"
"net"
"os"
"strconv"
"strings"
"time"
"github.com/antitbone/ja4/correlator/internal/domain"
"gopkg.in/yaml.v3"
)
// Config holds the complete application configuration.
type Config struct {
Log LogConfig `yaml:"log"`
Inputs InputsConfig `yaml:"inputs"`
Outputs OutputsConfig `yaml:"outputs"`
Correlation CorrelationConfig `yaml:"correlation"`
Metrics MetricsConfig `yaml:"metrics"`
}
// MetricsConfig holds metrics server configuration.
type MetricsConfig struct {
Enabled bool `yaml:"enabled"`
Addr string `yaml:"addr"` // e.g., ":8080", "localhost:8080"
}
// LogConfig holds logging configuration.
type LogConfig struct {
Level string `yaml:"level"` // DEBUG, INFO, WARN, ERROR
}
// GetLogLevel returns the log level, defaulting to INFO if not set.
func (c *LogConfig) GetLevel() string {
if c.Level == "" {
return "INFO"
}
return strings.ToUpper(c.Level)
}
// ServiceConfig holds service-level configuration.
type ServiceConfig struct {
Name string `yaml:"name"`
Language string `yaml:"language"`
}
// InputsConfig holds input sources configuration.
type InputsConfig struct {
UnixSockets []UnixSocketConfig `yaml:"unix_sockets"`
}
// UnixSocketConfig holds a Unix socket source configuration.
type UnixSocketConfig struct {
Name string `yaml:"name"`
Path string `yaml:"path"`
Format string `yaml:"format"`
SourceType string `yaml:"source_type"` // "A" for Apache/HTTP, "B" for Network
SocketPermissions string `yaml:"socket_permissions"` // octal string, e.g., "0660", "0666"
}
// OutputsConfig holds output sinks configuration.
type OutputsConfig struct {
File FileOutputConfig `yaml:"file"`
ClickHouse ClickHouseOutputConfig `yaml:"clickhouse"`
Stdout StdoutOutputConfig `yaml:"stdout"`
}
// FileOutputConfig holds file sink configuration.
type FileOutputConfig struct {
Enabled bool `yaml:"enabled"`
Path string `yaml:"path"`
}
// ClickHouseOutputConfig holds ClickHouse sink configuration.
type ClickHouseOutputConfig struct {
Enabled bool `yaml:"enabled"`
DSN string `yaml:"dsn"`
Table string `yaml:"table"`
BatchSize int `yaml:"batch_size"`
FlushIntervalMs int `yaml:"flush_interval_ms"`
MaxBufferSize int `yaml:"max_buffer_size"`
DropOnOverflow bool `yaml:"drop_on_overflow"`
AsyncInsert bool `yaml:"async_insert"`
TimeoutMs int `yaml:"timeout_ms"`
}
// StdoutOutputConfig holds stdout sink configuration.
type StdoutOutputConfig struct {
Enabled bool `yaml:"enabled"`
Level string `yaml:"level"` // DEBUG, INFO, WARN, ERROR - filters output verbosity
}
// CorrelationConfig holds correlation configuration.
type CorrelationConfig struct {
TimeWindow TimeWindowConfig `yaml:"time_window"`
OrphanPolicy OrphanPolicyConfig `yaml:"orphan_policy"`
Matching MatchingConfig `yaml:"matching"`
Buffers BuffersConfig `yaml:"buffers"`
TTL TTLConfig `yaml:"ttl"`
ExcludeSourceIPs []string `yaml:"exclude_source_ips"` // List of source IPs or CIDR ranges to exclude
IncludeDestPorts []int `yaml:"include_dest_ports"` // If non-empty, only correlate events matching these destination ports
// Deprecated: Use TimeWindow.Value instead
TimeWindowS int `yaml:"time_window_s"`
// Deprecated: Use OrphanPolicy.ApacheAlwaysEmit instead
EmitOrphans bool `yaml:"emit_orphans"`
}
// TimeWindowConfig holds time window configuration.
type TimeWindowConfig struct {
Value int `yaml:"value"`
Unit string `yaml:"unit"` // s, ms, etc.
}
// GetDuration returns the time window as a duration.
func (c *TimeWindowConfig) GetDuration() time.Duration {
value := c.Value
if value <= 0 {
value = 1
}
switch c.Unit {
case "ms", "millisecond", "milliseconds":
return time.Duration(value) * time.Millisecond
case "s", "sec", "second", "seconds":
fallthrough
default:
return time.Duration(value) * time.Second
}
}
// OrphanPolicyConfig holds orphan event policy configuration.
type OrphanPolicyConfig struct {
ApacheAlwaysEmit bool `yaml:"apache_always_emit"`
ApacheEmitDelayMs int `yaml:"apache_emit_delay_ms"` // Delay in ms before emitting orphan A
NetworkEmit bool `yaml:"network_emit"`
}
// MatchingConfig holds matching mode configuration.
type MatchingConfig struct {
Mode string `yaml:"mode"` // one_to_one or one_to_many
}
// BuffersConfig holds buffer size configuration.
type BuffersConfig struct {
MaxHTTPItems int `yaml:"max_http_items"`
MaxNetworkItems int `yaml:"max_network_items"`
}
// TTLConfig holds TTL configuration.
type TTLConfig struct {
NetworkTTLS int `yaml:"network_ttl_s"`
}
// Load loads configuration from a YAML file.
func Load(path string) (*Config, error) {
data, err := os.ReadFile(path)
if err != nil {
return nil, fmt.Errorf("failed to read config file: %w", err)
}
cfg := defaultConfig()
if err := yaml.Unmarshal(data, cfg); err != nil {
return nil, fmt.Errorf("failed to parse config file: %w", err)
}
if err := cfg.Validate(); err != nil {
return nil, fmt.Errorf("invalid config: %w", err)
}
return cfg, nil
}
// defaultConfig returns a Config with default values.
func defaultConfig() *Config {
return &Config{
Log: LogConfig{
Level: "INFO",
},
Inputs: InputsConfig{
UnixSockets: make([]UnixSocketConfig, 0),
},
Outputs: OutputsConfig{
File: FileOutputConfig{
Enabled: true,
Path: "/var/log/logcorrelator/correlated.log",
},
ClickHouse: ClickHouseOutputConfig{
Enabled: false,
BatchSize: 500,
FlushIntervalMs: 200,
MaxBufferSize: 5000,
DropOnOverflow: true,
AsyncInsert: true,
TimeoutMs: 1000,
},
Stdout: StdoutOutputConfig{Enabled: false},
},
Correlation: CorrelationConfig{
TimeWindowS: 1,
EmitOrphans: true,
},
}
}
// Validate validates the configuration.
func (c *Config) Validate() error {
if len(c.Inputs.UnixSockets) < 2 {
return fmt.Errorf("at least two unix socket inputs are required")
}
seenNames := make(map[string]struct{}, len(c.Inputs.UnixSockets))
seenPaths := make(map[string]struct{}, len(c.Inputs.UnixSockets))
for i, input := range c.Inputs.UnixSockets {
if strings.TrimSpace(input.Name) == "" {
return fmt.Errorf("inputs.unix_sockets[%d].name is required", i)
}
if strings.TrimSpace(input.Path) == "" {
return fmt.Errorf("inputs.unix_sockets[%d].path is required", i)
}
if _, exists := seenNames[input.Name]; exists {
return fmt.Errorf("duplicate unix socket input name: %s", input.Name)
}
seenNames[input.Name] = struct{}{}
if _, exists := seenPaths[input.Path]; exists {
return fmt.Errorf("duplicate unix socket input path: %s", input.Path)
}
seenPaths[input.Path] = struct{}{}
}
// At least one output must be enabled
hasOutput := false
if c.Outputs.File.Enabled && c.Outputs.File.Path != "" {
hasOutput = true
}
if c.Outputs.ClickHouse.Enabled {
hasOutput = true
}
if c.Outputs.Stdout.Enabled {
hasOutput = true
}
if !hasOutput {
return fmt.Errorf("at least one output must be enabled (file, clickhouse, or stdout)")
}
if c.Outputs.ClickHouse.Enabled {
if strings.TrimSpace(c.Outputs.ClickHouse.DSN) == "" {
return fmt.Errorf("clickhouse DSN is required when enabled")
}
if strings.TrimSpace(c.Outputs.ClickHouse.Table) == "" {
return fmt.Errorf("clickhouse table is required when enabled")
}
if c.Outputs.ClickHouse.BatchSize <= 0 {
return fmt.Errorf("clickhouse batch_size must be > 0")
}
if c.Outputs.ClickHouse.MaxBufferSize <= 0 {
return fmt.Errorf("clickhouse max_buffer_size must be > 0")
}
if c.Outputs.ClickHouse.TimeoutMs <= 0 {
return fmt.Errorf("clickhouse timeout_ms must be > 0")
}
}
if c.Correlation.TimeWindowS <= 0 {
return fmt.Errorf("correlation.time_window_s must be > 0")
}
return nil
}
// GetTimeWindow returns the time window as a duration.
// Deprecated: Use TimeWindow.GetDuration() instead.
func (c *CorrelationConfig) GetTimeWindow() time.Duration {
// New config takes precedence
if c.TimeWindow.Value > 0 {
return c.TimeWindow.GetDuration()
}
// Fallback to deprecated field
value := c.TimeWindowS
if value <= 0 {
value = 1
}
return time.Duration(value) * time.Second
}
// GetApacheAlwaysEmit returns whether to always emit Apache events.
func (c *CorrelationConfig) GetApacheAlwaysEmit() bool {
if c.OrphanPolicy.ApacheAlwaysEmit {
return true
}
// Fallback to deprecated field
return c.EmitOrphans
}
// GetApacheEmitDelayMs returns the delay in milliseconds before emitting orphan A events.
func (c *CorrelationConfig) GetApacheEmitDelayMs() int {
if c.OrphanPolicy.ApacheEmitDelayMs > 0 {
return c.OrphanPolicy.ApacheEmitDelayMs
}
return domain.DefaultApacheEmitDelayMs // Default: 500ms
}
// GetMatchingMode returns the matching mode.
func (c *CorrelationConfig) GetMatchingMode() string {
if c.Matching.Mode != "" {
return c.Matching.Mode
}
return "one_to_many" // Default to Keep-Alive
}
// GetMaxHTTPBufferSize returns the max HTTP buffer size.
func (c *CorrelationConfig) GetMaxHTTPBufferSize() int {
if c.Buffers.MaxHTTPItems > 0 {
return c.Buffers.MaxHTTPItems
}
return domain.DefaultMaxHTTPBufferSize
}
// GetMaxNetworkBufferSize returns the max network buffer size.
func (c *CorrelationConfig) GetMaxNetworkBufferSize() int {
if c.Buffers.MaxNetworkItems > 0 {
return c.Buffers.MaxNetworkItems
}
return domain.DefaultMaxNetworkBufferSize
}
// GetNetworkTTLS returns the network TTL in seconds.
func (c *CorrelationConfig) GetNetworkTTLS() int {
if c.TTL.NetworkTTLS > 0 {
return c.TTL.NetworkTTLS
}
return domain.DefaultNetworkTTLS
}
// GetSocketPermissions returns the socket permissions as os.FileMode.
// Default is 0666 (world read/write).
func (c *UnixSocketConfig) GetSocketPermissions() os.FileMode {
trimmed := strings.TrimSpace(c.SocketPermissions)
if trimmed == "" {
return 0666
}
// Parse octal string (e.g., "0660", "660", "0666")
perms, err := strconv.ParseUint(trimmed, 8, 32)
if err != nil {
return 0666
}
return os.FileMode(perms)
}
// GetIncludeDestPorts returns the list of destination ports allowed for correlation.
// An empty list means all ports are allowed.
func (c *CorrelationConfig) GetIncludeDestPorts() []int {
return c.IncludeDestPorts
}
// GetExcludeSourceIPs returns the list of excluded source IPs or CIDR ranges.
func (c *CorrelationConfig) GetExcludeSourceIPs() []string {
return c.ExcludeSourceIPs
}
// IsSourceIPExcluded checks if a source IP should be excluded.
// Supports both exact IP matches and CIDR ranges.
func (c *CorrelationConfig) IsSourceIPExcluded(ip string) bool {
if len(c.ExcludeSourceIPs) == 0 {
return false
}
// Parse the IP once
parsedIP := net.ParseIP(ip)
if parsedIP == nil {
return false // Invalid IP
}
for _, exclude := range c.ExcludeSourceIPs {
// Try CIDR first
if strings.Contains(exclude, "/") {
_, cidr, err := net.ParseCIDR(exclude)
if err != nil {
continue // Invalid CIDR, skip
}
if cidr.Contains(parsedIP) {
return true
}
} else {
// Exact IP match
if exclude == ip {
return true
}
// Also try parsing as IP (handles different formats like 192.168.1.1 vs 192.168.001.001)
if excludeIP := net.ParseIP(exclude); excludeIP != nil {
if excludeIP.Equal(parsedIP) {
return true
}
}
}
}
return false
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,151 @@
package domain
import (
"encoding/json"
"reflect"
"time"
)
// CorrelatedLog represents the output correlated log entry.
// All fields are flattened into a single-level structure.
type CorrelatedLog struct {
Timestamp time.Time `json:"timestamp"`
SrcIP string `json:"src_ip"`
SrcPort int `json:"src_port"`
DstIP string `json:"dst_ip,omitempty"`
DstPort int `json:"dst_port,omitempty"`
Correlated bool `json:"correlated"`
OrphanSide string `json:"orphan_side,omitempty"`
Fields map[string]any `json:"-"` // Additional fields, merged at marshal time
}
// MarshalJSON implements custom JSON marshaling to flatten the structure.
func (c CorrelatedLog) MarshalJSON() ([]byte, error) {
// Create a flat map with all fields
flat := make(map[string]any)
// Add core fields
flat["timestamp"] = c.Timestamp
flat["src_ip"] = c.SrcIP
flat["src_port"] = c.SrcPort
if c.DstIP != "" {
flat["dst_ip"] = c.DstIP
}
if c.DstPort != 0 {
flat["dst_port"] = c.DstPort
}
flat["correlated"] = c.Correlated
if c.OrphanSide != "" {
flat["orphan_side"] = c.OrphanSide
}
// Merge additional fields while preserving reserved keys
reservedKeys := map[string]struct{}{
"timestamp": {},
"src_ip": {},
"src_port": {},
"dst_ip": {},
"dst_port": {},
"correlated": {},
"orphan_side": {},
}
for k, v := range c.Fields {
if _, reserved := reservedKeys[k]; reserved {
continue
}
flat[k] = v
}
return json.Marshal(flat)
}
// NewCorrelatedLogFromEvent creates a correlated log from a single event (orphan).
func NewCorrelatedLogFromEvent(event *NormalizedEvent, orphanSide string) CorrelatedLog {
fields := extractFields(event)
if event.KeepAliveSeq > 0 {
fields["keepalives"] = event.KeepAliveSeq
}
return CorrelatedLog{
Timestamp: event.Timestamp,
SrcIP: event.SrcIP,
SrcPort: event.SrcPort,
DstIP: event.DstIP,
DstPort: event.DstPort,
Correlated: false,
OrphanSide: orphanSide,
Fields: fields,
}
}
// NewCorrelatedLog creates a correlated log from two matched events.
func NewCorrelatedLog(apacheEvent, networkEvent *NormalizedEvent) CorrelatedLog {
ts := apacheEvent.Timestamp
if networkEvent.Timestamp.After(ts) {
ts = networkEvent.Timestamp
}
fields := mergeFields(apacheEvent, networkEvent)
if apacheEvent.KeepAliveSeq > 0 {
fields["keepalives"] = apacheEvent.KeepAliveSeq
}
return CorrelatedLog{
Timestamp: ts,
SrcIP: apacheEvent.SrcIP,
SrcPort: apacheEvent.SrcPort,
DstIP: coalesceString(apacheEvent.DstIP, networkEvent.DstIP),
DstPort: coalesceInt(apacheEvent.DstPort, networkEvent.DstPort),
Correlated: true,
OrphanSide: "",
Fields: fields,
}
}
func extractFields(e *NormalizedEvent) map[string]any {
result := make(map[string]any)
for k, v := range e.Raw {
result[k] = v
}
return result
}
func mergeFields(a, b *NormalizedEvent) map[string]any {
result := make(map[string]any)
// Start with A fields
for k, v := range a.Raw {
result[k] = v
}
// Merge B fields with collision handling
for k, v := range b.Raw {
if existing, exists := result[k]; exists {
if reflect.DeepEqual(existing, v) {
continue
}
// Collision with different values: keep both with prefixes
delete(result, k)
result["a_"+k] = existing
result["b_"+k] = v
continue
}
result[k] = v
}
return result
}
func coalesceString(a, b string) string {
if a != "" {
return a
}
return b
}
func coalesceInt(a, b int) int {
if a != 0 {
return a
}
return b
}

View File

@ -0,0 +1,365 @@
package domain
import (
"encoding/json"
"testing"
"time"
)
func TestNormalizedEvent_CorrelationKey(t *testing.T) {
tests := []struct {
name string
event *NormalizedEvent
expected string
}{
{
name: "basic key",
event: &NormalizedEvent{
SrcIP: "192.168.1.1",
SrcPort: 8080,
},
expected: "192.168.1.1:8080",
},
{
name: "different port",
event: &NormalizedEvent{
SrcIP: "10.0.0.1",
SrcPort: 443,
},
expected: "10.0.0.1:443",
},
{
name: "port zero",
event: &NormalizedEvent{
SrcIP: "127.0.0.1",
SrcPort: 0,
},
expected: "127.0.0.1:0",
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
key := tt.event.CorrelationKey()
if key != tt.expected {
t.Errorf("expected %s, got %s", tt.expected, key)
}
})
}
}
func TestNewCorrelatedLogFromEvent(t *testing.T) {
event := &NormalizedEvent{
Source: SourceA,
Timestamp: time.Date(2024, 1, 1, 12, 0, 0, 0, time.UTC),
SrcIP: "192.168.1.1",
SrcPort: 8080,
DstIP: "10.0.0.1",
DstPort: 80,
Raw: map[string]any{
"method": "GET",
"path": "/api/test",
},
}
log := NewCorrelatedLogFromEvent(event, "A")
if log.Correlated {
t.Error("expected correlated to be false")
}
if log.OrphanSide != "A" {
t.Errorf("expected orphan_side A, got %s", log.OrphanSide)
}
if log.SrcIP != "192.168.1.1" {
t.Errorf("expected src_ip 192.168.1.1, got %s", log.SrcIP)
}
if log.Fields == nil {
t.Error("expected fields to be non-nil")
}
}
func TestNewCorrelatedLog(t *testing.T) {
apacheEvent := &NormalizedEvent{
Source: SourceA,
Timestamp: time.Date(2024, 1, 1, 12, 0, 0, 0, time.UTC),
SrcIP: "192.168.1.1",
SrcPort: 8080,
DstIP: "10.0.0.1",
DstPort: 80,
Raw: map[string]any{"method": "GET"},
}
networkEvent := &NormalizedEvent{
Source: SourceB,
Timestamp: time.Date(2024, 1, 1, 12, 0, 0, 500000000, time.UTC),
SrcIP: "192.168.1.1",
SrcPort: 8080,
DstIP: "10.0.0.1",
DstPort: 80,
Raw: map[string]any{"ja3": "abc123"},
}
log := NewCorrelatedLog(apacheEvent, networkEvent)
if !log.Correlated {
t.Error("expected correlated to be true")
}
if log.OrphanSide != "" {
t.Errorf("expected orphan_side to be empty, got %s", log.OrphanSide)
}
if log.Fields == nil {
t.Error("expected fields to be non-nil")
}
}
// TestNewCorrelatedLog_TimestampSelectionAEarlier verifies that when A is earlier the later (B) timestamp is used.
func TestNewCorrelatedLog_TimestampSelectionAEarlier(t *testing.T) {
tsA := time.Date(2024, 1, 1, 12, 0, 0, 0, time.UTC)
tsB := time.Date(2024, 1, 1, 12, 0, 1, 0, time.UTC) // B is later
a := &NormalizedEvent{Source: SourceA, Timestamp: tsA, SrcIP: "1.1.1.1", SrcPort: 100, Raw: map[string]any{}}
b := &NormalizedEvent{Source: SourceB, Timestamp: tsB, SrcIP: "1.1.1.1", SrcPort: 100, Raw: map[string]any{}}
log := NewCorrelatedLog(a, b)
if !log.Timestamp.Equal(tsB) {
t.Errorf("expected timestamp to be B's (later), got %v", log.Timestamp)
}
}
// TestNewCorrelatedLog_TimestampSelectionBEarlier verifies that when B is earlier, A's timestamp is used.
func TestNewCorrelatedLog_TimestampSelectionBEarlier(t *testing.T) {
tsA := time.Date(2024, 1, 1, 12, 0, 1, 0, time.UTC) // A is later
tsB := time.Date(2024, 1, 1, 12, 0, 0, 0, time.UTC)
a := &NormalizedEvent{Source: SourceA, Timestamp: tsA, SrcIP: "1.1.1.1", SrcPort: 100, Raw: map[string]any{}}
b := &NormalizedEvent{Source: SourceB, Timestamp: tsB, SrcIP: "1.1.1.1", SrcPort: 100, Raw: map[string]any{}}
log := NewCorrelatedLog(a, b)
// The later timestamp wins. Since B is not After A, ts stays as A's timestamp.
if !log.Timestamp.Equal(tsA) {
t.Errorf("expected timestamp to be A's (later), got %v", log.Timestamp)
}
}
// TestNewCorrelatedLog_TimestampEqual verifies equal timestamps yield A's timestamp.
func TestNewCorrelatedLog_TimestampEqual(t *testing.T) {
ts := time.Date(2024, 1, 1, 12, 0, 0, 0, time.UTC)
a := &NormalizedEvent{Source: SourceA, Timestamp: ts, SrcIP: "1.1.1.1", SrcPort: 100, Raw: map[string]any{}}
b := &NormalizedEvent{Source: SourceB, Timestamp: ts, SrcIP: "1.1.1.1", SrcPort: 100, Raw: map[string]any{}}
log := NewCorrelatedLog(a, b)
if !log.Timestamp.Equal(ts) {
t.Errorf("expected timestamp to be equal to both events' timestamp, got %v", log.Timestamp)
}
}
// TestNewCorrelatedLogFromEvent_WithKeepAlive verifies keepalives field is added when KeepAliveSeq > 0.
func TestNewCorrelatedLogFromEvent_WithKeepAlive(t *testing.T) {
event := &NormalizedEvent{
Source: SourceA,
Timestamp: time.Now(),
SrcIP: "1.1.1.1",
SrcPort: 9999,
KeepAliveSeq: 3,
Raw: map[string]any{"method": "GET"},
}
log := NewCorrelatedLogFromEvent(event, "A")
if log.Fields["keepalives"] != 3 {
t.Errorf("expected keepalives=3, got %v", log.Fields["keepalives"])
}
}
// TestNewCorrelatedLogFromEvent_NoKeepAlive verifies keepalives field is absent when KeepAliveSeq == 0.
func TestNewCorrelatedLogFromEvent_NoKeepAlive(t *testing.T) {
event := &NormalizedEvent{
Source: SourceA,
Timestamp: time.Now(),
SrcIP: "1.1.1.1",
SrcPort: 9999,
KeepAliveSeq: 0,
Raw: map[string]any{"method": "GET"},
}
log := NewCorrelatedLogFromEvent(event, "A")
if _, ok := log.Fields["keepalives"]; ok {
t.Error("keepalives field should not be present when KeepAliveSeq == 0")
}
}
// TestMergeFields_NoCollision verifies fields from A and B are merged without conflict.
func TestMergeFields_NoCollision(t *testing.T) {
a := &NormalizedEvent{Raw: map[string]any{"method": "GET", "path": "/foo"}}
b := &NormalizedEvent{Raw: map[string]any{"ja4": "abc123", "proto": "TLS"}}
fields := mergeFields(a, b)
if fields["method"] != "GET" {
t.Errorf("expected method=GET, got %v", fields["method"])
}
if fields["ja4"] != "abc123" {
t.Errorf("expected ja4=abc123, got %v", fields["ja4"])
}
}
// TestMergeFields_SameValueNoPrefix verifies same-value fields are not prefixed.
func TestMergeFields_SameValueNoPrefix(t *testing.T) {
a := &NormalizedEvent{Raw: map[string]any{"proto": "TCP"}}
b := &NormalizedEvent{Raw: map[string]any{"proto": "TCP"}}
fields := mergeFields(a, b)
if fields["proto"] != "TCP" {
t.Errorf("expected proto=TCP (no prefix), got %v", fields["proto"])
}
if _, ok := fields["a_proto"]; ok {
t.Error("a_proto should not exist for same-value collision")
}
if _, ok := fields["b_proto"]; ok {
t.Error("b_proto should not exist for same-value collision")
}
}
// TestMergeFields_DifferentValuePrefix verifies different-value fields get a_/b_ prefix.
func TestMergeFields_DifferentValuePrefix(t *testing.T) {
a := &NormalizedEvent{Raw: map[string]any{"port": 80}}
b := &NormalizedEvent{Raw: map[string]any{"port": 443}}
fields := mergeFields(a, b)
if fields["a_port"] != 80 {
t.Errorf("expected a_port=80, got %v", fields["a_port"])
}
if fields["b_port"] != 443 {
t.Errorf("expected b_port=443, got %v", fields["b_port"])
}
if _, ok := fields["port"]; ok {
t.Error("original 'port' key should be removed on collision")
}
}
// TestCoalesceString_EmptyA tests that when a is empty, b is returned.
func TestCoalesceString_EmptyA(t *testing.T) {
result := coalesceString("", "fallback")
if result != "fallback" {
t.Errorf("expected 'fallback', got %q", result)
}
}
// TestCoalesceString_NonEmptyA tests that when a is non-empty, a is returned.
func TestCoalesceString_NonEmptyA(t *testing.T) {
result := coalesceString("primary", "fallback")
if result != "primary" {
t.Errorf("expected 'primary', got %q", result)
}
}
// TestCoalesceInt_ZeroA tests that when a is zero, b is returned.
func TestCoalesceInt_ZeroA(t *testing.T) {
result := coalesceInt(0, 443)
if result != 443 {
t.Errorf("expected 443, got %d", result)
}
}
// TestCoalesceInt_NonZeroA tests that when a is non-zero, a is returned.
func TestCoalesceInt_NonZeroA(t *testing.T) {
result := coalesceInt(80, 443)
if result != 80 {
t.Errorf("expected 80, got %d", result)
}
}
// TestMarshalJSON_ReservedKeyProtection verifies reserved keys in Fields are not overwritten.
func TestMarshalJSON_ReservedKeyProtection(t *testing.T) {
log := CorrelatedLog{
Timestamp: time.Date(2024, 1, 1, 12, 0, 0, 0, time.UTC),
SrcIP: "1.2.3.4",
SrcPort: 1234,
Correlated: true,
Fields: map[string]any{
"src_ip": "EVIL_OVERRIDE", // should be ignored
"correlated": false, // should be ignored
"extra": "value",
},
}
data, err := json.Marshal(log)
if err != nil {
t.Fatalf("MarshalJSON failed: %v", err)
}
var flat map[string]any
if err := json.Unmarshal(data, &flat); err != nil {
t.Fatalf("Unmarshal failed: %v", err)
}
if flat["src_ip"] != "1.2.3.4" {
t.Errorf("reserved key src_ip should not be overwritten, got %v", flat["src_ip"])
}
if flat["correlated"] != true {
t.Errorf("reserved key correlated should not be overwritten, got %v", flat["correlated"])
}
if flat["extra"] != "value" {
t.Errorf("non-reserved key extra should be present, got %v", flat["extra"])
}
}
// TestMarshalJSON_OptionalFieldsOmittedWhenZero verifies DstIP/DstPort are omitted when zero.
func TestMarshalJSON_OptionalFieldsOmittedWhenZero(t *testing.T) {
log := CorrelatedLog{
Timestamp: time.Now(),
SrcIP: "1.2.3.4",
SrcPort: 1234,
Correlated: false,
}
data, err := json.Marshal(log)
if err != nil {
t.Fatalf("MarshalJSON failed: %v", err)
}
var flat map[string]any
if err := json.Unmarshal(data, &flat); err != nil {
t.Fatalf("Unmarshal failed: %v", err)
}
if _, ok := flat["dst_ip"]; ok {
t.Error("dst_ip should be omitted when empty")
}
if _, ok := flat["dst_port"]; ok {
t.Error("dst_port should be omitted when zero")
}
if _, ok := flat["orphan_side"]; ok {
t.Error("orphan_side should be omitted when empty")
}
}
// TestExtractFields_Basic verifies extractFields copies Raw fields.
func TestExtractFields_Basic(t *testing.T) {
e := &NormalizedEvent{
Raw: map[string]any{"key1": "val1", "key2": 42},
}
fields := extractFields(e)
if fields["key1"] != "val1" {
t.Errorf("expected key1=val1, got %v", fields["key1"])
}
if fields["key2"] != 42 {
t.Errorf("expected key2=42, got %v", fields["key2"])
}
}
// TestNewCorrelatedLog_KeepAliveSeq verifies keepalives is set from apache event.
func TestNewCorrelatedLog_KeepAliveSeq(t *testing.T) {
a := &NormalizedEvent{
Source: SourceA, Timestamp: time.Now(), SrcIP: "1.1.1.1", SrcPort: 100,
KeepAliveSeq: 5,
Raw: map[string]any{},
}
b := &NormalizedEvent{
Source: SourceB, Timestamp: time.Now(), SrcIP: "1.1.1.1", SrcPort: 100,
Raw: map[string]any{},
}
log := NewCorrelatedLog(a, b)
if log.Fields["keepalives"] != 5 {
t.Errorf("expected keepalives=5, got %v", log.Fields["keepalives"])
}
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,33 @@
package domain
import (
"strconv"
"time"
)
// EventSource identifies the source of an event.
type EventSource string
const (
SourceA EventSource = "A" // Apache/HTTP source
SourceB EventSource = "B" // Network source
)
// NormalizedEvent represents a unified internal event from either source.
type NormalizedEvent struct {
Source EventSource
Timestamp time.Time
SrcIP string
SrcPort int
DstIP string
DstPort int
Headers map[string]string
Extra map[string]any
Raw map[string]any // Original raw data
KeepAliveSeq int // Request sequence number within the Keep-Alive connection (1-based)
}
// CorrelationKey returns the key used for correlation (src_ip + src_port).
func (e *NormalizedEvent) CorrelationKey() string {
return e.SrcIP + ":" + strconv.Itoa(e.SrcPort)
}

View File

@ -0,0 +1,25 @@
// Package observability provides structured logging for the correlator service.
// Implementation is delegated to shared/go/ja4common/logger to avoid duplication.
package observability
import jalogger "github.com/antitbone/ja4/ja4common/logger"
// Type aliases — all existing correlator code compiles unchanged.
type Logger = jalogger.Logger
type LogLevel = jalogger.LogLevel
const (
DEBUG LogLevel = jalogger.DEBUG
INFO LogLevel = jalogger.INFO
WARN LogLevel = jalogger.WARN
ERROR LogLevel = jalogger.ERROR
)
// NewLogger creates a new Logger with INFO level.
func NewLogger(prefix string) *Logger { return jalogger.New(prefix) }
// NewLoggerWithLevel creates a new Logger with the specified minimum level.
func NewLoggerWithLevel(prefix, level string) *Logger { return jalogger.NewWithLevel(prefix, level) }
// ParseLogLevel converts a string to LogLevel.
func ParseLogLevel(level string) LogLevel { return jalogger.ParseLogLevel(level) }

View File

@ -0,0 +1,296 @@
// Package observability tests — behavioral tests for the Logger type alias.
// Since Logger = jalogger.Logger, we test the observable API only.
package observability_test
import (
"testing"
"github.com/antitbone/ja4/correlator/internal/observability"
)
func TestNewLogger_NonNil(t *testing.T) {
logger := observability.NewLogger("test")
if logger == nil {
t.Fatal("expected non-nil logger")
}
}
func TestLogger_DefaultLevel_IsInfo(t *testing.T) {
logger := observability.NewLogger("test")
if !logger.ShouldLog(observability.INFO) {
t.Error("INFO should be enabled by default")
}
if logger.ShouldLog(observability.DEBUG) {
t.Error("DEBUG should be disabled by default")
}
}
func TestLogger_Info_NoPanic(t *testing.T) {
logger := observability.NewLoggerWithLevel("test", "INFO")
if !logger.ShouldLog(observability.INFO) {
t.Error("INFO should be enabled")
}
logger.Info("test message")
}
func TestLogger_Error_NoPanic(t *testing.T) {
logger := observability.NewLoggerWithLevel("test", "ERROR")
if !logger.ShouldLog(observability.ERROR) {
t.Error("ERROR should be enabled")
}
logger.Error("error message", nil)
}
func TestLogger_Debug_NoPanic(t *testing.T) {
logger := observability.NewLogger("test")
logger.SetLevel("DEBUG")
if !logger.ShouldLog(observability.DEBUG) {
t.Error("DEBUG should be enabled after SetLevel(DEBUG)")
}
logger.Debug("test message")
}
func TestLogger_SetLevel(t *testing.T) {
logger := observability.NewLogger("test")
logger.SetLevel("DEBUG")
if !logger.ShouldLog(observability.DEBUG) {
t.Error("DEBUG should be enabled after SetLevel(DEBUG)")
}
logger.SetLevel("INFO")
if logger.ShouldLog(observability.DEBUG) {
t.Error("DEBUG should be disabled after SetLevel(INFO)")
}
logger.SetLevel("WARN")
if logger.ShouldLog(observability.INFO) {
t.Error("INFO should be disabled after SetLevel(WARN)")
}
if !logger.ShouldLog(observability.WARN) {
t.Error("WARN should be enabled after SetLevel(WARN)")
}
logger.SetLevel("ERROR")
if logger.ShouldLog(observability.WARN) {
t.Error("WARN should be disabled after SetLevel(ERROR)")
}
if !logger.ShouldLog(observability.ERROR) {
t.Error("ERROR should be enabled after SetLevel(ERROR)")
}
}
func TestParseLogLevel(t *testing.T) {
cases := []struct {
input string
expected observability.LogLevel
}{
{"DEBUG", observability.DEBUG},
{"debug", observability.DEBUG},
{"INFO", observability.INFO},
{"info", observability.INFO},
{"WARN", observability.WARN},
{"warn", observability.WARN},
{"WARNING", observability.WARN},
{"ERROR", observability.ERROR},
{"error", observability.ERROR},
{"", observability.INFO},
{"invalid", observability.INFO},
}
for _, tt := range cases {
t.Run(tt.input, func(t *testing.T) {
result := observability.ParseLogLevel(tt.input)
if result != tt.expected {
t.Errorf("ParseLogLevel(%q) = %v, want %v", tt.input, result, tt.expected)
}
})
}
}
func TestLogger_WithFields_NoPanic(t *testing.T) {
logger := observability.NewLogger("test")
child := logger.WithFields(map[string]any{"key1": "value1", "key2": 42})
if child == logger {
t.Error("expected different logger instance")
}
child.Info("message with fields")
}
func TestLogLevel_String(t *testing.T) {
cases := []struct {
level observability.LogLevel
expected string
}{
{observability.DEBUG, "DEBUG"},
{observability.INFO, "INFO"},
{observability.WARN, "WARN"},
{observability.ERROR, "ERROR"},
}
for _, tt := range cases {
t.Run(tt.expected, func(t *testing.T) {
if got := tt.level.String(); got != tt.expected {
t.Errorf("LogLevel(%d).String() = %q, want %q", tt.level, got, tt.expected)
}
})
}
}
func TestLogger_Warn_NoPanic(t *testing.T) {
logger := observability.NewLoggerWithLevel("test", "WARN")
if !logger.ShouldLog(observability.WARN) {
t.Error("WARN should be enabled")
}
logger.Warn("warning message")
}
func TestLogger_Formatted_NoPanic(t *testing.T) {
logger := observability.NewLoggerWithLevel("test", "DEBUG")
logger.Warnf("formatted %s %d", "message", 42)
logger.Infof("formatted %s %d", "message", 42)
logger.Debugf("formatted %s %d", "message", 42)
}
func TestLogger_Error_WithError(t *testing.T) {
logger := observability.NewLoggerWithLevel("test", "ERROR")
logger.Error("error occurred", &testErr{"test error"})
}
func TestLogger_ShouldLog_Concurrent(t *testing.T) {
logger := observability.NewLoggerWithLevel("test", "DEBUG")
done := make(chan bool)
for i := 0; i < 10; i++ {
go func() {
_ = logger.ShouldLog(observability.DEBUG)
done <- true
}()
}
for i := 0; i < 10; i++ {
<-done
}
}
func TestLogger_Log_Concurrent(t *testing.T) {
logger := observability.NewLoggerWithLevel("test", "DEBUG")
done := make(chan bool)
for i := 0; i < 10; i++ {
go func(n int) {
logger.Debugf("message %d", n)
done <- true
}(i)
}
for i := 0; i < 10; i++ {
<-done
}
}
func TestLogger_WithFields_Concurrent(t *testing.T) {
logger := observability.NewLogger("test")
done := make(chan bool)
for i := 0; i < 10; i++ {
go func(n int) {
_ = logger.WithFields(map[string]any{"key": n})
done <- true
}(i)
}
for i := 0; i < 10; i++ {
<-done
}
}
func TestLogger_SetLevel_Concurrent(t *testing.T) {
logger := observability.NewLogger("test")
done := make(chan bool)
for i := 0; i < 10; i++ {
go func() {
logger.SetLevel("DEBUG")
logger.SetLevel("INFO")
done <- true
}()
}
for i := 0; i < 10; i++ {
<-done
}
}
type testErr struct{ msg string }
func (e *testErr) Error() string { return e.msg }
func TestNewLoggerWithLevel_AllLevels(t *testing.T) {
levels := []string{"DEBUG", "INFO", "WARN", "WARNING", "ERROR", "invalid", ""}
for _, level := range levels {
t.Run(level, func(t *testing.T) {
logger := observability.NewLoggerWithLevel("test", level)
if logger == nil {
t.Errorf("NewLoggerWithLevel(%q) returned nil", level)
}
})
}
}
func TestLogLevel_Constants(t *testing.T) {
if observability.DEBUG >= observability.INFO {
t.Error("DEBUG should be less than INFO")
}
if observability.INFO >= observability.WARN {
t.Error("INFO should be less than WARN")
}
if observability.WARN >= observability.ERROR {
t.Error("WARN should be less than ERROR")
}
}
func TestLogger_ShouldLog_AllLevels(t *testing.T) {
cases := []struct {
minLevel string
level observability.LogLevel
want bool
}{
{"DEBUG", observability.DEBUG, true},
{"DEBUG", observability.INFO, true},
{"DEBUG", observability.WARN, true},
{"DEBUG", observability.ERROR, true},
{"INFO", observability.DEBUG, false},
{"INFO", observability.INFO, true},
{"INFO", observability.WARN, true},
{"WARN", observability.INFO, false},
{"WARN", observability.WARN, true},
{"WARN", observability.ERROR, true},
{"ERROR", observability.WARN, false},
{"ERROR", observability.ERROR, true},
}
for _, tc := range cases {
t.Run(tc.minLevel+"_"+tc.level.String(), func(t *testing.T) {
logger := observability.NewLoggerWithLevel("test", tc.minLevel)
got := logger.ShouldLog(tc.level)
if got != tc.want {
t.Errorf("ShouldLog(%v) with min=%s: expected %v, got %v",
tc.level, tc.minLevel, tc.want, got)
}
})
}
}
func TestParseLogLevel_WarningAlias(t *testing.T) {
got := observability.ParseLogLevel("WARNING")
if got != observability.WARN {
t.Errorf("ParseLogLevel(WARNING) = %v, want WARN", got)
}
}
func TestLogger_Errorf_NoPanic(t *testing.T) {
logger := observability.NewLoggerWithLevel("test", "DEBUG")
// Errorf is not defined in the interface, but Warnf/Infof/Debugf are tested
// Just ensure Error with a formatted message doesn't panic
logger.Error("formatted error", &testErr{"err detail"})
}
func TestNewLogger_PrefixIsUsed(t *testing.T) {
logger := observability.NewLogger("my-prefix")
if logger == nil {
t.Fatal("expected non-nil logger")
}
// The logger should be usable
logger.Infof("hello from %s", "my-prefix")
}

View File

@ -0,0 +1,176 @@
package observability
import (
"encoding/json"
"fmt"
"strings"
"sync"
"sync/atomic"
)
// CorrelationMetrics tracks correlation statistics for debugging and monitoring.
type CorrelationMetrics struct {
mu sync.RWMutex
// Events received
eventsReceivedA atomic.Int64
eventsReceivedB atomic.Int64
// Correlation results
correlationsSuccess atomic.Int64
correlationsFailed atomic.Int64
// Failure reasons
failedNoMatchKey atomic.Int64 // No event with same key in buffer
failedTimeWindow atomic.Int64 // Key found but outside time window
failedBufferEviction atomic.Int64 // Event evicted due to buffer full
failedTTLExpired atomic.Int64 // B event TTL expired before match
failedIPExcluded atomic.Int64 // Event excluded by IP filter
// Buffer stats
bufferASize atomic.Int64
bufferBSize atomic.Int64
// Orphan stats
orphansEmittedA atomic.Int64
orphansEmittedB atomic.Int64
orphansPendingA atomic.Int64
pendingOrphanMatch atomic.Int64 // B matched with pending orphan A
// Keep-Alive stats
keepAliveResets atomic.Int64 // Number of TTL resets (one-to-many mode)
}
// NewCorrelationMetrics creates a new metrics tracker.
func NewCorrelationMetrics() *CorrelationMetrics {
return &CorrelationMetrics{}
}
// RecordEventReceived records an event received from a source.
func (m *CorrelationMetrics) RecordEventReceived(source string) {
if source == "A" {
m.eventsReceivedA.Add(1)
} else if source == "B" {
m.eventsReceivedB.Add(1)
}
}
// RecordCorrelationSuccess records a successful correlation.
func (m *CorrelationMetrics) RecordCorrelationSuccess() {
m.correlationsSuccess.Add(1)
}
// RecordCorrelationFailed records a failed correlation attempt with the reason.
func (m *CorrelationMetrics) RecordCorrelationFailed(reason string) {
m.correlationsFailed.Add(1)
switch reason {
case "no_match_key":
m.failedNoMatchKey.Add(1)
case "time_window":
m.failedTimeWindow.Add(1)
case "buffer_eviction":
m.failedBufferEviction.Add(1)
case "ttl_expired":
m.failedTTLExpired.Add(1)
case "ip_excluded":
m.failedIPExcluded.Add(1)
}
}
// RecordBufferEviction records an event evicted from buffer.
func (m *CorrelationMetrics) RecordBufferEviction(source string) {
// Can be used for additional tracking if needed
}
// RecordOrphanEmitted records an orphan event emitted.
func (m *CorrelationMetrics) RecordOrphanEmitted(source string) {
if source == "A" {
m.orphansEmittedA.Add(1)
} else if source == "B" {
m.orphansEmittedB.Add(1)
}
}
// RecordPendingOrphan records an A event added to pending orphans.
func (m *CorrelationMetrics) RecordPendingOrphan() {
m.orphansPendingA.Add(1)
}
// RecordPendingOrphanMatch records a B event matching a pending orphan A.
func (m *CorrelationMetrics) RecordPendingOrphanMatch() {
m.pendingOrphanMatch.Add(1)
}
// RecordKeepAliveReset records a TTL reset for Keep-Alive.
func (m *CorrelationMetrics) RecordKeepAliveReset() {
m.keepAliveResets.Add(1)
}
// UpdateBufferSizes updates the current buffer sizes.
func (m *CorrelationMetrics) UpdateBufferSizes(sizeA, sizeB int64) {
m.bufferASize.Store(sizeA)
m.bufferBSize.Store(sizeB)
}
// Snapshot returns a point-in-time snapshot of all metrics.
func (m *CorrelationMetrics) Snapshot() MetricsSnapshot {
return MetricsSnapshot{
EventsReceivedA: m.eventsReceivedA.Load(),
EventsReceivedB: m.eventsReceivedB.Load(),
CorrelationsSuccess: m.correlationsSuccess.Load(),
CorrelationsFailed: m.correlationsFailed.Load(),
FailedNoMatchKey: m.failedNoMatchKey.Load(),
FailedTimeWindow: m.failedTimeWindow.Load(),
FailedBufferEviction: m.failedBufferEviction.Load(),
FailedTTLExpired: m.failedTTLExpired.Load(),
FailedIPExcluded: m.failedIPExcluded.Load(),
BufferASize: m.bufferASize.Load(),
BufferBSize: m.bufferBSize.Load(),
OrphansEmittedA: m.orphansEmittedA.Load(),
OrphansEmittedB: m.orphansEmittedB.Load(),
OrphansPendingA: m.orphansPendingA.Load(),
PendingOrphanMatch: m.pendingOrphanMatch.Load(),
KeepAliveResets: m.keepAliveResets.Load(),
}
}
// MetricsSnapshot is a point-in-time snapshot of metrics.
type MetricsSnapshot struct {
EventsReceivedA int64 `json:"events_received_a"`
EventsReceivedB int64 `json:"events_received_b"`
CorrelationsSuccess int64 `json:"correlations_success"`
CorrelationsFailed int64 `json:"correlations_failed"`
FailedNoMatchKey int64 `json:"failed_no_match_key"`
FailedTimeWindow int64 `json:"failed_time_window"`
FailedBufferEviction int64 `json:"failed_buffer_eviction"`
FailedTTLExpired int64 `json:"failed_ttl_expired"`
FailedIPExcluded int64 `json:"failed_ip_excluded"`
BufferASize int64 `json:"buffer_a_size"`
BufferBSize int64 `json:"buffer_b_size"`
OrphansEmittedA int64 `json:"orphans_emitted_a"`
OrphansEmittedB int64 `json:"orphans_emitted_b"`
OrphansPendingA int64 `json:"orphans_pending_a"`
PendingOrphanMatch int64 `json:"pending_orphan_match"`
KeepAliveResets int64 `json:"keepalive_resets"`
}
// MarshalJSON implements json.Marshaler.
func (m *CorrelationMetrics) MarshalJSON() ([]byte, error) {
return json.Marshal(m.Snapshot())
}
// String returns a human-readable string of metrics.
func (m *CorrelationMetrics) String() string {
s := m.Snapshot()
var b strings.Builder
b.WriteString("Correlation Metrics:\n")
fmt.Fprintf(&b, " Events Received: A=%d B=%d Total=%d\n", s.EventsReceivedA, s.EventsReceivedB, s.EventsReceivedA+s.EventsReceivedB)
fmt.Fprintf(&b, " Correlations: Success=%d Failed=%d\n", s.CorrelationsSuccess, s.CorrelationsFailed)
fmt.Fprintf(&b, " Failure Reasons: no_match_key=%d time_window=%d buffer_eviction=%d ttl_expired=%d ip_excluded=%d\n",
s.FailedNoMatchKey, s.FailedTimeWindow, s.FailedBufferEviction, s.FailedTTLExpired, s.FailedIPExcluded)
fmt.Fprintf(&b, " Buffer Sizes: A=%d B=%d\n", s.BufferASize, s.BufferBSize)
fmt.Fprintf(&b, " Orphans: Emitted A=%d B=%d Pending A=%d\n", s.OrphansEmittedA, s.OrphansEmittedB, s.OrphansPendingA)
fmt.Fprintf(&b, " Pending Orphan Match: %d\n", s.PendingOrphanMatch)
fmt.Fprintf(&b, " Keep-Alive Resets: %d\n", s.KeepAliveResets)
return b.String()
}

View File

@ -0,0 +1,128 @@
package observability
import (
"context"
"encoding/json"
"fmt"
"net"
"net/http"
"sync"
"time"
)
// MetricsServer exposes correlation metrics via HTTP.
type MetricsServer struct {
mu sync.Mutex
server *http.Server
listener net.Listener
metricsFunc func() MetricsSnapshot
running bool
}
// NewMetricsServer creates a new metrics HTTP server.
func NewMetricsServer(addr string, metricsFunc func() MetricsSnapshot) (*MetricsServer, error) {
if metricsFunc == nil {
return nil, fmt.Errorf("metricsFunc cannot be nil")
}
ms := &MetricsServer{
metricsFunc: metricsFunc,
}
mux := http.NewServeMux()
mux.HandleFunc("/metrics", ms.handleMetrics)
mux.HandleFunc("/health", ms.handleHealth)
ms.server = &http.Server{
Addr: addr,
Handler: mux,
ReadTimeout: 5 * time.Second,
WriteTimeout: 10 * time.Second,
}
return ms, nil
}
// Start begins listening on the configured address.
func (ms *MetricsServer) Start() error {
ms.mu.Lock()
defer ms.mu.Unlock()
if ms.running {
return nil
}
listener, err := net.Listen("tcp", ms.server.Addr)
if err != nil {
return fmt.Errorf("failed to start metrics server: %w", err)
}
ms.listener = listener
ms.running = true
go func() {
if err := ms.server.Serve(listener); err != nil && err != http.ErrServerClosed {
// Server error or closed
}
}()
return nil
}
// Stop gracefully stops the metrics server.
func (ms *MetricsServer) Stop(ctx context.Context) error {
ms.mu.Lock()
defer ms.mu.Unlock()
if !ms.running {
return nil
}
ms.running = false
return ms.server.Shutdown(ctx)
}
// handleMetrics returns the correlation metrics as JSON.
func (ms *MetricsServer) handleMetrics(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
return
}
metrics := ms.metricsFunc()
w.Header().Set("Content-Type", "application/json")
if err := json.NewEncoder(w).Encode(metrics); err != nil {
http.Error(w, "Failed to encode metrics", http.StatusInternalServerError)
return
}
}
// handleHealth returns a simple health check response.
func (ms *MetricsServer) handleHealth(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
return
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
fmt.Fprintf(w, `{"status":"healthy"}`)
}
// IsRunning returns true if the server is running.
func (ms *MetricsServer) IsRunning() bool {
ms.mu.Lock()
defer ms.mu.Unlock()
return ms.running
}
// Addr returns the listening address.
func (ms *MetricsServer) Addr() string {
ms.mu.Lock()
defer ms.mu.Unlock()
if ms.listener == nil {
return ""
}
return ms.listener.Addr().String()
}

View File

@ -0,0 +1,57 @@
package ports
import (
"context"
"github.com/antitbone/ja4/correlator/internal/domain"
)
// EventSource defines the interface for log sources.
type EventSource interface {
// Start begins reading events and sending them to the channel.
// Returns an error if the source cannot be started.
Start(ctx context.Context, eventChan chan<- *domain.NormalizedEvent) error
// Stop gracefully stops the source.
Stop() error
// Name returns the source name.
Name() string
}
// CorrelatedLogSink defines the interface for correlated log destinations.
type CorrelatedLogSink interface {
// Write sends a correlated log to the sink.
Write(ctx context.Context, log domain.CorrelatedLog) error
// Flush flushes any buffered logs.
Flush(ctx context.Context) error
// Close closes the sink.
Close() error
// Name returns the sink name.
Name() string
// Reopen closes and reopens the sink (for log rotation on SIGHUP).
// Optional: only FileSink implements this.
Reopen() error
}
// CorrelationProcessor defines the interface for the correlation service.
// This allows for easier testing and alternative implementations.
type CorrelationProcessor interface {
// ProcessEvent processes an incoming event and returns correlated logs.
ProcessEvent(event *domain.NormalizedEvent) []domain.CorrelatedLog
// Flush forces emission of remaining buffered events.
Flush() []domain.CorrelatedLog
// EmitPendingOrphans emits orphan A events whose delay has expired.
// Called periodically by the Orchestrator ticker so orphans are not blocked
// waiting for the next incoming event.
EmitPendingOrphans() []domain.CorrelatedLog
// GetBufferSizes returns the current buffer sizes for monitoring.
GetBufferSizes() (int, int)
}

View File

@ -0,0 +1,34 @@
[Unit]
Description=logcorrelator service
After=network.target
[Service]
Type=simple
User=logcorrelator
Group=logcorrelator
ExecStart=/usr/bin/logcorrelator -config /etc/logcorrelator/logcorrelator.yml
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
# Runtime directory: systemd crée /run/logcorrelator (= /var/run/logcorrelator)
# avec le bon propriétaire (logcorrelator:logcorrelator) à chaque démarrage/restart,
# ce qui évite que les sockets se retrouvent en root:root après un reboot (tmpfs vidé).
RuntimeDirectory=logcorrelator
RuntimeDirectoryMode=0755
# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log/logcorrelator /etc/logcorrelator
# Resource limits
LimitNOFILE=65536
# Systemd timeouts
TimeoutStartSec=10
TimeoutStopSec=30
[Install]
WantedBy=multi-user.target

View File

@ -0,0 +1,383 @@
# logcorrelator RPM spec file
# Compatible with CentOS 7, Rocky Linux 8, 9, 10
# Built with rpmbuild (not FPM)
Name: logcorrelator
Version: %{version}
Release: 1%{?dist}
Summary: Log correlation service for HTTP and network events
License: MIT
URL: https://github.com/logcorrelator/logcorrelator
Vendor: logcorrelator <dev@example.com>
Packager: logcorrelator <dev@example.com>
BuildArch: x86_64
# Dependencies
Requires: systemd
Requires(post): systemd
Requires(preun): systemd
Requires(postun): systemd
%description
logcorrelator est un service système écrit en Go qui reçoit deux flux de logs JSON
via des sockets Unix, corrèle les événements HTTP applicatifs avec des événements
réseau, et produit des logs corrélés en temps réel vers ClickHouse et/ou fichier local.
Notes de sécurité :
- Le service s'exécute sous l'utilisateur logcorrelator (non-root)
- Les sockets Unix sont créés avec des permissions 0666 (world read/write)
- Les répertoires critiques sont protégés : /var/log (750), /var/lib (750), /etc (750)
- /var/run/logcorrelator est en 755 pour permettre la création de sockets
%prep
# Files are already in BUILD directory (copied by build-rpm.sh)
# No extraction needed
echo "Files available in BUILD directory:"
ls -la %{_builddir}/
%install
# Create directory structure in buildroot
mkdir -p %{buildroot}/usr/bin
mkdir -p %{buildroot}/etc/logcorrelator
mkdir -p %{buildroot}/var/log/logcorrelator
mkdir -p %{buildroot}/var/run/logcorrelator
mkdir -p %{buildroot}/var/lib/logcorrelator
mkdir -p %{buildroot}/etc/systemd/system
mkdir -p %{buildroot}/etc/logrotate.d
mkdir -p %{buildroot}/usr/lib/tmpfiles.d
# Install binary (from BUILD directory)
install -m 0755 %{_builddir}/usr/bin/logcorrelator %{buildroot}/usr/bin/logcorrelator
# Install config files
install -m 0640 %{_builddir}/etc/logcorrelator/logcorrelator.yml %{buildroot}/etc/logcorrelator/logcorrelator.yml
install -m 0640 %{_builddir}/etc/logcorrelator/logcorrelator.yml.example %{buildroot}/etc/logcorrelator/logcorrelator.yml.example
# Install systemd service
install -m 0644 %{_builddir}/etc/systemd/system/logcorrelator.service %{buildroot}/etc/systemd/system/logcorrelator.service
# Install logrotate config
install -m 0644 %{_builddir}/etc/logrotate.d/logcorrelator %{buildroot}/etc/logrotate.d/logcorrelator
%post
# Create logcorrelator user and group
if ! getent group logcorrelator >/dev/null 2>&1; then
groupadd --system logcorrelator
fi
if ! getent passwd logcorrelator >/dev/null 2>&1; then
useradd --system \
--gid logcorrelator \
--home-dir /var/lib/logcorrelator \
--no-create-home \
--shell /usr/sbin/nologin \
logcorrelator
fi
# Create directories
mkdir -p /var/lib/logcorrelator
mkdir -p /var/log/logcorrelator
# Note: /var/run/logcorrelator est géré par RuntimeDirectory= (systemd) et tmpfiles.d
# Set ownership
chown -R logcorrelator:logcorrelator /var/lib/logcorrelator
chown -R logcorrelator:logcorrelator /var/log/logcorrelator
chown -R logcorrelator:logcorrelator /etc/logcorrelator
# Set permissions
chmod 750 /var/lib/logcorrelator
chmod 750 /var/log/logcorrelator
chmod 750 /etc/logcorrelator
# Copy default config if not exists
if [ ! -f /etc/logcorrelator/logcorrelator.yml ]; then
cp /etc/logcorrelator/logcorrelator.yml.example /etc/logcorrelator/logcorrelator.yml
chown logcorrelator:logcorrelator /etc/logcorrelator/logcorrelator.yml
chmod 640 /etc/logcorrelator/logcorrelator.yml
fi
# Reload systemd and start service
if [ -x /bin/systemctl ]; then
systemctl daemon-reload
systemctl enable logcorrelator.service
systemctl start logcorrelator.service
fi
exit 0
%preun
if [ $1 -eq 0 ]; then
# Package removal, not upgrade
if [ -x /bin/systemctl ]; then
systemctl stop logcorrelator.service
systemctl disable logcorrelator.service
fi
fi
exit 0
%postun
if [ -x /bin/systemctl ]; then
systemctl daemon-reload
if [ $1 -ge 1 ]; then
# Package upgrade, restart service
systemctl try-restart logcorrelator.service
fi
fi
exit 0
%files
/usr/bin/logcorrelator
%config(noreplace) /etc/logcorrelator/logcorrelator.yml
/etc/logcorrelator/logcorrelator.yml.example
/var/log/logcorrelator
/var/lib/logcorrelator
/etc/systemd/system/logcorrelator.service
%config(noreplace) /etc/logrotate.d/logcorrelator
%changelog
* Wed Mar 11 2026 logcorrelator <dev@example.com> - 1.1.22-1
- Feat(outputs): file output enabled/disabled toggle
Ajout du champ enabled: true/false dans outputs.file de la configuration.
Le sink fichier n'est cree que si enabled: true ET path: defini.
Permet de desactiver completement la sortie fichier tout en gardant stdout/clickhouse.
Tests: TestValidate_FileOutputDisabled, TestLoadConfig_FileOutputDisabled
- Fix(systemd): arret immediat sans vidage de queue
orchestrator.Stop() ne vide plus les buffers (events en transit perdus).
Suppression de ShutdownTimeout et de la logique de flush/attente.
systemd TimeoutStopSec=30 gere l'arret force si besoin.
Simplification: cancel() + Close() uniquement.
- Feat(sql): TTL et compression ZSTD sur tables ClickHouse
http_logs_raw: TTL 1 jour, compression ZSTD sur raw_json
http_logs: TTL 7 jours, compression ZSTD sur champs texte volumineux
Parametre ttl_only_drop_parts = 1 pour optimiser les suppressions
* Mon Mar 09 2026 logcorrelator <dev@example.com> - 1.1.21-1
- Update: vues ClickHouse et schema SQL
Ajout de bots.sql pour l'identification des bots (User-Agent parsing)
Ajout de tables.sql pour les tables de reference
Mise a jour de mv1.sql (vue materialisee) avec nouvelle structure de correlation
Documentation views.md enrichie avec exemples de requetes et schema complet
* Mon Mar 09 2026 logcorrelator <dev@example.com> - 1.1.20-1
- Fix(rpm): suppression de systemd-tmpfiles.conf redondant
RuntimeDirectory=logcorrelator dans le service systemd gere deja /run/logcorrelator
automatiquement. La commande systemd-tmpfiles --create causait des erreurs sur
les systemes avec /var/lib/mysql existant (fichier au lieu de repertoire).
Suppression de /usr/lib/tmpfiles.d/logcorrelator.conf et de systemd-tmpfiles --create.
* Mon Mar 09 2026 logcorrelator <dev@example.com> - 1.1.19-1
- Fix(systemd): stop/restart immediat sans attendre vidage queue
L'arret du service ne vide plus les buffers (events en transit perdus).
systemd TimeoutStopSec=30 gere deja l'arret force si besoin.
Simplification de orchestrator.Stop() : cancel() + Close() uniquement.
Suppression de ShutdownTimeout devenu inutile.
* Mon Mar 09 2026 logcorrelator <dev@example.com> - 1.1.18-1
- Fix(outputs): file output enabled: false ne coupait pas l ecriture du fichier
Le champ Enabled manquait dans FileOutputConfig. Le sink fichier etait cree
meme avec enabled: false tant que path etait defini. Desormais, la condition
verifie explicitement enabled && path != "" dans main.go et Validate().
Test: TestValidate_FileOutputDisabled et TestLoadConfig_FileOutputDisabled ajoutes.
* Fri Mar 06 2026 logcorrelator <dev@example.com> - 1.1.17-1
- Fix(correlation): champ keepalives non peuple dans ClickHouse
Le champ KeepAliveSeq de NormalizedEvent n'etait pas transfere dans les Fields
de CorrelatedLog. La vue materialisee ClickHouse extrayait keepalives du JSON
mais trouvait toujours 0. Desormais, NewCorrelatedLog et NewCorrelatedLogFromEvent
ajoutent explicitement keepalives = KeepAliveSeq dans les Fields.
* Fri Mar 06 2026 logcorrelator <dev@example.com> - 1.1.16-1
- Feat(correlation): emettre les evenements A filtrés par include_dest_ports vers ClickHouse
Quand un evenement A (HTTP) etait exclu par le filtre include_dest_ports, il etait
silencieusement ignore. Desormais, si ApacheAlwaysEmit=true, l evenement est emis comme
non-correle (orphan_side=A) afin d apparaitre dans ClickHouse. Les evenements B restent
ignores. Test: TestCorrelationService_IncludeDestPorts_FilteredPort mis a jour +
TestCorrelationService_IncludeDestPorts_FilteredPort_NoAlwaysEmit ajoute.
* Thu Mar 05 2026 logcorrelator <dev@example.com> - 1.1.15-1
- Fix(correlation/bug3): perte de donnees quand B expire avec des orphelins en attente
cleanNetworkBufferByTTL supprimait les pendingOrphans sans les emettre (perte silencieuse).
Desormais, les orphelins A sont retournes immediatement a l'appelant quand B expire,
et cleanExpired/ProcessEvent propagent ces resultats vers le sink.
Test: TestBTTLExpiry_PurgesPendingOrphans etendu pour verifier l'emission effective.
* Thu Mar 05 2026 logcorrelator <dev@example.com> - 1.1.14-1
- Fix(correlation/bug1): Keep-Alive sessions au-dela de TimeWindow ne correlent plus en orphelins
Le matcher dans processSourceA utilisait eventsMatch (comparaison de timestamps) en mode
one_to_many. Apres ~10s, B.Timestamp_original depasse la TimeWindow et toutes les requetes
suivantes devenaient orphelines. Nouveau matcher bEventHasValidTTL : un B event est valide
tant que son TTL n'a pas expire (le TTL est reset a chaque correlation Keep-Alive).
- Fix(correlation/bug4): checkPendingOrphansForCorrelation utilisait eventsMatch (meme bug)
En mode one_to_many, un B arrivant avec un vieux timestamp ne matchait plus les pending orphans
pour la meme cle. Remplace par une verification de cle uniquement (meme cle = meme connexion).
- Fix(correlation/bug3): pendingOrphans non purges quand le B expire (cleanNetworkBufferByTTL)
Quand un B event expirait (TTL), les pending orphan A associes etaient bloques indefiniment.
Ils sont desormais emis immediatement lors de l'expiration du B correspondant.
- Fix(correlation/bug2): orphans emis uniquement sur reception d'evenement (pas de timer dedie)
EmitPendingOrphans() est maintenant une methode publique thread-safe. L'Orchestrateur
demarre un goroutine ticker (250ms) qui appelle EmitPendingOrphans() independamment du flux,
garantissant l'emission meme en l'absence de nouveaux evenements.
- Feat(ports): ajout de EmitPendingOrphans() dans l'interface CorrelationProcessor
- Test: 4 nouveaux tests de non-regression (Bug #1, #2, #3, #4)
* Thu Mar 05 2026 logcorrelator <dev@example.com> - 1.1.13-1
- Fix: Unix sockets ne passent plus en root:root lors des restarts du service
- Fix: Ajout de RuntimeDirectory=logcorrelator dans le service systemd (systemd gère /run/logcorrelator avec le bon propriétaire à chaque démarrage/restart)
- Fix: Ajout de /usr/lib/tmpfiles.d/logcorrelator.conf pour recréer /run/logcorrelator au boot
- Chore: Retrait de /var/run/logcorrelator du RPM %files (géré par tmpfiles.d)
- Fix(correlation): emitPendingOrphans - corruption de slice lors de l expiration simultanée de plusieurs orphelins pour la même clé (slice aliasing bug, émissions en double)
- Fix(correlation): rotateOldestA - l événement rotaté était perdu silencieusement même avec ApacheAlwaysEmit=true (retourne désormais le CorrelatedLog)
- Fix(correlation): Keep-Alive cassé dans le chemin pending-orphan-then-B - le B event n était pas bufferisé en mode one_to_many, bloquant la corrélation des requêtes A2+ du même Keep-Alive
- Chore(correlation): suppression du champ mort timer *time.Timer dans pendingOrphan
- Feat(correlation): ajout de keepalive_seq dans les logs orphelins pour faciliter le debug (numéro de requête dans la connexion Keep-Alive, 1-based)
- Test: 4 nouveaux tests de non-régression pour les bugs de corrélation
* Thu Mar 05 2026 logcorrelator <dev@example.com> - 1.1.12-1
- Feat: New config directive include_dest_ports - restrict correlation to specific destination ports
- Feat: If include_dest_ports is non-empty, events on unlisted ports are silently ignored (not correlated, not emitted as orphan)
- Feat: New metric failed_dest_port_filtered for monitoring filtered traffic
- Feat: Debug log for filtered events: "event excluded by dest port filter: source=A dst_port=22"
- Test: New unit tests for include_dest_ports (allowed port, filtered port, empty=all)
- Docs: README.md updated with include_dest_ports section and current version references
- Docs: architecture.yml updated with include_dest_ports
- Fix: config.example.yml - removed obsolete stdout.level field
* Thu Mar 05 2026 logcorrelator <dev@example.com> - 1.1.11-1
- Fix: StdoutSink no longer writes correlated/orphan JSON to stdout
- Fix: stdout sink is now a no-op for data; operational logs go to stderr via logger
- Fix: ClickHouse sink had no logger - all flush errors were silently discarded
- Fix: Periodic, batch and final-close flush errors are now logged at ERROR level
- Fix: Buffer overflow with DropOnOverflow=true is now logged at WARN level
- Fix: Retry attempts are now logged at WARN level with attempt number, delay and error
- Feat: ClickHouse connection success logged at INFO (table, batch_size, flush_interval_ms)
- Feat: Successful batch sends logged at DEBUG (rows count, table)
- Feat: SetLogger() method added to ClickHouseSink for external logger injection
- Test: New unit tests for StdoutSink asserting stdout remains empty for all log types
* Wed Mar 04 2026 logcorrelator <dev@example.com> - 1.1.10-1
- Feat: IP exclusion filter - exclude specific source IPs or CIDR ranges
- Feat: Configuration exclude_source_ips supports single IPs and CIDR notation
- Feat: Debug logging for excluded IPs
- Feat: New metric failed_ip_excluded for monitoring filtered traffic
- Feat: Architecture documentation updated with observability section
- Use cases: exclude health checks, internal traffic, known bad actors
- Docs: README.md updated with IP exclusion documentation
- Docs: architecture.yml updated with metrics and troubleshooting guide
* Wed Mar 04 2026 logcorrelator <dev@example.com> - 1.1.9-1
- Feat: Debug logging - detailed DEBUG logs for correlation troubleshooting
- Feat: Correlation metrics server (HTTP endpoint /metrics and /health)
- Feat: New metrics: events_received, correlations_success/failed, failure reasons
- Feat: Failure reason tracking: no_match_key, time_window, buffer_eviction, ttl_expired
- Feat: Buffer size monitoring (buffer_a_size, buffer_b_size)
- Feat: Orphan tracking (orphans_emitted, orphans_pending, pending_orphan_match)
- Feat: Keep-Alive reset counter for connection tracking
- Feat: Test scripts added (test-correlation.sh, test-correlation-advanced.py)
- Change: Config example updated with metrics section
- Docs: README.md updated with debugging guide and troubleshooting table
* Tue Mar 03 2026 logcorrelator <dev@example.com> - 1.1.8-1
- Migrated from FPM to rpmbuild (native RPM build)
- Reduced build image size by 200MB (-40%)
- Removed FPM gem dependency (use rpmbuild directly)
- Scripts post/preun/postun now inline in spec file
- Build image: rockylinux:8 instead of ruby:3.2-bookworm
* Tue Mar 03 2026 logcorrelator <dev@example.com> - 1.1.7-1
- Fix: Critical Keep-Alive bug - network events evicted based on original timestamp instead of reset TTL
- Fix: Correlation time window increased from 1s to 10s for HTTP Keep-Alive support
- Fix: Network source now uses payload timestamp if available (fallback to reception time)
- Change: Default network TTL increased from 30s to 120s for long Keep-Alive sessions
- Test: Added comprehensive Keep-Alive tests (TTL reset, long session scenarios)
* Tue Mar 03 2026 logcorrelator <dev@example.com> - 1.1.6-1
- Docs: Update ClickHouse schema documentation (http_logs_raw + http_logs tables)
- Fix: ClickHouse insertion uses single raw_json column (FORMAT JSONEachRow)
- Fix: ClickHouse native API (clickhouse-go/v2 PrepareBatch + Append + Send)
* Tue Mar 03 2026 logcorrelator <dev@example.com> - 1.1.5-1
- Fix: ClickHouse insertion using native clickhouse-go/v2 API (PrepareBatch + Append + Send)
- Fix: Replaced database/sql wrapper with clickhouse.Open() and clickhouse.Conn
- Fix: Proper batch sending to avoid ATTEMPT_TO_READ_AFTER_EOF errors
- Fix: Set correct permissions (755) on /var/run/logcorrelator in RPM post-install
* Mon Mar 02 2026 logcorrelator <dev@example.com> - 1.1.4-1
- Fix: Log raw JSON data on parse errors for debugging
* Mon Mar 02 2026 logcorrelator <dev@example.com> - 1.1.3-1
- Refactor: Switch Unix sockets from STREAM to DGRAM mode (SOCK_DGRAM)
- Test: Comprehensive tests added - coverage improved to 74.4%
- Fix: Example config file installed to /etc/logcorrelator/logcorrelator.yml.example
- Change: Default socket permissions from 0660 to 0666 (world read/write)
* Mon Mar 02 2026 logcorrelator <dev@example.com> - 1.1.2-1
- Fix: Example config file installed to /etc/logcorrelator/logcorrelator.yml.example
- Change: Default socket permissions from 0660 to 0666 (world read/write)
* Mon Mar 02 2026 logcorrelator <dev@example.com> - 1.1.1-1
- Fix: Move logcorrelator.yml.example from /usr/share/logcorrelator/ to /etc/logcorrelator/
* Mon Mar 02 2026 logcorrelator <dev@example.com> - 1.1.0-1
- Feat: Keep-Alive support (one-to-many correlation mode)
- Feat: Dynamic TTL for network events (source B)
- Feat: Separate buffer sizes for HTTP and network events
- Feat: SIGHUP signal handling for log rotation
- Feat: File sink Reopen() method for log rotation
- Feat: logrotate configuration included
- Feat: ExecReload added to systemd service
- Feat: New YAML config structure (time_window, orphan_policy, matching, buffers, ttl)
- Docs: Updated architecture.yml and config.example.yml
* Sat Feb 28 2026 logcorrelator <dev@example.com> - 1.0.7-1
- Added: Log levels DEBUG, INFO, WARN, ERROR configurable via log.level
- Added: Warn and Warnf methods for warning messages
- Added: Debug logs for events received from sockets and correlations
- Added: Warning logs for orphan events and buffer overflow
- Changed: Configuration log.enabled replaced by log.level
- Changed: Orphan events and buffer overflow now logged as WARN instead of DEBUG
* Sat Feb 28 2026 logcorrelator <dev@example.com> - 1.0.6-1
- Changed: Configuration YAML simplified, removed service.name, service.language
- Changed: Correlation config simplified, time_window_s instead of nested object
- Changed: Orphan policy simplified to emit_orphans boolean
- Changed: Apache socket renamed to http.socket
- Added: socket_permissions option on unix sockets
* Sat Feb 28 2026 logcorrelator <dev@example.com> - 1.0.5-1
- Added: Systemd service auto-start after RPM installation
- Added: Systemd service hardening (TimeoutStartSec, TimeoutStopSec, ReadWritePaths)
- Fixed: Systemd service unit correct config path (.yml instead of .conf)
- Fixed: CI workflow branch name main to master
- Changed: RPM packaging generic el8/el9/el10 directory naming
* Sat Feb 28 2026 logcorrelator <dev@example.com> - 1.0.4-1
- Breaking: Flattened JSON output structure - removed apache and network subdivisions
- All log fields now merged into single-level JSON structure
- ClickHouse schema: replaced apache JSON and network JSON columns with fields JSON column
- Custom MarshalJSON() implementation for flat output
* Sat Feb 28 2026 logcorrelator <dev@example.com> - 1.0.3-1
- Fix: Added missing ClickHouse driver dependency
- Fix: Fixed race condition in orchestrator
- Security: Added explicit source_type configuration for Unix socket sources
- Added: Comprehensive test suite improvements
- Added: Test coverage improved from 50.6% to 62.0%
* Sat Feb 28 2026 logcorrelator <dev@example.com> - 1.0.2-1
- Added: Initial RPM packaging support for Rocky Linux 8/9 and AlmaLinux 10
- Added: Docker multi-stage build pipeline
- Added: Hexagonal architecture implementation
- Added: Unix socket input sources (JSON line protocol)
- Added: File output sink (JSON lines)
- Added: ClickHouse output sink with batching and retry logic
- Added: Time-window based correlation on src_ip + src_port
- Added: Graceful shutdown with signal handling (SIGINT, SIGTERM)
* Sat Feb 28 2026 logcorrelator <dev@example.com> - 1.0.1-1
- Initial package for CentOS 7, Rocky Linux 8, 9, 10

View File

@ -0,0 +1,13 @@
/var/log/logcorrelator/correlated.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
create 0640 logcorrelator logcorrelator
sharedscripts
postrotate
/bin/systemctl reload logcorrelator > /dev/null 2>&1 || true
endscript
}

View File

@ -0,0 +1,258 @@
#!/bin/bash
# Test script for logcorrelator RPM package
# Verifies installation, permissions, and service status
#
# Usage: ./packaging/test/test-rpm.sh [el8|el9|el10]
#
# This script tests the RPM package in a Docker container to ensure:
# - Installation succeeds
# - File permissions are correct
# - Service starts properly
# - Sockets are created with correct ownership
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(dirname "$(dirname "$SCRIPT_DIR")")"
RPM_DIR="${PROJECT_ROOT}/dist/rpm"
# Default to el8 if no argument provided
DISTRO="${1:-el8}"
echo "========================================="
echo "Testing logcorrelator RPM for ${DISTRO}"
echo "========================================="
# Find the RPM file
case "${DISTRO}" in
el8|rocky8)
RPM_PATH="${RPM_DIR}/el8"
BASE_IMAGE="rockylinux:8"
;;
el9|rocky9)
RPM_PATH="${RPM_DIR}/el9"
BASE_IMAGE="rockylinux:9"
;;
el10|alma10)
RPM_PATH="${RPM_DIR}/el10"
BASE_IMAGE="almalinux:10"
;;
*)
echo "Unknown distribution: ${DISTRO}"
echo "Valid options: el8, el9, el10"
exit 1
;;
esac
# Find the latest RPM file
RPM_FILE=$(ls -t "${RPM_PATH}"/logcorrelator-*.rpm 2>/dev/null | head -n 1)
if [ -z "${RPM_FILE}" ]; then
echo "ERROR: No RPM file found in ${RPM_PATH}"
echo "Please run 'make package-rpm' first"
exit 1
fi
echo "Testing RPM: ${RPM_FILE}"
echo "Base image: ${BASE_IMAGE}"
echo ""
# Create test script
TEST_SCRIPT=$(cat <<'EOF'
#!/bin/bash
set -e
echo "=== Installing logcorrelator RPM ==="
rpm -ivh /tmp/logcorrelator.rpm
echo ""
echo "=== Checking user and group ==="
if ! getent group logcorrelator >/dev/null; then
echo "FAIL: logcorrelator group not created"
exit 1
fi
echo "OK: logcorrelator group exists"
if ! getent passwd logcorrelator >/dev/null; then
echo "FAIL: logcorrelator user not created"
exit 1
fi
echo "OK: logcorrelator user exists"
echo ""
echo "=== Checking directory permissions ==="
# Check /var/run/logcorrelator
DIR="/var/run/logcorrelator"
if [ ! -d "$DIR" ]; then
echo "FAIL: $DIR does not exist"
exit 1
fi
OWNER=$(stat -c '%U:%G' "$DIR")
PERMS=$(stat -c '%a' "$DIR")
if [ "$OWNER" != "logcorrelator:logcorrelator" ]; then
echo "FAIL: $DIR owner is $OWNER (expected logcorrelator:logcorrelator)"
exit 1
fi
if [ "$PERMS" != "755" ]; then
echo "FAIL: $DIR permissions are $PERMS (expected 755)"
exit 1
fi
echo "OK: $DIR - owner=$OWNER, permissions=$PERMS"
# Check /var/log/logcorrelator
DIR="/var/log/logcorrelator"
if [ ! -d "$DIR" ]; then
echo "FAIL: $DIR does not exist"
exit 1
fi
OWNER=$(stat -c '%U:%G' "$DIR")
PERMS=$(stat -c '%a' "$DIR")
if [ "$OWNER" != "logcorrelator:logcorrelator" ]; then
echo "FAIL: $DIR owner is $OWNER (expected logcorrelator:logcorrelator)"
exit 1
fi
if [ "$PERMS" != "750" ]; then
echo "FAIL: $DIR permissions are $PERMS (expected 750)"
exit 1
fi
echo "OK: $DIR - owner=$OWNER, permissions=$PERMS"
# Check /var/lib/logcorrelator
DIR="/var/lib/logcorrelator"
if [ ! -d "$DIR" ]; then
echo "FAIL: $DIR does not exist"
exit 1
fi
OWNER=$(stat -c '%U:%G' "$DIR")
PERMS=$(stat -c '%a' "$DIR")
if [ "$OWNER" != "logcorrelator:logcorrelator" ]; then
echo "FAIL: $DIR owner is $OWNER (expected logcorrelator:logcorrelator)"
exit 1
fi
if [ "$PERMS" != "750" ]; then
echo "FAIL: $DIR permissions are $PERMS (expected 750)"
exit 1
fi
echo "OK: $DIR - owner=$OWNER, permissions=$PERMS"
echo ""
echo "=== Checking config files ==="
# Check config file exists and has correct permissions
CONFIG="/etc/logcorrelator/logcorrelator.yml"
if [ ! -f "$CONFIG" ]; then
echo "FAIL: $CONFIG does not exist"
exit 1
fi
OWNER=$(stat -c '%U:%G' "$CONFIG")
PERMS=$(stat -c '%a' "$CONFIG")
if [ "$OWNER" != "logcorrelator:logcorrelator" ]; then
echo "FAIL: $CONFIG owner is $OWNER (expected logcorrelator:logcorrelator)"
exit 1
fi
if [ "$PERMS" != "640" ]; then
echo "FAIL: $CONFIG permissions are $PERMS (expected 640)"
exit 1
fi
echo "OK: $CONFIG - owner=$OWNER, permissions=$PERMS"
# Check example config file
EXAMPLE_CONFIG="/etc/logcorrelator/logcorrelator.yml.example"
if [ ! -f "$EXAMPLE_CONFIG" ]; then
echo "FAIL: $EXAMPLE_CONFIG does not exist"
exit 1
fi
OWNER=$(stat -c '%U:%G' "$EXAMPLE_CONFIG")
PERMS=$(stat -c '%a' "$EXAMPLE_CONFIG")
if [ "$OWNER" != "logcorrelator:logcorrelator" ]; then
echo "FAIL: $EXAMPLE_CONFIG owner is $OWNER (expected logcorrelator:logcorrelator)"
exit 1
fi
if [ "$PERMS" != "640" ]; then
echo "FAIL: $EXAMPLE_CONFIG permissions are $PERMS (expected 640)"
exit 1
fi
echo "OK: $EXAMPLE_CONFIG - owner=$OWNER, permissions=$PERMS"
echo ""
echo "=== Checking systemd service ==="
if [ ! -f /etc/systemd/system/logcorrelator.service ]; then
echo "FAIL: systemd service file not found"
exit 1
fi
echo "OK: systemd service file exists"
echo ""
echo "=== Checking logrotate config ==="
if [ ! -f /etc/logrotate.d/logcorrelator ]; then
echo "FAIL: logrotate config not found"
exit 1
fi
echo "OK: logrotate config exists"
echo ""
echo "=== Testing service start ==="
# Try to start the service (may fail in container without full systemd)
if command -v systemctl >/dev/null 2>&1; then
systemctl daemon-reload || true
if systemctl start logcorrelator.service 2>/dev/null; then
echo "OK: service started successfully"
# Wait for sockets to be created
sleep 2
echo ""
echo "=== Checking sockets ==="
HTTP_SOCKET="/var/run/logcorrelator/http.socket"
NETWORK_SOCKET="/var/run/logcorrelator/network.socket"
if [ -S "$HTTP_SOCKET" ]; then
OWNER=$(stat -c '%U:%G' "$HTTP_SOCKET")
PERMS=$(stat -c '%a' "$HTTP_SOCKET")
echo "OK: $HTTP_SOCKET exists - owner=$OWNER, permissions=$PERMS"
if [ "$PERMS" != "666" ]; then
echo "WARN: socket permissions are $PERMS (expected 666)"
fi
else
echo "WARN: $HTTP_SOCKET not found (service may not have started)"
fi
if [ -S "$NETWORK_SOCKET" ]; then
OWNER=$(stat -c '%U:%G' "$NETWORK_SOCKET")
PERMS=$(stat -c '%a' "$NETWORK_SOCKET")
echo "OK: $NETWORK_SOCKET exists - owner=$OWNER, permissions=$PERMS"
if [ "$PERMS" != "666" ]; then
echo "WARN: socket permissions are $PERMS (expected 666)"
fi
else
echo "WARN: $NETWORK_SOCKET not found (service may not have started)"
fi
systemctl stop logcorrelator.service || true
else
echo "WARN: service failed to start (expected in minimal container)"
fi
else
echo "WARN: systemctl not available (minimal container)"
fi
echo ""
echo "========================================="
echo "All tests passed!"
echo "========================================="
EOF
)
# Run test in Docker container
echo "Running tests in Docker container..."
echo ""
docker run --rm \
-v "${RPM_FILE}:/tmp/logcorrelator.rpm:ro" \
-v "${TEST_SCRIPT}:/test.sh:ro" \
"${BASE_IMAGE}" \
bash /test.sh
echo ""
echo "Test completed successfully for ${DISTRO}"

View File

@ -0,0 +1,101 @@
#!/bin/bash
set -e
echo "=== AUDIT ARCHITECTURE COMPLIANCE ==="
echo ""
# 1. Runtime - systemd service
echo "1. RUNTIME - SYSTEMD SERVICE"
if [ -f /src/logcorrelator.service ]; then
echo "✅ logcorrelator.service exists"
grep -q "ExecStart=/usr/bin/logcorrelator" /src/logcorrelator.service && echo " ✅ ExecStart correct" || echo " ❌ ExecStart incorrect"
grep -q "ExecReload=" /src/logcorrelator.service && echo " ✅ ExecReload present" || echo " ❌ ExecReload missing"
grep -q "Restart=on-failure" /src/logcorrelator.service && echo " ✅ Restart policy correct" || echo " ❌ Restart policy incorrect"
else
echo "❌ logcorrelator.service missing"
fi
# Check signal handling in code
echo ""
grep -r "SIGINT\|SIGTERM\|SIGHUP" /src/cmd/logcorrelator/main.go > /dev/null && echo "✅ Signal handling (SIGINT/SIGTERM/SIGHUP) implemented" || echo "❌ Signal handling missing"
# 2. Packaging - RPM
echo ""
echo "2. PACKAGING - RPM"
[ -f /src/packaging/rpm/logcorrelator.spec ] && echo "✅ RPM spec file exists" || echo "❌ RPM spec missing"
grep -q "fpm" /src/Dockerfile.package && echo "✅ fpm tool used for packaging" || echo "❌ fpm not found"
# 3. Config - YAML
echo ""
echo "3. CONFIG - YAML"
[ -f /src/config.example.yml ] && echo "✅ config.example.yml exists" || echo "❌ config.example.yml missing"
grep -q "log:" /src/config.example.yml && echo " ✅ log section present" || echo " ❌ log section missing"
grep -q "inputs:" /src/config.example.yml && echo " ✅ inputs section present" || echo " ❌ inputs section missing"
grep -q "outputs:" /src/config.example.yml && echo " ✅ outputs section present" || echo " ❌ outputs section missing"
grep -q "correlation:" /src/config.example.yml && echo " ✅ correlation section present" || echo " ❌ correlation section missing"
# 4. Inputs - Unix datagram sockets
echo ""
echo "4. INPUTS - UNIX DATAGRAM SOCKETS"
grep -q "ListenUnixgram" /src/internal/adapters/inbound/unixsocket/source.go && echo "✅ Using ListenUnixgram (SOCK_DGRAM)" || echo "❌ Not using SOCK_DGRAM"
grep -q "ReadFromUnix" /src/internal/adapters/inbound/unixsocket/source.go && echo "✅ Using ReadFromUnix for datagrams" || echo "❌ Not using ReadFromUnix"
grep -q "MaxDatagramSize = 65535" /src/internal/adapters/inbound/unixsocket/source.go && echo "✅ max_datagram_bytes = 65535" || echo "❌ max_datagram_bytes incorrect"
grep -q "0666" /src/internal/adapters/inbound/unixsocket/source.go && echo "✅ Default socket permissions 0666" || echo "❌ Socket permissions not 0666"
# Check socket paths in config
grep -q "http.socket" /src/config.example.yml && echo " ✅ http.socket path configured" || echo " ❌ http.socket path missing"
grep -q "network.socket" /src/config.example.yml && echo " ✅ network.socket path configured" || echo " ❌ network.socket path missing"
# 5. Outputs - Sinks
echo ""
echo "5. OUTPUTS - SINKS"
[ -f /src/internal/adapters/outbound/file/sink.go ] && echo "✅ File sink exists" || echo "❌ File sink missing"
[ -f /src/internal/adapters/outbound/clickhouse/sink.go ] && echo "✅ ClickHouse sink exists" || echo "❌ ClickHouse sink missing"
[ -f /src/internal/adapters/outbound/multi/sink.go ] && echo "✅ MultiSink exists" || echo "❌ MultiSink missing"
# Check SIGHUP reopen in file sink
grep -q "Reopen" /src/internal/adapters/outbound/file/sink.go && echo " ✅ FileSink.Reopen() for SIGHUP" || echo " ❌ FileSink.Reopen() missing"
# Check ClickHouse batching
grep -q "batch" /src/internal/adapters/outbound/clickhouse/sink.go && echo " ✅ ClickHouse batching implemented" || echo " ❌ ClickHouse batching missing"
grep -q "drop_on_overflow\|DropOnOverflow" /src/internal/adapters/outbound/clickhouse/sink.go && echo " ✅ drop_on_overflow implemented" || echo " ❌ drop_on_overflow missing"
# 6. Correlation
echo ""
echo "6. CORRELATION"
grep -q "src_ip" /src/internal/domain/correlation_service.go && echo "✅ src_ip in correlation key" || echo "❌ src_ip missing"
grep -q "src_port" /src/internal/domain/correlation_service.go && echo "✅ src_port in correlation key" || echo "❌ src_port missing"
grep -q "MatchingMode" /src/internal/domain/correlation_service.go && echo "✅ MatchingMode (one_to_one/one_to_many) implemented" || echo "❌ MatchingMode missing"
grep -q "ApacheAlwaysEmit" /src/internal/domain/correlation_service.go && echo "✅ apache_always_emit orphan policy" || echo "❌ apache_always_emit missing"
grep -q "network_ttl\|NetworkTTLS" /src/internal/domain/correlation_service.go && echo "✅ TTL management for network events" || echo "❌ TTL management missing"
grep -q "max_http_items\|maxHttpItems\|MaxHTTPItems" /src/internal/domain/correlation_service.go && echo "✅ Buffer limit max_http_items" || echo " ⚠️ Buffer limit naming may differ"
grep -q "max_network_items\|maxNetworkItems\|MaxNetworkItems" /src/internal/domain/correlation_service.go && echo "✅ Buffer limit max_network_items" || echo " ⚠️ Buffer limit naming may differ"
# 7. Schema - Source A and B
echo ""
echo "7. SCHEMA - SOURCE A AND B"
grep -q "timestamp" /src/internal/adapters/inbound/unixsocket/source.go && echo "✅ timestamp field for Source A" || echo "❌ timestamp missing for Source A"
grep -q "SourceA\|SourceB" /src/internal/domain/event.go && echo "✅ EventSource enum (A/B)" || echo "❌ EventSource enum missing"
grep -q "header_" /src/internal/adapters/inbound/unixsocket/source.go && echo "✅ header_* dynamic fields" || echo "❌ header_* fields missing"
grep -q "Extra" /src/internal/domain/event.go && echo "✅ Extra fields map" || echo "❌ Extra fields missing"
# 8. Architecture modules
echo ""
echo "8. ARCHITECTURE MODULES"
[ -d /src/internal/domain ] && echo "✅ internal/domain" || echo "❌ internal/domain missing"
[ -d /src/internal/ports ] && echo "✅ internal/ports" || echo "❌ internal/ports missing"
[ -d /src/internal/app ] && echo "✅ internal/app" || echo "❌ internal/app missing"
[ -d /src/internal/adapters/inbound ] && echo "✅ internal/adapters/inbound" || echo "❌ internal/adapters/inbound missing"
[ -d /src/internal/adapters/outbound ] && echo "✅ internal/adapters/outbound" || echo "❌ internal/adapters/outbound missing"
[ -d /src/internal/config ] && echo "✅ internal/config" || echo "❌ internal/config missing"
[ -d /src/internal/observability ] && echo "✅ internal/observability" || echo "❌ internal/observability missing"
[ -d /src/cmd/logcorrelator ] && echo "✅ cmd/logcorrelator" || echo "❌ cmd/logcorrelator missing"
# 9. Testing
echo ""
echo "9. TESTING"
echo "Running tests with coverage..."
cd /src && go test ./... -cover 2>&1 | grep -E "^(ok|FAIL|\?)" || true
echo ""
echo "=== AUDIT COMPLETE ==="

View File

@ -0,0 +1,582 @@
#!/usr/bin/env python3
"""
test-correlation-advanced.py - Advanced correlation testing tool
This script provides comprehensive testing for the logcorrelator service,
including various scenarios to debug correlation issues.
Usage:
python3 test-correlation-advanced.py [options]
Requirements:
- Python 3.6+
- requests library (for metrics): pip install requests
"""
import argparse
import json
import socket
import sys
import time
from datetime import datetime
from typing import Dict, Any, Optional, Tuple
try:
import requests
HAS_REQUESTS = True
except ImportError:
HAS_REQUESTS = False
class Colors:
"""ANSI color codes for terminal output."""
BLUE = '\033[0;34m'
GREEN = '\033[0;32m'
YELLOW = '\033[1;33m'
RED = '\033[0;31m'
NC = '\033[0m' # No Color
BOLD = '\033[1m'
def colorize(text: str, color: str) -> str:
"""Wrap text with ANSI color codes."""
return f"{color}{text}{Colors.NC}"
def info(text: str):
print(colorize(f"[INFO] ", Colors.BLUE) + text)
def success(text: str):
print(colorize(f"[OK] ", Colors.GREEN) + text)
def warn(text: str):
print(colorize(f"[WARN] ", Colors.YELLOW) + text)
def error(text: str):
print(colorize(f"[ERROR] ", Colors.RED) + text)
def debug(text: str, verbose: bool = False):
if verbose:
print(colorize(f"[DEBUG] ", Colors.BLUE) + text)
class CorrelationTester:
"""Main test class for correlation testing."""
def __init__(
self,
http_socket: str = "/var/run/logcorrelator/http.socket",
network_socket: str = "/var/run/logcorrelator/network.socket",
metrics_url: str = "http://localhost:8080/metrics",
verbose: bool = False,
skip_metrics: bool = False
):
self.http_socket = http_socket
self.network_socket = network_socket
self.metrics_url = metrics_url
self.verbose = verbose
self.skip_metrics = skip_metrics
self.http_sock: Optional[socket.socket] = None
self.network_sock: Optional[socket.socket] = None
def connect(self) -> bool:
"""Connect to Unix sockets."""
try:
# HTTP socket
self.http_sock = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
self.http_sock.connect(self.http_socket)
debug(f"Connected to HTTP socket: {self.http_socket}", self.verbose)
# Network socket
self.network_sock = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
self.network_sock.connect(self.network_socket)
debug(f"Connected to Network socket: {self.network_socket}", self.verbose)
return True
except FileNotFoundError as e:
error(f"Socket not found: {e}")
return False
except Exception as e:
error(f"Connection error: {e}")
return False
def close(self):
"""Close socket connections."""
if self.http_sock:
self.http_sock.close()
if self.network_sock:
self.network_sock.close()
def send_http_event(
self,
src_ip: str,
src_port: int,
timestamp: int,
method: str = "GET",
path: str = "/test",
host: str = "example.com",
extra_headers: Optional[Dict[str, str]] = None
) -> Dict[str, Any]:
"""Send an HTTP (source A) event."""
event = {
"src_ip": src_ip,
"src_port": src_port,
"dst_ip": "10.0.0.1",
"dst_port": 443,
"timestamp": timestamp,
"method": method,
"path": path,
"host": host,
"http_version": "HTTP/1.1",
"header_user_agent": "TestAgent/1.0",
"header_accept": "*/*"
}
if extra_headers:
for key, value in extra_headers.items():
event[f"header_{key}"] = value
json_data = json.dumps(event)
if self.http_sock:
self.http_sock.sendall(json_data.encode())
debug(f"Sent HTTP event: {src_ip}:{src_port} ts={timestamp}", self.verbose)
return event
def send_network_event(
self,
src_ip: str,
src_port: int,
timestamp: int,
ja3: str = "abc123",
ja4: str = "def456",
tls_version: str = "TLS1.3",
tls_sni: str = "example.com"
) -> Dict[str, Any]:
"""Send a Network (source B) event."""
event = {
"src_ip": src_ip,
"src_port": src_port,
"dst_ip": "10.0.0.1",
"dst_port": 443,
"timestamp": timestamp,
"ja3": ja3,
"ja4": ja4,
"tls_version": tls_version,
"tls_sni": tls_sni
}
json_data = json.dumps(event)
if self.network_sock:
self.network_sock.sendall(json_data.encode())
debug(f"Sent Network event: {src_ip}:{src_port} ts={timestamp}", self.verbose)
return event
def get_metrics(self) -> Dict[str, Any]:
"""Fetch metrics from the metrics server."""
if self.skip_metrics:
return {}
if not HAS_REQUESTS:
warn("requests library not installed, skipping metrics")
return {}
try:
response = requests.get(self.metrics_url, timeout=5)
response.raise_for_status()
return response.json()
except Exception as e:
warn(f"Failed to fetch metrics: {e}")
return {}
def print_metrics(self, metrics: Dict[str, Any], title: str = "Metrics"):
"""Print metrics in a formatted way."""
if not metrics:
return
print(f"\n{colorize(f'=== {title} ===', Colors.BOLD)}")
keys_to_show = [
("events_received_a", "Events A"),
("events_received_b", "Events B"),
("correlations_success", "Correlations"),
("correlations_failed", "Failures"),
("failed_no_match_key", " - No match key"),
("failed_time_window", " - Time window"),
("failed_buffer_eviction", " - Buffer eviction"),
("failed_ttl_expired", " - TTL expired"),
("buffer_a_size", "Buffer A size"),
("buffer_b_size", "Buffer B size"),
("orphans_emitted_a", "Orphans A"),
("orphans_emitted_b", "Orphans B"),
("pending_orphan_match", "Pending orphan matches"),
("keepalive_resets", "Keep-Alive resets"),
]
for key, label in keys_to_show:
if key in metrics:
print(f" {label}: {metrics[key]}")
def check_sockets(self) -> bool:
"""Check if sockets exist."""
import os
errors = 0
for name, path in [("HTTP", self.http_socket), ("Network", self.network_socket)]:
if not os.path.exists(path):
error(f"{name} socket not found: {path}")
errors += 1
elif not os.path.exists(path) or not os.path.stat(path).st_mode & 0o170000 == 0o140000:
# Check if it's a socket
try:
if not socket.getaddrinfo(path, None, socket.AF_UNIX):
error(f"{name} path exists but is not a socket: {path}")
errors += 1
except:
pass
else:
debug(f"{name} socket found: {path}", self.verbose)
return errors == 0
def run_basic_test(self, count: int = 10, delay_ms: int = 100) -> Tuple[bool, Dict[str, int]]:
"""
Run basic correlation test.
Sends N pairs of A+B events with matching src_ip:src_port and timestamps.
All should correlate successfully.
"""
info(f"Running basic correlation test with {count} pairs...")
# Get initial metrics
initial_metrics = self.get_metrics()
self.print_metrics(initial_metrics, "Initial Metrics")
initial_success = initial_metrics.get("correlations_success", 0)
initial_failed = initial_metrics.get("correlations_failed", 0)
initial_a = initial_metrics.get("events_received_a", 0)
initial_b = initial_metrics.get("events_received_b", 0)
# Send test events
print(f"\nSending {count} event pairs...")
base_timestamp = time.time_ns()
sent = 0
for i in range(1, count + 1):
src_ip = f"192.168.1.{(i % 254) + 1}"
src_port = 8000 + i
# Same timestamp for perfect correlation
timestamp = base_timestamp + (i * 1_000_000)
self.send_http_event(src_ip, src_port, timestamp)
self.send_network_event(src_ip, src_port, timestamp)
sent += 1
if delay_ms > 0:
time.sleep(delay_ms / 1000.0)
success(f"Sent {sent} event pairs")
# Wait for processing
info("Waiting for processing (2 seconds)...")
time.sleep(2)
# Get final metrics
final_metrics = self.get_metrics()
self.print_metrics(final_metrics, "Final Metrics")
# Calculate deltas
delta_success = final_metrics.get("correlations_success", 0) - initial_success
delta_failed = final_metrics.get("correlations_failed", 0) - initial_failed
delta_a = final_metrics.get("events_received_a", 0) - initial_a
delta_b = final_metrics.get("events_received_b", 0) - initial_b
results = {
"sent": sent,
"received_a": delta_a,
"received_b": delta_b,
"correlations": delta_success,
"failures": delta_failed
}
# Print results
print(f"\n{colorize('=== Results ===', Colors.BOLD)}")
print(f" Events A sent: {delta_a} (expected: {sent})")
print(f" Events B sent: {delta_b} (expected: {sent})")
print(f" Correlations: {delta_success}")
print(f" Failures: {delta_failed}")
# Validation
test_passed = True
if delta_a != sent:
error(f"Event A count mismatch: got {delta_a}, expected {sent}")
test_passed = False
if delta_b != sent:
error(f"Event B count mismatch: got {delta_b}, expected {sent}")
test_passed = False
if delta_success != sent:
error(f"Correlation count mismatch: got {delta_success}, expected {sent}")
test_passed = False
if delta_failed > 0:
warn(f"Unexpected correlation failures: {delta_failed}")
if test_passed:
success("All tests passed! Correlation is working correctly.")
else:
error("Some tests failed. Check logs for details.")
return test_passed, results
def run_time_window_test(self) -> bool:
"""Test time window expiration."""
info("Running time window test...")
src_ip = "192.168.100.1"
src_port = 9999
# Send A event
ts_a = time.time_ns()
self.send_http_event(src_ip, src_port, ts_a)
info(f"Sent A event at {ts_a}")
# Wait for time window to expire (default 10s)
info("Waiting 11 seconds (time window should expire)...")
time.sleep(11)
# Send B event
ts_b = time.time_ns()
self.send_network_event(src_ip, src_port, ts_b)
info(f"Sent B event at {ts_b}")
time_diff_sec = (ts_b - ts_a) / 1_000_000_000
info(f"Time difference: {time_diff_sec:.1f} seconds")
info("Expected: time_window failure (check metrics)")
return True
def run_different_ip_test(self) -> bool:
"""Test different IP (should not correlate)."""
info("Running different IP test...")
ts = time.time_ns()
# Send A with IP 192.168.200.1
self.send_http_event("192.168.200.1", 7777, ts)
info("Sent A event from 192.168.200.1:7777")
# Send B with different IP
self.send_network_event("192.168.200.2", 7777, ts)
info("Sent B event from 192.168.200.2:7777 (different IP)")
info("Expected: no_match_key failure (different src_ip)")
return True
def run_keepalive_test(self, count: int = 5) -> bool:
"""Test Keep-Alive mode (one B correlates with multiple A)."""
info(f"Running Keep-Alive test with {count} HTTP requests on same connection...")
src_ip = "192.168.50.1"
src_port = 6000
# Send one B event first (network/TCP connection)
ts_b = time.time_ns()
self.send_network_event(src_ip, src_port, ts_b)
info(f"Sent B event (connection): {src_ip}:{src_port}")
# Send multiple A events (HTTP requests) on same connection
for i in range(count):
ts_a = time.time_ns() + (i * 100_000_000) # 100ms apart
self.send_http_event(src_ip, src_port, ts_a, path=f"/request{i}")
info(f"Sent A event (request {i}): {src_ip}:{src_port}")
time.sleep(0.05) # 50ms delay
time.sleep(2) # Wait for processing
# Check metrics
metrics = self.get_metrics()
keepalive_resets = metrics.get("keepalive_resets", 0)
info(f"Keep-Alive resets: {keepalive_resets} (expected: {count - 1})")
if keepalive_resets >= count - 1:
success("Keep-Alive test passed!")
return True
else:
warn(f"Keep-Alive resets lower than expected. This may be normal depending on timing.")
return True
def run_all_tests(self) -> bool:
"""Run all test scenarios."""
results = []
# Basic test
passed, _ = self.run_basic_test(count=10)
results.append(("Basic correlation", passed))
print("\n" + "=" * 50 + "\n")
# Time window test
self.run_time_window_test()
results.append(("Time window", True)) # Informational
print("\n" + "=" * 50 + "\n")
# Different IP test
self.run_different_ip_test()
results.append(("Different IP", True)) # Informational
print("\n" + "=" * 50 + "\n")
# Keep-Alive test
self.run_keepalive_test()
results.append(("Keep-Alive", True))
# Summary
print(f"\n{colorize('=== Test Summary ===', Colors.BOLD)}")
for name, passed in results:
status = colorize("PASS", Colors.GREEN) if passed else colorize("FAIL", Colors.RED)
print(f" {name}: {status}")
return all(r[1] for r in results)
def main():
parser = argparse.ArgumentParser(
description="Advanced correlation testing tool for logcorrelator",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Run basic test with 20 pairs
python3 test-correlation-advanced.py -c 20
# Run all tests with verbose output
python3 test-correlation-advanced.py --all -v
# Test with custom socket paths
python3 test-correlation-advanced.py -H /tmp/http.sock -N /tmp/network.sock
# Skip metrics check
python3 test-correlation-advanced.py --skip-metrics
"""
)
parser.add_argument(
"-H", "--http-socket",
default="/var/run/logcorrelator/http.socket",
help="Path to HTTP Unix socket (default: /var/run/logcorrelator/http.socket)"
)
parser.add_argument(
"-N", "--network-socket",
default="/var/run/logcorrelator/network.socket",
help="Path to Network Unix socket (default: /var/run/logcorrelator/network.socket)"
)
parser.add_argument(
"-m", "--metrics-url",
default="http://localhost:8080/metrics",
help="Metrics server URL (default: http://localhost:8080/metrics)"
)
parser.add_argument(
"-c", "--count",
type=int,
default=10,
help="Number of test pairs to send (default: 10)"
)
parser.add_argument(
"-d", "--delay",
type=int,
default=100,
help="Delay between pairs in milliseconds (default: 100)"
)
parser.add_argument(
"-v", "--verbose",
action="store_true",
help="Enable verbose output"
)
parser.add_argument(
"--skip-metrics",
action="store_true",
help="Skip metrics check"
)
parser.add_argument(
"--all",
action="store_true",
help="Run all test scenarios"
)
parser.add_argument(
"--time-window",
action="store_true",
help="Run time window test only"
)
parser.add_argument(
"--different-ip",
action="store_true",
help="Run different IP test only"
)
parser.add_argument(
"--keepalive",
action="store_true",
help="Run Keep-Alive test only"
)
args = parser.parse_args()
# Create tester
tester = CorrelationTester(
http_socket=args.http_socket,
network_socket=args.network_socket,
metrics_url=args.metrics_url,
verbose=args.verbose,
skip_metrics=args.skip_metrics
)
# Check sockets
if not tester.check_sockets():
error("Socket check failed. Is logcorrelator running?")
sys.exit(1)
success("Socket check passed")
# Connect
if not tester.connect():
error("Failed to connect to sockets")
sys.exit(1)
try:
if args.all:
success = tester.run_all_tests()
elif args.time_window:
tester.run_time_window_test()
success = True
elif args.different_ip:
tester.run_different_ip_test()
success = True
elif args.keepalive:
tester.run_keepalive_test()
success = True
else:
_, _ = tester.run_basic_test(count=args.count, delay_ms=args.delay)
success = True
sys.exit(0 if success else 1)
finally:
tester.close()
if __name__ == "__main__":
main()

View File

@ -0,0 +1,404 @@
#!/bin/bash
#
# test-correlation.sh - Test script for log correlation debugging
#
# This script sends test HTTP (A) and Network (B) events to the logcorrelator
# Unix sockets and verifies that correlation is working correctly.
#
# Usage:
# ./test-correlation.sh [options]
#
# Options:
# -h, --http-socket PATH Path to HTTP socket (default: /var/run/logcorrelator/http.socket)
# -n, --network-socket PATH Path to Network socket (default: /var/run/logcorrelator/network.socket)
# -c, --count NUM Number of test pairs to send (default: 10)
# -d, --delay MS Delay between pairs in milliseconds (default: 100)
# -v, --verbose Enable verbose output
# -m, --metrics-url URL Metrics server URL (default: http://localhost:8080/metrics)
# --skip-metrics Skip metrics check
# --help Show this help message
#
set -e
# Default values
HTTP_SOCKET="/var/run/logcorrelator/http.socket"
NETWORK_SOCKET="/var/run/logcorrelator/network.socket"
COUNT=10
DELAY_MS=100
VERBOSE=false
METRICS_URL="http://localhost:8080/metrics"
SKIP_METRICS=false
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Print functions
info() {
echo -e "${BLUE}[INFO]${NC} $1"
}
success() {
echo -e "${GREEN}[OK]${NC} $1"
}
warn() {
echo -e "${YELLOW}[WARN]${NC} $1"
}
error() {
echo -e "${RED}[ERROR]${NC} $1"
}
verbose() {
if [ "$VERBOSE" = true ]; then
echo -e "${BLUE}[DEBUG]${NC} $1"
fi
}
# Show help
show_help() {
head -20 "$0" | tail -17 | sed 's/^#//' | sed 's/^ //'
exit 0
}
# Parse arguments
while [[ $# -gt 0 ]]; do
case $1 in
-h|--http-socket)
HTTP_SOCKET="$2"
shift 2
;;
-n|--network-socket)
NETWORK_SOCKET="$2"
shift 2
;;
-c|--count)
COUNT="$2"
shift 2
;;
-d|--delay)
DELAY_MS="$2"
shift 2
;;
-v|--verbose)
VERBOSE=true
shift
;;
-m|--metrics-url)
METRICS_URL="$2"
shift 2
;;
--skip-metrics)
SKIP_METRICS=true
shift
;;
--help)
show_help
;;
*)
error "Unknown option: $1"
echo "Use --help for usage information"
exit 1
;;
esac
done
# Check if socat or netcat is available
if command -v socat &> /dev/null; then
SEND_CMD="socat"
elif command -v nc &> /dev/null; then
SEND_CMD="nc"
else
error "Neither socat nor nc (netcat) found. Please install one of them."
echo " Ubuntu/Debian: apt-get install socat OR apt-get install netcat-openbsd"
echo " RHEL/CentOS: yum install socat OR yum install nc"
exit 1
fi
# Function to send data to Unix socket
send_to_socket() {
local socket="$1"
local data="$2"
if [ "$SEND_CMD" = "socat" ]; then
echo "$data" | socat - "UNIX-SENDTO:$socket" 2>/dev/null
else
echo "$data" | nc -U -u "$socket" 2>/dev/null
fi
}
# Function to generate timestamp in nanoseconds
get_timestamp_ns() {
date +%s%N
}
# Function to send HTTP (A) event
send_http_event() {
local src_ip="$1"
local src_port="$2"
local timestamp="$3"
local method="${4:-GET}"
local path="${5:-/test}"
local host="${6:-example.com}"
local json=$(cat <<EOF
{"src_ip":"$src_ip","src_port":$src_port,"dst_ip":"10.0.0.1","dst_port":443,"timestamp":$timestamp,"method":"$method","path":"$path","host":"$host","http_version":"HTTP/1.1","header_user_agent":"TestAgent/1.0","header_accept":"*/*"}
EOF
)
verbose "Sending HTTP event: $json"
send_to_socket "$HTTP_SOCKET" "$json"
}
# Function to send Network (B) event
send_network_event() {
local src_ip="$1"
local src_port="$2"
local timestamp="$3"
local ja3="${4:-abc123}"
local ja4="${5:-def456}"
local json=$(cat <<EOF
{"src_ip":"$src_ip","src_port":$src_port,"dst_ip":"10.0.0.1","dst_port":443,"timestamp":$timestamp,"ja3":"$ja3","ja4":"$ja4","tls_version":"TLS1.3","tls_sni":"example.com"}
EOF
)
verbose "Sending Network event: $json"
send_to_socket "$NETWORK_SOCKET" "$json"
}
# Check sockets exist
check_sockets() {
local errors=0
if [ ! -S "$HTTP_SOCKET" ]; then
error "HTTP socket not found: $HTTP_SOCKET"
errors=$((errors + 1))
else
verbose "HTTP socket found: $HTTP_SOCKET"
fi
if [ ! -S "$NETWORK_SOCKET" ]; then
error "Network socket not found: $NETWORK_SOCKET"
errors=$((errors + 1))
else
verbose "Network socket found: $NETWORK_SOCKET"
fi
if [ $errors -gt 0 ]; then
error "$errors socket(s) not found. Is logcorrelator running?"
exit 1
fi
success "Sockets check passed"
}
# Get metrics from server
get_metrics() {
if [ "$SKIP_METRICS" = true ]; then
return 0
fi
if command -v curl &> /dev/null; then
curl -s "$METRICS_URL" 2>/dev/null || echo "{}"
elif command -v wget &> /dev/null; then
wget -qO- "$METRICS_URL" 2>/dev/null || echo "{}"
else
warn "Neither curl nor wget found. Skipping metrics check."
echo "{}"
fi
}
# Extract value from JSON (simple grep-based, requires jq for complex queries)
get_json_value() {
local json="$1"
local key="$2"
if command -v jq &> /dev/null; then
echo "$json" | jq -r ".$key // 0"
else
# Fallback: simple grep (works for flat JSON)
echo "$json" | grep -o "\"$key\":[0-9]*" | cut -d: -f2 || echo "0"
fi
}
# Main test function
run_test() {
info "Starting correlation test..."
info "Configuration:"
echo " HTTP Socket: $HTTP_SOCKET"
echo " Network Socket: $NETWORK_SOCKET"
echo " Test pairs: $COUNT"
echo " Delay between: ${DELAY_MS}ms"
echo " Metrics URL: $METRICS_URL"
echo " Send command: $SEND_CMD"
echo ""
# Get initial metrics
info "Fetching initial metrics..."
local initial_metrics=$(get_metrics)
local initial_success=$(get_json_value "$initial_metrics" "correlations_success")
local initial_failed=$(get_json_value "$initial_metrics" "correlations_failed")
local initial_a=$(get_json_value "$initial_metrics" "events_received_a")
local initial_b=$(get_json_value "$initial_metrics" "events_received_b")
info "Initial metrics:"
echo " Events A: $initial_a"
echo " Events B: $initial_b"
echo " Success: $initial_success"
echo " Failed: $initial_failed"
echo ""
# Send test events
info "Sending $COUNT test event pairs..."
local base_timestamp=$(get_timestamp_ns)
local sent=0
local correlated=0
for i in $(seq 1 $COUNT); do
local src_ip="192.168.1.$((i % 254 + 1))"
local src_port=$((8000 + i))
# Send A and B with same timestamp (should correlate)
local ts_a=$((base_timestamp + i * 1000000))
local ts_b=$ts_a # Same timestamp for perfect correlation
send_http_event "$src_ip" "$src_port" "$ts_a"
send_network_event "$src_ip" "$src_port" "$ts_b"
sent=$((sent + 1))
verbose "Sent pair $i: $src_ip:$src_port"
if [ $DELAY_MS -gt 0 ]; then
sleep $(echo "scale=3; $DELAY_MS / 1000" | bc)
fi
done
success "Sent $sent event pairs"
echo ""
# Wait for processing
info "Waiting for processing (2 seconds)..."
sleep 2
# Get final metrics
info "Fetching final metrics..."
local final_metrics=$(get_metrics)
local final_success=$(get_json_value "$final_metrics" "correlations_success")
local final_failed=$(get_json_value "$final_metrics" "correlations_failed")
local final_a=$(get_json_value "$final_metrics" "events_received_a")
local final_b=$(get_json_value "$final_metrics" "events_received_b")
# Calculate deltas
local delta_success=$((final_success - initial_success))
local delta_failed=$((final_failed - initial_failed))
local delta_a=$((final_a - initial_a))
local delta_b=$((final_b - initial_b))
echo ""
info "Results:"
echo " Events A sent: $delta_a (expected: $sent)"
echo " Events B sent: $delta_b (expected: $sent)"
echo " Correlations: $delta_success"
echo " Failures: $delta_failed"
echo ""
# Validation
local test_passed=true
if [ "$delta_a" -ne "$sent" ]; then
error "Event A count mismatch: got $delta_a, expected $sent"
test_passed=false
fi
if [ "$delta_b" -ne "$sent" ]; then
error "Event B count mismatch: got $delta_b, expected $sent"
test_passed=false
fi
if [ "$delta_success" -ne "$sent" ]; then
error "Correlation count mismatch: got $delta_success, expected $sent"
test_passed=false
fi
if [ "$delta_failed" -ne 0 ]; then
warn "Unexpected correlation failures: $delta_failed"
fi
if [ "$test_passed" = true ]; then
success "All tests passed! Correlation is working correctly."
exit 0
else
error "Some tests failed. Check the logs for details."
exit 1
fi
}
# Test with time window exceeded
run_time_window_test() {
info "Running time window test (B arrives after time window)..."
local src_ip="192.168.100.1"
local src_port="9999"
# Send A event
local ts_a=$(get_timestamp_ns)
send_http_event "$src_ip" "$src_port" "$ts_a"
info "Sent A event at timestamp $ts_a"
# Wait for time window to expire (default is 10s, we wait 11s)
info "Waiting 11 seconds (time window should expire)..."
sleep 11
# Send B event
local ts_b=$(get_timestamp_ns)
send_network_event "$src_ip" "$src_port" "$ts_b"
info "Sent B event at timestamp $ts_b"
info "This should result in a time_window failure (check metrics)"
}
# Test with different src_ip
run_different_ip_test() {
info "Running different IP test (should NOT correlate)..."
# Send A with IP 192.168.200.1
local ts=$(get_timestamp_ns)
send_http_event "192.168.200.1" "7777" "$ts"
info "Sent A event from 192.168.200.1:7777"
# Send B with different IP
send_network_event "192.168.200.2" "7777" "$ts"
info "Sent B event from 192.168.200.2:7777 (different IP)"
info "These should NOT correlate (different src_ip)"
}
# Run tests
check_sockets
echo ""
# Run main test
run_test
echo ""
info "Additional tests available:"
echo " --test-time-window Test time window expiration"
echo " --test-different-ip Test different IP (no correlation)"
# Check for additional test flags
if [[ "$@" == *"--test-time-window"* ]]; then
echo ""
run_time_window_test
fi
if [[ "$@" == *"--test-different-ip"* ]]; then
echo ""
run_different_ip_test
fi

View File

@ -0,0 +1,21 @@
DROP TABLE IF EXISTS mabase_prod.ref_bot_networks;
CREATE TABLE mabase_prod.ref_bot_networks (
-- On utilise IPv6CIDR car il accepte aussi les IPv4 au format ::ffff:1.2.3.4/120
network IPv6CIDR,
bot_name LowCardinality(String),
is_legitimate UInt8,
last_update DateTime
) ENGINE = ReplacingMergeTree(last_update)
ORDER BY (network, bot_name);
-- Création de la table lisant le fichier des IPs
CREATE TABLE mabase_prod.bot_ip (
ip String
) ENGINE = File(CSV, 'bot_ip.csv');
-- Création de la table lisant le fichier des signatures JA4
CREATE TABLE mabase_prod.bot_ja4 (
ja4 String
) ENGINE = File(CSV, 'bot_ja4.csv');

View File

@ -0,0 +1,234 @@
-- =============================================================================
-- logcorrelator - Initialisation ClickHouse
-- =============================================================================
-- Ce fichier crée la base de données, les tables, la vue matérialisée
-- et les utilisateurs nécessaires au fonctionnement de logcorrelator.
--
-- Usage :
-- clickhouse-client --multiquery < sql/init.sql
-- =============================================================================
-- -----------------------------------------------------------------------------
-- Base de données
-- -----------------------------------------------------------------------------
CREATE DATABASE IF NOT EXISTS mabase_prod;
-- -----------------------------------------------------------------------------
-- Table brute : cible directe des inserts du service
-- Le service n'insère que dans cette table (colonne raw_json).
-- -----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS mabase_prod.http_logs_raw
(
`raw_json` String CODEC(ZSTD(3)),
`ingest_time` DateTime DEFAULT now()
)
ENGINE = MergeTree
PARTITION BY toDate(ingest_time)
ORDER BY ingest_time
TTL ingest_time + INTERVAL 1 DAY
SETTINGS
index_granularity = 8192,
ttl_only_drop_parts = 1;
-- -----------------------------------------------------------------------------
-- Table parsée : alimentée automatiquement par la vue matérialisée
-- -----------------------------------------------------------------------------
CREATE TABLE mabase_prod.http_logs
(
-- Temporel
`time` DateTime,
`log_date` Date DEFAULT toDate(time),
-- Réseau
`src_ip` IPv4,
`src_port` UInt16,
`dst_ip` IPv4,
`dst_port` UInt16,
-- Enrichissement IPLocate
`src_asn` UInt32,
`src_country_code` LowCardinality(String),
`src_as_name` LowCardinality(String),
`src_org` LowCardinality(String),
`src_domain` LowCardinality(String),
-- HTTP
`method` LowCardinality(String),
`scheme` LowCardinality(String),
`host` LowCardinality(String),
`path` String CODEC(ZSTD(3)),
`query` String CODEC(ZSTD(3)),
`http_version` LowCardinality(String),
-- Corrélation
`orphan_side` LowCardinality(String),
`correlated` UInt8,
`keepalives` UInt16,
`a_timestamp` UInt64,
`b_timestamp` UInt64,
`conn_id` String CODEC(ZSTD(3)),
-- Métadonnées IP
`ip_meta_df` UInt8,
`ip_meta_id` UInt16,
`ip_meta_total_length` UInt16,
`ip_meta_ttl` UInt8,
-- Métadonnées TCP
`tcp_meta_options` LowCardinality(String),
`tcp_meta_window_size` UInt32,
`tcp_meta_mss` UInt16,
`tcp_meta_window_scale` UInt8,
`syn_to_clienthello_ms` Int32,
-- TLS / fingerprint
`tls_version` LowCardinality(String),
`tls_sni` LowCardinality(String),
`tls_alpn` LowCardinality(String),
`ja3` String CODEC(ZSTD(3)),
`ja3_hash` String CODEC(ZSTD(3)),
`ja4` String CODEC(ZSTD(3)),
-- En-têtes HTTP
`client_headers` String CODEC(ZSTD(3)),
`header_user_agent` String CODEC(ZSTD(3)),
`header_accept` String CODEC(ZSTD(3)),
`header_accept_encoding` String CODEC(ZSTD(3)),
`header_accept_language` String CODEC(ZSTD(3)),
`header_content_type` String CODEC(ZSTD(3)),
`header_x_request_id` String CODEC(ZSTD(3)),
`header_x_trace_id` String CODEC(ZSTD(3)),
`header_x_forwarded_for` String CODEC(ZSTD(3)),
`header_sec_ch_ua` String CODEC(ZSTD(3)),
`header_sec_ch_ua_mobile` String CODEC(ZSTD(3)),
`header_sec_ch_ua_platform` String CODEC(ZSTD(3)),
`header_sec_fetch_dest` String CODEC(ZSTD(3)),
`header_sec_fetch_mode` String CODEC(ZSTD(3)),
`header_sec_fetch_site` String CODEC(ZSTD(3))
)
ENGINE = MergeTree
PARTITION BY log_date
ORDER BY (time, src_ip, dst_ip, ja4)
TTL log_date + INTERVAL 7 DAY
SETTINGS
index_granularity = 8192,
ttl_only_drop_parts = 1;
-- -----------------------------------------------------------------------------
-- Vue matérialisée : parse le JSON de http_logs_raw vers http_logs
-- -----------------------------------------------------------------------------
DROP VIEW IF EXISTS mabase_prod.mv_http_logs;
CREATE MATERIALIZED VIEW IF NOT EXISTS mabase_prod.mv_http_logs
TO mabase_prod.http_logs
AS
SELECT
parseDateTimeBestEffort(coalesce(JSONExtractString(raw_json, 'time'), '1970-01-01T00:00:00Z')) AS time,
toDate(time) AS log_date,
toIPv4(coalesce(JSONExtractString(raw_json, 'src_ip'), '0.0.0.0')) AS src_ip,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'src_port'), 0)) AS src_port,
toIPv4(coalesce(JSONExtractString(raw_json, 'dst_ip'), '0.0.0.0')) AS dst_ip,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'dst_port'), 0)) AS dst_port,
dictGetOrDefault(
'mabase_prod.dict_iplocate_asn',
'asn',
IPv4ToIPv6(IPv4StringToNum(toString(src_ip))),
toUInt32(0)
) AS src_asn,
dictGetOrDefault(
'mabase_prod.dict_iplocate_asn',
'country_code',
IPv4ToIPv6(IPv4StringToNum(toString(src_ip))),
''
) AS src_country_code,
dictGetOrDefault(
'mabase_prod.dict_iplocate_asn',
'name',
IPv4ToIPv6(IPv4StringToNum(toString(src_ip))),
''
) AS src_as_name,
dictGetOrDefault(
'mabase_prod.dict_iplocate_asn',
'org',
IPv4ToIPv6(IPv4StringToNum(toString(src_ip))),
''
) AS src_org,
dictGetOrDefault(
'mabase_prod.dict_iplocate_asn',
'domain',
IPv4ToIPv6(IPv4StringToNum(toString(src_ip))),
''
) AS src_domain,
coalesce(JSONExtractString(raw_json, 'method'), '') AS method,
coalesce(JSONExtractString(raw_json, 'scheme'), '') AS scheme,
coalesce(JSONExtractString(raw_json, 'host'), '') AS host,
coalesce(JSONExtractString(raw_json, 'path'), '') AS path,
coalesce(JSONExtractString(raw_json, 'query'), '') AS query,
coalesce(JSONExtractString(raw_json, 'http_version'), '') AS http_version,
coalesce(JSONExtractString(raw_json, 'orphan_side'), '') AS orphan_side,
toUInt8(coalesce(JSONExtractBool(raw_json, 'correlated'), 0)) AS correlated,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'keepalives'), 0)) AS keepalives,
coalesce(JSONExtractUInt(raw_json, 'a_timestamp'), 0) AS a_timestamp,
coalesce(JSONExtractUInt(raw_json, 'b_timestamp'), 0) AS b_timestamp,
coalesce(JSONExtractString(raw_json, 'conn_id'), '') AS conn_id,
toUInt8(coalesce(JSONExtractBool(raw_json, 'ip_meta_df'), 0)) AS ip_meta_df,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'ip_meta_id'), 0)) AS ip_meta_id,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'ip_meta_total_length'), 0)) AS ip_meta_total_length,
toUInt8(coalesce(JSONExtractUInt(raw_json, 'ip_meta_ttl'), 0)) AS ip_meta_ttl,
coalesce(JSONExtractString(raw_json, 'tcp_meta_options'), '') AS tcp_meta_options,
toUInt32(coalesce(JSONExtractUInt(raw_json, 'tcp_meta_window_size'), 0)) AS tcp_meta_window_size,
toUInt16(coalesce(JSONExtractUInt(raw_json, 'tcp_meta_mss'), 0)) AS tcp_meta_mss,
toUInt8(coalesce(JSONExtractUInt(raw_json, 'tcp_meta_window_scale'), 0)) AS tcp_meta_window_scale,
toInt32(coalesce(JSONExtractInt(raw_json, 'syn_to_clienthello_ms'), 0)) AS syn_to_clienthello_ms,
coalesce(JSONExtractString(raw_json, 'tls_version'), '') AS tls_version,
coalesce(JSONExtractString(raw_json, 'tls_sni'), '') AS tls_sni,
coalesce(JSONExtractString(raw_json, 'tls_alpn'), '') AS tls_alpn,
coalesce(JSONExtractString(raw_json, 'ja3'), '') AS ja3,
coalesce(JSONExtractString(raw_json, 'ja3_hash'), '') AS ja3_hash,
coalesce(JSONExtractString(raw_json, 'ja4'), '') AS ja4,
coalesce(JSONExtractString(raw_json, 'client_headers'), '') AS client_headers,
coalesce(JSONExtractString(raw_json, 'header_User-Agent'), '') AS header_user_agent,
coalesce(JSONExtractString(raw_json, 'header_Accept'), '') AS header_accept,
coalesce(JSONExtractString(raw_json, 'header_Accept-Encoding'), '') AS header_accept_encoding,
coalesce(JSONExtractString(raw_json, 'header_Accept-Language'), '') AS header_accept_language,
coalesce(JSONExtractString(raw_json, 'header_Content-Type'), '') AS header_content_type,
coalesce(JSONExtractString(raw_json, 'header_X-Request-Id'), '') AS header_x_request_id,
coalesce(JSONExtractString(raw_json, 'header_X-Trace-Id'), '') AS header_x_trace_id,
coalesce(JSONExtractString(raw_json, 'header_X-Forwarded-For'), '') AS header_x_forwarded_for,
coalesce(JSONExtractString(raw_json, 'header_Sec-CH-UA'), '') AS header_sec_ch_ua,
coalesce(JSONExtractString(raw_json, 'header_Sec-CH-UA-Mobile'), '') AS header_sec_ch_ua_mobile,
coalesce(JSONExtractString(raw_json, 'header_Sec-CH-UA-Platform'), '') AS header_sec_ch_ua_platform,
coalesce(JSONExtractString(raw_json, 'header_Sec-Fetch-Dest'), '') AS header_sec_fetch_dest,
coalesce(JSONExtractString(raw_json, 'header_Sec-Fetch-Mode'), '') AS header_sec_fetch_mode,
coalesce(JSONExtractString(raw_json, 'header_Sec-Fetch-Site'), '') AS header_sec_fetch_site
FROM mabase_prod.http_logs_raw;
-- -----------------------------------------------------------------------------
-- Utilisateurs et permissions
-- -----------------------------------------------------------------------------
CREATE USER IF NOT EXISTS data_writer IDENTIFIED WITH plaintext_password BY 'ChangeMe';
CREATE USER IF NOT EXISTS analyst IDENTIFIED WITH plaintext_password BY 'ChangeMe';
-- data_writer : INSERT uniquement sur la table brute
GRANT INSERT ON mabase_prod.http_logs_raw TO data_writer;
GRANT SELECT ON mabase_prod.http_logs_raw TO data_writer;
-- analyst : lecture sur la table parsée
GRANT SELECT ON mabase_prod.http_logs TO analyst;
-- -----------------------------------------------------------------------------
-- Vérifications post-installation
-- -----------------------------------------------------------------------------
-- SELECT count(*), min(ingest_time), max(ingest_time) FROM mabase_prod.http_logs_raw;
-- SELECT count(*), min(time), max(time) FROM mabase_prod.http_logs;
-- SELECT time, src_ip, dst_ip, method, host, path, ja4 FROM mabase_prod.http_logs ORDER BY time DESC LIMIT 10;

View File

@ -0,0 +1,29 @@
DROP DICTIONARY IF EXISTS mabase_prod.dict_iplocate_asn;
CREATE DICTIONARY IF NOT EXISTS mabase_prod.dict_iplocate_asn
(
network String,
asn UInt32,
country_code String,
name String,
org String,
domain String
)
PRIMARY KEY network
SOURCE(FILE(path '/var/lib/clickhouse/user_files/iplocate-ip-to-asn.csv' format 'CSVWithNames'))
LAYOUT(IP_TRIE())
LIFETIME(MIN 3600 MAX 7200);
-- Suppression si existe pour reconfiguration
DROP TABLE IF EXISTS mabase_prod.ref_bot_networks;
-- Table optimisée pour le filtrage binaire de CIDR
CREATE TABLE mabase_prod.ref_bot_networks (
network IPv6CIDR, -- Gère nativement '1.2.3.0/24' et '2001:db8::/32'
bot_name LowCardinality(String),
is_legitimate UInt8, -- 1 = Whitelist, 0 = Blacklist
last_update DateTime
) ENGINE = ReplacingMergeTree(last_update)
ORDER BY (network, bot_name)

View File

@ -0,0 +1,8 @@
# dashboard configuration — DO NOT COMMIT real values
CLICKHOUSE_HOST=clickhouse
CLICKHOUSE_PORT=8123
CLICKHOUSE_DB=mabase_prod
CLICKHOUSE_USER=analyst
CLICKHOUSE_PASSWORD=
API_HOST=0.0.0.0
CORS_ORIGINS=["http://localhost:3000"]

View File

@ -0,0 +1,114 @@
# Copilot Instructions — Bot Detector Dashboard
## Architecture Overview
This is a **SOC (Security Operations Center) dashboard** for visualizing bot detections from an upstream `bot_detector_ai` service. It is a **single-service, full-stack app**: the FastAPI backend serves the built React frontend as static files *and* exposes a REST API, all on port 8000. There is no separate frontend server in production and **no authentication**.
**Data source:** ClickHouse database (`mabase_prod`), primarily the `ml_detected_anomalies` table and the `view_dashboard_entities` view.
```
dashboard/
├── backend/ # Python 3.11 + FastAPI — REST API + static file serving
│ ├── main.py # App entry point: CORS, router registration, SPA catch-all
│ ├── config.py # pydantic-settings Settings, reads .env
│ ├── database.py # ClickHouseClient singleton (db)
│ ├── models.py # All Pydantic v2 response models
│ ├── routes/ # One module per domain: metrics, detections, variability,
│ │ # attributes, analysis, entities, incidents, audit, reputation
│ └── services/
│ └── reputation_ip.py # Async httpx → ip-api.com + ipinfo.io (no API keys)
└── frontend/ # React 18 + TypeScript 5 + Vite 5 + Tailwind CSS 3
└── src/
├── App.tsx # BrowserRouter + Sidebar + TopHeader + all Routes
├── ThemeContext.tsx # dark/light/auto, persisted to localStorage (key: soc_theme)
├── api/client.ts # Axios instance (baseURL: /api) + all TS interfaces
├── components/ # One component per route view + shared panels + ui/
├── hooks/ # useMetrics, useDetections, useVariability (polling wrappers)
└── utils/STIXExporter.ts
```
## Dev Commands
```bash
# Backend (run from repo root)
pip install -r requirements.txt
python -m uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000
# Frontend (separate terminal)
cd frontend && npm install
npm run dev # :3000 with HMR, proxies /api → localhost:8000
npm run build # tsc type-check + vite build → frontend/dist/
npm run preview # preview the production build
# Docker (production)
docker compose up -d dashboard_web
docker compose build dashboard_web && docker compose up -d dashboard_web
docker compose logs -f dashboard_web
```
There is no test suite or linter configured (no pytest, vitest, ESLint, Black, etc.).
```bash
# Manual smoke tests
curl http://localhost:8000/health
curl http://localhost:8000/api/metrics | jq '.summary'
curl "http://localhost:8000/api/detections?page=1&page_size=5" | jq '.items | length'
```
## Key Conventions
### Backend
- **All routes are raw SQL** — no ORM. Results are accessed by positional index: `result.result_rows[0][n]`. Column order is determined by the `SELECT` statement.
- **Query parameters** use `%(name)s` dict syntax: `db.query(sql, {"param": value})`.
- **Every router module** defines `router = APIRouter(prefix="/api/<domain>", tags=["..."])` and is registered in `main.py` via `app.include_router(...)`.
- **SPA catch-all** (`/{full_path:path}`) **must remain the last registered route** in `main.py`. New routers must be added with `app.include_router()` before it.
- **IPv4 IPs** are stored as IPv6-mapped (`::ffff:x.x.x.x`) in `src_ip`; queries normalize with `replaceRegexpAll(toString(src_ip), '^::ffff:', '')`.
- **NULL guards** — all row fields are coalesced: `row[n] or ""`, `row[n] or 0`, `row[n] or "LOW"`.
- **`anomaly_score`** can be negative in the DB; always normalize with `abs()` for display.
- **`analysis.py`** stores SOC classifications in a `classifications` ClickHouse table. The `audit_logs` table is optional — routes silently return empty results if absent.
### Frontend
- **API calls** use the axios instance from `src/api/client.ts` (baseURL `/api`) or direct `fetch('/api/...')`. There is **no global state manager** — components use `useState`/`useEffect` or custom hooks directly.
- **TypeScript interfaces** in `client.ts` mirror the Pydantic models in `backend/models.py`. Both must be kept in sync when changing data shapes.
- **Tailwind uses semantic CSS-variable tokens** — always use `bg-background`, `bg-background-secondary`, `bg-background-card`, `text-text-primary`, `text-text-secondary`, `text-text-disabled`, `bg-accent-primary`, `threat-critical/high/medium/low` rather than raw Tailwind color classes (e.g., `slate-800`). This ensures dark/light theme compatibility.
- **Threat level taxonomy**: `CRITICAL` > `HIGH` > `MEDIUM` > `LOW` — always uppercase strings; colors: red / orange / yellow / green.
- **URL encoding**: entity values with special characters (JA4 fingerprints, subnets) are `encodeURIComponent`-encoded. Subnets use `_24` in place of `/24` (e.g., `/entities/subnet/141.98.11.0_24`).
- **Recent investigations** are stored in `localStorage` under `soc_recent_investigations` (max 8). Tracked by `RouteTracker` component. Only types `ip`, `ja4`, `subnet` are tracked.
- **Auto-refresh**: metrics every 30 s, incidents every 60 s.
- **French UI text** — all user-facing strings and log messages are in French; code identifiers are in English.
### Frontend → Backend in Dev vs Production
- **Dev**: Vite dev server on `:3000` proxies `/api/*` to `http://localhost:8000` (see `vite.config.ts`).
- **Production**: React SPA is served by FastAPI from `frontend/dist/`. API calls hit the same origin at `:8000` — no proxy needed.
### Docker
- Single service using `network_mode: "host"` — no port mapping; the container shares the host network stack.
- Multi-stage Dockerfile: `node:20-alpine` builds the frontend → `python:3.11-slim` installs deps → final image copies both.
## Environment Variables (`.env`)
| Variable | Default | Description |
|---|---|---|
| `CLICKHOUSE_HOST` | `clickhouse` | ClickHouse hostname |
| `CLICKHOUSE_PORT` | `8123` | ClickHouse HTTP port (set in code) |
| `CLICKHOUSE_DB` | `mabase_prod` | Database name |
| `CLICKHOUSE_USER` | `admin` | |
| `CLICKHOUSE_PASSWORD` | `` | |
| `API_HOST` | `0.0.0.0` | Uvicorn bind host |
| `API_PORT` | `8000` | Uvicorn bind port |
| `CORS_ORIGINS` | `["http://localhost:3000", ...]` | Allowed origins |
> ⚠️ The `.env` file contains real credentials — never commit it to public repos.
## ClickHouse Tables
| Table / View | Used by |
|---|---|
| `ml_detected_anomalies` | Primary source for detections, metrics, variability, analysis |
| `view_dashboard_entities` | User agents, client headers, paths, query params (entities routes) |
| `classifications` | SOC analyst classifications (created by `analysis.py`) |
| `mabase_prod.audit_logs` | Audit trail (optional — missing table is handled silently) |

86
services/dashboard/.gitignore vendored Normal file
View File

@ -0,0 +1,86 @@
# ═══════════════════════════════════════════════════════════════════════════════
# GITIGNORE - Bot Detector Dashboard
# ═══════════════════════════════════════════════════════════════════════════════
# ───────────────────────────────────────────────────────────────────────────────
# SÉCURITÉ - Ne jamais committer
# ───────────────────────────────────────────────────────────────────────────────
.env
.env.local
.env.production
*.pem
*.key
secrets/
credentials/
# ───────────────────────────────────────────────────────────────────────────────
# Python
# ───────────────────────────────────────────────────────────────────────────────
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
.pytest_cache/
.coverage
htmlcov/
*.manifest
*.spec
# ───────────────────────────────────────────────────────────────────────────────
# Node.js / Frontend
# ───────────────────────────────────────────────────────────────────────────────
node_modules/
npm-debug.log*
yarn-debug.log*
yarn-error.log*
frontend/node_modules/
frontend/dist/
frontend/build/
package-lock.json
yarn.lock
# ───────────────────────────────────────────────────────────────────────────────
# IDE / Éditeurs
# ───────────────────────────────────────────────────────────────────────────────
.idea/
.vscode/
*.swp
*.swo
*~
.DS_Store
Thumbs.db
# ───────────────────────────────────────────────────────────────────────────────
# Logs
# ───────────────────────────────────────────────────────────────────────────────
*.log
logs/
test_output.log
# ───────────────────────────────────────────────────────────────────────────────
# Docker
# ───────────────────────────────────────────────────────────────────────────────
docker-compose.override.yml
*.tar
# ───────────────────────────────────────────────────────────────────────────────
# Documentation temporaire
# ───────────────────────────────────────────────────────────────────────────────
# *.md.tmp
# *.md.bak

View File

@ -0,0 +1,203 @@
# Audit SOC du dashboard
## Résumé exécutif
Le dashboard est riche fonctionnellement (incidents, investigation IP/JA4, threat intel), mais **pas prêt pour un usage SOC en production** sans durcissement.
Points majeurs :
- **Sécurité daccès insuffisante** : pas dauthentification/RBAC.
- **Navigation incohérente** : plusieurs liens pointent vers des routes inexistantes.
- **Traçabilité/audit partielle** : journalisation contournable et parfois “success” même en échec.
- **Organisation UX perfectible** pour un triage SOC rapide (priorisation, workflow, “next actions”).
## Périmètre audité
- Frontend React (`frontend/src/App.tsx` + composants de navigation et investigation).
- Backend FastAPI (`backend/main.py` + routes `incidents`, `audit`, `entities`, `analysis`, `detections`, `reputation`).
- Documentation projet (`README.md`).
## Cartographie des pages et navigation
### Routes front déclarées
- `/``IncidentsView`
- `/threat-intel``ThreatIntelView`
- `/detections``DetectionsList`
- `/detections/:type/:value``DetailsView`
- `/investigation/:ip``InvestigationView`
- `/investigation/ja4/:ja4``JA4InvestigationView`
- `/entities/subnet/:subnet``SubnetInvestigation`
- `/entities/:type/:value``EntityInvestigationView`
- `/tools/correlation-graph/:ip``CorrelationGraph`
- `/tools/timeline/:ip?``InteractiveTimeline`
### Graphe de navigation (pages)
```mermaid
flowchart LR
A["/ (Incidents)"] --> B["/investigation/:ip"]
A --> C["/entities/subnet/:subnet"]
A --> X["/bulk-classify?ips=... (route absente)"]
A --> T["/threat-intel"]
D["/detections"] --> E["/detections/:type/:value"]
D --> B
E --> B
E --> F["/investigation/ja4/:ja4"]
C --> B
C --> G["/entities/ip/:ip"]
G --> B
G --> F
F --> B
B --> H["/tools/correlation-graph/:ip"]
B --> I["/tools/timeline/:ip?"]
Q["QuickSearch (global + local)"] --> Y["/investigate/... (route absente)"]
Q --> Z["/incidents?threat_level=CRITICAL (route absente)"]
```
### Incohérences de navigation identifiées
- `QuickSearch` navigue vers `/investigate/...` et `/incidents...` mais ces routes nexistent pas.
- `IncidentsView` envoie vers `/bulk-classify?...` sans route déclarée.
- `DetectionsList` utilise `window.location.href` (rechargement complet) au lieu du router.
- Navigation top-level limitée à 2 entrées (“Incidents”, “Threat Intel”), alors que “Détections” est une vue centrale SOC.
- Usage de `window.location.pathname` dans `App.tsx` pour récupérer `:ip` sur certaines routes outils (fragile, non idiomatique React Router).
## Constat sécurité / robustesse (usage SOC)
## Critique
- **Absence dauthentification et de RBAC** (confirmé aussi dans le README “usage local”).
- Impact SOC : impossible dattribuer correctement les actions analyste, risque daccès non maîtrisé.
- **Injection potentielle dans `entities.py`** :
- Construction dun `IN (...)` SQL par concaténation de valeurs (`ip_values`), non paramétrée.
- Impact : surface dinjection côté backend.
- **Audit log non fiable** :
- `/api/audit/logs` accepte un `user` fourni par la requête (default `soc_user`).
- En cas déchec dinsert audit, le code retourne quand même `status: success`.
- Impact : non-répudiation faible, traçabilité compromise.
## Élevé
- **Rate limiting non appliqué** :
- Variable `RATE_LIMIT_PER_MINUTE` existe mais pas de middleware effectif.
- Impact : exposition aux abus/DoS et scraping massif.
- **Fuite derreurs internes** :
- Plusieurs endpoints retournent `detail=f"Erreur: {str(e)}"`.
- Impact : divulgation dinformations techniques.
## Moyen
- **Dépendance externe réputation IP** (`ip-api` en HTTP + `ipinfo`) sans contrôle de résilience avancé (fallback opérationnel limité).
- **Composants avec `console.error`/`console.log`** en production front.
- **Endpoints incidents partiellement “mockés”** (`Implementation en cours`) pouvant tromper lanalyste.
## Format des pages : ce quil faut améliorer
## 1) Priorisation SOC visuelle
- Uniformiser les conventions de sévérité (couleur, wording, position).
- Ajouter un bandeau “Incidents nécessitant action immédiate” en haut de `/`.
- Afficher systématiquement : **niveau, confiance, impact, dernière activité, action recommandée**.
## 2) Densité et lisibilité
- Réduire lusage demojis non essentiels dans les zones de décision.
- Passer les tableaux volumineux en mode “triage” :
- colonnes par défaut minimales,
- tri par criticité/recence,
- tags compacts avec tooltip.
## 3) Workflow analyste explicite
- Introduire des CTA standardisés :
- `Investiguer`, `Escalader`, `Classer`, `Créer IOC`, `Exporter`.
- Ajouter une timeline dactions SOC (qui a fait quoi, quand, pourquoi) directement sur les vues incident/investigation.
## 4) Accessibilité opérationnelle
- Raccourcis clavier cohérents (navigation, filtres, next incident).
- État vide explicite + actions suggérées.
- Breadcrumb homogène entre toutes les vues.
## Organisation de linformation : recommandations
## IA) Repenser lIA de navigation (menu)
Proposition de structure :
- **Triage**
- Incidents (par défaut)
- Détections
- **Investigation**
- Recherche entité
- Vue IP
- Vue JA4
- Subnet
- **Knowledge**
- Threat Intel
- Tags/Patterns
- **Administration**
- Audit logs
- Santé plateforme
## IB) Normaliser les routes
- Remplacer les routes mortes (`/investigate`, `/incidents`, `/bulk-classify` non déclaré) par des routes existantes ou les implémenter.
- Éviter `window.location.*` dans les composants routés.
- Centraliser les chemins dans un module unique (ex: `routes.ts`) pour éviter les divergences.
## IC) Standardiser le modèle de page
Chaque page SOC devrait avoir la même ossature :
1. Contexte (titre + périmètre + horodatage).
2. KPIs critiques.
3. Tableau principal de triage.
4. Panneau actions.
5. Journal dactivité lié à la page.
## Plan damélioration priorisé
## Phase 1 (bloquant prod SOC)
- Ajouter auth SSO/OIDC + RBAC (viewer/analyst/admin).
- Corriger routes mortes et navigation cassée.
- Corriger requête SQL non paramétrée dans `entities.py`.
- Fiabiliser audit log (identité dérivée de lauth, échec explicite si log non écrit).
## Phase 2 (fiabilité)
- Mettre en place rate limiting effectif.
- Assainir gestion derreurs (messages utilisateurs + logs serveurs structurés).
- Retirer `window.location.href` et unifier navigation SPA.
## Phase 3 (UX SOC)
- Refonte “triage-first” des écrans (priorité, next action, temps de traitement).
- Uniformiser design tokens et hiérarchie visuelle.
- Ajouter vues “queue analyste” et “handover” (passation de quart).
## Verdict
Le socle est prometteur pour linvestigation technique, mais pour un SOC opérationnel il faut dabord :
1. **Sécuriser laccès et la traçabilité**.
2. **Fiabiliser la navigation et les routes**.
3. **Recentrer les pages sur le flux de triage SOC**.
Sans ces corrections, le risque principal est une **dette opérationnelle** (temps perdu en triage) et une **dette de conformité** (auditabilité insuffisante).

View File

@ -0,0 +1,22 @@
FROM node:20-alpine AS frontend-builder
WORKDIR /app/frontend
COPY services/dashboard/frontend/package*.json ./
RUN npm install
COPY services/dashboard/frontend/ ./
RUN npm run build
FROM python:3.11-slim AS backend
WORKDIR /app
COPY shared/python/ja4_common/ /app/shared/ja4_common/
RUN pip install --no-cache-dir /app/shared/ja4_common/
COPY services/dashboard/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY services/dashboard/backend/ ./backend/
FROM python:3.11-slim
WORKDIR /app
COPY --from=backend /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=backend /app/backend ./backend
COPY --from=frontend-builder /app/frontend/dist ./frontend/dist
EXPOSE 8000
CMD ["python", "-m", "uvicorn", "backend.main:app", "--host", "0.0.0.0", "--port", "8000"]

View File

@ -0,0 +1,10 @@
FROM python:3.11-slim
WORKDIR /app
COPY shared/python/ja4_common/ /app/shared/ja4_common/
RUN pip install --no-cache-dir /app/shared/ja4_common/
COPY services/dashboard/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --no-cache-dir pytest pytest-mock httpx
COPY services/dashboard/backend/ ./backend/
COPY services/dashboard/backend/tests/ ./backend/tests/
CMD ["pytest", "backend/tests/", "-v"]

View File

@ -0,0 +1,242 @@
# Rapport Final — SOC Bot Detector Dashboard
**Date :** 2026-03-16
**Commits :** `8032eba` (corrections bugs), `d4c3512` (améliorations)
---
## 1. Corrections de bugs (commit 8032eba)
| Bug | Cause | Correction |
|-----|-------|-----------|
| Brute Force > Attaquants : IPs affichées en `::ffff:x.x.x.x` | Pas de normalisation IPv6 dans la requête SQL | `replaceRegexpAll(toString(src_ip), '^::ffff:', '')` ajouté |
| Brute Force > Cibles : lien "Voir détails" → page inexistante | Navigation vers `/investigation/{host}` (hostname) au lieu d'une IP | Remplacement par composant `TargetRow` avec expansion inline des attaquants par host |
| Header Fingerprint : tableau de détail toujours vide | Frontend lisait `data.ips` au lieu de `data.items` | Correction de la clé |
| Heatmap Temporelle : "Top hosts ciblés" vide | Frontend lisait `data.hosts` + erreur de type TypeScript `{ hosts: TopHost[] }` | Correction clé `data.items` + type annotation |
| Botnets Distribués : clic sur ligne n'affiche rien | Frontend lisait `data.countries` au lieu de `data.items` | Correction de la clé |
| Rotation & Persistance : IPs en `::ffff:` + historique toujours vide | Pas de normalisation + frontend lisait `data.history` au lieu de `data.ja4_history` | Normalisation SQL + correction de la clé |
| TCP Spoofing : spoofings détectés sans corrélation TTL | Filtre Python-side sur données déjà filtrées TTL=3031 | Filtre SQL `spoof_only` déplacé côté ClickHouse |
---
## 2. Améliorations implémentées (commit d4c3512)
### J — Synthèse IP multi-sources
- **Endpoint :** `GET /api/investigation/{ip}/summary`
- **Widget :** `IPActivitySummary` en haut de toute page d'investigation IP
- **Données :** ML + bruteforce + TCP spoofing + JA4 rotation + persistance + timeline 24h
- **Score de risque :** 0100 (jauge SVG colorée)
- **Résultat :** Contexte immédiat en un coup d'œil, sans naviguer entre 6 pages
### I — Comparaison baseline 24h/hier
- **Endpoint :** `GET /api/metrics/baseline`
- **Widget :** 3 cartes (Détections 24h, IPs uniques, CRITICAL) avec variation ▲▼ en %
- **Impact :** Détecte immédiatement les pics anormaux (ex: +246% détections observé)
### M-4 — Score de sophistication adversaire
- **Endpoint :** `GET /api/rotation/sophistication`
- **Calcul :** JOIN 3 tables (rotation JA4 × 10 + récurrence × 20 + log(bruteforce+1) × 5)
- **Tiers :** APT-like / Advanced / Automated / Basic
- **Résultat :** Prioritisation des enquêtes les plus urgentes
### M-7 — Chasse proactive (low-and-slow)
- **Endpoint :** `GET /api/rotation/proactive-hunt`
- **Logique :** IPs récurrentes avec `abs(anomaly_score) < 0.5` — volent sous le radar ML
- **Évaluation :** "Évadeur potentiel" (ratio récurrence/score > 10) ou "Persistant modéré"
- **Impact :** Détecte les botnets slow-and-low que le modèle ML sous-score
### M-2 — Badge réputation ASN inline
- **Modification :** LEFT JOIN `asn_reputation` dans la requête des détections
- **Badge :** Rouge (malicious/bot/scanner), orange (proxy/vpn), vert (human)
- **Limitation :** La table `asn_reputation` contient 36 ASN français (ISPs légitimes) — les ASNs malveillants connus ne sont pas encore catalogués
---
## 3. Tests exhaustifs Playwright
| Page | Résultat | Notes |
|------|----------|-------|
| Dashboard principal | ✅ | Baseline ▲ +246.5% détections, ▲ +11.6% IPs, = CRITICAL |
| Détections | ✅ | Badge ASN affiché (null pour ASNs hors table reputation) |
| Investigation IP (162.55.94.175) | ✅ | Score 38, TCP Spoof TTL 59, JA4 Rotation 9 sig |
| Rotation > Sophistication | ✅ | APT-like: 162.55.94.175 (score 100), 46.4.81.149 (score 100) |
| Rotation > Chasse proactive | ✅ | IPs avec scores négatifs sous le radar ML |
| Brute Force > Attaquants | ✅ | IPs propres (sans `::ffff:`) |
| Brute Force > Cibles | ✅ | Expansion inline des attaquants par host |
| Header Fingerprint | ✅ | Tableau détail rempli au clic |
| Heatmap Temporelle | ✅ | Top hosts ciblés affiché |
| Botnets Distribués | ✅ | Détail pays au clic |
| TCP Spoofing | ✅ | Filtre `spoof_only` fonctionnel |
---
## 4. Points problématiques et axes d'amélioration
### 🔴 Critiques
1. **Table `asn_reputation` incomplète** — 36 entrées uniquement (ISPs français). Pour être utile, elle devrait contenir les ASNs des datacenters, VPS, proxies connus (OVH, DigitalOcean, AWS, Linode, etc.). Source suggérée : AbuseIPDB ASN database, IPInfo, Maxmind.
2. **Chasse proactive — scores négatifs**`view_ip_recurrence.worst_score` stocke le score brut (peut être négatif). La condition `abs(score) < 0.5` capture des IPs HIGH avec score -0.18 qui sont déjà détectées par ML. Il faudrait filtrer par niveau de menace (`worst_threat_level NOT IN ('HIGH', 'CRITICAL')`) pour vraiment identifier les cas sous le radar.
3. **Pas de persistance des classifications SOC** — Les classifications manuelles (`/api/analysis/classify`) ne persistent que pendant la session si la table `classifications` n'est pas créée. Un script d'init DB serait utile.
### 🟡 Moyens
4. **Score de sophistication biaised** — Les IPs avec forte rotation JA4 mais `recurrence=0` dans `view_ip_recurrence` (non présentes) atteignent quand même score 100. Les données des deux vues ne sont pas toujours cohérentes sur la même période temporelle.
5. **Timeline 24h dans la synthèse IP** — Utilise `window_start >= now() - INTERVAL 24 HOUR` sur `agg_host_ip_ja4_1h`. Si les données ont moins de 24h d'historique, le graphique sera partiel/vide. Adapter la fenêtre dynamiquement selon les données disponibles.
6. **Heatmap Temporelle** — Les données de `agg_host_ip_ja4_1h` ne sont agrégées que pour les dernières 24h dans l'endpoint. Un sélecteur de plage temporelle (7j, 30j) permettrait de détecter les patterns de vagues cycliques (botnets hebdomadaires).
7. **Pas d'export des résultats** — Les analystes SOC ne peuvent pas exporter les listes d'IPs malveillantes (CSV, STIX). Un endpoint `GET /api/rotation/sophistication?format=csv` serait utile pour l'IOC sharing.
### 🟢 Mineurs
8. **"Investiguer" dans le RotationView ne transmet pas le contexte** — Un clic sur "Investiguer" depuis l'onglet Sophistication navigue vers `/investigation/{ip}` sans pré-charger le contexte de l'onglet source. Un `?source=sophistication&score=100` dans l'URL permettrait d'afficher un bandeau contextuel.
9. **Onglets non présents dans la sidebar** — Les 7 dashboards d'analyse avancée ne sont pas organisés en sous-menus. Avec l'ajout des onglets Sophistication et Chasse proactive dans Rotation, la sidebar commence à être longue.
10. **Badge ASN ne trie pas les détections** — Il n'y a pas encore de filtre "Afficher seulement les ASNs malveillants" dans les détections.
---
## 5. Architecture — points de vigilance
- Le **SPA catch-all** (`/{full_path:path}`) doit rester **le dernier router** dans `main.py`
- L'endpoint `/api/investigation/{ip}/summary` utilise le préfixe `/api/investigation` — compatible avec la route SPA `/investigation/:ip` (distinct)
- Les **scores négatifs** dans `anomaly_score` et `worst_score` sont normaux — toujours utiliser `abs()` pour l'affichage
- Les **IPv6-mapped** (`::ffff:x.x.x.x`) sont présentes dans toutes les vues agrégées — systématiquement utiliser `replaceRegexpAll(toString(src_ip), '^::ffff:', '')`
---
# Rapport — v2.0.0 : TCP Fingerprinting Multi-Signal + Clustering IPs
**Date :** 2026-03-19
**Commit :** `e2db8ca`
---
## 1. TCP Fingerprinting OS amélioré
### Problème initial
L'ancien `tcp_spoofing.py` utilisait uniquement le TTL avec 3 plages grossières (≤64 = Linux, ≤128 = Windows, sinon = Network). Résultat : faux positifs, aucune détection de bots scanners.
### Solution implémentée
**`backend/services/tcp_fingerprint.py`** — 20 signatures OS, scoring multi-signal :
| Signal | Poids | Source ClickHouse |
|--------|-------|------------------|
| TTL initial (estimé) | 40% | `tcp_ttl_raw` |
| MSS | 30% | `tcp_mss_raw` |
| Fenêtre TCP | 20% | `tcp_win_raw` |
| Scale factor | 10% | `tcp_scale_raw` |
**Détections validées en production :**
- **Masscan** : `win=5808, mss=1452, scale=4, TTL 4857` → confiance **97%**
- **Googlebot** : stack Windows détecté avec UA Android → **spoof confirmé**
- **Bot-tool** : `risk_score += 30` (vs +15 pour spoof simple)
**MSS → chemin réseau :**
- 1460 → Ethernet standard
- 1452 → PPPoE / DSL (Masscan pattern)
- 14201452 → VPN probable
- < 1420 Tunnel / double-encapsulation
**Fichiers modifiés :**
- `backend/services/tcp_fingerprint.py` (nouveau)
- `backend/routes/tcp_spoofing.py` (réécriture complète queries `agg_host_ip_ja4_1h`)
- `backend/routes/investigation_summary.py` (utilise le service tcp_fingerprint)
- `frontend/src/components/TcpSpoofingView.tsx` (nouvelles colonnes MSS/scale/confiance, graphique distribution MSS)
---
## 2. Clustering IPs multi-métriques
### Problème initial
La première version du clustering utilisait uniquement des règles sur les propriétés TCP. L'utilisateur a demandé d'utiliser **l'ensemble des métriques disponibles**.
### Solution implémentée
**`backend/services/clustering_engine.py`** K-means++ pur Python (sans dépendances ML) :
**21 features normalisées [0,1] :**
| Catégorie | Features |
|-----------|----------|
| Stack TCP (4) | TTL initial, MSS, scale, fenêtre |
| Anomalie ML (6) | score, vélocité, fuzzing, headless, POST ratio, IP-ID zéro |
| TLS/Protocole (5) | ALPN mismatch, ALPN absent, efficacité H2, ordre headers, UA-CH mismatch |
| Navigateur (1) | score navigateur moderne (normalisé /50) |
| Temporel (3) | entropie, diversité JA4 (log1p), UA rotatif |
| Comportement (2) | ratio assets, ratio accès direct |
**Algorithme :**
```
K-means++ : init O(k·n), n_init=3, meilleure inertie retenue
Power iter : X^T(Xv) trick, O(n·d) par iter — pas de matrice n×n
Déflation : Hotelling pour PC2 après extraction PC1
```
**Stratégie d'échantillonnage :** `ORDER BY avg(abs(anomaly_score)) DESC` les bots (score élevé) sont inclus en priorité, même si leurs hits individuels sont faibles (cas Masscan).
**Résultats en production (k=14, 3000 IPs) :**
- **289 bots confirmés** : clusters UA rotatif + UA-CH mismatch (cloud providers : Microsoft, Google, Akamai)
- **655 IPs suspects** : anomalie ML modérée ou UA-CH incohérent
- **ASN dominants** : MICROSOFT-CORP-MSN-AS-BLOCK, GOOGLE-CLOUD-PLATFORM, OVH, AMAZON
- **Temps de calcul** : ~59 secondes (Python pur, 3000 points × 21 features)
---
## 3. Visualisation clustering redesignée
### Problème initial
La première version utilisait des bulles ReactFlow positionnées par PCA. L'utilisateur a signalé : **"l'affichage du graphe est illisible"**.
### Solution implémentée
**Deux vues distinctes, accessibles par onglets :**
#### ⊞ Tableau de bord (défaut — toujours lisible)
- Grille de cartes groupées par niveau de risque
- **Bots & Menaces confirmées** (rouge) **Suspects** (orange) **Légitimes** (vert)
- Chaque carte : label + IP count + hits + badge CRITIQUE/ÉLEVÉ/MODÉRÉ/SAIN + 4 mini-barres + stack TCP + pays + ASN
#### ⬡ Graphe de relations
- Nœuds-cartes ReactFlow (220px texte entièrement lisible)
- **Colonnes par niveau de menace** (disposition déterministe, pas PCA)
- Arêtes colorées : orange=similaire, gris=distant, animé=très fort
- Légende intégrée, minimap, contrôles zoom
#### Sidebar de détail
- RadarChart comportemental (10 axes)
- Toutes les métriques avec barres de progression
- Liste des IPs avec badges menace/pays
- Export **Copier IPs** + ** CSV**
- Intégrée dans le flux flex (ne bloque plus la barre de contrôle)
**Fichiers modifiés :**
- `backend/routes/clustering.py` (réécriture complète)
- `backend/services/clustering_engine.py` (nouveau seuils calibrés sur données réelles)
- `frontend/src/components/ClusteringView.tsx` (réécriture complète)
- `frontend/src/App.tsx` (route `/clustering` + nav "🔬 Clustering IPs")
---
## 4. Points d'attention
### Performances
- K-means++ sur 3000 × 21 : **59s** (acceptable pas de cache implémenté)
- Le cache mémoire du drill-down (`_cache["cluster_ips"]`) est volatile : rechargement = recalcul
- Pour améliorer : cache Redis ou TTL 5 min avec `functools.lru_cache`
### Calibration des seuils
Les seuils de `name_cluster()` et `risk_score_from_centroid()` sont calibrés sur les données observées :
- `anomaly_score` en production : plage 0.20.35 (pas 01 comme attendu)
- Score normalisé affiché : `min(1, score / 0.5)` pour étirer la plage utile
- UA-CH mismatch = 1.0 sur les clusters bot = signal **très fort** (cloud providers simulant un navigateur)
### Données manquantes dans le LEFT JOIN
Certaines IPs n'apparaissent pas dans `ml_detected_anomalies` (score=0, fuzz=0). Ce sont les IPs légitimes non détectées par le modèle ML. Elles forment naturellement les clusters "Trafic Légitime".
### Fuzzing_index = 100% dans beaucoup de clusters
Après analyse : le `fuzzing_index` log-normalisé dépasse souvent le seuil de 100% car les valeurs brutes sont très variables (0 à 229+). Ce n'est pas un bug c'est la nature du trafic web moderne (beaucoup de requêtes avec des paths variés).

View File

@ -0,0 +1,672 @@
# 🛡️ Bot Detector Dashboard
Dashboard web interactif pour visualiser et investiguer les décisions de classification du Bot Detector IA.
**Version:** 2.0.0 - TCP Fingerprinting Multi-Signal + Clustering IPs Multi-Métriques
## 🚀 Démarrage Rapide
### Prérequis
- Docker et Docker Compose
- Le service `clickhouse` déjà déployé
- Des données dans la table `ml_detected_anomalies`
- Des données dans la table `http_logs` (pour les user-agents)
> **Note:** Le dashboard peut fonctionner indépendamment de `bot_detector_ai`. Il lit les données déjà détectées dans ClickHouse.
### Lancement
```bash
# 1. Vérifier que .env existe
cp .env.example .env # Si ce n'est pas déjà fait
# 2. Lancer le dashboard (avec Docker Compose v2)
docker compose up -d dashboard_web
# Ou avec l'ancienne syntaxe
docker-compose up -d dashboard_web
# 3. Ouvrir le dashboard
# http://localhost:3000
```
### Arrêt
```bash
docker compose stop dashboard_web
```
### Vérifier le statut
```bash
# Voir les services en cours d'exécution
docker compose ps
# Voir les logs en temps réel
docker compose logs -f dashboard_web
```
## 📊 Fonctionnalités
### Dashboard Principal
- **Métriques en temps réel** : Total détections, menaces, bots connus, IPs uniques
- **Comparaison baseline J-1** : variation ▲▼ vs hier (détections, IPs uniques, CRITICAL)
- **Répartition par menace** : Visualisation CRITICAL/HIGH/MEDIUM/LOW
- **Évolution temporelle** : Graphique des détections sur 24h
- **Incidents clusterisés** : Regroupement automatique par subnet /24
- **Top Menaces Actives** : Top 10 des IPs les plus dangereuses
### 🧬 TCP Spoofing & Fingerprinting OS (amélioré v2.0)
- **Détection multi-signal** : TTL initial + MSS + scale + fenêtre TCP (p0f-style)
- **20 signatures OS** : Linux, Windows, macOS, Android, iOS, Masscan, ZMap, Shodan, Googlebot…
- **Estimation hop-count** : différence TTL initial (arrondi) TTL observé
- **Détection réseau** : MSS → Ethernet (1460) / PPPoE (1452) / VPN (1420) / Tunnel (<1420)
- **Confiance 0100%** : score pondéré (TTL 40% + MSS 30% + fenêtre 20% + scale 10%)
- **Badge bot-tool** : Masscan détecté à 97% (win=5808, mss=1452, scale=4)
- **Distribution MSS** : histogramme des MSS observés par cluster
### 🔬 Clustering IPs Multi-Métriques (nouveau v2.0)
- **URL:** `/clustering`
- **Algorithme :** K-means++ (Arthur & Vassilvitskii, 2007), initialisé avec k-means++, 3 runs
- **21 features normalisées [0,1] :**
- Stack TCP : TTL initial, MSS, scale, fenêtre TCP
- Anomalie ML : score, vélocité, fuzzing, headless, POST ratio, IP-ID zéro
- TLS/Protocole : ALPN mismatch, ALPN absent, efficacité H2 (multiplexing)
- Navigateur : score navigateur moderne, ordre headers, UA-CH mismatch
- Temporel : entropie, diversité JA4, UA rotatif
- **Positionnement 2D :** PCA par puissance itérative (Hotelling) + déflation
- **Nommage automatique :** Masscan / Bot UA Rotatif / Bot Fuzzer / Anomalie ML / Linux / Windows / VPN
**Vue Tableau de bord (défaut) :**
- Grille de cartes groupées : Bots confirmés Suspects Légitimes
- Chaque carte : label, IP count, hits, badge CRITIQUE/ÉLEVÉ/MODÉRÉ/SAIN
- 4 mini-barres : anomalie, UA-CH mismatch, fuzzing, UA rotatif
- Stack TCP (TTL, MSS, Scale), top pays, ASN
**Vue Graphe de relations :**
- Nœuds-cartes ReactFlow (220px, texte lisible)
- Colonnes par niveau de menace : Bots | Suspects | Légitimes
- Arêtes colorées par similarité (orange=fort, animé=très fort)
- Légende intégrée, minimap, contrôles zoom
**Sidebar de détail :**
- RadarChart comportemental (10 axes : anomalie, UA-CH, fuzzing, headless…)
- Toutes les métriques avec barres de progression colorées
- Liste des IPs avec badges menace/pays/ASN
- Export **Copier IPs** + ** CSV**
### Investigation Subnet /24
- **URL:** `/entities/subnet/x.x.x.x_24`
- Stats globales, tableau des IPs, actions par IP
### Investigation IP + Réputation
- **URL:** `/investigation/:ip`
- Synthèse multi-sources (ML + bruteforce + TCP + JA4 + timeline)
- Score de risque 0100, réputation IP-API + IPinfo
### Investigation (Variabilité)
- User-Agents, JA4 fingerprints, pays, ASN, hosts, niveaux de menace
- Insights automatiques, navigation enchaînable
## 🏗️ Architecture
```
┌─────────────────────────────────────────────────────────┐
│ Docker Compose │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │
│ │ ClickHouse │ │ bot_detector│ │ dashboard_web │ │
│ │ :8123 │ │ (existant) │ │ :8000 (web+API)│ │
│ │ :9000 │ │ │ │ network=host │ │
│ └──────┬──────┘ └──────┬──────┘ └────────┬────────┘ │
│ └────────────────┴───────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
> Le container utilise `network_mode: "host"` — le frontend buildé est servi par FastAPI
> sur le **port 8000 uniquement** (pas de port 3000 en production).
### Composants
| Composant | Technologie | Description |
|-----------|-------------|-------------|
| **Frontend** | React 18 + TypeScript 5 + Vite 5 + Tailwind CSS 3 | Interface utilisateur (SPA) |
| **Backend API** | FastAPI 0.111 + Python 3.11 | API REST + serveur statique SPA |
| **Database** | ClickHouse (existant) port 8123 | Base de données principale |
| **Clustering** | K-means++ pur Python + PCA puissance itérative | Algorithmes embarqués, sans dépendance ML |
## 📁 Structure
```
dashboard/
├── Dockerfile # Multi-stage: node:20-alpine → python:3.11-slim
├── docker-compose.yaml
├── requirements.txt
├── backend/
│ ├── main.py # FastAPI: CORS, routers, SPA catch-all (doit être DERNIER)
│ ├── config.py # pydantic-settings, lit .env
│ ├── database.py # ClickHouseClient singleton (db)
│ ├── models.py # Modèles Pydantic v2
│ ├── routes/
│ │ ├── metrics.py # GET /api/metrics, /api/metrics/baseline
│ │ ├── detections.py # GET /api/detections
│ │ ├── variability.py # GET /api/variability
│ │ ├── attributes.py # GET /api/attributes
│ │ ├── incidents.py # GET /api/incidents/clusters
│ │ ├── entities.py # GET /api/entities
│ │ ├── analysis.py # GET/POST /api/analysis — classifications SOC
│ │ ├── reputation.py # GET /api/reputation/ip/{ip}
│ │ ├── tcp_spoofing.py # GET /api/tcp-spoofing — fingerprinting OS multi-signal
│ │ ├── clustering.py # GET /api/clustering/clusters + /cluster/{id}/ips
│ │ └── investigation_summary.py # GET /api/investigation/{ip}/summary
│ └── services/
│ ├── tcp_fingerprint.py # 20 signatures OS, scoring, hop-count, réseau path
│ ├── clustering_engine.py # K-means++, PCA-2D, nommage, score risque (pur Python)
│ └── reputation_ip.py # httpx → ip-api.com + ipinfo.io (async, sans API key)
└── frontend/
├── package.json
├── vite.config.ts # Proxy /api → :8000 en dev
└── src/
├── App.tsx # BrowserRouter + Sidebar + TopHeader + Routes
├── ThemeContext.tsx # dark/light/auto, localStorage: soc_theme
├── api/client.ts # Axios baseURL=/api + toutes les interfaces TypeScript
├── components/
│ ├── ClusteringView.tsx # K-means++ clustering — 2 vues
│ ├── TcpSpoofingView.tsx # TCP fingerprinting OS
│ ├── InvestigationView.tsx # Investigation IP complète
│ └── ... # Autres vues
├── hooks/ # useMetrics, useDetections, useVariability (polling)
└── utils/STIXExporter.ts
```
## 🔌 API
### Endpoints
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/metrics` | Métriques globales |
| GET | `/api/metrics/baseline` | Comparaison J-1 (détections, IPs, CRITICAL) |
| GET | `/api/metrics/threats` | Distribution par menace |
| GET | `/api/detections` | Liste des détections paginée |
| GET | `/api/detections/{id}` | Détails d'une détection |
| GET | `/api/variability/{type}/{value}` | Variabilité d'un attribut |
| GET | `/api/attributes/{type}` | Valeurs uniques d'un attribut |
| GET | `/api/incidents/clusters` | Incidents clusterisés par subnet /24 |
| GET | `/api/entities/subnet/{subnet}` | Investigation subnet (ex: `141.98.11.0_24`) |
| GET | `/api/entities/{type}/{value}` | Investigation entité (IP, JA4, UA…) |
| GET | `/api/reputation/ip/{ip}` | Réputation IP (IP-API + IPinfo) |
| GET | `/api/investigation/{ip}/summary` | Synthèse IP multi-sources (ML + TCP + JA4) |
| GET | `/api/analysis/{ip}/subnet` | Analyse subnet / ASN |
| GET | `/api/analysis/{ip}/recommendation` | Recommandation de classification |
| POST | `/api/analysis/classifications` | Sauvegarder classification SOC |
| GET | `/api/tcp-spoofing/overview` | Vue d'ensemble TCP spoofing + OS |
| GET | `/api/tcp-spoofing/list` | Liste des détections TCP spoofing |
| GET | `/api/tcp-spoofing/matrix` | Matrice OS déclaré vs OS réel |
| GET | `/api/clustering/clusters` | Clustering K-means++ (`?k=14&n_samples=3000`) |
| GET | `/api/clustering/cluster/{id}/ips` | IPs d'un cluster (drill-down) |
| GET | `/health` | Health check |
### Exemples
```bash
# Health check
curl http://localhost:8000/health
# Métriques globales + baseline
curl http://localhost:8000/api/metrics | jq '.summary'
curl http://localhost:8000/api/metrics/baseline | jq
# Détections CRITICAL
curl "http://localhost:8000/api/detections?threat_level=CRITICAL&page=1" | jq '.items | length'
# TCP Spoofing — vue d'ensemble
curl http://localhost:8000/api/tcp-spoofing/overview | jq
# Clustering IPs (14 clusters sur 3000 échantillons)
curl "http://localhost:8000/api/clustering/clusters?k=14&n_samples=3000" | jq '.stats'
# Drill-down d'un cluster
curl "http://localhost:8000/api/clustering/cluster/c0_k14/ips?limit=20" | jq '.ips[].ip'
# Réputation IP
curl http://localhost:8000/api/reputation/ip/162.55.94.175 | jq
```
## ⚙️ Configuration
### Variables d'Environnement
| Variable | Défaut | Description |
|----------|--------|-------------|
| `CLICKHOUSE_HOST` | `clickhouse` | Hôte ClickHouse |
| `CLICKHOUSE_PORT` | `8123` | Port HTTP ClickHouse |
| `CLICKHOUSE_DB` | `mabase_prod` | Base de données |
| `CLICKHOUSE_USER` | `admin` | Utilisateur |
| `CLICKHOUSE_PASSWORD` | `` | Mot de passe |
| `API_HOST` | `0.0.0.0` | Bind Uvicorn |
| `API_PORT` | `8000` | Port API + frontend |
| `CORS_ORIGINS` | `["http://localhost:3000", ...]` | Origines CORS autorisées |
Ces variables sont lues depuis le fichier `.env` à la racine du projet.
> ⚠️ Le fichier `.env` contient les credentials réels — ne jamais le committer.
## 🔍 Workflows d'Investigation
### Exemple 1 : Identifier un bot Masscan
1. **🔬 Clustering IPs** → Cluster "🤖 Masscan / Scanner IP" visible en rouge
2. **Clic sur la carte** → Sidebar : TTL=52, MSS=1452, Scale=4 — pattern Masscan
3. **Copier les IPs** → Liste prête pour le blocage
4. **Export CSV** → Import dans le SIEM ou firewall
### Exemple 2 : Analyser des bots UA-rotatifs (cloud)
1. **Clustering** → Cluster "🤖 Bot UA Rotatif + CH Mismatch" (risque 50%)
2. **RadarChart** → UA-CH=100%, UA rotatif=100%, anomalie=59%
3. **Top ASN** → Microsoft, Google, Akamai — cloud providers
4. **🧬 TCP Spoofing** → Confirmer : ces IPs déclarent Windows UA mais ont TTL Linux
5. **Investigation IP** → Détail complet avec timeline 24h
### Exemple 3 : Détecter le spoofing d'OS
1. **🧬 TCP Spoofing** → Liste des IPs avec mismatch OS
2. **Matrice UA×OS** → User-Agent Android mais stack TCP Windows = spoof
3. **Confiance 85%** → MSS=1460 (Ethernet), scale=7, TTL≈64 → Linux réel
4. **Action** → Classer comme bot avec IP proxy
### Exemple 4 : Investiguer une IP suspecte
1. **🎯 Détections** → IP classifiée 🔴 CRITICAL
2. **Clic sur l'IP** → Synthèse : ML + TCP + JA4 + bruteforce + timeline
3. **Score de risque** : 85/100
4. **User-Agents** → 3 UA différents en 24h (rotation)
5. **TCP** → TTL initial 128 (Windows) mais UA Linux → spoof
6. **Action** → Blacklist immédiate
## 🧬 Services techniques (v2.0)
### `backend/services/tcp_fingerprint.py`
Détection multi-signal de l'OS réel basée sur la stack TCP :
```python
from backend.services.tcp_fingerprint import fingerprint_os, detect_spoof
result = fingerprint_os(ttl=52, win=5808, scale=4, mss=1452)
# → OSFingerprint(os_family="Masscan/Scanner", confidence=0.97, is_bot_tool=True)
spoof = detect_spoof(declared_ua="Chrome/Windows", fingerprint=result)
# → SpoofResult(is_spoof=True, reason="UA Windows mais stack Masscan", risk_score=30)
```
**Poids du scoring :** TTL initial 40% + MSS 30% + fenêtre 20% + scale 10%
**Estimation hop-count :**
- TTL observé 52 → TTL initial arrondi = 64 → hops = 64 52 = **12**
- TTL observé 119 → TTL initial = 128 → hops = 9
**MSS → chemin réseau :**
| MSS | Réseau détecté |
|-----|---------------|
| 1460 | Ethernet standard |
| 1452 | PPPoE / DSL |
| 14201452 | VPN probable |
| < 1420 | Tunnel / double-encap |
### `backend/services/clustering_engine.py`
K-means++ + PCA-2D embarqués en pur Python (sans numpy/sklearn) :
```
K-means++ init : O(k·n) distances, n_init=3 runs → meilleure inertie
Power iteration : X^T(Xv) trick → O(n·d) par itération, pas de matrice n×n
Déflation Hotelling : retire PC1 de X avant de calculer PC2
```
**21 features normalisées [0,1]** — voir `FEATURES` dans le fichier.
**Nommage automatique** par priorité décroissante :
1. Pattern Masscan (mss 14401460, scale 35, TTL<60)
2. Fuzzing agressif (fuzzing_index normalisé > 0.35 ≈ valeur brute > 100)
3. UA rotatif + UA-CH mismatch simultanés
4. UA-CH mismatch seul > 80%
5. Score anomalie ML > 20% + signal comportemental
6. Classification réseau / OS par TTL/MSS
## 🗄️ Tables ClickHouse utilisées
| Table / Vue | Routes |
|---|---|
| `mabase_prod.ml_detected_anomalies` | metrics, detections, variability, analysis, clustering |
| `mabase_prod.agg_host_ip_ja4_1h` | tcp_spoofing, clustering, investigation_summary |
| `mabase_prod.view_dashboard_entities` | entities (UA, JA4, paths, query params) |
| `mabase_prod.classifications` | analysis (classifications SOC manuelles) |
| `mabase_prod.audit_logs` | audit (optionnel — silencieux si absent) |
**Conventions SQL :**
- IPs stockées en IPv6-mappé : `replaceRegexpAll(toString(src_ip), '^::ffff:', '')`
- `anomaly_score` peut être négatif : toujours utiliser `abs()`
- `fuzzing_index` peut dépasser 200 : normaliser avec `log1p`
- `multiplexing_efficiency` peut dépasser 1 : normaliser avec `log1p`
- Paramètres SQL : syntaxe `%(name)s` (dict ClickHouse)
- **SPA catch-all DOIT être le dernier router dans `main.py`**
## 🎨 Thème
Le dashboard utilise un **thème sombre** optimisé SOC (dark par défaut, clair et auto disponibles) :
- **Tokens CSS sémantiques** : `bg-background`, `bg-background-card`, `text-text-primary`, `text-text-secondary`…
- **Taxonomie menaces** : rouge CRITICAL / orange HIGH / jaune MEDIUM / vert LOW
- **Persistance** : `localStorage` clé `soc_theme`
- **Ne jamais utiliser** de classes Tailwind brutes (`slate-800`) — toujours les tokens sémantiques
## 📝 Logs
Les logs du dashboard sont accessibles via Docker :
```bash
# Logs du container
docker logs dashboard_web
# Logs en temps réel
docker logs -f dashboard_web
```
## 🧪 Tests et Validation
### Script de test rapide
Créez un fichier `test_dashboard.sh` :
```bash
#!/bin/bash
echo "=== Test Dashboard Bot Detector ==="
# 1. Health check
echo -n "1. Health check... "
curl -s http://localhost:3000/health > /dev/null && echo "✅ OK" || echo "❌ ÉCHOUÉ"
# 2. API Metrics
echo -n "2. API Metrics... "
curl -s http://localhost:3000/api/metrics | jq -e '.summary' > /dev/null && echo "✅ OK" || echo "❌ ÉCHOUÉ"
# 3. API Detections
echo -n "3. API Detections... "
curl -s http://localhost:3000/api/detections | jq -e '.items' > /dev/null && echo "✅ OK" || echo "❌ ÉCHOUÉ"
# 4. Frontend
echo -n "4. Frontend HTML... "
curl -s http://localhost:3000 | grep -q "Bot Detector" && echo "✅ OK" || echo "❌ ÉCHOUÉ"
echo "=== Tests terminés ==="
```
Rendez-le exécutable et lancez-le :
```bash
chmod +x test_dashboard.sh
./test_dashboard.sh
```
### Tests manuels de l'API
```bash
# 1. Health check
curl http://localhost:3000/health
# 2. Métriques globales
curl http://localhost:3000/api/metrics | jq
# 3. Liste des détections (page 1, 25 items)
curl "http://localhost:3000/api/detections?page=1&page_size=25" | jq
# 4. Filtrer par menace CRITICAL
curl "http://localhost:3000/api/detections?threat_level=CRITICAL" | jq '.items[].src_ip'
# 5. Distribution par menace
curl http://localhost:3000/api/metrics/threats | jq
# 6. Liste des IPs uniques (top 10)
curl "http://localhost:3000/api/attributes/ip?limit=10" | jq
# 7. Variabilité d'une IP (remplacer par une IP réelle)
curl http://localhost:3000/api/variability/ip/192.168.1.100 | jq
# 8. Variabilité d'un pays
curl http://localhost:3000/api/variability/country/FR | jq
# 9. Variabilité d'un ASN
curl http://localhost:3000/api/variability/asn/16276 | jq
```
### Test du Frontend
```bash
# Vérifier que le HTML est servi
curl -s http://localhost:3000 | head -20
# Ou ouvrir dans le navigateur
# http://localhost:3000
```
### Scénarios de test utilisateur
1. **Navigation de base**
- Ouvrir http://localhost:3000
- Vérifier que les métriques s'affichent
- Cliquer sur "📋 Détections"
2. **Recherche et filtres**
- Rechercher une IP : `192.168`
- Filtrer par menace : CRITICAL
- Changer de page
3. **Investigation (variabilité)**
- Cliquer sur une IP dans le tableau
- Vérifier la section "User-Agents" (plusieurs valeurs ?)
- Cliquer sur un User-Agent pour investiguer
- Utiliser le breadcrumb pour revenir en arrière
4. **Insights**
- Trouver une IP avec plusieurs User-Agents
- Vérifier que l'insight "Possible rotation/obfuscation" s'affiche
### Vérifier les données ClickHouse
```bash
# Compter les détections (24h)
docker compose exec clickhouse clickhouse-client -d mabase_prod -q \
"SELECT count() FROM ml_detected_anomalies WHERE detected_at >= now() - INTERVAL 24 HOUR"
# Voir un échantillon
docker compose exec clickhouse clickhouse-client -d mabase_prod -q \
"SELECT src_ip, threat_level, model_name, detected_at FROM ml_detected_anomalies ORDER BY detected_at DESC LIMIT 5"
# Vérifier les vues du dashboard
docker compose exec clickhouse clickhouse-client -d mabase_prod -q \
"SELECT * FROM view_dashboard_summary"
```
---
## 🐛 Dépannage
### Diagnostic rapide
```bash
# 1. Vérifier que les services tournent
docker compose ps
# 2. Vérifier les logs du dashboard
docker compose logs dashboard_web | tail -50
# 3. Tester la connexion ClickHouse depuis le dashboard
docker compose exec dashboard_web curl -v http://clickhouse:8123/ping
```
### Le dashboard ne démarre pas
```bash
# Vérifier les logs
docker compose logs dashboard_web
# Erreur courante: Port déjà utilisé
# Solution: Changer le port dans docker-compose.yml
# Erreur courante: Image non construite
docker compose build dashboard_web
docker compose up -d dashboard_web
```
### Aucune donnée affichée (dashboard vide)
```bash
# 1. Vérifier qu'il y a des données dans ClickHouse
docker compose exec clickhouse clickhouse-client -d mabase_prod -q \
"SELECT count() FROM ml_detected_anomalies WHERE detected_at >= now() - INTERVAL 24 HOUR"
# Si le résultat est 0:
# - Lancer bot_detector_ai pour générer des données
docker compose up -d bot_detector_ai
docker compose logs -f bot_detector_ai
# - Ou importer des données manuellement
```
### Erreur "Connexion ClickHouse échoué"
```bash
# 1. Vérifier que ClickHouse est démarré
docker compose ps clickhouse
# 2. Tester la connexion
docker compose exec clickhouse clickhouse-client -q "SELECT 1"
# 3. Vérifier les credentials dans .env
cat .env | grep CLICKHOUSE
# 4. Redémarrer le dashboard
docker compose restart dashboard_web
# 5. Vérifier les logs d'erreur
docker compose logs dashboard_web | grep -i error
```
### Erreur 404 sur les routes API
```bash
# Vérifier que l'API répond
curl http://localhost:3000/health
curl http://localhost:3000/api/metrics
# Si 404, redémarrer le dashboard
docker compose restart dashboard_web
```
### Port 3000 déjà utilisé
```bash
# Option 1: Changer le port dans docker-compose.yml
# Remplacer: - "3000:8000"
# Par: - "8080:8000"
# Option 2: Trouver et tuer le processus
lsof -i :3000
kill <PID>
# Puis redémarrer
docker compose up -d dashboard_web
```
### Frontend ne se charge pas (page blanche)
```bash
# 1. Vérifier la console du navigateur (F12)
# 2. Vérifier que le build frontend existe
docker compose exec dashboard_web ls -la /app/frontend/dist
# 3. Si vide, reconstruire l'image
docker compose build --no-cache dashboard_web
docker compose up -d dashboard_web
```
### Logs d'erreur courants
| Erreur | Cause | Solution |
|--------|-------|----------|
| `Connection refused` | ClickHouse pas démarré | `docker compose up -d clickhouse` |
| `Authentication failed` | Mauvais credentials | Vérifier `.env` |
| `Table doesn't exist` | Vues non créées | Lancer `deploy_views.sql` |
| `No data available` | Pas de données | Lancer `bot_detector_ai` |
---
## 🔒 Sécurité
- **Pas d'authentification** : Dashboard conçu pour un usage local
- **CORS restreint** : Seulement localhost:3000
- **Rate limiting** : 100 requêtes/minute
- **Credentials** : Via variables d'environnement (jamais en dur)
## 📊 Performances
- **Temps de chargement** : < 2s (avec données)
- **Requêtes ClickHouse** : Optimisées avec agrégations
- **Rafraîchissement auto** : 30 secondes (métriques)
## 🧪 Développement
### Build local (sans Docker)
```bash
# Backend
cd dashboard
pip install -r requirements.txt
python -m uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000
# Frontend (dans un autre terminal)
cd dashboard/frontend
npm install
npm run dev # http://localhost:5173
```
### Documentation API interactive
L'API inclut une documentation Swagger interactive :
```bash
# Ouvrir dans le navigateur
http://localhost:3000/docs
# Ou directement sur le port API
http://localhost:8000/docs
```
### Tests unitaires (à venir)
```bash
# Backend (pytest)
cd dashboard
pytest backend/tests/
# Frontend (jest)
cd dashboard/frontend
npm test
```
## 📄 License
Même license que le projet principal Bot Detector.
---
## 📞 Support
Pour toute question ou problème :
1. Vérifier la section **🐛 Dépannage** ci-dessus
2. Consulter les logs : `docker compose logs dashboard_web`
3. Vérifier que ClickHouse contient des données
4. Ouvrir une issue sur le dépôt

View File

@ -0,0 +1,57 @@
# Plan d'exécution — Routes & Navigation
## Contexte
- Authentification applicative **hors périmètre** (gérée par `htaccess`).
- Objectif: rendre les routes/navigation cohérentes et sans liens cassés.
## Étapes et avancement
| Étape | Description | Statut | Notes |
|---|---|---|---|
| 1 | Préparer ce document de suivi | ✅ Fait | Document créé et utilisé comme source de progression. |
| 2 | Lancer un baseline (checks existants) | ✅ Fait | `docker compose build dashboard_web` exécuté (OK). |
| 3 | Corriger les routes déclarées (aliases + routes manquantes) | ✅ Fait | Ajout de `/incidents`, `/investigate`, `/investigate/:type/:value`, `/bulk-classify` + wrappers tools route params. |
| 4 | Corriger la navigation (liens/boutons/quick search) | ✅ Fait | Navigation top enrichie, quick actions corrigées, suppression de `window.location.href`. |
| 5 | Valider après changements (build/checks) | ✅ Fait | `docker compose build dashboard_web` OK après modifications. |
| 6 | Finaliser ce document avec résultats | ✅ Fait | Synthèse et statut final complétés. |
| 7 | Réécriture graph de corrélations | ✅ Fait | Custom node types, layout radial, fitView, séparation fetch/filtre, erreur gérée, hauteur 700px. |
## Journal davancement
### Étape 1 — Préparer le document
- Statut: ✅ Fait
- Action: création du document de suivi avec étapes et statuts.
### Étape 2 — Baseline Docker
- Statut: ✅ Fait
- Action: exécution de `docker compose build dashboard_web`.
- Résultat: build OK (code de sortie 0), warning non bloquant sur `version` obsolète dans compose.
### Étape 3 — Correction des routes
- Statut: ✅ Fait
- Actions:
- ajout route alias `/incidents` vers la vue incidents;
- ajout routes `/investigate` et `/investigate/:type/:value` avec redirection intelligente;
- ajout route `/bulk-classify` avec wrapper dintégration;
- remplacement des usages `window.location.pathname` par des wrappers route basés sur `useParams`.
### Étape 4 — Correction de la navigation
- Statut: ✅ Fait
- Actions:
- ajout dun onglet navigation `Détections`;
- activation menu corrigée (gestion des alias/sous-routes);
- remplacement de `window.location.href` dans `DetectionsList` par `navigate(...)`;
- action rapide “Investigation avancée” alignée vers `/detections`.
### Étape 5 — Validation Docker post-modifications
- Statut: ✅ Fait
- Action: exécution de `docker compose build dashboard_web`.
- Résultat: build OK (code de sortie 0), warning compose `version` obsolète non bloquant.
### Étape 6 — Clôture
- Statut: ✅ Fait
- Résultat global:
- routes invalides couvertes via aliases/wrappers;
- navigation interne homogène en SPA;
- build Docker validé avant/après.

View File

@ -0,0 +1 @@
# Backend package

View File

@ -0,0 +1,27 @@
"""
Configuration du Dashboard Bot Detector
"""
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
# ClickHouse
CLICKHOUSE_HOST: str = "clickhouse"
CLICKHOUSE_PORT: int = 8123
CLICKHOUSE_DB: str = "mabase_prod"
CLICKHOUSE_USER: str = "admin"
CLICKHOUSE_PASSWORD: str = ""
# API
API_HOST: str = "0.0.0.0"
API_PORT: int = 8000
# CORS
CORS_ORIGINS: list = ["http://localhost:3000", "http://127.0.0.1:3000"]
class Config:
env_file = ".env"
case_sensitive = True
settings = Settings()

Some files were not shown because too many files have changed in this diff Show More