Files
ja4-platform/.github/copilot-instructions.md
toto 77c0450a22 docs: update copilot-instructions.md for dashboard rewrite and ML upgrades
- Dashboard: FastAPI+React → FastAPI+Jinja2+htmx+Chart.js (2 route modules)
- Bot-detector: IsolationForest → triple-voice EIF+Autoencoder+XGBoost ensemble
- SQL schema: 10 → 13 files (added thesis features, perf indexes, views)
- Added ClickHouse 24.8 gotchas (projections, nested aggregates, let bindings)
- Added IPv4/IPv6 duality pattern, bot-detector test patterns
- Updated data retention table with 4 new thesis aggregation tables
- Fixed single-test commands to reference existing files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 07:31:10 +02:00

13 KiB
Raw Blame History

Copilot Instructions — ja4-platform

What is this?

A monorepo for a JA4/JA3 TLS fingerprinting security pipeline. Five services capture network traffic, correlate logs, detect bots via ML, and present results in a SOC dashboard. All backed by ClickHouse.

Data flow: mod-reqin-log (Apache HTTP logs) → unix socket → correlator ← unix socket ← sentinel (TLS/TCP capture) → ClickHouse → bot-detector (ML scoring) → dashboard (FastAPI SOC UI)

Build, test, lint

All builds run in Docker — no native Go/Python/C toolchain required on the host.

# Full suite
make test-all        # run all tests (Docker)
make build-all       # build all service images
make rpm-all         # build RPMs (sentinel, correlator, mod-reqin-log) for el8/el9/el10

# Per-service tests
make test-sentinel          # Go tests (needs --cap-add=NET_RAW inside)
make test-correlator        # Go tests with 80% coverage gate
make test-bot-detector      # Python pytest
make test-dashboard         # Python pytest
make test-ja4common-python  # Python pytest (shared lib)
make test-mod-reqin-log     # C cmocka tests

# Single Go test (from service dir, or via Docker):
docker run --rm -v $(pwd):/build -w /build/services/correlator golang:1.24 \
  go test -v -run TestConfigLoad ./internal/config/

# Single Python test (bot-detector, from repo root):
docker build -f services/bot-detector/bot_detector/Dockerfile.tests -t bd-tests .
docker run --rm bd-tests pytest bot_detector/tests/test_detector.py -v -k test_ae_class_train_and_score

# Local Python tests (needs deps installed):
cd services/bot-detector && pip install -r bot_detector/requirements.txt pytest pytest-mock
pytest bot_detector/tests/test_detector.py -v -k "test_benford"

# Faster correlator build (skip tests):
docker build --target builder --build-arg SKIP_TESTS=true -f services/correlator/Dockerfile .

# Linting (Go only — no Python linter configured)
cd services/sentinel && go vet ./... && gofmt -l .
cd services/correlator && go vet ./... && gofmt -l .

# Full-stack integration tests (Docker Compose, resets DB each run)
make test-integration          # 8 phases: build → start → schema → traffic → pipeline → dashboard → bot-detector → sentinel
make test-integration-keep     # same but leaves stack running after
make test-integration-down     # tear down integration stack
# run-tests.sh also accepts: --build-only (build images without running tests)

Architecture

Go workspace (go.work, Go 1.24.6)

Three modules in the workspace:

  • services/sentinel — TLS/TCP packet capture daemon (gopacket/pcap, systemd)
  • services/correlator — log correlation engine, hexagonal architecture
  • shared/go/ja4common — shared logger, config, shutdown, ipfilter

Both services have a replace directive in their go.mod pointing to ../../shared/go/ja4common. The workspace takes precedence for local dev; the replace is needed for Docker builds.

Correlator hexagonal architecture

ports/source.go        → EventSource, CorrelatedLogSink, CorrelationProcessor interfaces
adapters/inbound/      → unixsocket (reads from sentinel + mod-reqin-log)
adapters/outbound/     → clickhouse, file, stdout, multi (fan-out wrapper)
domain/                → CorrelationService, CorrelatedLog, NormalizedEvent
app/                   → Orchestrator (wires everything together)
config/                → YAML config loader

Python services

  • bot-detectorTriple-voice ensemble ML pipeline. Single monolithic module (bot_detector.py, ~1550 lines). Runs two parallel models per cycle:

    • Complet (~45 features L3→L7) on correlated traffic (correlated=1, TCP+TLS+HTTP)
    • Applicatif (~35 features L7-only) on uncorrelated traffic (correlated=0)
    • Optional 24h variants when ENABLE_MULTIWINDOW=true

    ML ensemble:

    • Extended Isolation Forest (isotree, primary unsupervised scorer, falls back to sklearn)
    • Autoencoder (PyTorch, architecture n→64→32→16→32→64→n, reconstruction error scorer)
    • XGBoost (supervised, trained on SOC analyst labels from soc_feedback)
    • Score: final = (1-XGB_WEIGHT) * ((1-AE_WEIGHT)*eif_norm + AE_WEIGHT*ae_norm) + XGB_WEIGHT*xgb_prob
    • Default weights: AE_WEIGHT=0.30, XGB_WEIGHT=0.20

    Also: adaptive thresholds, SHAP explainability, HDBSCAN campaign clustering, Benford deviation, lag-1 autocorrelation, quantile drift detection, feature pruning, validation gate.

    model_name is part of the ORDER BY key in both ml_detected_anomalies and ml_all_scores. Uses os.getenv() directly for config, NOT pydantic-settings.

  • dashboard — FastAPI + Jinja2 server-side templates + htmx + Chart.js + Tailwind CSS (CDN). Two route modules: backend/routes/api.py (9 JSON endpoints) and backend/routes/pages.py (8 HTML pages). Uses os.getenv() with safe_identifier() for DB name validation.

  • shared/python/ja4_commonClickHouseClient singleton + ClickHouseSettings (pydantic-settings). Installed as a local package in each Python Dockerfile. Note: dashboard does NOT use ja4_common — it has its own lightweight clickhouse-connect client in backend/database.py.

C module

  • mod-reqin-log — Apache HTTPD module (C11, built with apxs). Logs HTTP requests as JSON to a Unix socket. Tests use cmocka.

ClickHouse dual-database pattern

Two configurable databases (env vars with defaults):

Env var Default Contains
CLICKHOUSE_DB_LOGS ja4_logs http_logs_raw, http_logs, mv_http_logs
CLICKHOUSE_DB_PROCESSING ja4_processing Aggregations, ML tables, views, dicts, audit

Cross-database references exist — materialized views in one DB read from the other:

  • ja4_logs.mv_http_logs references ja4_processing.dict_anubis_* and ja4_processing.dict_iplocate_asn
  • ja4_processing.mv_agg_* reads FROM ja4_logs.http_logs

In Python code, always use fully qualified table names:

from ..config import settings
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.ml_detected_anomalies ..."
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_LOGS}.http_logs ..."

Never hardcode database names in queries.

In Go (correlator), the database is part of the ClickHouse DSN (clickhouse://user:pass@host:9000/ja4_logs). The target table is configurable via YAML (outputs.clickhouse.table).

SQL schema has two deployment paths:

  • Base schema: shared/clickhouse/ (13 ordered files, 00_database.sql through 12_thesis_features.sql). Deploy with shared/clickhouse/deploy_schema.sh which substitutes DB names from env vars.
  • Post-deploy migrations: services/correlator/sql/migrations/ — ALTER TABLE statements for existing deployments. Run manually: clickhouse-client --multiquery < file.sql.

Data retention (TTL)

Table TTL Partition
http_logs_raw 2 hours toStartOfHour(ingest_time)
http_logs 30 days toDate(log_date)
agg_host_ip_ja4_1h 7 days toDate(window_start)
agg_header_fingerprint_1h 7 days toDate(window_start)
agg_ip_behavior_1h 7 days toDate(window_start)
agg_request_timing_1h 7 days toDate(window_start)
agg_path_sequences_1h 7 days toDate(window_start)
agg_resource_cascade_1h 7 days toDate(window_start)
ml_detected_anomalies 7 days toDate(detected_at)
ml_all_scores 7 days toDate(window_start)

All aggregation/ML tables use ttl_only_drop_parts=1 for efficient partition-level expiry.

Key conventions

Docker-first builds

Every service has Dockerfile (prod), Dockerfile.dev or Dockerfile.tests (tests), and Go/C services have Dockerfile.package (RPM packaging via 3-stage: builder → rpmbuild × 3 distros → alpine output).

Go config: YAML + env vars

  • Sentinel: config.yml, env prefix JA4SENTINEL_
  • Correlator: config.yml, env prefix LOGCORRELATOR_
  • Both support SIGHUP for log rotation

Python config

  • ja4_common: ClickHouseSettings(BaseSettings) — pydantic-settings singleton
  • bot-detector: raw os.getenv() with defaults, NOT pydantic-settings
  • dashboard: backend/config.py reads CLICKHOUSE_DB_PROCESSING/CLICKHOUSE_DB_LOGS from env, validates identifiers with safe_identifier() regex

Dashboard route structure

Two route modules: backend/routes/api.py (JSON API) and backend/routes/pages.py (HTML). API endpoints use parameterized ClickHouse queries with sort/order whitelists:

from backend.config import DB_PROCESSING, DB_LOGS, safe_identifier
from backend.database import query, query_scalar, execute

_DB = safe_identifier(DB_PROCESSING)
rows = query(
    f"SELECT ... FROM {_DB}.ml_detected_anomalies "
    "WHERE src_ip = toIPv6({ip:String})",
    {"ip": ip_value},
)

IPv4/IPv6 duality: http_logs.src_ip is IPv4, but ML tables (ml_all_scores, ml_detected_anomalies) store IPv6 (mapped as ::ffff:x.x.x.x). Use toIPv6() for ML table queries, toIPv4OrZero() for http_logs queries. Strip ::ffff: prefix when displaying IPs.

RPM spec files

Located at services/<name>/packaging/rpm/<name>.spec. Version injected via --define "build_version X.Y.Z" at build time.

Inter-service communication

Services communicate via Unix sockets, not HTTP:

  • sentinel/var/run/logcorrelator/network.socketcorrelator (source B: TLS/TCP data)
  • mod-reqin-log/var/run/logcorrelator/http.socketcorrelator (source A: HTTP data)
  • correlator → ClickHouse (batch inserts into ja4_logs.http_logs_raw)

Sentinel requires elevated privileges

Tests need --cap-add=NET_RAW --cap-add=NET_ADMIN for packet capture (pcap).

Comments standard

All code is commented in French (identifiers stay in English). Standard defined in docs/commenting-standard.md:

  • Go: godoc // FuncName does X, package-level // Package foo fournit...
  • Python: PEP-257 triple-quoted French docstrings on all functions/classes/modules
  • C: Doxygen /** @brief ... @param ... @return ... */ before every function, /* ====== Section ====== */ banners
  • Bash: standardized header block with Usage: and Variables d'environnement:
  • SQL: -- === filename.sql — description === banner + -- --- Table --- section headers

Known gotchas

go.work and Docker build contexts

When building either sentinel or correlator in Docker, the build context must include both service directories because go.work references them both. The root-level Makefiles always use . (repo root) as context — don't change this.

Correlator YAML does not expand env vars

Go's YAML parser reads ${VAR:-default} as a literal string. Use hardcoded values or pass values directly in the YAML file. This is why tests/integration/platform/correlator.yml has a hardcoded DSN.

REGEXP_TREE dictionary requires ≥1 rule

dict_anubis_ua uses LAYOUT(REGEXP_TREE). If anubis_ua_rules is empty, every INSERT into http_logs_raw fails because the materialized view mv_http_logs calls dictGet() on it. The integration test init script seeds a catch-all rule.

TLS/pcap capture needs non-loopback traffic

sentinel listens on a network interface (e.g., eth0), not loopback. Traffic sent to localhost or 127.0.0.1 from the same container is invisible to pcap. In integration tests, traffic must come from a separate container crossing the Docker bridge network.

ClickHouse initialization timing

ClickHouse takes ~15-20s to initialize all 13 SQL files. Integration health checks use a 120s timeout (not the default 60s).

ClickHouse 24.8 SQL gotchas

  • Projections in AggregatingMergeTree require SETTINGS deduplicate_merge_projection_mode = 'drop'
  • No let bindings in SQL expressions — inline everything or use CTEs
  • Nested aggregates are illegal: argMax(any(col), sum(hits)) fails. Use a subquery to pre-aggregate, then apply the outer aggregate
  • groupArrayIf column type must match the State/Merge functions: if the MV uses groupArrayIfState(), the column must be AggregateFunction(groupArrayIf(...), Type, UInt8), not plain groupArrayState
  • SimpleAggregateFunction columns are auto-wrapped in their aggregate function during queries on AggregatingMergeTree. So argMax(ja4, hits) where hits is SimpleAggregateFunction(sum, UInt64) becomes argMax(ja4, sum(hits)) — a nested aggregate error

Bot-detector test patterns

Tests in bot_detector/tests/test_detector.py are self-contained — they re-implement key logic rather than importing from the main module (avoids heavy joblib/sklearn import chains). Autoencoder tests use a local _make_ae() helper. XGBoost tests create in-memory models. Tests skip gracefully if torch or xgboost are not installed (pytest.skip).

RPM builds must use Rocky Linux

All Dockerfile.package files use rockylinux:9 (or rockylinux:8/almalinux:10) as the build base — never Debian-based images. Reason: Rocky provides libpcap.so.1; Debian provides libpcap.so.0.8. Building sentinel on Debian and running on Rocky produces a missing library error at runtime.

ClickHouse FLAT() layout requires numeric keys

If adding a new dictionary with a String primary key, use COMPLEX_KEY_HASHED() not FLAT().