Files
ja4-platform/.github/copilot-instructions.md
toto d8dbd4e706 docs: add .github/copilot-instructions.md for Copilot context
Covers: build/test/lint commands, architecture overview, ClickHouse
dual-DB pattern, inter-service communication, key conventions for
Go (hexagonal, YAML config), Python (pydantic-settings, FastAPI routes),
C (Apache module), Docker-first builds, and RPM packaging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:21:32 +02:00

6.1 KiB
Raw Blame History

Copilot Instructions — ja4-platform

What is this?

A monorepo for a JA4/JA3 TLS fingerprinting security pipeline. Five services capture network traffic, correlate logs, detect bots via ML, and present results in a SOC dashboard. All backed by ClickHouse.

Data flow: mod-reqin-log (Apache HTTP logs) → unix socket → correlator ← unix socket ← sentinel (TLS/TCP capture) → ClickHouse → bot-detector (ML scoring) → dashboard (FastAPI SOC UI)

Build, test, lint

All builds run in Docker — no native Go/Python/C toolchain required on the host.

# Full suite
make test-all        # run all tests (Docker)
make build-all       # build all service images
make rpm-all         # build RPMs (sentinel, correlator, mod-reqin-log) for el8/el9/el10

# Per-service tests
make test-sentinel          # Go tests (needs --cap-add=NET_RAW inside)
make test-correlator        # Go tests with 60% coverage gate
make test-bot-detector      # Python pytest
make test-dashboard         # Python pytest
make test-ja4common-python  # Python pytest (shared lib)
make test-mod-reqin-log     # C cmocka tests

# Single Go test (from service dir, or via Docker):
docker run --rm -v $(pwd):/build -w /build/services/correlator golang:1.24 \
  go test -v -run TestConfigLoad ./internal/config/

# Single Python test (from repo root):
docker build -f services/dashboard/Dockerfile.tests -t dash-tests .
docker run --rm dash-tests pytest backend/tests/test_metrics.py -v -k test_health

# Linting (Go only — no Python linter configured)
cd services/sentinel && go vet ./... && gofmt -l .
cd services/correlator && go vet ./... && gofmt -l .

Architecture

Go workspace (go.work, Go 1.24.6)

Three modules in the workspace:

  • services/sentinel — TLS/TCP packet capture daemon (gopacket/pcap, systemd)
  • services/correlator — log correlation engine, hexagonal architecture
  • shared/go/ja4common — shared logger, config, shutdown, ipfilter

Both services have a replace directive in their go.mod pointing to ../../shared/go/ja4common. The workspace takes precedence for local dev; the replace is needed for Docker builds.

Correlator hexagonal architecture

ports/source.go        → EventSource, CorrelatedLogSink, CorrelationProcessor interfaces
adapters/inbound/      → unixsocket (reads from sentinel + mod-reqin-log)
adapters/outbound/     → clickhouse, file, stdout, multi (fan-out wrapper)
domain/                → CorrelationService, CorrelatedLog, NormalizedEvent
app/                   → Orchestrator (wires everything together)
config/                → YAML config loader

Python services

  • bot-detector — scikit-learn IsolationForest + DBSCAN. Single monolithic module (bot_detector.py). Uses os.getenv() directly for config, NOT pydantic-settings.
  • dashboard — FastAPI + React SPA. 20 route modules in backend/routes/. Uses pydantic-settings (backend/config.py).
  • shared/python/ja4_commonClickHouseClient singleton + ClickHouseSettings (pydantic-settings). Installed as a local package in each Python Dockerfile.

C module

  • mod-reqin-log — Apache HTTPD module (C11, built with apxs). Logs HTTP requests as JSON to a Unix socket. Tests use cmocka.

ClickHouse dual-database pattern

Two configurable databases (env vars with defaults):

Env var Default Contains
CLICKHOUSE_DB_LOGS ja4_logs http_logs_raw, http_logs, mv_http_logs
CLICKHOUSE_DB_PROCESSING ja4_processing Aggregations, ML tables, views, dicts, audit

Cross-database references exist — materialized views in one DB read from the other:

  • ja4_logs.mv_http_logs references ja4_processing.dict_anubis_* and ja4_processing.dict_iplocate_asn
  • ja4_processing.mv_agg_* reads FROM ja4_logs.http_logs

In Python code, always use fully qualified table names:

from ..config import settings
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.ml_detected_anomalies ..."
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_LOGS}.http_logs ..."

Never hardcode database names in queries.

In Go (correlator), the database is part of the ClickHouse DSN (clickhouse://user:pass@host:9000/ja4_logs). The target table is configurable via YAML (outputs.clickhouse.table).

SQL migrations live in shared/clickhouse/ (10 ordered files). Deploy with shared/clickhouse/deploy_schema.sh which substitutes DB names from env vars.

Key conventions

Docker-first builds

Every service has Dockerfile (prod), Dockerfile.dev or Dockerfile.tests (tests), and Go/C services have Dockerfile.package (RPM packaging via 3-stage: builder → rpmbuild × 3 distros → alpine output).

Go config: YAML + env vars

  • Sentinel: config.yml, env prefix JA4SENTINEL_
  • Correlator: config.yml, env prefix LOGCORRELATOR_
  • Both support SIGHUP for log rotation

Python config: pydantic-settings

  • Dashboard: backend/config.pySettings(BaseSettings) with .env file
  • ja4_common: ClickHouseSettings(BaseSettings) — singleton at settings
  • bot-detector: exception — uses raw os.getenv(), not pydantic-settings

Dashboard route structure

Every route file follows this pattern:

from fastapi import APIRouter, HTTPException, Query
from ..config import settings
from ..database import db

router = APIRouter()

@router.get("/api/something")
async def get_something():
    query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.table_name ..."
    result = db.query(query)
    ...

RPM spec files

Located at services/<name>/packaging/rpm/<name>.spec. Version injected via --define "build_version X.Y.Z" at build time.

Inter-service communication

Services communicate via Unix sockets, not HTTP:

  • sentinel/var/run/logcorrelator/network.socketcorrelator (source B: TLS/TCP data)
  • mod-reqin-log/var/run/logcorrelator/http.socketcorrelator (source A: HTTP data)
  • correlator → ClickHouse (batch inserts into ja4_logs.http_logs_raw)

Sentinel requires elevated privileges

Tests need --cap-add=NET_RAW --cap-add=NET_ADMIN for packet capture (pcap).