feat(tests): realistic traffic seeder + IP diversity via mod_remoteip
Option A — X-Forwarded-For + mod_remoteip:
- httpd-integration.conf: load mod_remoteip, trust all Docker RFC-1918
subnets (172/192.168/10). mod_reqin_log uses r->useragent_ip which
mod_remoteip updates from XFF → each request logged with distinct src_ip
- generate_traffic.py: XFF always set (was 30% only); human scenarios
use 91.121/78.41/90.x ranges, bot scenarios use 185.220/45.155/193.32;
pool of 1168 human IPs and 180 bot IPs; default --requests 500
Option D — Direct ClickHouse seeder (seed_clickhouse.py, stdlib only):
- Inserts ~4000 rows into http_logs_raw triggering full MV chain:
http_logs_raw → mv_http_logs → http_logs
→ mv_agg_host_ip_ja4_1h → agg_host_ip_ja4_1h
• 720 human sessions: IPs in OVH/SFR/Orange ASN ranges (16276/15557/3215)
→ dict_asn_reputation maps these to asn_label='human'
→ satisfies bot_detector human_baseline >= 500 threshold
• 150 scanner sessions: datacenter IPs, attack paths (/.env, wp-login,
SQLi, path traversal), scanner UAs, minimal TCP fingerprints
• 100 known-bot sessions: IPs matching bot_ip.csv entries
• 20 brute-force clusters: 20-50 POST /login per IP
All TCP/TLS metadata is profile-realistic (window, MSS, TTL, JA4, JA3)
CSV stubs (mounted at /var/lib/clickhouse/user_files/):
- iplocate-ip-to-asn.csv: 13 CIDR→ASN mappings (OVH/SFR/Orange/Tor/Contabo)
- asn_reputation.csv: 13 ASN→label (8 'human', 3 'datacenter'/'hosting')
- bot_ip.csv: 14 known scanner/Tor IPs (Shodan, Censys, Tor exits)
- bot_ja4.csv: 5 bot JA4 fingerprints (curl, python-requests, masscan, zgrab)
run-tests.sh:
- Phase 4a: seeder runs before live traffic (ensures bot_detector baseline)
- Phase 4b: live traffic gen at 500 requests (up from 200)
- Phase 5f: new assertions — agg_host_ip_ja4_1h populated, ≥500 human
rows in view_ai_features_1h, known-bot labels present
- Phase 7: verifies ml_all_scores populated (bot_detector ran a cycle)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@ -115,10 +115,6 @@ wait_for_service clickhouse 120
|
||||
wait_for_service platform 120
|
||||
wait_for_service dashboard 60
|
||||
|
||||
# Give bot-detector time to start (it's expected to fail initially — no data yet)
|
||||
log "Waiting 10s for bot-detector to initialize..."
|
||||
sleep 10
|
||||
|
||||
# =============================================================================
|
||||
# Phase 3: Verify ClickHouse schema
|
||||
# =============================================================================
|
||||
@ -157,28 +153,46 @@ for user in data_writer analyst; do
|
||||
done
|
||||
|
||||
# =============================================================================
|
||||
# Phase 4: Generate test traffic
|
||||
# Phase 4: Seed ClickHouse + Generate test traffic
|
||||
# =============================================================================
|
||||
log "============================================"
|
||||
log "Phase 4: Generating test traffic"
|
||||
log "Phase 4a: Seeding ClickHouse with synthetic data"
|
||||
log "============================================"
|
||||
|
||||
# Traffic comes from traffic-gen container (crosses Docker network eth0)
|
||||
# so sentinel's pcap capture sees TLS ClientHello packets.
|
||||
# Python generator uses multiple SSL contexts → varied JA4/JA3 fingerprints.
|
||||
# Both HTTP (port 80) and HTTPS (port 443) requests are sent.
|
||||
log "Starting Python traffic generator (200 requests, 10 workers)..."
|
||||
# The seeder inserts directly into http_logs_raw, triggering all MVs:
|
||||
# http_logs_raw → mv_http_logs → http_logs → mv_agg_host_ip_ja4_1h → agg_host_ip_ja4_1h
|
||||
# This pre-populates:
|
||||
# - 720 human sessions (IPs in residential ASN ranges → asn_label='human')
|
||||
# - 150 scanner/anomaly sessions (IPs in datacenter ASN → ML anomaly candidates)
|
||||
# - 100 known-bot sessions (IPs/JA4 matching bot_ip.csv / bot_ja4.csv)
|
||||
# - 20 brute-force clusters (many POST /login per IP)
|
||||
# After seeding, bot_detector has ≥500 human rows → can train and run.
|
||||
log "Running seed_clickhouse.py..."
|
||||
if docker compose exec -T traffic-gen python /app/seed_clickhouse.py \
|
||||
--host clickhouse --port 8123 --user default --password ""; then
|
||||
pass "ClickHouse seeded (700+ human + 150 scanner + 100 known-bot rows)"
|
||||
else
|
||||
warn "Seeder reported errors (pipeline verification will show impact)"
|
||||
fi
|
||||
|
||||
log "============================================"
|
||||
log "Phase 4b: Generating live test traffic via Apache"
|
||||
log "============================================"
|
||||
|
||||
# Live traffic crosses the Docker network so sentinel can capture TLS handshakes.
|
||||
# X-Forwarded-For is always set — mod_remoteip updates r->useragent_ip → diverse src_ips.
|
||||
log "Starting traffic generator (500 requests, 10 workers)..."
|
||||
if docker compose exec -T traffic-gen python /app/generate_traffic.py \
|
||||
--host platform --http-port 80 --https-port 443 \
|
||||
--requests 200 --workers 10; then
|
||||
pass "Traffic generation complete (200 requests: browsers, bots, GET/POST/HEAD/PUT/DELETE/OPTIONS)"
|
||||
--requests 500 --workers 10; then
|
||||
pass "Traffic generation complete (500 requests with diverse XFF IPs: browsers, bots)"
|
||||
else
|
||||
warn "Traffic generator reported some errors (>80% success still passes)"
|
||||
fi
|
||||
|
||||
# Wait for correlator to flush all batches to ClickHouse
|
||||
log "Waiting 15s for correlator to flush..."
|
||||
sleep 15
|
||||
log "Waiting 20s for correlator to flush and bot-detector first cycle..."
|
||||
sleep 20
|
||||
|
||||
# =============================================================================
|
||||
# Phase 5: Verify data pipeline
|
||||
@ -190,7 +204,7 @@ log "============================================"
|
||||
# 5a. Raw logs ingested
|
||||
RAW_COUNT=$(ch_query "SELECT count() FROM ja4_logs.http_logs_raw")
|
||||
if [ "$RAW_COUNT" -gt 0 ] 2>/dev/null; then
|
||||
pass "Raw logs ingested: $RAW_COUNT rows in http_logs_raw"
|
||||
pass "Raw logs ingested: $RAW_COUNT rows in http_logs_raw (seeder + live traffic)"
|
||||
else
|
||||
fail "No raw logs in http_logs_raw (correlator → ClickHouse failed)"
|
||||
# Debug
|
||||
@ -252,6 +266,35 @@ else
|
||||
warn "Correlator file output empty"
|
||||
fi
|
||||
|
||||
# 5f. Verify seeder data reached agg table and AI features view
|
||||
AGG_COUNT=$(ch_query "SELECT count() FROM ja4_processing.agg_host_ip_ja4_1h")
|
||||
HUMAN_COUNT=$(ch_query "SELECT count() FROM ja4_processing.view_ai_features_1h WHERE asn_label='human'")
|
||||
BOT_LABEL_COUNT=$(ch_query "SELECT count() FROM ja4_processing.view_ai_features_1h WHERE bot_name != ''")
|
||||
UNIQ_SRC_IPS=$(ch_query "SELECT count(DISTINCT src_ip) FROM ja4_processing.view_ai_features_1h")
|
||||
UNIQ_JA4=$(ch_query "SELECT count(DISTINCT ja4) FROM ja4_processing.view_ai_features_1h")
|
||||
|
||||
if [ "$AGG_COUNT" -gt 0 ] 2>/dev/null; then
|
||||
pass "Aggregation table populated: $AGG_COUNT sessions in agg_host_ip_ja4_1h"
|
||||
else
|
||||
fail "agg_host_ip_ja4_1h empty (MV chain broken)"
|
||||
fi
|
||||
|
||||
if [ "$HUMAN_COUNT" -ge 500 ] 2>/dev/null; then
|
||||
pass "Bot-detector baseline: $HUMAN_COUNT human sessions (≥500 threshold met)"
|
||||
elif [ "$HUMAN_COUNT" -gt 0 ] 2>/dev/null; then
|
||||
warn "Human sessions below threshold: $HUMAN_COUNT < 500 (bot_detector will skip cycle)"
|
||||
else
|
||||
fail "No human sessions in view_ai_features_1h (asn_reputation CSV not loaded?)"
|
||||
fi
|
||||
|
||||
if [ "$BOT_LABEL_COUNT" -gt 0 ] 2>/dev/null; then
|
||||
pass "Known bots labeled: $BOT_LABEL_COUNT sessions with bot_name (bot_ip/bot_ja4 dicts working)"
|
||||
else
|
||||
warn "No known-bot labels in view_ai_features_1h (bot_ip.csv / bot_ja4.csv empty?)"
|
||||
fi
|
||||
|
||||
log " Unique src_ips: $UNIQ_SRC_IPS | Unique JA4: $UNIQ_JA4"
|
||||
|
||||
# =============================================================================
|
||||
# Phase 6: Verify dashboard API
|
||||
# =============================================================================
|
||||
@ -305,7 +348,17 @@ for line in sys.stdin:
|
||||
if [ "$BOT_STATUS" = "running" ]; then
|
||||
pass "Bot-detector is running"
|
||||
else
|
||||
warn "Bot-detector state: $BOT_STATUS (may need more data to start properly)"
|
||||
warn "Bot-detector state: $BOT_STATUS"
|
||||
fi
|
||||
|
||||
# Check if bot-detector successfully ran a detection cycle (not just SKIPPED_LOW_DATA)
|
||||
BD_SCORES=$(ch_query "SELECT count() FROM ja4_processing.ml_all_scores" 2>/dev/null || echo 0)
|
||||
BD_ANOMALIES=$(ch_query "SELECT count() FROM ja4_processing.ml_detected_anomalies" 2>/dev/null || echo 0)
|
||||
if [ "$BD_SCORES" -gt 0 ] 2>/dev/null; then
|
||||
pass "Bot-detector scored traffic: $BD_SCORES rows in ml_all_scores, $BD_ANOMALIES anomalies detected"
|
||||
else
|
||||
warn "ml_all_scores is empty — bot-detector may not have completed a cycle yet"
|
||||
warn " (check: docker compose logs bot-detector | grep -E 'CYCLE|SKIP|train')"
|
||||
fi
|
||||
|
||||
# =============================================================================
|
||||
|
||||
Reference in New Issue
Block a user