feat(scripts): complete stack init + prod data import with date shift

Schema cleanup:
- Remove anubis_ua_rules table stub from 03_anubis_tables.sql
- Remove anubis_ua_rules from bot-detector deploy_schema.sql
- Remove UA seed step from clickhouse-init.sh (no more REGEXP_TREE dependency)
- Drop dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules

New scripts:
- scripts/init-stack.sh: comprehensive ClickHouse init (13 SQL files + migrations
  + validation + cleanup of obsolete tables). Supports --reset, --import-prod.
- scripts/import-prod-data.sh: imports pre-exported prod data (Native format)
  with dynamic date shift (max(time) → now). Supports --shift, --no-truncate.
- scripts/data/prod-export/: directory for cached Native format exports

Makefile targets: init-stack, import-prod-data, init-and-import

Tested: init-stack.sh passes all 13 SQL + 7 critical tables + 7 dicts
        import-prod-data.sh: 3M rows in ~37s with auto date shift
        Dashboard: 55 routes OK, bot-detector: 36/36 tests pass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
toto
2026-04-09 21:40:05 +02:00
parent d8ca804a55
commit 9ea36ad22e
8 changed files with 437 additions and 54 deletions

View File

@ -1,7 +1,6 @@
-- ============================================================================
-- ANUBIS CRAWLER RULES — Labeling des http_logs + pipeline ML
-- Architecture simplifiée (IP/CIDR et ASN uniquement) :
-- anubis_ua_rules (table stub) → dict_anubis_ua (REGEXP_TREE, catch-all)
-- anubis_ip_rules (table) → dict_anubis_ip (IP_TRIE)
-- anubis_asn_rules (table) → dict_anubis_asn (FLAT)
-- http_logs : +anubis_bot_name, +anubis_bot_action, +anubis_bot_category
@ -11,23 +10,7 @@
-- ============================================================================
-- ----------------------------------------------------------------------------
-- 1. TABLE SOURCE — règles User-Agent (stub REGEXP_TREE)
-- REGEXP_TREE nécessite ≥1 règle ; le catch-all est injecté à l'init.
-- Cette table n'est PAS peuplée par fetch_rules.py.
-- ----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS ja4_processing.anubis_ua_rules
(
id UInt64,
parent_id UInt64,
regexp String,
keys Array(String),
values Array(String)
)
ENGINE = ReplacingMergeTree()
ORDER BY id;
-- ----------------------------------------------------------------------------
-- 2. TABLE SOURCE — règles IP/CIDR (pour dictionnaire IP_TRIE)
-- 1. TABLE SOURCE — règles IP/CIDR (pour dictionnaire IP_TRIE)
-- Peuplée par fetch_rules.py depuis les fichiers YAML Anubis.
-- ----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS ja4_processing.anubis_ip_rules