feat(scripts): complete stack init + prod data import with date shift
Schema cleanup:
- Remove anubis_ua_rules table stub from 03_anubis_tables.sql
- Remove anubis_ua_rules from bot-detector deploy_schema.sql
- Remove UA seed step from clickhouse-init.sh (no more REGEXP_TREE dependency)
- Drop dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules
New scripts:
- scripts/init-stack.sh: comprehensive ClickHouse init (13 SQL files + migrations
+ validation + cleanup of obsolete tables). Supports --reset, --import-prod.
- scripts/import-prod-data.sh: imports pre-exported prod data (Native format)
with dynamic date shift (max(time) → now). Supports --shift, --no-truncate.
- scripts/data/prod-export/: directory for cached Native format exports
Makefile targets: init-stack, import-prod-data, init-and-import
Tested: init-stack.sh passes all 13 SQL + 7 critical tables + 7 dicts
import-prod-data.sh: 3M rows in ~37s with auto date shift
Dashboard: 55 routes OK, bot-detector: 36/36 tests pass
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@ -1,29 +1,10 @@
|
||||
-- =============================================================================
|
||||
-- 03_anubis_tables.sql — Anubis crawler rule tables and dictionaries
|
||||
-- Only IP/CIDR and ASN rules are populated by fetch_rules.py.
|
||||
-- UA and Country dictionaries are kept as stubs (required by MV references)
|
||||
-- but are never populated with real data.
|
||||
-- Only IP/CIDR and ASN rules are used. UA and Country have been removed.
|
||||
-- =============================================================================
|
||||
|
||||
-- -----------------------------------------------------------------------------
|
||||
-- 1. TABLE SOURCE — User-Agent rules (REGEXP_TREE stub)
|
||||
-- REGEXP_TREE requires ≥1 rule; the catch-all is seeded at init time.
|
||||
-- This table is NOT populated by fetch_rules.py.
|
||||
-- -----------------------------------------------------------------------------
|
||||
CREATE TABLE IF NOT EXISTS ja4_processing.anubis_ua_rules
|
||||
(
|
||||
id UInt64,
|
||||
parent_id UInt64,
|
||||
regexp String,
|
||||
keys Array(String),
|
||||
values Array(String)
|
||||
)
|
||||
ENGINE = ReplacingMergeTree()
|
||||
ORDER BY id;
|
||||
|
||||
|
||||
-- -----------------------------------------------------------------------------
|
||||
-- 2. TABLE SOURCE — IP/CIDR rules (for IP_TRIE dictionary)
|
||||
-- 1. TABLE SOURCE — IP/CIDR rules (for IP_TRIE dictionary)
|
||||
-- Populated by fetch_rules.py from Anubis GitHub data.
|
||||
-- -----------------------------------------------------------------------------
|
||||
CREATE TABLE IF NOT EXISTS ja4_processing.anubis_ip_rules
|
||||
|
||||
Reference in New Issue
Block a user