SOC Log Analysis Data Warehouse Best Practices

Centralizing firewall logs, identity authentication logs (IAM), and application access logs into Singdata Lakehouse to build a low-cost, high-coverage threat detection data warehouse that replaces or augments traditional SIEM analytics capabilities. Based on a Kaggle cybersecurity threat detection log dataset (containing firewall, IDS, and application logs — 30 representative samples), this guide provides an end-to-end demonstration of the complete OSS PIPE → Bronze → Silver → Gold build process, covering five key platform capabilities: BloomFilter Index, Inverted Index, Dynamic Table, SQL UDF, and Time Travel.


Overview

The core challenge of SOC log analysis is: rapidly ingesting massive heterogeneous logs (firewall/IDS/application), identifying genuinely threatening behavior, while supporting post-incident forensic analysis.

ProblemSingdata Solution
Log agents write to OSS buckets and need automatic ingestionOSS PIPE (EVENT_NOTIFICATION mode) — new file triggers ingestion immediately upon landing
High-cardinality source_ip / dest_ip columns with frequent point queriesBloomFilter Index — filters non-matching data blocks and reduces scan volume
Attack tool fingerprint and request path keyword searchInverted Index (english / keyword tokenizer) — supports full-text matching
Auto-refresh from raw logs → normalization → threat aggregationDynamic Table with declarative SQL and incremental computation
Real-time IP threat intelligence API for tagging malicious IPsExternal Function — embed external threat intelligence service calls in SQL
Post-incident need to trace back the complete data state at attack timeTime Travel — query historical snapshots by timestamp or version

SQL Commands Used

Command / FunctionPurposeNotes
CREATE TABLECreate Bronze layer raw log tableRegular table serving as upstream source for Dynamic Tables
CREATE BLOOMFILTER INDEXCreate BloomFilter Index on source_ip / dest_ipSuitable for point-query filtering on high-cardinality IP columns
CREATE INVERTED INDEXCreate Inverted Index on user_agent / request_pathSupports full-text search for attack tool fingerprints
CREATE PIPECreate OSS continuous ingestion pipelineEVENT_NOTIFICATION mode — new file automatically triggers ingestion
CREATE FUNCTIONCreate SQL UDF classify_ip_riskEncapsulates IP risk classification logic
CREATE DYNAMIC TABLECreate Silver / Gold layer incremental computation tablesSystem auto-detects upstream changes and refreshes incrementally
REFRESH DYNAMIC TABLEManually trigger a single refreshUsed for initial build or debugging
SELECT ... TIMESTAMP AS OFHistorical snapshot querySecurity event forensics — trace back data at attack time
MATCH_ALLFull-text search functionUses Inverted Index to accelerate log keyword retrieval

Prerequisites

All examples in this guide run under the best_practice_soc_log schema.

CREATE SCHEMA IF NOT EXISTS best_practice_soc_log;


Bronze Layer: Raw Log Table

Create Table

The Bronze layer stores raw logs written by OSS PIPE or INSERT, with fields matching the log source — no business logic transformation.

CREATE TABLE IF NOT EXISTS best_practice_soc_log.doc_raw_logs ( log_id BIGINT, log_timestamp TIMESTAMP, source_ip STRING, dest_ip STRING, protocol STRING, action STRING, threat_label STRING, log_type STRING, bytes_transferred BIGINT, user_agent STRING, request_path STRING, ingest_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() );

ingest_time uses DEFAULT CURRENT_TIMESTAMP() and is automatically populated when OSS PIPE writes; it does not need to be carried in the log body.

Create BloomFilter Index

source_ip and dest_ip are high-cardinality string columns. Security analysts frequently query specific IPs as point queries, making them suitable for BloomFilter Index.

CREATE BLOOMFILTER INDEX IF NOT EXISTS idx_bf_source_ip ON TABLE doc_raw_logs (source_ip); CREATE BLOOMFILTER INDEX IF NOT EXISTS idx_bf_dest_ip ON TABLE doc_raw_logs (dest_ip);

Create Inverted Index

Attack tools typically leave fingerprint strings in user_agent (e.g., SQLMap, Nmap, Metasploit), and request_path contains injection payloads (e.g., /admin?id=1 OR 1=1). Create Inverted Indexes on both columns to support full-text search:

CREATE INVERTED INDEX IF NOT EXISTS idx_inv_user_agent ON TABLE doc_raw_logs (user_agent) PROPERTIES('analyzer'='english'); CREATE INVERTED INDEX IF NOT EXISTS idx_inv_request_path ON TABLE doc_raw_logs (request_path) PROPERTIES('analyzer'='keyword');

user_agent uses the english tokenizer (splits by word); request_path uses keyword (treated as a whole, no splitting) to preserve URL semantics.

Verify index creation:

SHOW INDEXES FROM best_practice_soc_log.doc_raw_logs;

index_name | index_type --------------------|------------ idx_bf_source_ip | bloom_filter idx_inv_user_agent | inverted idx_inv_request_path| inverted idx_bf_dest_ip | bloom_filter

Data Ingestion: OSS PIPE (EVENT_NOTIFICATION Mode)

In production, log agents (Fluentd / Logstash) write logs to an OSS bucket, and OSS event notifications trigger the PIPE to automatically ingest.

Option 1: Continuous ingestion via OSS PIPE (recommended)

First create a Storage Connection and Volume pointing to the log bucket, then create the PIPE:

-- Create Volume after an OSS Storage Connection is available CREATE VOLUME soc_log_vol EXTERNAL STORAGE_CONNECTION = '<your_oss_connection>' LOCATION = 'oss://<your-bucket>/soc-logs/'; -- Create OSS PIPE (EVENT_NOTIFICATION mode) CREATE PIPE IF NOT EXISTS best_practice_soc_log.pipe_raw_logs VIRTUAL_CLUSTER = 'DEFAULT' AUTO_INGEST = TRUE AS COPY INTO best_practice_soc_log.doc_raw_logs (log_timestamp, source_ip, dest_ip, protocol, action, threat_label, log_type, bytes_transferred, user_agent, request_path) FROM ( SELECT TO_TIMESTAMP($1, 'YYYY-MM-DD"T"HH24:MI:SS'), $2, $3, $4, $5, $6, $7, $8::BIGINT, $9, $10 FROM VOLUME soc_log_vol ) USING csv OPTIONS('header'='true', 'sep'=',');

Option 2: INSERT simulation (when OSS is unavailable)

If OSS is not yet configured, use INSERT INTO to write directly to simulate the effect of OSS PIPE already having parsed and written the data:

Import from local CSV (recommended):

-- Step 1: Upload local CSV to User Volume via SQL PUT PUT '/path/to/your/data.csv' TO USER VOLUME FILE 'data.csv';

-- Step 2: COPY INTO the table from User Volume COPY INTO best_practice_soc_log.doc_raw_logs FROM USER VOLUME USING csv OPTIONS('header'='true', 'sep'=',', 'nullValue'='') FILES ('data.csv');

You can also insert test data inline (no CSV file needed):

INSERT INTO best_practice_soc_log.doc_raw_logs (log_id, log_timestamp, source_ip, dest_ip, protocol, action, threat_label, log_type, bytes_transferred, user_agent, request_path) VALUES (1, CAST('2024-05-01 00:00:00' AS TIMESTAMP), '192.168.1.125', '192.168.1.124', 'TCP', 'blocked', 'benign', 'firewall', 10889, 'Nmap Scripting Engine', '/'), (2, CAST('2024-05-03 08:15:00' AS TIMESTAMP), '192.168.1.130', '192.168.1.100', 'TCP', 'allowed', 'benign', 'firewall', 2048, 'Nmap Scripting Engine', '/'), (3, CAST('2024-05-07 14:22:00' AS TIMESTAMP), '10.0.0.15', '10.0.0.20', 'UDP', 'allowed', 'benign', 'firewall', 512, 'Mozilla/5.0', '/'), (4, CAST('2024-05-10 09:30:00' AS TIMESTAMP), '172.16.0.1', '172.16.0.5', 'TCP', 'allowed', 'benign', 'firewall', 3072, 'Cobalt Strike', '/'), (5, CAST('2024-05-12 11:45:00' AS TIMESTAMP), '192.168.2.50', '192.168.1.200', 'TCP', 'blocked', 'benign', 'firewall', 1024, 'Mozilla/5.0 Firefox/119.0', '/'), (6, CAST('2024-05-15 16:00:00' AS TIMESTAMP), '10.1.0.10', '10.1.0.50', 'TCP', 'allowed', 'benign', 'firewall', 768, 'curl/7.64.1', '/'), (7, CAST('2024-05-20 10:05:00' AS TIMESTAMP), '192.168.1.77', '192.168.1.100', 'HTTP', 'allowed', 'benign', 'ids', 1536, 'SQLMap/1.6-dev', '/login'), (8, CAST('2024-05-22 13:10:00' AS TIMESTAMP), '192.168.1.50', '192.168.1.80', 'UDP', 'allowed', 'benign', 'ids', 256, 'Havoc/0.7', '/'), (9, CAST('2024-05-25 09:00:00' AS TIMESTAMP), '10.0.0.5', '10.0.0.10', 'TCP', 'allowed', 'benign', 'ids', 1024, 'Metasploit v6.3', '/api/health'), (10, CAST('2024-05-28 15:30:00' AS TIMESTAMP), '192.168.1.228', '192.168.1.1', 'HTTP', 'allowed', 'benign', 'ids', 2048, 'SQLMap/1.6-dev', '/'), (11, CAST('2024-06-01 08:45:00' AS TIMESTAMP), '10.0.0.25', '10.0.0.100', 'TCP', 'allowed', 'benign', 'ids', 384, 'Mozilla/5.0 Safari/537', '/api/status'), (12, CAST('2024-06-05 10:20:00' AS TIMESTAMP), '192.168.1.88', '192.168.1.10', 'HTTP', 'allowed', 'benign', 'application', 5120, 'Mozilla/5.0 Chrome/119.0', '/'), (13, CAST('2024-06-10 14:35:00' AS TIMESTAMP), '10.0.0.30', '10.0.0.5', 'HTTPS', 'allowed', 'benign', 'application', 3584, 'Mozilla/5.0 Firefox/118.0', '/'), (14, CAST('2024-06-12 11:15:00' AS TIMESTAMP), '192.168.1.60', '192.168.1.200', 'HTTP', 'allowed', 'benign', 'application', 1280, 'Nmap Scripting Engine', '/api/search'), (15, CAST('2024-06-15 09:50:00' AS TIMESTAMP), '10.0.0.40', '10.0.0.20', 'HTTPS', 'allowed', 'benign', 'application', 2048, 'Mozilla/5.0 Edge/120.0', '/login'), (16, CAST('2024-06-18 16:30:00' AS TIMESTAMP), '192.168.1.90', '192.168.1.50', 'HTTP', 'allowed', 'benign', 'application', 1792, 'Mozilla/5.0 Chrome/120.0', '/login'), (17, CAST('2024-06-20 10:00:00' AS TIMESTAMP), '10.0.0.50', '10.0.0.15', 'HTTP', 'allowed', 'benign', 'application', 2560, 'Mozilla/5.0 Safari/604', '/home'), (18, CAST('2024-06-22 14:00:00' AS TIMESTAMP), '192.168.1.95', '192.168.1.80', 'HTTP', 'allowed', 'benign', 'application', 1024, 'Mozilla/5.0 Chrome/121.0', '/about'), (19, CAST('2024-06-25 09:30:00' AS TIMESTAMP), '217.89.155.68', '192.168.1.20', 'HTTP', 'allowed', 'benign', 'application', 2048, 'SQLMap/1.6-dev', '/login'), (20, CAST('2024-06-27 11:00:00' AS TIMESTAMP), '10.0.0.60', '10.0.0.25', 'HTTP', 'allowed', 'benign', 'application', 4096, 'Mozilla/5.0 Firefox/120.0', '/assets/logo.png'), (21, CAST('2024-07-31 00:00:00' AS TIMESTAMP), '177.52.183.80', '192.168.1.50', 'HTTPS', 'blocked', 'suspicious', 'ids', 45164, 'Mozilla/5.0 Chrome/120.0', '/login?backup.sql'), (22, CAST('2024-08-05 03:15:00' AS TIMESTAMP), '103.22.200.174','192.168.1.20', 'HTTP', 'blocked', 'suspicious', 'ids', 8192, 'SQLMap/1.6-dev', '/api/users'), (23, CAST('2024-08-10 02:30:00' AS TIMESTAMP), '91.108.4.55', '192.168.1.10', 'TCP', 'blocked', 'suspicious', 'ids', 16384, 'Mozilla/5.0 Firefox/119.0', '/'), (24, CAST('2024-08-12 01:20:00' AS TIMESTAMP), '45.142.213.99', '192.168.1.100', 'TCP', 'blocked', 'suspicious', 'firewall', 6144, 'Nmap Scripting Engine', '/login'), (25, CAST('2024-09-05 04:00:00' AS TIMESTAMP), '78.128.113.47', '192.168.1.30', 'HTTP', 'blocked', 'suspicious', 'application', 12288, 'Mozilla/5.0 Chrome/119.0', '/wp-admin'), (26, CAST('2024-05-18 00:00:00' AS TIMESTAMP), '185.220.101.33','192.168.1.10', 'TCP', 'blocked', 'malicious', 'firewall', 4096, 'Metasploit v6.3', '/'), (27, CAST('2024-06-29 00:00:00' AS TIMESTAMP), '198.199.119.1', '192.168.1.22', 'HTTP', 'blocked', 'malicious', 'ids', 62500, 'curl/7.64.1', '/etc/passwd'), (28, CAST('2024-08-14 00:30:00' AS TIMESTAMP), '91.240.118.172','192.168.1.15', 'TCP', 'blocked', 'malicious', 'firewall', 32768, 'Cobalt Strike', '/'), (29, CAST('2024-11-05 05:20:00' AS TIMESTAMP), '45.95.147.236', '192.168.1.25', 'TCP', 'blocked', 'malicious', 'ids', 65536, 'Havoc/0.7', '/shell.php'), (30, CAST('2024-09-27 23:45:00' AS TIMESTAMP), '104.21.58.152', '192.168.1.40', 'UDP', 'allowed', 'malicious', 'application', 8192, 'Mozilla/5.0 Chrome/118.0', '/admin?id=1 OR 1=1') ;

Verify Bronze layer row count and threat distribution:

SELECT threat_label, log_type, COUNT(*) AS cnt FROM best_practice_soc_log.doc_raw_logs GROUP BY threat_label, log_type ORDER BY threat_label, cnt DESC;

threat_label | log_type | cnt -------------|-------------|---- benign | application | 9 benign | firewall | 6 benign | ids | 5 malicious | ids | 2 malicious | firewall | 2 malicious | application | 1 suspicious | ids | 3 suspicious | firewall | 1 suspicious | application | 1

20 benign, 5 suspicious, 5 malicious records — covering all three log sources: firewall, IDS, and application.


Full-Text Search: Using Inverted Index to Locate Attack Signatures

The MATCH_ALL function uses the Inverted Index to quickly locate log rows containing specific attack tool signatures:

SELECT log_id, source_ip, user_agent, threat_label FROM best_practice_soc_log.doc_raw_logs WHERE MATCH_ALL(user_agent, 'SQLMap') LIMIT 5;

log_id | source_ip | user_agent | threat_label -------|-----------------|------------------|------------- 19 | 217.89.155.68 | SQLMap/1.6-dev | benign 22 | 103.22.200.174 | SQLMap/1.6-dev | suspicious 7 | 192.168.1.77 | SQLMap/1.6-dev | benign 10 | 192.168.1.228 | SQLMap/1.6-dev | benign

4 logs carrying the SQLMap signature were found, 1 of which is already labeled suspicious. Rows with a benign label indicate the tool's operation did not trigger a threat judgment in this instance, but they are still worth monitoring.

Combine BloomFilter Index point queries for a specific IP with path analysis to understand attack intent:

SELECT log_id, source_ip, threat_label, action FROM best_practice_soc_log.doc_raw_logs WHERE source_ip = '185.220.101.33';

log_id | source_ip | threat_label | action -------|-----------------|--------------|-------- 26 | 185.220.101.33 | malicious | blocked

The BloomFilter Index quickly eliminates data blocks that don't contain this IP at the block level, scanning only blocks with matches.


IP Risk Classification UDF

Encapsulate IP risk determination logic as a SQL UDF, reusable in both Silver and Gold layers:

CREATE OR REPLACE FUNCTION best_practice_soc_log.classify_ip_risk( ip STRING, threat_label STRING, is_attack_tool INT ) RETURNS STRING AS CASE WHEN threat_label = 'malicious' THEN 'HIGH' WHEN threat_label = 'suspicious' AND is_attack_tool = 1 THEN 'HIGH' WHEN threat_label = 'suspicious' THEN 'MEDIUM' WHEN is_attack_tool = 1 THEN 'MEDIUM' ELSE 'LOW' END;

Verify the UDF:

SELECT source_ip, threat_label, is_attack_tool, best_practice_soc_log.classify_ip_risk(source_ip, threat_label, is_attack_tool) AS ip_risk_level FROM best_practice_soc_log.doc_silver_normalized_logs WHERE threat_label != 'benign' ORDER BY ip_risk_level LIMIT 8;

source_ip | threat_label | is_attack_tool | ip_risk_level -----------------|--------------|----------------|--------------- 103.22.200.174 | suspicious | 1 | HIGH 45.142.213.99 | suspicious | 1 | HIGH 185.220.101.33 | malicious | 1 | HIGH 198.199.119.1 | malicious | 0 | HIGH 91.240.118.172 | malicious | 1 | HIGH 104.21.58.152 | malicious | 0 | HIGH 45.95.147.236 | malicious | 1 | HIGH 177.52.183.80 | suspicious | 0 | MEDIUM


Silver Layer Dynamic Table: Normalization and Threat Tagging

The Silver layer performs three operations on top of Bronze raw logs: IP internal/external network classification, threat level digitization, and attack tool fingerprint marking.

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_soc_log.doc_silver_normalized_logs AS SELECT log_id, log_timestamp, DATE(log_timestamp) AS log_date, HOUR(log_timestamp) AS log_hour, source_ip, dest_ip, protocol, action, threat_label, log_type, bytes_transferred, user_agent, request_path, ingest_time, -- Whether source IP is internal (RFC 1918 address ranges) CASE WHEN source_ip LIKE '192.168.%' OR source_ip LIKE '10.%' OR source_ip LIKE '172.16.%' THEN 1 ELSE 0 END AS is_internal_src, -- Threat level digitization CASE threat_label WHEN 'malicious' THEN 3 WHEN 'suspicious' THEN 2 WHEN 'benign' THEN 1 ELSE 0 END AS threat_level, -- Attack tool fingerprint CASE WHEN user_agent IN ( 'SQLMap/1.6-dev','Nmap Scripting Engine', 'Metasploit v6.3','Cobalt Strike','Havoc/0.7') THEN 1 ELSE 0 END AS is_attack_tool FROM best_practice_soc_log.doc_raw_logs;

Manually trigger the first refresh:

REFRESH DYNAMIC TABLE best_practice_soc_log.doc_silver_normalized_logs;

Verify Silver layer:

SELECT COUNT(*) AS cnt, SUM(is_attack_tool) AS attack_tool_cnt FROM best_practice_soc_log.doc_silver_normalized_logs;

cnt | attack_tool_cnt ----|---------------- 30 | 14

14 of the 30 logs carry known attack tool User-Agents.


Gold Layer Dynamic Tables: Threat Indicator Aggregation

The Gold layer provides two aggregation tables: daily per-IP threat summary, and high-risk external IP rankings.

Daily Threat Summary

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_soc_log.doc_gold_threat_summary AS SELECT source_ip, log_date, log_type, COUNT(*) AS total_events, SUM(CASE WHEN threat_label = 'malicious' THEN 1 ELSE 0 END) AS malicious_cnt, SUM(CASE WHEN threat_label = 'suspicious' THEN 1 ELSE 0 END) AS suspicious_cnt, SUM(CASE WHEN action = 'blocked' THEN 1 ELSE 0 END) AS blocked_cnt, SUM(is_attack_tool) AS attack_tool_cnt, MAX(threat_level) AS max_threat_level, SUM(bytes_transferred) AS total_bytes, COUNT(DISTINCT dest_ip) AS unique_dest_count FROM best_practice_soc_log.doc_silver_normalized_logs GROUP BY source_ip, log_date, log_type;

High-Risk External IP Rankings

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_soc_log.doc_gold_high_risk_ips AS SELECT source_ip, COUNT(DISTINCT log_date) AS active_days, COUNT(*) AS total_events, SUM(CASE WHEN threat_label IN ('malicious','suspicious') THEN 1 ELSE 0 END) AS threat_events, SUM(is_attack_tool) AS attack_tool_hits, ROUND( SUM(CASE WHEN threat_label IN ('malicious','suspicious') THEN 1.0 ELSE 0.0 END) / COUNT(*) * 100, 2 ) AS threat_rate_pct, MAX(log_timestamp) AS last_seen, COUNT(DISTINCT dest_ip) AS targets_count FROM best_practice_soc_log.doc_silver_normalized_logs WHERE is_internal_src = 0 GROUP BY source_ip HAVING threat_events > 0;

Manually trigger refresh:

REFRESH DYNAMIC TABLE best_practice_soc_log.doc_gold_threat_summary; REFRESH DYNAMIC TABLE best_practice_soc_log.doc_gold_high_risk_ips;

Query high-risk IP rankings:

SELECT source_ip, log_date, max_threat_level, total_events, malicious_cnt, blocked_cnt FROM best_practice_soc_log.doc_gold_threat_summary ORDER BY max_threat_level DESC, total_events DESC LIMIT 5;

source_ip | log_date | max_threat_level | total_events | malicious_cnt | blocked_cnt -----------------|------------|------------------|--------------|---------------|------------ 104.21.58.152 | 2024-09-27 | 3 | 1 | 1 | 0 198.199.119.1 | 2024-06-29 | 3 | 1 | 1 | 1 45.95.147.236 | 2024-11-05 | 3 | 1 | 1 | 1 91.240.118.172 | 2024-08-14 | 3 | 1 | 1 | 1 185.220.101.33 | 2024-05-18 | 3 | 1 | 1 | 1

max_threat_level = 3 is the highest threat level (malicious). The DNS tunneling behavior of 104.21.58.152 (blocked_cnt = 0 but malicious_cnt = 1) deserves special attention — this traffic was not blocked.


Configure Refresh Tasks

Dynamic Table periodic refresh is managed via Lakehouse Studio Tasks, which can have monitoring alerts and data quality checks attached.

Create Silver Layer Refresh Task

# Create SQL type task, placed in the best_practices folder cz-cli task create refresh_soc_silver --type SQL --folder best_practices -p skill_test # Write refresh SQL cz-cli task save-content refresh_soc_silver \ --content "REFRESH DYNAMIC TABLE best_practice_soc_log.doc_silver_normalized_logs;" \ -p skill_test # Set refresh every 5 minutes cz-cli task save-cron refresh_soc_silver --cron "*/5 * * * *" -p skill_test

Create Gold Layer Refresh Task

cz-cli task create refresh_soc_gold --type SQL --folder best_practices -p skill_test cz-cli task save-content refresh_soc_gold \ --content "REFRESH DYNAMIC TABLE best_practice_soc_log.doc_gold_threat_summary; REFRESH DYNAMIC TABLE best_practice_soc_log.doc_gold_high_risk_ips;" \ -p skill_test # Gold layer refreshes every 10 minutes (after Silver completes) cz-cli task save-cron refresh_soc_gold --cron "*/10 * * * *" -p skill_test

Deploy tasks to activate scheduling:

cz-cli task deploy refresh_soc_silver -p skill_test cz-cli task deploy refresh_soc_gold -p skill_test


Security Event Forensics: Time Travel

After a security incident, analysts need to trace back the complete data state at the time of the attack. Time Travel provides historical snapshot queries based on timestamps.

View Version History

DESC HISTORY best_practice_soc_log.doc_raw_logs LIMIT 5;

version | time | total_rows | operation --------|-------------------------|------------|---------- 8 | 2026-06-06T23:37:56.770 | 30 | INSERT 7 | 2026-06-06T23:37:52.335 | 30 | INSERT 6 | 2026-06-06T23:37:24.639 | 30 | INSERT

Trace Back Log State at a Specific Time

-- Trace back data state at 2026-06-06 23:37:30 SELECT COUNT(*) AS row_count FROM best_practice_soc_log.doc_raw_logs TIMESTAMP AS OF '2026-06-06 23:37:30';

row_count --------- 30

Trace Attack Path

For known malicious IPs, trace back all operations within the attack time window:

SELECT log_timestamp, source_ip, dest_ip, protocol, action, request_path FROM best_practice_soc_log.doc_raw_logs TIMESTAMP AS OF '2026-06-06 23:39:07' WHERE source_ip IN ('185.220.101.33', '198.199.119.1', '45.95.147.236') ORDER BY log_timestamp;

The Time Travel retention period defaults to 7 days. Within this window, any historical version can be queried at any time without additional backups.


Attack Path Analysis

Combine Gold layer aggregated data to quickly identify high-frequency attack paths:

SELECT request_path, COUNT(*) AS total_hits, SUM(CASE WHEN threat_label IN ('malicious','suspicious') THEN 1 ELSE 0 END) AS threat_hits FROM best_practice_soc_log.doc_raw_logs GROUP BY request_path ORDER BY threat_hits DESC, total_hits DESC LIMIT 8;

request_path | total_hits | threat_hits ----------------------|------------|------------ / | 13 | 3 /login | 5 | 1 /login?backup.sql | 1 | 1 /admin?id=1 OR 1=1 | 1 | 1 /wp-admin | 1 | 1 /api/users | 1 | 1 /etc/passwd | 1 | 1 /shell.php | 1 | 1

The / path has 13 hits with 3 being threats. /etc/passwd and /shell.php are classic system file probe and webshell upload signatures — even a single access requires immediate response.


Notes

  • BloomFilter Index only applies to newly added data: After the index is created, run BUILD INDEX idx_bf_source_ip ON TABLE doc_raw_logs to rebuild for existing historical data. It automatically takes effect for newly written data.
  • Inverted Index tokenizer selection: Use english for user_agent (splits by word) and keyword for request_path (treated as a whole). Choosing the wrong tokenizer will cause MATCH_ALL to fail to match expected content.
  • Do not write REFRESH INTERVAL in Dynamic Table DDL: Refresh scheduling must be managed via Lakehouse Studio Tasks to enable attaching alerts and data quality checks to the same task.
  • Dynamic Table static partition mode: To partition Silver/Gold layer Dynamic Tables by log_date, you must use PARTITION BY with static partition declaration — dynamic partition inference cannot be used.
  • Time Travel retention period: Default is 7 days. Historical versions beyond the retention period cannot be queried. For scenarios requiring longer retention for compliance, configure DATA_RETENTION_TIME_IN_DAYS at table creation time.
  • OSS PIPE EVENT_NOTIFICATION mode: Requires configuring event notification rules in the OSS console, pointing ObjectCreated:* events to the message queue endpoint provided by Lakehouse.
  • External Function calling threat intelligence APIs: The example classify_ip_risk is a SQL UDF — a simplified logical version. For production scenarios, call real APIs such as AbuseIPDB via External Function. See the related documentation section.
  • Column Masking: You can set masking policies on PII data such as source_ip.