Build a Smart City Government Data Platform

Aggregate open data from multiple government departments including transportation, urban management, civil affairs, and public health to build a four-layer data warehouse supporting city operations KPI dashboards and cross-department joint analysis. Using NYC Open Data's 311 citizen complaint tickets, traffic accidents, and health inspection datasets as samples, this document provides an end-to-end demonstration of the complete build process from COPY INTO + Volume โ†’ ODS โ†’ DWD โ†’ DWS โ†’ ADS, covering the practical implementation of five key capabilities: External Schema (connecting to existing Hive data lakes), RBAC (multi-department data isolation), Dynamic Table (cross-department data fusion), Table Stream (complaint ticket status change capture), and Column Masking (PII field masking).


Overview

The core challenges of government data platforms are: diverse data sources, heterogeneous formats, complex permission boundaries, while simultaneously supporting cross-department joint analysis under the premise of protecting personal privacy.

ProblemSolution
Departments periodically submit CSV files requiring unified batch ingestionCOPY INTO + Volume, declarative loading, automatic deduplication of already-loaded files
Existing Hive data lake holds historical data with high migration costExternal Schema mounts Hive metadata for direct cross-source federated queries
Inconsistent data formats across departments require standardization into city-themed eventsDynamic Table, CTE-based UNION ALL, automatic incremental fusion
Certain fields (coordinates, names) are PII requiring dynamic maskingColumn Masking bound to columns, masking transparent to queries
Department analysts can only see their own department's data; administrators see everythingRBAC, grant SELECT permissions on Schema or table level by role
Complaint ticket status changes need real-time capture to drive resolution statisticsTable Stream, captures UPDATE_BEFORE / UPDATE_AFTER

SQL Commands Used

Command / FunctionPurposeNotes
CREATE TABLECreate ODS layer raw data tables for each departmentRegular tables, COPY INTO targets
COPY INTOBatch import department CSV files from VolumeAutomatically skips already-loaded files
CREATE EXTERNAL SCHEMAMount existing Hive data lakeNo migration needed; direct federated queries
CREATE TABLE STREAMCapture complaint ticket status changesSTANDARD mode, supports INSERT / UPDATE / DELETE
CREATE DYNAMIC TABLECreate DWD / DWS / ADS three-layer incremental computation tablesDeclarative SQL, automatic dependency chain refresh
REFRESH DYNAMIC TABLEManually trigger a single refreshUsed during initial build or debugging
CREATE ROLE / GRANT / REVOKECreate roles and manage authorizationMulti-department data isolation
ALTER TABLE ... CHANGE COLUMN ... SET MASKBind Column Masking policyDynamic masking for PII fields

Prerequisites

All examples in this document run under the best_practice_smart_city schema.

CREATE SCHEMA IF NOT EXISTS best_practice_smart_city;


ODS Layer: Raw Data Tables for Each Department

The ODS layer stores data by source in isolation. This document demonstrates three data types: NYC 311 complaint tickets (citizen service hotline), traffic accidents (DOT/NYPD), and health inspections (Dept of Health), all corresponding to the actual data structure from NYC Open Data.

Create Tables

CREATE TABLE IF NOT EXISTS best_practice_smart_city.doc_ods_311_complaints ( complaint_id STRING, created_date TIMESTAMP, closed_date TIMESTAMP, agency STRING, agency_name STRING, complaint_type STRING, descriptor STRING, location_type STRING, incident_zip STRING, incident_address STRING, city STRING, borough STRING, latitude DOUBLE, longitude DOUBLE, status STRING, resolution_desc STRING, community_board STRING, bbl STRING, open_data_channel STRING, load_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() );

CREATE TABLE IF NOT EXISTS best_practice_smart_city.doc_ods_traffic_accidents ( accident_id STRING, crash_date DATE, crash_time STRING, borough STRING, zip_code STRING, latitude DOUBLE, longitude DOUBLE, on_street_name STRING, cross_street_name STRING, persons_injured INT, persons_killed INT, pedestrians_injured INT, pedestrians_killed INT, cyclists_injured INT, cyclists_killed INT, motorists_injured INT, motorists_killed INT, vehicle_type_1 STRING, vehicle_type_2 STRING, contributing_factor_1 STRING, contributing_factor_2 STRING, load_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() );

CREATE TABLE IF NOT EXISTS best_practice_smart_city.doc_ods_health_inspections ( inspection_id STRING, facility_name STRING, facility_type STRING, borough STRING, zip_code STRING, inspection_date DATE, inspection_type STRING, violation_code STRING, violation_desc STRING, grade STRING, grade_date DATE, score INT, latitude DOUBLE, longitude DOUBLE, load_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() );

CREATE TABLE IF NOT EXISTS best_practice_smart_city.doc_dim_district ( district_code STRING, district_name STRING, borough STRING, city STRING, population INT, area_sqkm DOUBLE, district_type STRING );

Batch Import CSV Files via COPY INTO + Volume

Departments periodically submit data as CSV files stored in their dedicated object storage directories. First create a Volume pointing to the department submission directory, then use COPY INTO to batch import.

Step 1: Create a Storage Connection and Volume (Alibaba Cloud OSS example):

-- Create Storage Connection CREATE STORAGE CONNECTION IF NOT EXISTS conn_city_data TYPE = OSS ACCESS_ID = '<your-access-id>' ACCESS_KEY = '<your-access-key>' ENDPOINT = 'oss-cn-hangzhou.aliyuncs.com'; -- Create Volume for 311 complaint data CREATE EXTERNAL VOLUME IF NOT EXISTS best_practice_smart_city.vol_311_complaints TYPE = OSS BUCKET = '<your-bucket>' PATH = 'smart-city/311_complaints/' CONNECTION = conn_city_data;

Step 2: Execute COPY INTO to load:

COPY INTO best_practice_smart_city.doc_ods_311_complaints FROM VOLUME best_practice_smart_city.vol_311_complaints USING csv OPTIONS('header'='true', 'sep'=',', 'nullValue'='') ON_ERROR = CONTINUE;

COPY INTO automatically records the metadata of loaded files; re-executing the same COPY statement will not re-import the same files. This is very useful for periodic department submission scenarios: just trigger COPY INTO periodically, and the system automatically processes only new files.

View loading history:

SELECT * FROM load_history('doc_ods_311_complaints') LIMIT 5;

External Schema: Connect to Existing Hive Data Lake

If a department already has a Hive data lake with historical data, External Schema can mount it directly without migration.

-- First create a Connection pointing to the Hive Metastore CREATE CATALOG CONNECTION IF NOT EXISTS conn_hive_metastore CATALOG_TYPE = HIVE METASTORE_URI = 'thrift://<hive-metastore-host>:9083' HDFS_DEFAULT_FS = 'hdfs://<namenode>:8020'; -- Mount the Hive database as an External Schema CREATE EXTERNAL SCHEMA IF NOT EXISTS hive_civil_affairs CONNECTION = conn_hive_metastore DATABASE = 'civil_affairs_db';

After mounting, Hive tables can be queried directly:

-- Federated query: Hive historical data + Singdata Lakehouse new data SELECT ha.district_code, ha.population_2020, COUNT(c.complaint_id) AS recent_complaints FROM hive_civil_affairs.district_population ha LEFT JOIN best_practice_smart_city.doc_ods_311_complaints c ON ha.zip_code = c.incident_zip AND c.created_date >= CAST('2026-01-01' AS TIMESTAMP) GROUP BY ha.district_code, ha.population_2020;


Table Stream: Capturing Ticket Status Changes

311 complaint tickets undergo multiple status changes during processing (Open โ†’ In Progress โ†’ Closed), and each change needs to be recorded in an audit log for resolution timeliness analysis.

Create Table Stream

Create a STANDARD mode Stream on the complaint ticket table to capture all INSERT / UPDATE / DELETE operations:

CREATE TABLE STREAM IF NOT EXISTS best_practice_smart_city.doc_stream_complaint_changes ON TABLE best_practice_smart_city.doc_ods_311_complaints WITH PROPERTIES ('TABLE_STREAM_MODE' = 'STANDARD');

Verify stream creation:

SHOW STREAMS IN best_practice_smart_city;

create_time | schema_name | name | table_name | mode | comment -----------------------------+-----------------------------+--------------------------------+-------------------------------------------------------------------------+----------+-------- 2026-06-06T14:40:10.831 | best_practice_smart_city | doc_stream_complaint_changes | quick_start.best_practice_smart_city.doc_ods_311_complaints | STANDARD |

Simulate Ticket Status Changes

-- Ticket CMP002: Sewer Backup complaint resolved after second inspection UPDATE best_practice_smart_city.doc_ods_311_complaints SET status = 'Closed', closed_date = CAST('2026-01-10 16:00:00' AS TIMESTAMP), resolution_desc = 'Issue resolved after second inspection.' WHERE complaint_id = 'CMP002'; -- Ticket CMP015: Elevator repair completed UPDATE best_practice_smart_city.doc_ods_311_complaints SET status = 'Closed', closed_date = CAST('2026-01-11 10:00:00' AS TIMESTAMP), resolution_desc = 'Elevator maintenance completed.' WHERE complaint_id = 'CMP015';

Consume the Stream: Write to Ticket Audit Log

Ticket status change audit log table:

CREATE TABLE IF NOT EXISTS best_practice_smart_city.doc_dwd_complaint_audit_log ( complaint_id STRING, change_type STRING, change_version BIGINT, change_time TIMESTAMP, new_status STRING, closed_date TIMESTAMP, resolution_desc STRING, agency STRING, borough STRING, complaint_type STRING );

View change records in the stream (you can SELECT to preview before consuming):

SELECT __change_type, complaint_id, status, closed_date, __commit_timestamp FROM best_practice_smart_city.doc_stream_complaint_changes LIMIT 10;

__change_type | complaint_id | status | closed_date | __commit_timestamp --------------+--------------+--------+---------------------+-------------------- UPDATE_AFTER | CMP002 | Closed | 2026-01-10T16:00:00 | 2026-06-06T14:42:56.449 UPDATE_AFTER | CMP015 | Closed | 2026-01-11T10:00:00 | 2026-06-06T14:43:02.979 UPDATE_BEFORE | CMP002 | Open | null | 2026-06-06T14:37:45.066 UPDATE_BEFORE | CMP015 | In Progress | null | 2026-06-06T14:37:45.066

In STANDARD mode, each UPDATE produces one UPDATE_BEFORE (before change) and one UPDATE_AFTER (after change) record. Consume changes and advance the offset using INSERT INTO ... SELECT FROM stream:

-- Keep only UPDATE_AFTER, write to audit log INSERT INTO best_practice_smart_city.doc_dwd_complaint_audit_log SELECT complaint_id, __change_type AS change_type, __commit_version AS change_version, __commit_timestamp AS change_time, status AS new_status, closed_date, resolution_desc, agency, borough, complaint_type FROM best_practice_smart_city.doc_stream_complaint_changes WHERE __change_type = 'UPDATE_AFTER';

Verify after consumption that the offset has advanced โ€” stream is now empty:

SELECT COUNT(*) AS remaining_changes FROM best_practice_smart_city.doc_stream_complaint_changes;

remaining_changes ----------------- 0

View the audit log:

SELECT complaint_id, change_type, change_version, change_time, new_status, borough, complaint_type FROM best_practice_smart_city.doc_dwd_complaint_audit_log ORDER BY change_time;

complaint_id | change_type | change_version | change_time | new_status | borough | complaint_type -------------+--------------+----------------+-------------------------+------------+----------+--------------- CMP002 | UPDATE_AFTER | 3 | 2026-06-06T14:42:56.449 | Closed | BROOKLYN | Sewer CMP015 | UPDATE_AFTER | 4 | 2026-06-06T14:43:02.979 | Closed | QUEENS | ELEVATOR

change_version increases monotonically, ensuring temporal completeness of the audit log. Creating this INSERT INTO ... SELECT FROM stream statement as a periodic task in Lakehouse Studio (recommended to schedule hourly) enables continuous tracking of ticket status changes.


DWD Layer: Cross-Department Event Standardization

The DWD layer unifies raw events from different departments into city-themed events, resolving inconsistent field naming and varying status code meanings across source tables.

Create Table

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_smart_city.doc_dwd_city_events AS SELECT complaint_id AS event_id, 'COMPLAINT' AS event_category, complaint_type AS event_type, agency AS dept_code, agency_name AS dept_name, borough, incident_zip AS zip_code, latitude, longitude, created_date AS event_time, CASE WHEN status = 'Closed' THEN 'RESOLVED' WHEN status = 'Open' THEN 'OPEN' ELSE 'IN_PROGRESS' END AS event_status, DATEDIFF(CAST(closed_date AS DATE), CAST(created_date AS DATE)) AS resolution_days, CONCAT(borough, '-', incident_zip) AS geo_key FROM best_practice_smart_city.doc_ods_311_complaints WHERE created_date IS NOT NULL UNION ALL SELECT accident_id AS event_id, 'TRAFFIC_ACCIDENT' AS event_category, contributing_factor_1 AS event_type, 'DOT' AS dept_code, 'Department of Transportation' AS dept_name, borough, zip_code, latitude, longitude, CAST(crash_date AS TIMESTAMP) AS event_time, 'RESOLVED' AS event_status, 0 AS resolution_days, CONCAT(borough, '-', zip_code) AS geo_key FROM best_practice_smart_city.doc_ods_traffic_accidents WHERE crash_date IS NOT NULL;

Manually trigger the first refresh and verify:

REFRESH DYNAMIC TABLE best_practice_smart_city.doc_dwd_city_events; SELECT event_category, COUNT(*) AS cnt FROM best_practice_smart_city.doc_dwd_city_events GROUP BY event_category ORDER BY cnt DESC;

event_category | cnt -----------------+---- COMPLAINT | 20 TRAFFIC_ACCIDENT | 15

The DWD layer merges 311 complaints (20 records) and traffic accidents (15 records) into standard fields like event_id, event_category, and geo_key. Downstream DWS/ADS layers only need to query this single Dynamic Table without worrying about the raw field differences across departments.


DWS Layer: Street/Borough Daily Summary

The DWS layer aggregates at the granularity of borough ร— date ร— event category, for borough-level managers to view daily event volumes and resolution efficiency.

Create Table

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_smart_city.doc_dws_borough_daily AS SELECT borough, DATE_TRUNC('day', event_time) AS stat_date, event_category, COUNT(*) AS event_count, SUM(CASE WHEN event_status = 'RESOLVED' THEN 1 ELSE 0 END) AS resolved_count, SUM(CASE WHEN event_status = 'OPEN' THEN 1 ELSE 0 END) AS open_count, ROUND(AVG(CASE WHEN resolution_days IS NOT NULL AND resolution_days >= 0 THEN resolution_days END), 2) AS avg_resolution_days FROM best_practice_smart_city.doc_dwd_city_events WHERE event_time IS NOT NULL GROUP BY borough, DATE_TRUNC('day', event_time), event_category;

REFRESH DYNAMIC TABLE best_practice_smart_city.doc_dws_borough_daily; SELECT borough, stat_date, event_category, event_count, resolved_count, avg_resolution_days FROM best_practice_smart_city.doc_dws_borough_daily ORDER BY stat_date, borough, event_category LIMIT 10;

borough | stat_date | event_category | event_count | resolved_count | avg_resolution_days --------------+---------------------+-----------------+-------------+----------------+-------------------- BROOKLYN | 2026-01-03T00:00:00 | COMPLAINT | 1 | 0 | null BROOKLYN | 2026-01-03T00:00:00 | TRAFFIC_ACCIDENT| 1 | 1 | 0 MANHATTAN | 2026-01-03T00:00:00 | COMPLAINT | 2 | 2 | 1.5 MANHATTAN | 2026-01-03T00:00:00 | TRAFFIC_ACCIDENT| 1 | 1 | 0 BRONX | 2026-01-04T00:00:00 | COMPLAINT | 1 | 1 | 2 BRONX | 2026-01-04T00:00:00 | TRAFFIC_ACCIDENT| 1 | 1 | 0 QUEENS | 2026-01-04T00:00:00 | TRAFFIC_ACCIDENT| 1 | 1 | 0 STATEN ISLAND | 2026-01-04T00:00:00 | COMPLAINT | 1 | 0 | null BROOKLYN | 2026-01-05T00:00:00 | COMPLAINT | 1 | 1 | 1 MANHATTAN | 2026-01-05T00:00:00 | TRAFFIC_ACCIDENT| 1 | 1 | 0

A null avg_resolution_days means the complaint for that day has not yet been closed (closed_date is empty). The CASE WHEN resolution_days >= 0 filters out negative numbers (data anomalies) and null values, preventing the average from being incorrectly pulled down.

Borough complaint resolution summary (aggregated across days):

SELECT borough, event_category, SUM(event_count) AS total_events, ROUND(100.0 * SUM(resolved_count) / NULLIF(SUM(event_count), 0), 1) AS resolution_rate_pct, ROUND(AVG(avg_resolution_days), 2) AS avg_days_to_resolve FROM best_practice_smart_city.doc_dws_borough_daily GROUP BY borough, event_category ORDER BY total_events DESC;

borough | event_category | total_events | resolution_rate_pct | avg_days_to_resolve --------------+-----------------+--------------+---------------------+-------------------- MANHATTAN | COMPLAINT | 6 | 66.7 | 1.5 BROOKLYN | COMPLAINT | 4 | 50.0 | 1.5 BRONX | COMPLAINT | 4 | 100.0 | 2 QUEENS | COMPLAINT | 4 | 75.0 | 2 MANHATTAN | TRAFFIC_ACCIDENT| 4 | 100.0 | 0 BROOKLYN | TRAFFIC_ACCIDENT| 3 | 100.0 | 0 BRONX | TRAFFIC_ACCIDENT| 3 | 100.0 | 0 QUEENS | TRAFFIC_ACCIDENT| 3 | 100.0 | 0 STATEN ISLAND | COMPLAINT | 2 | 50.0 | 1 STATEN ISLAND | TRAFFIC_ACCIDENT| 2 | 100.0 | 0

Result interpretation: BRONX has the highest complaint closure rate (100%), but its average resolution time of 2 days is higher than MANHATTAN's 1.5 days โ€” meaning BRONX resolves all complaints but is slower to respond. BROOKLYN and STATEN ISLAND have only 50% closure rates with backlogged tickets requiring focused attention. Traffic accident events have 100% closure rates across all boroughs since they are recorded as completed at the time of entry.


ADS Layer: City Operations Index and KPI Dashboard

The ADS layer aggregates across multiple event types to compute a comprehensive city operations score, directly consumed by BI tools and management dashboards.

Create Table

City operations scoring formula: 100 - (backlog ticket percentage ร— penalty coefficient + average resolution time ร— timeliness coefficient), range 0โ€“100, higher means better operational status.

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_smart_city.doc_ads_city_ops_index AS SELECT borough, DATE_TRUNC('day', event_time) AS stat_date, COUNT(*) AS total_events, SUM(CASE WHEN event_category = 'COMPLAINT' THEN 1 ELSE 0 END) AS complaint_count, SUM(CASE WHEN event_category = 'TRAFFIC_ACCIDENT' THEN 1 ELSE 0 END) AS accident_count, SUM(CASE WHEN event_status = 'RESOLVED' THEN 1 ELSE 0 END) AS resolved_count, ROUND( 100.0 * SUM(CASE WHEN event_status = 'RESOLVED' THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0), 1 ) AS resolution_rate_pct, ROUND( AVG(CASE WHEN resolution_days IS NOT NULL AND resolution_days >= 0 THEN resolution_days END), 2 ) AS avg_resolution_days, -- City operations score: penalties for both backlogged events and resolution time ROUND( 100.0 - LEAST(100.0, SUM(CASE WHEN event_status = 'OPEN' THEN 1 ELSE 0 END) * 10.0 + COALESCE(AVG(CASE WHEN resolution_days IS NOT NULL AND resolution_days >= 0 THEN resolution_days END), 0) * 5.0 ), 1 ) AS city_ops_score FROM best_practice_smart_city.doc_dwd_city_events WHERE event_time IS NOT NULL GROUP BY borough, DATE_TRUNC('day', event_time);

REFRESH DYNAMIC TABLE best_practice_smart_city.doc_ads_city_ops_index; SELECT borough, ROUND(AVG(city_ops_score), 1) AS avg_ops_score, SUM(total_events) AS total_events, SUM(complaint_count) AS complaints, SUM(accident_count) AS accidents, ROUND(AVG(resolution_rate_pct), 1) AS resolution_rate_pct FROM best_practice_smart_city.doc_ads_city_ops_index GROUP BY borough ORDER BY avg_ops_score DESC;

borough | avg_ops_score | total_events | complaints | accidents | resolution_rate_pct --------------+---------------+--------------+------------+-----------+-------------------- QUEENS | 95.0 | 7 | 4 | 3 | 90.0 BRONX | 93.8 | 7 | 4 | 3 | 100.0 BROOKLYN | 79.2 | 7 | 4 | 3 | 75.0 MANHATTAN | 78.8 | 10 | 6 | 4 | 75.0 STATEN ISLAND | 73.8 | 4 | 2 | 2 | 75.0

Result interpretation:

  • QUEENS (95 points) and BRONX (93.8 points) have the best city operations status; BRONX's 100% complaint closure rate is the main driver of its high score.
  • MANHATTAN (78.8 points) has the highest event volume (10 events), but has 2 backlogged complaints (unclosed), pulling down its operations score. MANHATTAN's absolute event volume is the highest, placing the greatest pressure on overall city operations management.
  • STATEN ISLAND (73.8 points) has the lowest event volume (4 events), but a 50% backlog rate still requires focused follow-up.

Complaint ticket department resolution efficiency comparison (cross-department analysis under different dimensions):

SELECT agency_name, COUNT(*) AS total_complaints, SUM(CASE WHEN status = 'Closed' THEN 1 ELSE 0 END) AS resolved, SUM(CASE WHEN status IN ('Open', 'In Progress') THEN 1 ELSE 0 END) AS pending, ROUND(100.0 * SUM(CASE WHEN status = 'Closed' THEN 1 ELSE 0 END) / COUNT(*), 1) AS resolution_rate_pct FROM best_practice_smart_city.doc_ods_311_complaints GROUP BY agency_name ORDER BY total_complaints DESC;

agency_name | total_complaints | resolved | pending | resolution_rate_pct --------------------------------------------------+-----------------+----------+---------+-------------------- New York City Police Department | 4 | 3 | 1 | 75.0 Department of Transportation | 4 | 3 | 1 | 75.0 Department of Sanitation New York | 4 | 4 | 0 | 100.0 Department of Housing Preservation and Development| 4 | 4 | 0 | 100.0 Department of Environmental Protection | 4 | 2 | 2 | 50.0

The Department of Environmental Protection (DEP) has only a 50% closure rate, with 2 of its 4 complaints backlogged (water quality and noise complaints tend to have longer processing cycles) โ€” it is currently the department most in need of attention.


Studio Periodic Refresh Task Configuration

Dynamic Table scheduled refresh is managed by creating tasks in Lakehouse Studio, allowing monitoring alerts and data quality rules to be configured on the same task.

Create refresh tasks for the three Dynamic Tables by dependency layer:

  1. In Studio Development โ†’ Tasks, navigate to the path best_practices/smart_city/
  2. Create a "Refresh Dynamic Table" task for each Dynamic Table:
Task NameTarget TableScheduleDependency
refresh_dwd_city_eventsdoc_dwd_city_eventsDaily at 02:00None (reads ODS directly)
refresh_dws_borough_dailydoc_dws_borough_dailyDaily at 03:00After refresh_dwd_city_events completes
refresh_ads_city_ops_indexdoc_ads_city_ops_indexDaily at 04:00After refresh_dws_borough_daily completes
  1. Attach data quality rules to the refresh_ads_city_ops_index task: result row count > 0, avg_ops_score between 0 and 100

On the same task page, you can also configure scheduling for the Table Stream consumption task (INSERT INTO doc_dwd_complaint_audit_log SELECT FROM stream), recommended to run hourly to keep the audit log current.


RBAC: Multi-Department Data Isolation

Government data platforms typically require strict data isolation: department analysts can only view their own department's ODS data, city operations management can view DWS/ADS aggregated data, and platform administrators have full-layer access.

Create Roles

-- Read-only access to ADS layer public indexes (for all departments to view overall city operations) CREATE ROLE IF NOT EXISTS smart_city_viewer; -- Dedicated analyst for a specific department (NYPD in this example) CREATE ROLE IF NOT EXISTS dept_nypd_analyst; -- City operations platform administrator CREATE ROLE IF NOT EXISTS city_ops_admin;

Grant Permissions by Layer

-- smart_city_viewer: read-only ADS and DWS aggregation layers (all tables) GRANT SELECT ON ALL TABLES IN SCHEMA best_practice_smart_city TO ROLE smart_city_viewer; -- city_ops_admin: full-layer access GRANT SELECT ON ALL TABLES IN SCHEMA best_practice_smart_city TO ROLE city_ops_admin;

View role permissions:

SHOW GRANTS TO ROLE smart_city_viewer;

granted_type | privilege | granted_on | object_name | grantee_name -------------------+--------------+------------+---------------------------------------+------------------ OBJECT_HIERARCHY | SELECT TABLE | TABLE | quick_start.best_practice_smart_city.* | smart_city_viewer PRIVILEGE | READ METADATA| SCHEMA | quick_start.best_practice_smart_city | smart_city_viewer

OBJECT_HIERARCHY type authorization automatically covers all existing and future tables created under the schema, eliminating the need to re-authorize after each new table is created.

Revoke Permissions

-- Execute when employees leave or permissions change REVOKE SELECT ON ALL TABLES IN SCHEMA best_practice_smart_city FROM ROLE dept_nypd_analyst;


Column Masking: Dynamic PII Field Masking

311 complaint tickets and health inspection data contain geographic coordinates (latitude and longitude); in some scenarios they also include complainant names, contact information, and other PII (personally identifiable information) fields. Column Masking can bind masking functions at the column level to automatically return masked values to non-privileged users.

Step 1: Create the masking function (using coordinate precision reduction as an example):

CREATE OR REPLACE FUNCTION best_practice_smart_city.mask_geo_coord(coord DOUBLE) RETURNS DOUBLE AS CASE WHEN CURRENT_USER() IN ('privileged_user') THEN coord -- Replace with actual authorized username ELSE ROUND(coord, 1) -- Non-privileged users get reduced precision to 1 decimal place (approximately 11 km range) END;

Step 2: Bind the masking function to sensitive columns:

ALTER TABLE best_practice_smart_city.doc_ods_311_complaints CHANGE COLUMN latitude SET MASK best_practice_smart_city.mask_geo_coord; ALTER TABLE best_practice_smart_city.doc_ods_311_complaints CHANGE COLUMN longitude SET MASK best_practice_smart_city.mask_geo_coord;

After binding, regular analysts can only see coordinates at reduced precision (e.g., 40.7 instead of 40.7484), while the admin account can see the original precision. The masking effect also applies transparently to Dynamic Tables (DWD/DWS/ADS layers) โ€” downstream tables store already-masked values.


Data Warehouse Object Overview

After build completion, the objects under the best_practice_smart_city schema:

SHOW TABLES IN best_practice_smart_city;

schema_name | table_name | is_dynamic --------------------------+----------------------------------+----------- best_practice_smart_city | doc_ads_city_ops_index | true best_practice_smart_city | doc_dim_district | false best_practice_smart_city | doc_dwd_city_events | true best_practice_smart_city | doc_dwd_complaint_audit_log | false best_practice_smart_city | doc_dws_borough_daily | true best_practice_smart_city | doc_ods_311_complaints | false best_practice_smart_city | doc_ods_health_inspections | false best_practice_smart_city | doc_ods_traffic_accidents | false

Data warehouse layer structure:

Department CSV files (OSS / COS / S3) โ”‚ โ–ผ COPY INTO + Volume (batch import) doc_ods_311_complaints doc_ods_traffic_accidents doc_ods_health_inspections (citizen hotline complaints) (traffic accidents) (health inspections) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ Table Stream โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ doc_stream_complaint_changes โ”‚ โ”‚ (STANDARD mode: UPDATE_BEFORE / UPDATE_AFTER) โ”‚ โ”‚ โ†’ INSERT INTO doc_dwd_complaint_audit_log (Studio hourly) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ REFRESH daily at 02:00 (Studio Task) doc_dwd_city_events (Dynamic Table) UNION ALL standardization ยท event_category ยท geo_key โ”‚ โ–ผ REFRESH daily at 03:00 (Studio Task) doc_dws_borough_daily (Dynamic Table) borough ร— day ร— category ยท avg_resolution_days โ”‚ โ–ผ REFRESH daily at 04:00 (Studio Task) doc_ads_city_ops_index (Dynamic Table) city_ops_score ยท resolution_rate_pct External Schema (Hive data lake) โ””โ”€โ”€ Direct query without migration; can federated JOIN with ODS tables

Three roles (smart_city_viewer, dept_nypd_analyst, city_ops_admin) are also created for multi-department data isolation access control.


Notes

  • COPY INTO idempotency: COPY INTO automatically records metadata of already-loaded files; the same file will not be re-imported. However, if file content changes while the filename remains unchanged (overwrite write), the system will not reload it. Use this in conjunction with file naming conventions in the Volume directory (e.g., date-based subdirectories).

  • Dynamic Tables should not write REFRESH INTERVAL: All Dynamic Table DDLs in this document do not contain REFRESH INTERVAL parameters; scheduling is entirely managed by Studio Tasks. This allows centralized configuration of monitoring alerts, data quality checks, and inter-task dependency orchestration (e.g., DWD refresh triggers DWS refresh) in one place.

  • Table Stream offset advancement mechanism: Stream offsets advance only after DML statements (INSERT INTO ... SELECT FROM stream) execute. Pure SELECT queries do not consume the offset and will not lose changes. Once advanced, offsets cannot be rolled back; ensure the idempotency of DML statements.

  • External Schema is read-only: After mounting the Hive data lake, only read operations are supported; write operations still need to be performed in Singdata Lakehouse's own schema. External Schema does not support Dynamic Tables as target tables; it can only be used as an upstream data source.

  • RBAC row-level permissions have no DDL: Row-level filtering rules for department analysts to see only their own department's data are configured through the Studio Data Security โ†’ Row-Level Permissions UI, with no corresponding SQL DDL. Do not write statements like CREATE ROW ACCESS POLICY in code (this syntax does not exist).

  • Column Masking: Masking effects apply transparently to all downstream queries (including Dynamic Table SELECTs); DWD/DWS/ADS layers also store masked values. If high-precision coordinates are needed for spatial analysis, query the ODS raw tables directly using a privileged admin account.