A Comprehensive Guide to Importing Data into Singdata Lakehouse

Data Ingestion: Loading Data via Zettapark using SAVE_AS_TABLE

Overview

Use Cases

The SAVE_AS_TABLE method automatically creates tables, simplifying the process of loading data via Zettapark using SQL INSERT, which requires manual table creation. Additionally, SAVE_AS_TABLE automatically optimizes INSERT INTO, inserting multiple records at once instead of one at a time.

Implementation Steps

Open VS Code on your computer, create a file named py_zettapark_save_as_table.py, and copy the following code into the py_zettapark_save_as_table.py file.

import json
import gzip
from clickzetta.zettapark.session import Session
from datetime import datetime

# Read parameters from the configuration file
with open('config-ingest.json', 'r') as config_file:
    config = json.load(config_file)

print("Connecting to Singdata Lakehouse.....\n")

# Create session
session = Session.builder.configs(config).create()

print("Connection successful!...\n")

target_table_name = "lift_tuckets_import_by_py_save_as_table"

def save_as_table_to_clickzetta(session, schema, data):
    print('Saving data to Clickzetta Lakehouse')

    # Convert data to dataframe
    df = session.create_dataframe(data, schema=schema)
    
    # Save dataframe as table
    df.write.save_as_table(target_table_name, mode="overwrite", table_type="transient")
    print(f"Data saved to table {target_table_name}")

if __name__ == "__main__":
    schema = None
    data = []

    # Open the compressed JSON file and read the content
    with gzip.open('lift_tickets_data.json.gz', 'rt', encoding='utf-8') as file:
        for message in file:
            if message.strip():  # Ensure it's not an empty line
                record = json.loads(message)
                if 'schema' in record:
                    schema = record['schema']
                else:
                    data.append(record)
    
    save_as_table_to_clickzetta(session, schema, data)
    session.close()
    print("Ingest complete")

In VS Code, open a new "Terminal" and run the following command to activate the Python environment created in the "Environment Setup" step. If you are already in the cz-ingest-examples environment, please skip this step.

conda activate cz-ingest-examples

Then run the following command in the same terminal:

python py_zettapark_save_as_table.py

Next Steps Recommendations

Resources

Zettapark Quick Start