Building an Analytics-Oriented Modern Data Stack with Singdata Lakehouse
This document describes how to build an analytics-oriented Modern Data Stack based on Singdata Lakehouse, Metabase, and MindsDB.
Solution Architecture
Features of the Modern Data Stack solution based on Singdata Lakehouse:
- Evolve from AWS data warehouse to data lake, achieving optimization and improvement of lake-house unification through Singdata Lakehouse, significantly reducing data storage, compute, and operations costs.
- Unlimited Storage and Efficient Migration: The full-link data pipeline uses cloud object storage to achieve a compute-storage separation architecture, avoiding bandwidth and storage capacity bottlenecks of server nodes in traditional solutions.
- Singdata Lakehouse + Metabase: Achieve ultra-simple BI data analysis, requiring just two mouse clicks to complete visual data exploration and analysis, greatly lowering the barrier for business personnel to analyze data and being very user-friendly for business users.
- Singdata Lakehouse + MindsDB: Achieve 100% SQL-based AI and LLM enhanced analysis. Without needing to master other complex languages, data engineers and BI analysts can implement AI and LLM enhanced analysis based on SQL.
- Lower the requirements for technical personnel across the full data stack (cloud infrastructure, data lake, data warehouse, BI, AI), reducing enterprise hiring thresholds and improving talent availability.
This solution emphasizes simplicity and ease of use, aiming to help enterprises shift their focus from data infrastructure to data analysis, achieving modernization of data analysis.

The above diagram illustrates the architecture for migrating to and building a Modern Data Stack based on Singdata Lakehouse, summarized as follows:
- Use the Redshift UNLOAD command to unload data to Parquet files in an S3 bucket.
- Through Singdata Lakehouse's
SELECT * FROM VOLUMEstatement, directly load data from Parquet files in the AWS S3 bucket into Singdata Lakehouse tables, achieving rapid data ingestion (in this example, loading a table with over 20 million rows into the Lakehouse took only 30 seconds). - BI Application: Explore and analyze data through Metabase (from table to dashboard in just two mouse clicks -- yes, just two).
- AI Application: Predict house prices through MindsDB (100% implemented using SQL for model prediction).
Solution Components
-
AWS:
- Redshift
- S3
-
Singdata
- Singdata Lakehouse, a multi-cloud and unified data platform. Adopts a SaaS fully managed service model, providing enterprises with an ultra-simple data architecture.
- Singdata Lakehouse Driver for Metabase
- Singdata Lakehouse Connector for MindsDB
-
Data Analysis
- Metabase with Lakehouse Driver on Docker: Metabase is a comprehensive BI platform, but its design philosophy is very different from Superset. Metabase places great emphasis on the user experience for business personnel (such as product managers, marketing operations staff), allowing them to freely explore data and answer their own questions.
- MindsDB with Lakehouse Connector on Docker: MindsDB can model directly in Singdata Lakehouse, eliminating professional steps such as data processing and model building. Data analysts and BI analysts can use it out of the box without needing to be familiar with data engineering and modeling knowledge, lowering the modeling barrier so that everyone can be a data analyst and everyone can apply algorithms.
- Zeppelin with Lakehouse JDBC Interpreter on Docker
- Zeppelin with MySQL JDBC Interpreter on Docker (connecting to MindsDB's MySQL interface)
Why Choose Singdata Lakehouse?
- Fully Managed: Singdata Lakehouse provides a fully managed, cloud-based Lakehouse service that is easy to use and scale. This means you don't have to worry about managing and maintaining your own hardware and infrastructure, avoiding time-consuming and costly investments, achieving peace of mind.
- Cost Savings: Compared to Redshift, Singdata Lakehouse's total cost of ownership (TCO) is typically lower because it charges based on usage without requiring upfront commitments. Singdata Lakehouse's highly flexible pricing model means you only pay for the resources you actually use, without being locked into a fixed cost model.
- Scalability: Singdata Lakehouse is designed to handle large amounts of data and can scale up or down as needed, making it a great choice for enterprises with fluctuating compute loads. Singdata Lakehouse stores data on cloud object storage services, achieving "unlimited scaling" in data scale.
- Performance: Singdata Lakehouse adopts a Single Engine All Data architecture, achieving compute-storage separation, enabling it to process queries faster than Redshift.
- Ease of Use: Singdata Lakehouse provides a unified data integration, development, operations, and governance platform, making development and management much easier without complex solution integration.
- Data Source Support: Singdata Lakehouse supports a variety of data sources and formats, including structured and semi-structured data. In most cases, BI and AI applications can be developed using only SQL.
- Data Integration: Singdata Lakehouse's built-in data integration features support a wide range of data sources, making data loading and preparation easier for analysis. Overall, migrating to Singdata Lakehouse can help you save time and money, and enable you to process and analyze data more easily and effectively.
Implementation Steps
Data Extraction (E)
Unload House Price Sales Data from Redshift to S3
Redshift UNLOAD command: Use Amazon S3 server-side encryption (SSE-S3) to unload query results to one or more text, JSON, or Apache Parquet files on Amazon S3.
Data Lake Data Exploration: Explore Parquet Data on AWS S3 through Singdata Lakehouse
View the total number of rows (requires creating Singdata Lakehouse's STORAGE CONNECTION and EXTERNAL VOLUME in advance):

Preview data
Execute the above query in Singdata Lakehouse, with the following results:

Data Loading: Load (L) Data from S3 into Singdata Lakehouse and Perform Data Transformation (T)
Verify the number of rows loaded:

Explore data in the Lakehouse using SQL:

BI Application: Explore and Analyze Data in Singdata Lakehouse through Metabase
Create a Database Connection to Singdata Lakehouse in Metabase

Explore and Analyze Data through Metabase (Just Two Clicks -- Yes, Just Two!)
Select Database and Table:


Browse and Analyze Data through Metabase

Explore and Analyze Data through Metabase:

AI Application: Predict House Prices through MindsDB (Only SQL)
This section's data flow: Zeppelin -> MindsDB -> Singdata Lakehouse.
- Zeppelin creates a new Interpreter via MySQL JDBC Driver to connect to MindsDB
- MindsDB connects to Singdata Lakehouse via clickzetta handler (based on Python SQLAlchemy)
Build Model Training Data in Singdata Lakehouse
Create Zeppelin Interpreter, Connect to MindsDB via MySQL JDBC

Create a New Notebook in Zeppelin
MindsDB connects to Singdata Lakehouse, using Singdata Lakehouse as a data source

Create Model
Create a prediction model to predict paid_times, i.e., the number of times a house has been sold.

House Price Prediction

Prediction result:
Batch House Price Prediction

Appendix
Metabase, MindsDB, Zeppelin Environment Installation and Deployment Guide
- Metabase with Lakehouse Driver on Docker
- MindsDB with Lakehouse Connector on Docker
- Zeppelin with Lakehouse JDBC Driver
Preview Parquet File Schema and Data via Python Code, and Generate SQL Code for Singdata Lakehouse
Sample input:

