Map Join Optimization
Overview
Map Join is an efficient JOIN optimization technique in Lakehouse, especially suitable for small table to large table JOIN scenarios. Map Join broadcasts the small table to each compute node and completes the JOIN directly in the Map phase, avoiding expensive Shuffle and Reduce processes, thereby saving resources and improving query performance.
Syntax
Add the /*+ MAPJOIN(table_alias) */ hint to the query, where table_alias is the alias of the small table to broadcast into memory:
Advantages
- Eliminates the Shuffle phase, reducing network transfer and disk I/O overhead.
- Avoids data skew issues since data does not need to be distributed by the JOIN column.
- Significantly improves query performance for small-table-to-large-table JOIN scenarios.
Limitations
- The small table must fit entirely in memory; otherwise, an out-of-memory error may occur. Currently, Lakehouse limits the small table size to 1 GB.
- Only applicable to small-table-to-large-table JOINs, not suitable for large-table-to-large-table scenarios.
Examples
Example 1: Employee and department association query
The departments table (3 rows) is a small table, and the employees table (5 rows) is a large table. Broadcast departments to each node to complete the JOIN.
Result:
Example 2: Order and product association query
The products table (5 rows) is a small table, and the orders table is a large table.
Example 3: Multi-table JOIN with multiple broadcast tables
You can specify multiple small tables in a single hint:
Notes
- The table name in the
MAPJOINhint is the query alias, not the original table name. - If the small table exceeds 1 GB, Lakehouse ignores the hint and automatically falls back to a regular JOIN.
- Map Join does not support FULL OUTER JOIN and RIGHT OUTER JOIN (the small table must be on the right side).
- In Lakehouse, the optimizer can also automatically select Map Join based on statistics, so manual hints are not always required.
