Lakehouse Map Join Optimization Operations
Description
Map Join is an efficient JOIN operation method in Lakehouse, particularly suitable for JOIN scenarios involving small tables and large tables. Map Join broadcasts the small table to each node and completes the JOIN operation directly in the Map phase, thereby avoiding the expensive Shuffle and Reduce processes. This optimization method saves resources to a certain extent and improves query performance.
Syntax
To use Map Join, simply add a hint /*+ MAPJOIN (table) */
in the query statement, where table
represents the name of the small table that needs to be loaded into memory. For example:
In this example, table2
is a small table. Through the Map Join method, it will be loaded into memory and the JOIN operation with table1
will be completed in the Map phase.
Advantages
Map Join has the following advantages:
- Eliminates the Shuffle phase, reducing network transmission and disk IO overhead.
- Avoids data skew issues, as there is no need to distribute data according to the JOIN column.
- Improves query execution speed, especially in scenarios where a small table is joined with a large table.
Disadvantages
Map Join has the following limitations:
- The small table must be fully loaded into memory, otherwise it may cause memory overflow or Map Join failure. Currently, Lakehouse limits the size of the small table to 1GB.
- Map Join is only suitable for JOIN operations between a small table and a large table. For JOIN scenarios between large tables, it may not be advantageous.
Example
Here are some examples of using Map Join:
Example 1: Association query between employee information and department information
Example 2: Association Query of Order Information and Customer Information
3: Associative Query of Sales Records and Employee Information
Through the above examples, you can see the application of Map Join in different scenarios. In actual use, please ensure that the small table can be fully loaded into memory to fully utilize the advantages of Map Join.