Lakehouse Map Join Optimization Operations

Description

Map Join is an efficient JOIN operation method in Lakehouse, particularly suitable for JOIN scenarios involving small tables and large tables. Map Join broadcasts the small table to each node and completes the JOIN operation directly in the Map phase, thereby avoiding the expensive Shuffle and Reduce processes. This optimization method saves resources to a certain extent and improves query performance.

Syntax

To use Map Join, simply add a hint /*+ MAPJOIN (table) */ in the query statement, where table represents the name of the small table that needs to be loaded into memory. For example:

SELECT /*+ MAPJOIN (t2) */ * FROM table1 t1
JOIN table2 t2
ON (t1.emp_id = t2.emp_id);

In this example, table2 is a small table. Through the Map Join method, it will be loaded into memory and the JOIN operation with table1 will be completed in the Map phase.

Advantages

Map Join has the following advantages:

  1. Eliminates the Shuffle phase, reducing network transmission and disk IO overhead.
  2. Avoids data skew issues, as there is no need to distribute data according to the JOIN column.
  3. Improves query execution speed, especially in scenarios where a small table is joined with a large table.

Disadvantages

Map Join has the following limitations:

  1. The small table must be fully loaded into memory, otherwise it may cause memory overflow or Map Join failure. Currently, Lakehouse limits the size of the small table to 1GB.
  2. Map Join is only suitable for JOIN operations between a small table and a large table. For JOIN scenarios between large tables, it may not be advantageous.

Example

Here are some examples of using Map Join:

Example 1: Association query between employee information and department information

SELECT /*+ MAPJOIN (dept) */ * FROM employees emp
JOIN departments dept
ON (emp.dept_id = dept.dept_id);

Example 2: Association Query of Order Information and Customer Information

SELECT /*+ MAPJOIN (customer) */ * FROM orders order_
JOIN customers customer
ON (order_.customer_id = customer.customer_id);

3: Associative Query of Sales Records and Employee Information

SELECT /*+ MAPJOIN (employee) */ * FROM sales sales_
JOIN employees employee
ON (sales_.employee_id = employee.employee_id);

Through the above examples, you can see the application of Map Join in different scenarios. In actual use, please ensure that the small table can be fully loaded into memory to fully utilize the advantages of Map Join.