BloomFilter Index
Introduction to the Principle of Bloom Filter Index:
-
Data Structure: A Bloom Filter consists of a bit array (BitSet) and multiple hash functions. Initially, all bits in the bit array are set to 0.
-
Element Addition: When adding an element to the set, multiple positions in the bit array are calculated using the hash functions, and the bits at these positions are set to 1.
-
Element Query: To check if an element exists, the hash functions are used to calculate the possible positions of the element in the bit array, and these positions are checked to see if all bits are 1. If all bits are 1, the element is considered to possibly exist in the set; if any bit is not 1, the element is definitely not in the set.
-
Advantages:
- Space efficiency and query time are far superior to general algorithms, with storage space and insertion/query time being constant O(k).
- Hash functions are independent of each other, facilitating hardware parallel implementation.
- There is no need to store the elements themselves, which is advantageous in scenarios with strict confidentiality requirements.
-
Disadvantages:
- As the number of elements increases, the false positive rate increases, i.e., the probability of false positives increases.
- Generally, elements cannot be deleted from a Bloom Filter.
-
Application Scenarios:
- Bloom Filter indexes are suitable for columns with high cardinality, such as ID columns, to quickly determine whether the data file of a table might contain the queried data, and if not, skip it, thereby reducing the amount of data scanned.
- Suitable for equality queries (including = and IN), and effective for high cardinality fields such as USER-ID and other unique ID fields. The acceleration effect is limited for low cardinality fields, such as the "gender" field.
-
False Positive Rate: A Bloom Filter may falsely indicate that an element is in the set when it is not, but it will never falsely indicate that an element is not in the set when it is.
Application of Bloom Filter in Lakehouse
A BloomFilter index is an efficient data structure used to quickly determine whether a data file contains specific data. By creating a BloomFilter index for a column, the system generates a BloomFilter for that column. During query execution, this index can be used to skip unnecessary data files, thereby reducing the amount of data scanned and improving query performance. BloomFilter is particularly suitable for columns with high cardinality.
For example, when executing a query such as SELECT * FROM table WHERE col_name = 'xxx'
, the BloomFilter will first determine whether the target data file contains the value. If it does not, the system will skip the file, avoiding unnecessary reads. If the index indicates that the file might contain the target data, the file will be read for further querying.
The BloomFilter index is effective in the following operations: AND
, OR
, IN
, =
. It is important to note that after creating a BloomFilter index, only newly written data will include the index. For old data, it is recommended to use the INSERT OVERWRITE
statement to rewrite the data and add the index.
Specific Use Cases for Bloom Filter
Scenario One: Equality Queries with Large Data Volumes
Case:
Suppose there is a user table users
containing millions of user records, with the following table structure:
In this table, the email
field has a high cardinality, meaning most users have a unique email address. If we often need to query user information based on email addresses, we can create a Bloom Filter index for the email
field.
Create Index:
Query:
When we need to query users with a specific email, we can use the following query:
- Quick Exclusion: Bloom Filter can help the database quickly determine whether
user@example.com
might be in the table. If the Bloom Filter determines that the email is not in the table, the related data files will be skipped, thereby reducing I/O operations and query time. - Performance Improvement: For tables with large amounts of data, this quick exclusion can significantly improve query performance.
Scenario Two: Query Optimization for High Cardinality Columns
Example:
Suppose there is a product table products
, which contains various attributes of the products. The table structure is as follows:
In this table, the cardinality of the category
field is very high because products can belong to different categories. If we often need to query products based on categories, we can create a Bloom Filter index for the category
field.
Create Index:
Query:
When we need to query a specific category of products, we can use the following query:
- Reduce Scanning: Bloom Filter can reduce the amount of data that needs to be scanned, as it can help skip data files that do not contain the target category.
- Improve Efficiency: For high cardinality columns, Bloom Filter can improve query efficiency, especially in cases with large amounts of data.
Notes
- False Positive Probability: Bloom Filter has a defined false positive probability (FPP), which means it may incorrectly indicate that a value exists in the table. Therefore, even if the Bloom Filter suggests that a value might exist, the actual query may not find the value.
- Write Operation Cost: Bloom Filter may have higher costs during write operations (such as insert, update, delete) because it needs to update the Bloom Filter data structure.
- Unsupported Data Types: Bloom Filter does not support complex data types such as interval, struct, map, array, etc. Therefore, attention should be paid to the data type of the column when creating the index.
Create BloomFilter Index
To create a BloomFilter index for a column, you can use the following statement:
For example, create an index named email_bloomfilter
on the email
column in the employees
table:
View BloomFilter Index Details
To view the BloomFilter index details on a table, you can use the DESCRIBE INDEX
statement:
For example, view the details of the index named email_bloomfilter
on the employees
table:
Delete BloomFilter Index
To delete a BloomFilter index, you can use the DROP INDEX
statement:
For example, delete the index named email_bloomfilter
on the employees
table:
List BloomFilter Indexes on the Table
List BloomFilter Indexes on the Table
To view all BloomFilter indexes on the table, you can use the SHOW INDEXES
statement:
For example, view all indexes on the employees
table: