Generated Columns

Generated Columns are columns in a database table whose values are automatically calculated from the values of other columns in the table through an expression. When such a column is created, a calculation rule is defined, and the database management system automatically fills in the column's values based on this rule, without requiring the user to explicitly insert or update these values. An application scenario could be when data integration synchronization does not support transformation, and Generated Columns can be used for transformation.

Syntax

CREATE TABLE [ IF NOT EXISTS ] table_name
(
    column_definition GENERATED ALWAYS AS ( expr ), [column_definition,...]
)
[ PARTITIONED BY (column_name column_type | column_name ) ];
  • GENERATED ALWAYS AS (expr): Automatically generates the value of the column through the expression expr. The expression can include constants and built-in scalar deterministic SQL functions. Non-deterministic functions such as (current_date, random, current_timestamp, context functions) or operators are not supported. Aggregate functions, window functions, or table functions are also not supported. Partition columns using generated columns are supported.
  • Example:
-- Correct usage
CREATE TABLE t_genet (
    col1 TIMESTAMP,
    hour int GENERATED ALWAYS AS (hour(col1)),
    pt STRING GENERATED ALWAYS AS (date_format(col1, 'yyyy-MM-dd'))
) PARTITIONED BY (pt);

Difference Between Generated Columns and Default Values

  1. Partition Column Support:

    1. Default Values: Currently, setting default values for partition columns is not supported.
    2. Generated Columns: Supports generating values for partition columns.
  2. Function Support:

    1. Default Values: Supports using non-deterministic functions, such as current_timestamp().
    2. Generated Columns: Does not support non-deterministic functions, only deterministic scalar functions can be used.
  3. Value Specification in Insert Operations:

    1. Default Values: When inserting data, a static value can be specified for the column. If not specified, the default value is used.
    2. Generated Columns: When inserting data, values cannot be specified for generated columns; their values are entirely determined by the generation expression.
  4. Source of Column Values:

    1. Default Values: Does not support column values derived from transformations of other columns.
    2. Generated Columns: Supports column values derived from transformations of other columns, i.e., values can be computed based on other columns.
  5. Handling Existing Data Rows:

    1. Default Values: When adding a column with a default value to an existing table, the column in existing data rows will be filled with null.
    2. Generated Columns: For existing data rows, the values of generated columns will be transformed according to the generation expression and display the transformed data.

Usage Restrictions

  • When using generated columns, you cannot explicitly specify the value of the column in insert operations; it will be automatically generated by the expression. However, to be compatible with Hive syntax, Lakehouse allows you to specify a static value for partition fields. Nevertheless, the specified static value will not take effect, and the query result will still be determined by the specified generation expression.
  • Writing through real-time interfaces and batch interfaces is not supported, including batch import and real-time writing in Studio data integration.
  • Generated columns do not support non-deterministic functions such as (current_date\random\current_timestamp\context functions) or operators, and do not support aggregate functions, window functions, or table functions.

Inserting Data

  • When using generated columns, you cannot specify a constant value for the column. For example, if the hour column is generated by col1, you cannot specify the value of hour during insertion. Doing so will result in an error.

Error Example:

-- The specified 2024-09-26 will not take effect because pt is generated from col1. Although it will not report an error, the value will not be written.
CREATE TABLE t_genet (
    col1 TIMESTAMP,
    pt STRING GENERATED ALWAYS AS (date_format(col1, 'yyyy-MM-dd'))
) PARTITIONED BY (pt);
INSERT INTO t_genet (col1,pt) VALUES (current_timestamp, '2024-09-26');

Correct Example

-- Can be executed, only insert col1, pt will be automatically calculated by the generation rule
INSERT INTO t_genet (col1) VALUES (current_timestamp);

Specify Generated Column When Adding Field

Syntax

-- Add column
ALTER TABLE table_name ADD COLUMN
      column1_name_identifier data_type [column_properties]
      [FIRST | AFTER column1_name_identifier]  ,....

column_properties:==
    GENERATED ALWAYS AS ( expr ) |
    COMMENT column_comment
  • GENERATED ALWAYS AS (expr): Specifies an expression used to automatically generate the value of a newly added column. For existing rows in the table, the column will be filled with the result of the expression. The expression can include constants and built-in scalar deterministic SQL functions. Non-deterministic functions such as (current_date, random, current_timestamp, etc.) or operators are not supported. Aggregate functions, window functions, or table functions are also not supported.

Usage Restrictions

  • Adding generated columns to existing columns is not supported

Example

Adding a Generated Column

ALTER TABLE my_table ADD COLUMN
    generated_col TIMESTAMP GENERATED ALWAYS AS (date_format(col1, 'yyyy-MM-dd')) FIRST;

In this example, generated_col will be added as the first column in the table, and the generated_col for all existing data rows will be generated based on the current timestamp.