Shared entities

⚠️

Check This First!

This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.


Often, the same entity—such as a product—can appear in multiple data sources, each capturing different aspects of it. With the shared_entities configuration block, BaseModel allows you to unify these representations across various data sources, combining their attributes into a single entity.

Benefits:

Enriching the representation of an entity can significantly enhance model performance. However, this improvement depends on the characteristics of your data and the specific task.


Unifying Representation Across Multiple Sources

Example:
Suppose you have two data sources, page_visit and product_buy, both containing a column named article_id that identifies products. Normally, products in page_visit would be separate from those in product_buy. Using shared_entities, you can create a single, unified representation called product that encompasses both sources.


Configuration

data_sorces:
  - type: event
    main_entity_column: use_id
    name: product_buy
    date_column:
       name: timestamp
    data_location:
       database_type: snowflake
       connection_params:
           user: username,
           password: strongpassword123,
           account: xy12345.west-europe.azure,
           database: EXAMPLE_DB,
           schema: EXAMPLE_SCHEMA,
           role: ACCOUNT_ADMIN
        table_name: product_buy
    shared_entities:
      - name: product
        columns: [article_id]
        id_column: article_id
        
  - type: event
    main_entity_column: use_id
    name: page_visit
    date_column:
       name: timestamp
    data_location:
       database_type: snowflake
       connection_params:
           user: username,
           password: strongpassword123,
           account: xy12345.west-europe.azure,
           database: EXAMPLE_DB,
           schema: EXAMPLE_SCHEMA,
           role: ACCOUNT_ADMIN
        table_name: page_visit
    shared_entities:
      - name: product
        columns: [article_id]
        id_column: article_id

Enriching Entity Representation with Additional Columns

You can further enrich the unified entity by adding additional columns from each data source. For instance, in the product_buy source, there may be a name column containing the product's name, while in page_visit, there may be a product_description column.

Using shared_entities, both columns can be included to create a more comprehensive entity.

❗️

Watch Out

Currently, only Text and CategoricalCompressed (Cleora-based) columns are supported.

Configuration:

data_sorces:
  - type: event
    main_entity_column: use_id
    name: product_buy
    date_column:
       name: timestamp
    data_location:
       database_type: snowflake
       connection_params:
           user: username,
           password: strongpassword123,
           account: xy12345.west-europe.azure,
           database: EXAMPLE_DB,
           schema: EXAMPLE_SCHEMA,
           role: ACCOUNT_ADMIN
        table_name: product_buy
    shared_entities:
       -name: product
        columns: [article_id, name]
        id_column: article_id
        
  - type: event
    main_entity_column: use_id
    name: page_visit
    date_column:
       name: timestamp
    data_location:
       database_type: snowflake
       connection_params:
           user: username,
           password: strongpassword123,
           account: xy12345.west-europe.azure,
           database: EXAMPLE_DB,
           schema: EXAMPLE_SCHEMA,
           role: ACCOUNT_ADMIN
        table_name: page_visit
    shared_entities:
      - name: product
        columns: [article_id, product_description]
        id_column: article_id

Combining Attributes from Joined Sources

Suppose we have one event data source, product_buy, with the column article_id, and an attribute data source, product_attributes, containing columns article_id, name, and description.

You can join the product_attributes data source with the product_buy event source following the process described in the Joining Additional Attributes section. Normally, the attributes would be joined as separate product representations.

Usingshared_entities, you can merge all these attributes into a single entity representation.

🚧

Remember

Shared entities should be defined in the data source to which the attributes are being joined. In this case, it is the event data source product_buy.

Configuration Format:

data_sorces:
  - type: attribute
    name: product_attributes
    data_location:
       database_type: snowflake
       connection_params:
           user: username,
           password: strongpassword123,
           account: xy12345.west-europe.azure,
           database: EXAMPLE_DB,
           schema: EXAMPLE_SCHEMA,
           role: ACCOUNT_ADMIN
        table_name: product_attributes
        
  - type: event
    main_entity_column: use_id
    name: product_buy
    date_column:
       name: timestamp
    data_location:
       database_type: snowflake
       connection_params:
           user: username,
           password: strongpassword123,
           account: xy12345.west-europe.azure,
           database: EXAMPLE_DB,
           schema: EXAMPLE_SCHEMA,
           role: ACCOUNT_ADMIN
        table_name: product_buy
    shared_entities:
      - name: product
        columns: [article_id, [description, [product_attributes]], [name, [product_attributes]]]
        id_column: article_id

The syntax for defining columns is:
[column_from_source_to_which_we_join, [column_from_the_source_being_joined, [name_of_the_source_being_joined]]]


shared_entities Parameters

  • name (str)
    Example: "shared_entity_1"
    A string that specifies the name for the shared entity. This needs to match between event data sources to identify the same shared entity.

  • columns (list[str | tuple[str, list[str]]])
    Example: ["col1", ["col2", ["joined"]]]
    A list of columns that compose the shared entity. Columns from the data source where shared entity is defined are provided as a string. Columns form the joined source are provided in the format [col_name,
    [joined_source_name]].

  • id_column (str | tuple[str, list[str]])
    Example: "article_id"
    The main column identifying the shared entity, usually the id or a property with the highest cardinality.