Shared entities
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.
Often, the same entity—such as a product—can appear in multiple data sources, each capturing different aspects of it. With the shared_entities
configuration block, BaseModel allows you to unify these representations across various data sources, combining their attributes into a single entity.
Benefits:
Enriching the representation of an entity can significantly enhance model performance. However, this improvement depends on the characteristics of your data and the specific task.
Unifying Representation Across Multiple Sources
Example:
Suppose you have two data sources, page_visit
and product_buy
, both containing a column named article_id
that identifies products. Normally, products in page_visit
would be separate from those in product_buy
. Using shared_entities
, you can create a single, unified representation called product
that encompasses both sources.
Configuration
data_sorces:
- type: event
main_entity_column: use_id
name: product_buy
date_column:
name: timestamp
data_location:
database_type: snowflake
connection_params:
user: username,
password: strongpassword123,
account: xy12345.west-europe.azure,
database: EXAMPLE_DB,
schema: EXAMPLE_SCHEMA,
role: ACCOUNT_ADMIN
table_name: product_buy
shared_entities:
- name: product
columns: [article_id]
id_column: article_id
- type: event
main_entity_column: use_id
name: page_visit
date_column:
name: timestamp
data_location:
database_type: snowflake
connection_params:
user: username,
password: strongpassword123,
account: xy12345.west-europe.azure,
database: EXAMPLE_DB,
schema: EXAMPLE_SCHEMA,
role: ACCOUNT_ADMIN
table_name: page_visit
shared_entities:
- name: product
columns: [article_id]
id_column: article_id
Enriching Entity Representation with Additional Columns
You can further enrich the unified entity by adding additional columns from each data source. For instance, in the product_buy source, there may be a name column containing the product's name, while in page_visit, there may be a product_description column.
Using shared_entities, both columns can be included to create a more comprehensive entity.
Watch Out
Currently, only Text and CategoricalCompressed (Cleora-based) columns are supported.
Configuration:
data_sorces:
- type: event
main_entity_column: use_id
name: product_buy
date_column:
name: timestamp
data_location:
database_type: snowflake
connection_params:
user: username,
password: strongpassword123,
account: xy12345.west-europe.azure,
database: EXAMPLE_DB,
schema: EXAMPLE_SCHEMA,
role: ACCOUNT_ADMIN
table_name: product_buy
shared_entities:
-name: product
columns: [article_id, name]
id_column: article_id
- type: event
main_entity_column: use_id
name: page_visit
date_column:
name: timestamp
data_location:
database_type: snowflake
connection_params:
user: username,
password: strongpassword123,
account: xy12345.west-europe.azure,
database: EXAMPLE_DB,
schema: EXAMPLE_SCHEMA,
role: ACCOUNT_ADMIN
table_name: page_visit
shared_entities:
- name: product
columns: [article_id, product_description]
id_column: article_id
Combining Attributes from Joined Sources
Suppose we have one event data source, product_buy
, with the column article_id
, and an attribute data source, product_attributes
, containing columns article_id
, name
, and description
.
You can join the product_attributes
data source with the product_buy
event source following the process described in the Joining Additional Attributes section. Normally, the attributes would be joined as separate product representations.
Usingshared_entities
, you can merge all these attributes into a single entity representation.
Remember
Shared entities should be defined in the data source to which the attributes are being joined. In this case, it is the event data source
product_buy
.
Configuration Format:
data_sorces:
- type: attribute
name: product_attributes
data_location:
database_type: snowflake
connection_params:
user: username,
password: strongpassword123,
account: xy12345.west-europe.azure,
database: EXAMPLE_DB,
schema: EXAMPLE_SCHEMA,
role: ACCOUNT_ADMIN
table_name: product_attributes
- type: event
main_entity_column: use_id
name: product_buy
date_column:
name: timestamp
data_location:
database_type: snowflake
connection_params:
user: username,
password: strongpassword123,
account: xy12345.west-europe.azure,
database: EXAMPLE_DB,
schema: EXAMPLE_SCHEMA,
role: ACCOUNT_ADMIN
table_name: product_buy
shared_entities:
- name: product
columns: [article_id, [description, [product_attributes]], [name, [product_attributes]]]
id_column: article_id
The syntax for defining columns is:
[column_from_source_to_which_we_join, [column_from_the_source_being_joined, [name_of_the_source_being_joined]]]
shared_entities
Parameters
shared_entities
Parameters-
name (str)
Example:"shared_entity_1"
A string that specifies the name for the shared entity. This needs to match between event data sources to identify the same shared entity. -
columns (list[str | tuple[str, list[str]]])
Example:["col1", ["col2", ["joined"]]]
A list of columns that compose the shared entity. Columns from the data source where shared entity is defined are provided as a string. Columns form the joined source are provided in the format [col_name,
[joined_source_name]]. -
id_column (str | tuple[str, list[str]])
Example:"article_id"
The main column identifying the shared entity, usually theid
or a property with the highest cardinality.
Updated 15 days ago