Apache Hive data sources

Connection parameters in YAML configuration file

⚠️

Check This First!

This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.


Various data sources are specified in the YAML file used by the pretrain function and configured by the entries in data_location section. Below is an example code that should be adapted to your configuration.

data_location:
  database_type: hive
  connection_params:
    # Query string parameters
    hive_params:
      DSN: SmokeTests
    # Path to ini file (optional, can be set via env variable ODBCINI) 
    ini_file: "/PATH_TO_INI_FILE"
  table_name: some_table
Parameters
  • database_type : str, required
    No default value.
    Information about the database type or source file. All data tables should be stored in the same type.
    Set to: hive.

  • connection_params : dict, required
    Configures the connection to the database.
    If environment variable ODBCINI or ini_file field in connection_params are set, configuration for connection is specified by .ini file and setting matching DSN param in hive_params field.
    Additionally, all of the parameters specified in hive_params field will override the parameters set in the ini file.
    Authenticating with Kerberos requires specifying field kerberos_params in connection params. Parameters specified in Kerberos config will override ones set in .ini file, or as hive_params.

    Keyword arguments are:

    • hive_params : dict
      Specifies connection parameters to Hive. Defined with keywords:

      • DSN : str, optional
        DSN used for the connection.

      • Driver : str, optional
        Path to connection driver.

      • Port : int, optional
        Connection port.

      • HiveServerType : int, optional
        Type of Hive server.

      • Additional parameters can be found in Apache Hive Configuration Properties.

        Each parameter should have prefix SSP_, for example to set hive.test.mode.samplefreq=100 add SSP_hive.test.mode.samplefreq: 100to hive_params


    • ini_file : str, optional
      Path to ini file. If not provided, path will be taken from ODBCINI environmental variable.


    • kerberos_params : dict, optional
      Specifies authentication with Kerberos if needed. Defined with keywords:

      • user : str
        Kerberos principal name.

      • realm : str
        Kerberos realm name.

      • kerberos_host : str
        Kerberos service host ip.

      • kerberos_service_name : str
        Kerberos service name eg. 'hive'.

      • kerberos_fqdn : str
        Fully qualified domain name.

      • keytab_path : str
        Pa th to the keytab file.

      • krb5_config_path : str
        Pa th to the krb5.conf file. Defaults to "/etc/krb5.conf".

      • password : str, optional
        A password in plain text or a path to a file containing the password. Password file works only with Heimdal Kerberos client. Defaults to None.

      • kerberos_renewal_interval_minutes : int, optional
        Interval in minutes at which to renew the Kerberos ticket. Defaults to 540.

      • verbose : bool, optional

        Whether to print verbose output. Defaults to False.


  • table_name : str
    Specifies the table to use to create features. Example: customers.

The connection_params should be set separately in each data_location block, for each data source.

Example

The following example demonstrates the connection to Hive in the context of a simple configuration with two data sources.

data_sources:
  -type: main_entity_attribute
   main_entity_column: UserID
   name: customers
   data_location:
     database_type: hive
     connection_params:
       # Query string parameters
       hive_params:
       DSN: SmokeTests
       # Path to ini file (optional, can be set via env variable ODBCINI) 
       ini_file: "/PATH_TO_INI_FILE"
     table_name: customers
   disallowed_columns: [CreatedAt]
  -type: event
   main_entity_column: UserID
   name: purchases
   date_column: Timestamp
   data_location:
     database_type: hive
     connection_params:
       # Query string parameters
       hive_params:
       DSN: SmokeTests
       # Path to ini file (optional, can be set via env variable ODBCINI) 
       ini_file: "/PATH_TO_INI_FILE"
     table_name: purchases
   where_condition: "Timestamp >= today() - 365"
   sql_lambdas: 
     - alias: price_float
       expression: "TO_DOUBLE(price)"

📘

Learn More

The detailed description of optional fields such as disallowed_columns, where_condition, sql_lambda, and many others is provided here