Multi-Table Concepts
This section introduces the use of Khiops dictionaries for managing data preparation with multi-table datasets, a frequent scenario in real-world business applications. Khiops eliminates the need for labor-intensive preprocessing and manual handling of relationships between tables, offering a scalable and automated solution for relational data. For full documentation, please refer to the dictionaries reference page.
Relational Data Description
When working with relational (multi-table) data, a key step is defining relationships between tables and their respective variables. Traditional libraries require manual table joins, ad hoc data management, and domain expertise to prepare data for analysis. This is time-consuming, prone to errors, and becomes infeasible for large scale datasets (which occurs more rapidly with such multi-table datasets).
Khiops uses a dictionary-based language that naturally describes the relational schema without flattening the data. Relationships between tables are encoded directly in the dictionary, enabling efficient and interpretable processing. Key benefits include:
- Scalability: Process large datasets without loading tables into memory, leveraging I/O optimizations, out-of-core and distributed processing.
- Streamlined Versioning: End-to-end data transformations are recorded in the dictionary, eliminating the need for versioning large intermediary datasets.
Simple Star Relational Schema
Let's start with a example of a star schema describing a company’s customers, their addresses, and the services they use:
Customer
designates the data type of the main statistical units studied, Address
and Services
correspond to the types of their secondary records.
The corresponding Khiops dictionary is:
Example: Dictionary for a star schema
Dictionary Customer (customer_id)
{
Categorical customer_id;
Numerical age;
Categorical sex;
Categorical marketingSegment;
Entity(Address) customerAddress; // 0-1 relationship
Table(Services) customerServices; // 0-N relationship
};
Dictionary Address (customer_id)
{
Categorical customer_id;
Categorical streetNumber;
Categorical streetName;
Categorical city;
Categorical zipcode;
Categorical State;
};
Dictionary Services (customer_id)
{
Categorical customer_id;
Categorical name;
Numerical cost;
Date purchaseDate;
};
In this example:
- Each
Dictionary
corresponds to a user-defined data structure (Customer
,Address
,Services
), effectively serving as a data type; - The categorical variable
customer_id
acts as the key, uniquely identifying the statistical units in the main table (i.e.,Customer
) and linking their associated records in the secondary tables (i.e.,Services
andAddress
). Entity(Address)
designates a 0-1 relationship (e.g., each customer has one address or none).Table(Services)
designates a 0-N relationship (e.g., each customer can have multiple services). Each statistical unit in the main tableCustomer
refers to a table of records in the secondary tableServices
.
Snowflake Relational Schema
This example extends the previous star schema by introducing a snowflake relational schema, where secondary tables are expanded with additional hierarchical levels. In this case, the schema includes a new Usages
table, which describes how customers use specific Services
.
Note that a customer can use each service several times (relation 0-N). This additional complexity reflects real-world scenarios, such as tracking multiple transactions or interactions linked to a single entity.
Here's the dictionary file describing this relational data:
Example: Dictionary file of a snowflake relational schema
Dictionary Customer (customer_id)
{
Categorical customer_id;
Numerical age;
Categorical sex;
Categorical marketingSegment;
Entity(Address) customerAddress; // 0-1 relationship
Table(Services) customerServices; // 0-N relationship
};
Dictionary Address (customer_id)
{
Categorical customer_id;
Categorical streetNumber;
Categorical streetName;
Categorical city;
Categorical zipcode;
Categorical State;
};
Dictionary Services (customer_id, service_id)
{
Categorical customer_id;
Categorical service_id;
Categorical name;
Numerical cost;
Date purchaseDate;
Table(Usages) serviceUsages; // 0-n relationship
};
Dictionary Usages (customer_id, service_id)
{
Categorical customer_id;
Categorical service_id;
Categorical usageType;
Date date;
Time time;
Numerical duration;
};
The only new syntax feature in this example is the use of a multiple-field key in the Services
and Usages
dictionaries. These keys allow Khiops to efficiently associate multiple levels of data, such as linking usage records to specific services used by specific customers. As the number of table levels increases in a snowflake schema, the identification key becomes longer, consisting of a concatenation of multiple identifier variables (e.g., customer_id and service_id).
This structure enables hierarchical feature engineering. For example, Khiops can calculate aggregates like “Total usage duration per customer” by traversing multiple levels.
Snowflake Schema with External Tables
This example introduces the concept of an external table. External tables are used to enrich descriptive variables in other tables without duplicating information. Unlike standard secondary tables, external tables are not directly linked to the main statistical units (e.g., customer_id
) but instead provide additional descriptive information for a specific variable.
For instance, the City
table in the following schema adds information such as the city name, country, and time zone to the Address
table, without repeating the same city details across multiple rows. Instead, the Address
table references the City
table using the zipcode
key. This approach ensures efficiency and consistency, especially in large scale datasets.
The relational schema is structured as follows:
Here’s the dictionary file that defines this relational schema:
Example: Dictionary file for a snowflake schema with an external table
Root Dictionary Customer (customer_id)
{
Categorical customer_id;
Numerical age;
Categorical sex;
Categorical marketingSegment;
Entity(Address) customerAddress; // 0-1 relationship
Table(Services) services; // 0-n relationship
};
Dictionary Address (customer_id)
{
Categorical customer_id;
Categorical streetNumber;
Categorical streetName;
Categorical zipcode;
Entity(City) city[zipcode];
};
Root Dictionary City (zipcode)
{
Categorical zipcode;
Categorical name;
Categorical country;
Categorical timeZone;
};
Dictionary Services (customer_id, service_id)
{
Categorical customer_id;
Categorical service_id;
Categorical name;
Numerical cost;
Date purchaseDate;
Table(Usage) Usages; // 0-n relationship
};
Dictionary Usage (customer_id, service_id)
{
Categorical customer_id;
Categorical service_id;
Categorical usageType;
Date date;
Time time;
Numerical duration;
};
An external table behaves like an Entity, creating a 0-1 relationship. The syntax starts with Entity(City)
. And because it provides additional information for a specific descriptive variable rather than statistical units, square brackets [zipcode]
are used to specify the linking key.
The Root
keyword is required for external tables, as it indicates that the key variable (e.g., zipcode) uniquely identifies the records in that table. For the Customer
table, the Root
keyword is optional and simply tags the main dictionary representing the statistical units under study.
Filtering Out-of-scope Tables and Variables
In data management, defining the scope of an analysis is crucial to focus on the most relevant information while avoiding the overhead of processing unnecessary data. For multi-table datasets, this involves filtering out tables and variables that do not contribute to the analysis.
With conventional tools, data scientists often load entire datasets into memory before manually discarding irrelevant variables or tables. This approach is manageable for small datasets but becomes inefficient and costly for larger, multi-table datasets. It can also lead to excessive trial-and-error runs, creating versioning challenges and increasing storage costs, especially in cloud environments.
Khiops dictionaries provide a more efficient solution by enabling on-the-fly filtering. Only the relevant columns and tables are loaded into memory during processing. This eliminates unnecessary overhead and makes dictionaries lightweight and easy to version, offering a scalable alternative for industrial use cases.
The Unused
keyword in Khiops dictionaries allows you to specify variables and tables that should be excluded from analysis. Here’s an example:
Example: Filtering with the Unused keyword
Dictionary Customer (customer_id)
{
Categorical customer_id;
Numerical age;
Categorical sex;
Categorical marketingSegment;
Unused Entity(Address) customerAddress; // Unused entity
Table(Services) customerServices;
};
Dictionary Address (customer_id)
{
Categorical customer_id;
Categorical streetNumber;
Categorical streetName;
Categorical city;
Categorical zipcode;
Categorical State;
};
Dictionary Services (customer_id)
{
Categorical customer_id;
Categorical name;
Unused Numerical cost; // Unused variable
Date purchaseDate;
};
User-defined Relational Schema
In many cases, databases are designed with technical constraints in mind (e.g., optimizing query speed or storage efficiency), often at the expense of usability for business experts. As a result, the stored data can be difficult to interpret or align with the experts’ knowledge of the problem. Reorganizing this data to better reflect the business context is a critical step in data management.
Traditionally, this process involves manually coding project-specific data management workflows—a time-consuming and error-prone task, compounded by the challenges of versioning multiple scripts and iterations.
Khiops dictionaries offer a powerful alternative. By using their built-in data manipulation language (see the reference page), you can streamline this process and encode business knowledge directly into a reusable, versionable format. This not only simplifies data management but also ensures alignment between technical implementation and business understanding.
The next three subsections illustrate how dictionaries can facilitate data manipulation through:
- Redefining the scope of statistical units;
- Concatenating tables;
- Advanced selection of training examples.
Redefining the scope of statistical units
In many real-world applications, the statistical units of interest are not directly stored in the database and need to be derived from raw data. For instance, it may be necessary to eliminate noisy or insignificant records to focus on meaningful information. This process is essential for reducing complexity, saving computational resources, and ensuring the analysis is targeted.
Khiops makes this process seamless using the TableSelection
function, which allows you to filter secondary records based on specified criteria, directly within the dictionary. This eliminates the need for pre-processing steps in Python or SQL, enabling scalable and efficient data management.
Example 1: Filter by Duration of Use
The following example redefines the scope of statistical units by selecting only the records in the Usage
table where the duration exceeds ten minutes (GE(duration,10))
:
The TableSelection function applied to secondary records
Root Dictionary Customer (customer_id)
{
Categorical customer_id;
Numerical age;
Categorical sex;
Categorical marketingSegment;
Table(Services) services;
};
Dictionary Services (customer_id, service_id)
{
Categorical customer_id;
Categorical service_id;
Categorical name;
Numerical cost;
Date purchaseDate;
Unused Table(Usage) allServiceUsages;
// selection of usages > 10 minutes
Table(Usage) studiedUsages = TableSelection(allServiceUsages, GE(duration,10));
};
Dictionary Usage (customer_id, service_id)
{
Categorical customer_id;
Categorical service_id;
Categorical usageType;
Date date;
Time time;
Numerical duration;
};
This command identifies and retains only the relevant records from the Usage table, streamlining the analysis process by discarding unnecessary data.
Notice that in the rule TableSelection(.,.)
, the first operand designates the name of a table and the second operand is a selection rule applied in the scope of this table.
Example 2: Filter by Time Relative to a Parent Variable
In more advanced scenarios, filtering may require conditions that depend on variables from a parent table. For instance, the following example filters records from the Usage
table to include only those occurring within 30 days of the purchaseDate
from the Services
table:
The . scope operator
Root Dictionary Customer (customer_id)
{
Categorical customer_id;
Numerical age;
Categorical sex;
Categorical marketingSegment;
Table(Services) services;
};
Dictionary Services (customer_id, service_id)
{
Categorical customer_id;
Categorical service_id;
Categorical name;
Numerical cost;
Date purchaseDate;
Unused Table(Usage) allServiceUsages;
// selection of usages within 30 days after subscription
Table(Usage) studiedUsages = TableSelection(allServiceUsages, LE(DiffDate(date, .purchaseDate),30));
};
Dictionary Usage (customer_id, service_id)
{
Categorical customer_id;
Categorical service_id;
Categorical usageType;
Date date;
Time time;
Numerical duration;
};
In this example, the second operand in the rule LE(DiffDate(date, .purchaseDate), 30)
operates within the scope of the Usage
dictionary, where it directly uses the date
variable. It also accesses the purchaseDate
variable from the parent Services
dictionary using the .
scope operator to establish the relationship between the two tables.
Table concatenation
In many production environments, large data tables are often divided into multiple chunks for easier storage and management. This is especially common for secondary tables with a high volume of records, such as log data or transactional details. While this practice improves storage efficiency and system performance, it introduces the challenge of reconstructing the original table for analysis or model training.
Khiops provides an efficient solution for this scenario through the TableUnion
rule, which allows you to seamlessly concatenate these chunks. In the example below, Usage
table is divided into quarterly chunks (usagesQuarter1
, usagesQuarter2
, etc.). These chunks are unified into a single table using TableUnion
. To prevent duplicates during analysis, the Unused
keyword is applied to each chunk sub-table, as the concatenated table already contains all the records.
The TableUnion function
Root Dictionary Customer (customer_id)
{
Categorical customer_id;
Numerical age;
Categorical sex;
Categorical marketingSegment;
Table(Services) services;
};
Dictionary Services (customer_id, service_id)
{
Categorical customer_id;
Categorical service_id;
Categorical name;
Numerical cost;
Date purchaseDate;
Unused Table(Usage) usagesQuarter1;
Unused Table(Usage) usagesQuarter2;
Unused Table(Usage) usagesQuarter3;
Unused Table(Usage) usagesQuarter4;
// concatenation of 4 files, divided for volume purposes
Unused Table(Usage) allUsages = TableUnion(usagesQuarter1, usagesQuarter2, usagesQuarter3, usagesQuarter4);
// selection of usages > 10 minutes
Table(Usage) studiedUsages = TableSelection(allServiceUsages, GE(duration,10));
};
Dictionary Usage (customer_id, service_id)
{
Categorical customer_id;
Categorical service_id;
Categorical usageType;
Date date;
Time time;
Numerical duration;
};
Advanced Example Selection
The selection of training examples is an important step in data management, enabling the scope of the analysis to be defined. In the case of multi-table data, this step can become complex, requiring laborious manual work and coding. Here again, dictionaries offer an effective alternative, and greatly facilitate versioning.
The following example shows the selection of a particular marketing segment consisting of housewives under 50 who have used the VOD service at least 10 times. As this selection criterion is complex, it is written on several lines:
- line 1: selection of customers < 50 years old
- line 2: selection of women
- line 3: selection of services that are both (i) VOD and (ii) with more than 10 uses
For full documentation on dictionary language, please refer to the reference page.
Complex criteria for selecting examples
Root Dictionary Customer (customer_id)
{
Categorical customer_id;
Numerical age;
Categorical sex;
Categorical marketingSegment;
Table(Services) services;
// Selection of housewives under 50 who have used VOD at least 10 times.
// This complex selection criterion is written on several lines:
// - line 1: selection of customers < 50 years old
// - line 2: selection of women
// - line 3: selection of services that are both
// -- VOD
// -- and with more than 10 uses
Unused Numerical selectionVariable = And(L(age,50),
EQc(sex,"F"),
EQ(TableCount(TableSelection(services,
And(EQc(name,"VOD"),
GE(TableCount(allServiceUsages),10))))
,1)
);
};
Dictionary Services (customer_id, service_id)
{
Categorical customer_id;
Categorical service_id;
Categorical name;
Numerical cost;
Date purchaseDate;
Table(Usage) allServiceUsages;
};
Dictionary Usage (customer_id, service_id)
{
Categorical customer_id;
Categorical service_id;
Categorical usageType;
Date date;
Time time;
Numerical duration;
};