Auto Feature Engineering & more¶

Introduction¶

This Notebook presents two key components of the Auto ML pipeline provided by Khiops:

Auto Feature Engineering which automatically generates a large number of informative aggregates from secondary tables of a multi-table training set,
Parsimonious Training which trains a model by selecting a small subset of independant and highly informative variables (native or aggregates).

The sequencing of these two steps greatly improves model interpretability. To be more precise, data representation varies in size over the whole pipeline (see the figure bellow). On the one hand, Auto Feature Engineering explores a large number of aggregates, enriching the data representation with useful but possibly redundant information. On the other hand, Parsimonious Training reduces the data representation by selecting a few informative and independant variables. The contributions of the selected variables are almost additive, since their interactions are reduced to a minimum, making the model easy to interpret.

Combined with the fact that the aggregates generated in the Auto Feature Engineering step have explicit names, this makes the models produced by Khiops very easy to understand. A visualization tool is provided for this purpose, making it possible to understand and visualize the entire Auto ML pipeline, from optimal encoding to model evaluation.

In this notebook, we'll explore the Khiops' Auto Feature Engineering capabilities which is unrivalled considering overfitting prevention, interpretability and scalability. We demonstrate that khiops' Auto Feature Engineering algorithm can be coupled to any classifier (here, using an LGBM classifier), dramatically improving the productivity of data scientists who no longer have to do feature engineering by hand. Finally, we demonstrate the benefits of using the full pipeline provided by Khiops, to leverage parsimonious training and considerably increase model interpretability.

We will illustrate this using "Accidents" dataset. This dataset describes the characteristics of road accidents that occurred in France in 2018. It has three tables with the following schema:

Accidents
|
| -- 1:n -- Vehicles
              |
              |-- 1:n -- Users

Installation and set up¶

If you do not use our offical khiops-notebook Jupyter Docker image, you may have to install khiops locally using conda:

In [3]:

Copied!

#!conda install -y -c conda-forge -c khiops khiops
#!conda install -y -c conda-forge -c khiops khiops

For the experiments, you also need some external libraries you can install via pip:

In [4]:

Copied!





# Installation of external libraries
#!pip install matplotlib seaborn

# Note: Installing PyCaret can sometimes be complex due to its dependencies. 
# If you encounter any issues, please refer to the PyCaret documentation for detailed installation instructions:
# https://pycaret.gitbook.io/docs/get-started/installation
#!pip install pycaret
# Installation of external libraries
#!pip install matplotlib seaborn

# Note: Installing PyCaret can sometimes be complex due to its dependencies. 
# If you encounter any issues, please refer to the PyCaret documentation for detailed installation instructions:
# https://pycaret.gitbook.io/docs/get-started/installation
#!pip install pycaret

We now import all the dependencies here:

In [5]:

Copied!





import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from khiops.sklearn import KhiopsClassifier
from khiops.sklearn import KhiopsEncoder

from pycaret.classification import *
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from khiops.sklearn import KhiopsClassifier
from khiops.sklearn import KhiopsEncoder

from pycaret.classification import *

Import and preparation of data¶

For this notebook, we use the "Accident" French Dataset. More details on the French Goverment Open Data Site.

This dataset is also available on our khiops-samples repository on Github.

This dataset has three tables Accident, Vehicle, and User organized in the following relational schema.

Accident
|
| -- 1:n -- Vehicle
|             |
|             |-- 1:n -- User

Each accident has associated one or more vehicles. The vehicles involved in an accident have in turn associated one or more road users (passengers and pedestrians).

The fields of each table are self-explanatory, and so are their values. The target in the Accident table is the constructed variable Gravity which is set to Lethal if there was at least one casualty in the accident.

In [6]:

Copied!





# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url_accidents = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/10.2.4/Accidents/Accidents.txt"
accidents_df = pd.read_csv(url_accidents, delimiter='\t')

# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#If the samples have not been downloaded yet:
#from khiops.tools import download_datasets
#download_datasets() 
#
#from os import path
#from khiops import core as kh
#accidents_dataset_path = path.join(kh.get_samples_dir(), "AccidentsSummary")
#accidents_df = pd.read_csv(path.join(accidents_dataset_path, "Accidents.txt"),sep="\t")

# Display the first 10 records from the dataset
accidents_df.head(10)
# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url_accidents = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/10.2.4/Accidents/Accidents.txt"
accidents_df = pd.read_csv(url_accidents, delimiter='\t')

# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#If the samples have not been downloaded yet:
#from khiops.tools import download_datasets
#download_datasets() 
#
#from os import path
#from khiops import core as kh
#accidents_dataset_path = path.join(kh.get_samples_dir(), "AccidentsSummary")
#accidents_df = pd.read_csv(path.join(accidents_dataset_path, "Accidents.txt"),sep="\t")

# Display the first 10 records from the dataset
accidents_df.head(10)

Out[6]:

	AccidentId	Gravity	Date	Hour	Light	Department	Commune	InAgglomeration	IntersectionType	Weather	CollisionType	PostalAddress
0	201800000001	NonLethal	2018-01-24	15:05:00	Daylight	590	5	No	Y-type	Normal	2Vehicles-BehindVehicles-Frontal	route des Ansereuilles
1	201800000002	NonLethal	2018-02-12	10:15:00	Daylight	590	11	Yes	Square	VeryGood	NoCollision	Place du général de Gaul
2	201800000003	NonLethal	2018-03-04	11:35:00	Daylight	590	477	Yes	T-type	Normal	NoCollision	Rue nationale
3	201800000004	NonLethal	2018-05-05	17:35:00	Daylight	590	52	Yes	NoIntersection	VeryGood	2Vehicles-Side	30 rue Jules Guesde
4	201800000005	NonLethal	2018-06-26	16:05:00	Daylight	590	477	Yes	NoIntersection	Normal	2Vehicles-Side	72 rue Victor Hugo

In [7]:

Copied!





# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url_vehicule = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/main/Accidents/Vehicles.txt"
vehicles_df = pd.read_csv(url_vehicule, delimiter='\t')

# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#vehicles_df = pd.read_csv(path.join(accidents_dataset_path, "Vehicles.txt"), sep="\t")

# Display the first 10 records from the dataset
vehicles_df.head(10)
# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url_vehicule = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/main/Accidents/Vehicles.txt"
vehicles_df = pd.read_csv(url_vehicule, delimiter='\t')

# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#vehicles_df = pd.read_csv(path.join(accidents_dataset_path, "Vehicles.txt"), sep="\t")

# Display the first 10 records from the dataset
vehicles_df.head(10)

Out[7]:

	AccidentId	VehicleId	Direction	Category	FixedObstacle	MobileObstacle	ImpactPoint	Maneuver
0	201800000001	A01	Unknown	Car<=3.5T	NaN	Vehicle	RightFront	TurnToLeft
1	201800000001	B01	Unknown	Car<=3.5T	NaN	Vehicle	LeftFront	NoDirectionChange
2	201800000002	A01	Unknown	Car<=3.5T	NaN	Pedestrian	NaN	NoDirectionChange
3	201800000003	A01	Unknown	Motorbike>125cm3	StationaryVehicle	Vehicle	Front	NoDirectionChange
4	201800000003	B01	Unknown	Car<=3.5T	NaN	Vehicle	LeftSide	TurnToLeft

In [8]:

Copied!





# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url_user = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/main/Accidents/Users.txt"
users_df = pd.read_csv(url_user, delimiter='\t')

# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#users_df = pd.read_csv(path.join(accidents_dataset_path, "Users.txt"), sep="\t")

# Display the first 10 records from the dataset
users_df.head(10)
# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url_user = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/main/Accidents/Users.txt"
users_df = pd.read_csv(url_user, delimiter='\t')

# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#users_df = pd.read_csv(path.join(accidents_dataset_path, "Users.txt"), sep="\t")

# Display the first 10 records from the dataset
users_df.head(10)

Out[8]:

	AccidentId	VehicleId	Seat	Category	Gravity	Gender	TripReason	SafetyDevice	SafetyDeviceUsed	PedestrianLocation	PedestrianAction	PedestrianCompany	BirthYear
0	201800000001	A01	1.0	Driver	Unscathed	Male	Leisure	SeatBelt	Yes	NaN	NaN	Unknown	1960.0
1	201800000001	B01	1.0	Driver	InjuredAndHospitalized	Male	NaN	SeatBelt	Yes	NaN	NaN	Unknown	1928.0
2	201800000002	A01	1.0	Driver	Unscathed	Male	NaN	SeatBelt	Yes	NaN	NaN	Unknown	1947.0
3	201800000002	A01	NaN	Pedestrian	MildlyInjured	Male	NaN	Helmet	NaN	OnLane<=OnSidewalk0mCrossing	Crossing	Alone	1959.0
4	201800000003	A01	1.0	Driver	InjuredAndHospitalized	Male	Leisure	Helmet	Yes	NaN	NaN	Unknown	1987.0
5	201800000003	C01	1.0	Driver	Unscathed	Male	NaN	ChildrenDevice	NaN	NaN	NaN	Unknown	1977.0
6	201800000004	A01	1.0	Driver	Unscathed	Male	Leisure	SeatBelt	Yes	NaN	NaN	Unknown	1982.0
7	201800000004	B01	1.0	Driver	InjuredAndHospitalized	Male	Leisure	Helmet	NaN	NaN	NaN	Unknown	2013.0
8	201800000005	A01	1.0	Driver	MildlyInjured	Male	Leisure	Helmet	Yes	NaN	NaN	Unknown	2001.0
9	201800000005	B01	1.0	Driver	Unscathed	Male	Leisure	SeatBelt	Yes	NaN	NaN	Unknown	1946.0

We need a final step to remove the target from the main table :

In [9]:

Copied!

df_accident_train = df_accident.drop("Gravity", axis=1)
y_accident_train = df_accident["Gravity"].map({'NonLethal': 0, 'Lethal': 1})
df_accident_train = df_accident.drop("Gravity", axis=1)
y_accident_train = df_accident["Gravity"].map({'NonLethal': 0, 'Lethal': 1})

Autofeature Engineering Overview¶

Just describe multi-table data¶

When the input data is multitable, Khiops expect a dictionary. Khiops provides a simple language for describing multi-table data. Let's create this special X input:

In [10]:

Copied!





X_accidents_train = {
    "main_table": "Accidents",
    "tables": {
        "Accidents": (df_accident_train, "AccidentId"),
        "Vehicles": (df_vehicule, ["AccidentId", "VehicleId"]),
    },
    "relations": [
        ("Accidents", "Vehicles"),
    ],
}
X_accidents_train = {
    "main_table": "Accidents",
    "tables": {
        "Accidents": (df_accident_train, "AccidentId"),
        "Vehicles": (df_vehicule, ["AccidentId", "VehicleId"]),
    },
    "relations": [
        ("Accidents", "Vehicles"),
    ],
}

This dictionary includes three attributes:

main_table indicating the name of the main table,
tables describing all tables,
relations describing the links between tables.

main table is itself a dictionary, composed of one record per table. For each record, the key corresponds to the table name and the value is a tuple associating a Pandas Dataframe and a list of keys (first the main key, then the secondary keys). And relations is a tuple list indicating the links between tables.

Now, let Khiops do the work¶

In this section, we use a KhiopsEncoder to build a flat table containing the aggregates generated by Khiops. In the next section, we'll use this table to train a LGBM classifier. The syntax used to fit this encoder is standard and simply consists of using the special dictionary entry X defined above. To use this encoder :

You can select the number of features n_features to be generated by Khiops (this is a max number). The default setting is 100.
You can set the number of trees to zero (n_trees=0). By default, Khiops builds 10 decision trees to enrich the main table with the generated categorical variables (each corresponding to the leef of the trees). This is not necessary for this tutorial.
No other parameters are requested from the user. However, for a better visualization of the coded dataset with intervals and groups instead of IDs, the user has the option of replacing part_id with part_label in the two parameters transform_type_numerical and transform_type_categorical.

Note that features are created in a supervised way, taking into account the target variable. This algorithm is intrinsically regularized, i.e. it avoids the risk of over-fitting due to the generation of over-complex aggregates.

In [11]:

Copied!

pke = KhiopsEncoder(transform_type_categorical='part_label', transform_type_numerical='part_label',n_trees=0, n_features=1000)
pke.fit(X_accidents_train, y_accident_train)
pke = KhiopsEncoder(transform_type_categorical='part_label', transform_type_numerical='part_label',n_trees=0, n_features=1000)
pke.fit(X_accidents_train, y_accident_train)

Out[11]:

KhiopsEncoder(n_features=1000, transform_type_categorical='part_label',
              transform_type_numerical='part_label')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Training is over. Let's deploy the obtained encoder.

In [12]:

Copied!





X_transformed = pke.transform(X_accidents_train)
X_transformed = pd.DataFrame(X_transformed, columns = pke.feature_names_out_)
print('\n Encoded features of the first 5 rows: \n')
X_transformed[:5]
X_transformed = pke.transform(X_accidents_train)
X_transformed = pd.DataFrame(X_transformed, columns = pke.feature_names_out_)
print('\n Encoded features of the first 5 rows: \n')
X_transformed[:5]

 Encoded features of the first 5 rows:

Out[12]:

	LabelPLight	LabelPDepartment	LabelPCommune	LabelPInAgglomeration	LabelPIntersectionType	LabelPWeather	LabelPCollisionType	LabelPPostalAddress	LabelPCount(Vehicles)	LabelPCountDistinct(Vehicles.Category)	...	LabelPSum(Vehicles.PassengerNumber) where Direction not in {Increasing, Decreasing} and PassengerNumber <= 0.5	LabelPSum(Vehicles.PassengerNumber) where ImpactPoint = Front and MobileObstacle = Vehicle	LabelPSum(Vehicles.PassengerNumber) where ImpactPoint <> Front and MobileObstacle = Vehicle	LabelPSum(Vehicles.PassengerNumber) where ImpactPoint = Front and MobileObstacle <> Vehicle	LabelPSum(Vehicles.PassengerNumber) where ImpactPoint <> Front and MobileObstacle <> Vehicle	LabelPSum(Vehicles.PassengerNumber) where Maneuver <> NoDirectionChange and MobileObstacle = Vehicle	LabelPSum(Vehicles.PassengerNumber) where Maneuver = NoDirectionChange and MobileObstacle <> Vehicle	LabelPSum(Vehicles.PassengerNumber) where Maneuver <> NoDirectionChange and MobileObstacle <> Vehicle	LabelPSum(Vehicles.PassengerNumber) where Maneuver = NoDirectionChange and PassengerNumber <= 0.5	LabelPSum(Vehicles.PassengerNumber) where Maneuver <> NoDirectionChange and PassengerNumber <= 0.5
0	{Daylight}	]545;635]	]-inf;54.5]	{No}	{X-type}	{Normal, Overcast, HeavyRain}	{2Vehicles-BehindVehicles-Frontal, }	{A4, A13, AUTOROUTE A1, ...}	]1.5;+inf[	]-inf;1.5]	...	]-inf;+inf[	Missing	]-inf;+inf[	Missing	Missing	]-inf;+inf[	Missing	Missing	]-inf;+inf[	]-inf;+inf[
1	{Daylight}	]545;635]	]-inf;54.5]	{Yes}	{X-type}	{VeryGood, FogOrSmoke}	{Other, NoCollision, 3+Vehicles-Multiple}	{A4, A13, AUTOROUTE A1, ...}	]-inf;1.5]	]-inf;1.5]	...	]-inf;+inf[	Missing	Missing	Missing	]-inf;+inf[	Missing	]-inf;+inf[	Missing	]-inf;+inf[	Missing
2	{Daylight}	]545;635]	]454.5;577.5]	{Yes}	{X-type}	{Normal, Overcast, HeavyRain}	{Other, NoCollision, 3+Vehicles-Multiple}	{A4, A13, AUTOROUTE A1, ...}	]1.5;+inf[	]1.5;+inf[	...	]-inf;+inf[	]-inf;+inf[	]-inf;+inf[	Missing	]-inf;+inf[	]-inf;+inf[	Missing	]-inf;+inf[	]-inf;+inf[	]-inf;+inf[
3	{Daylight}	]545;635]	]-inf;54.5]	{Yes}	{NoIntersection}	{VeryGood, FogOrSmoke}	{2Vehicles-Side, 2Vehicles-Behind, 3+Vehicles-...	{A4, A13, AUTOROUTE A1, ...}	]1.5;+inf[	]1.5;+inf[	...	]-inf;+inf[	Missing	]-inf;+inf[	Missing	]-inf;+inf[	]-inf;+inf[	Missing	]-inf;+inf[	Missing	]-inf;+inf[
4	{Daylight}	]545;635]	]454.5;577.5]	{Yes}	{NoIntersection}	{Normal, Overcast, HeavyRain}	{2Vehicles-Side, 2Vehicles-Behind, 3+Vehicles-...	{A4, A13, AUTOROUTE A1, ...}	]1.5;+inf[	]1.5;+inf[	...	]-inf;+inf[	Missing	]-inf;+inf[	Missing	Missing	]-inf;+inf[	Missing	Missing	Missing	]-inf;+inf[

5 rows × 619 columns

Check the new features of the encoded table. Let's notice the following:

8 of the original 11 features were selected by Khiops. The remaining features have been detected as uninformative because they have a negative compression_gain (named Level in Khiops output) and are therefore not correlated with the target.
611 new aggregates were automatically created by Khiops which saves up a large amount of time to a data scientist who usually defines and evaluate aggregates manually.
The aggregates created by Khiops are labelled with their mathematical formula, which makes them easy to interpret. For example, the aggregate "Count(Users) where PedestrianCompany <> Unknown" is simply the number of users for whom PedestrianCompany is known.

In [13]:

Copied!





kpis = {"Feature" : [], "Level": []}
variables = pke.model_report_.preparation_report.get_variable_names()
for var in variables:
    kpis["Feature"].append(var)
    level = pke.model_report_.preparation_report.get_variable_statistics(var).level
    kpis["Level"].append(level)
    
df_kpis = pd.DataFrame(kpis).sort_values(by = 'Level', ascending=False)
df_kpis.head(10).style.set_properties(subset=['Feature'], **{'width': '400px'})
kpis = {"Feature" : [], "Level": []}
variables = pke.model_report_.preparation_report.get_variable_names()
for var in variables:
    kpis["Feature"].append(var)
    level = pke.model_report_.preparation_report.get_variable_statistics(var).level
    kpis["Level"].append(level)
    
df_kpis = pd.DataFrame(kpis).sort_values(by = 'Level', ascending=False)
df_kpis.head(10).style.set_properties(subset=['Feature'], **{'width': '400px'})

Out[13]:

	Feature	Level
0	InAgglomeration	0.062196
1	Department	0.050247
2	CollisionType	0.037656
3	CountDistinct(Vehicles.Direction) where FixedObstacle is empty	0.031881
4	Mode(Vehicles.FixedObstacle) where FixedObstacle not empty	0.031605
5	Mode(Vehicles.Maneuver) where Maneuver <> NoDirectionChange	0.030864
6	Mode(Vehicles.FixedObstacle)	0.029517
7	Mode(Vehicles.FixedObstacle) where PassengerNumber <= 0.5	0.029394
8	Mode(Vehicles.FixedObstacle) where MobileObstacle <> Vehicle	0.028928
9	Mode(Vehicles.FixedObstacle) where MobileObstacle is empty	0.028779

Using Khiops Autofeature Engineering in your pipeline¶

In this section, we use the Khiops encoder within a complete pipeline using the pyCaret library. In particular, we will only consider an LGBM classifier.

In [14]:

Copied!

# the pyCaret setup for the standard models:
setup(pd.concat([df_accident_train, y_accident_train], axis=1), target = 'Gravity', session_id=123, verbose=False)
compare_models(include=["lightgbm"])
# the pyCaret setup for the standard models:
setup(pd.concat([df_accident_train, y_accident_train], axis=1), target = 'Gravity', session_id=123, verbose=False)
compare_models(include=["lightgbm"])

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
lightgbm	Light Gradient Boosting Machine	0.9354	0.6418	0.0251	0.1158	0.0411	0.0216	0.0289	0.4160

Out[14]:

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent='warn',
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [15]:

Copied!





results_lgbm = pull()
results_lgbm['Model'].replace({
    'Light Gradient Boosting Machine': 'LGBM with root table only',
    }, inplace=True)
results_lgbm = pull()
results_lgbm['Model'].replace({
    'Light Gradient Boosting Machine': 'LGBM with root table only',
    }, inplace=True)

In [16]:

Copied!

# the pyCaret setup for the standard models:
setup(pd.concat([X_transformed.reset_index(drop=True), y_accident_train.reset_index(drop=True)], axis=1), target = 'Gravity', session_id=123, verbose=False,preprocess=False)
compare_models(include=["lightgbm"])
# the pyCaret setup for the standard models:
setup(pd.concat([X_transformed.reset_index(drop=True), y_accident_train.reset_index(drop=True)], axis=1), target = 'Gravity', session_id=123, verbose=False,preprocess=False)
compare_models(include=["lightgbm"])

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
lightgbm	Light Gradient Boosting Machine	0.9451	0.8295	0.0489	0.5249	0.0891	0.0805	0.1470	0.7760

Out[16]:

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent='warn',
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [17]:

Copied!





results_lgbm = pd.concat([results_lgbm, pull()], ignore_index=True)
results_lgbm['Model'].replace({
    'Light Gradient Boosting Machine': 'LGBM with Khiops AutoFeature Engineering',
    }, inplace=True)
results_lgbm.sort_values(by="Accuracy",ascending=False)
results_lgbm = pd.concat([results_lgbm, pull()], ignore_index=True)
results_lgbm['Model'].replace({
    'Light Gradient Boosting Machine': 'LGBM with Khiops AutoFeature Engineering',
    }, inplace=True)
results_lgbm.sort_values(by="Accuracy",ascending=False)

Out[17]:

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
1	LGBM with Khiops AutoFeature Engineering	0.9451	0.8295	0.0489	0.5249	0.0891	0.0805	0.1470	0.776
0	LGBM with root table only	0.9354	0.6418	0.0251	0.1158	0.0411	0.0216	0.0289	0.416

In [18]:

Copied!

df_plot = results_lgbm.drop("TT (Sec)",axis=1).melt(id_vars=['Model'], var_name='Metric', value_name='Value')

plt.figure(figsize=(14, 6))

# Create a bar plot with Seaborn
sns.barplot(x='Metric', y='Value', hue='Model', data=df_plot, palette="Set3")

plt.title("LGBM performances improvements with autofeature enginnering")
plt.ylabel('Value')
plt.xlabel('Metric')

plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()
df_plot = results_lgbm.drop("TT (Sec)",axis=1).melt(id_vars=['Model'], var_name='Metric', value_name='Value')

plt.figure(figsize=(14, 6))

# Create a bar plot with Seaborn
sns.barplot(x='Metric', y='Value', hue='Model', data=df_plot, palette="Set3")

plt.title("LGBM performances improvements with autofeature enginnering")
plt.ylabel('Value')
plt.xlabel('Metric')

plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

No description has been provided for this image

Finally, when we compare the performance of the LGBM classifier using, or not, the aggregates generated by Khiops, we appreciate the extent to which the Auto Feature Engineering algorithm is able to extract useful information from the secondary tables. This work would have been painstaking to do by hand, requiring numerous interactions between the data scientist and the business experts, and a great deal of trial and error.

Boosting model interpretability with Khiops' end-to-end pipeline¶

In this section, we use the complete pipeline provided by Khiops, which chains Auto Feature Engineering and Parsimonious Training through a simplistic syntax (a simple fit function). Using this pipeline has a number of advantages, it considerably improves model interpretability, it's ultra-easy to use, and it scales very well.

In [19]:

Copied!

# we use a KhiopsClassifier to leverage the full pipeline
pkc_accidents = KhiopsClassifier(n_trees=0, n_features=1000)

# Just fit it ! 
pkc_accidents.fit(X_accidents_train, y_accident_train)
# we use a KhiopsClassifier to leverage the full pipeline
pkc_accidents = KhiopsClassifier(n_trees=0, n_features=1000)

# Just fit it ! 
pkc_accidents.fit(X_accidents_train, y_accident_train)

Out[19]:

KhiopsClassifier(auto_sort=True, internal_sort=None, key=None, n_features=1000,
                 n_pairs=0, n_trees=0, output_dir=None, verbose=False)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [20]:

Copied!

# Let's take a look at the classifier's performance on the training set
train_eval = pkc_accidents.model_report_.train_evaluation_report.get_snb_performance()
# Let's take a look at the classifier's performance on the training set
train_eval = pkc_accidents.model_report_.train_evaluation_report.get_snb_performance()

In [21]:

Copied!





print(f"Accident train accuracy: {train_eval.accuracy}")
print(f"Accident train AUC     : {train_eval.auc}")

classes = train_eval.confusion_matrix.values
confusion_matrix = pd.DataFrame(
    train_eval.confusion_matrix.matrix,
    columns=classes,
    index=classes,
)
print("Accident train confusion matrix:")
confusion_matrix
print(f"Accident train accuracy: {train_eval.accuracy}")
print(f"Accident train AUC     : {train_eval.auc}")

classes = train_eval.confusion_matrix.values
confusion_matrix = pd.DataFrame(
    train_eval.confusion_matrix.matrix,
    columns=classes,
    index=classes,
)
print("Accident train confusion matrix:")
confusion_matrix

Accident train accuracy: 0.945036
Accident train AUC     : 0.818541
Accident train confusion matrix:

Out[21]:

	0	1
0	54573	3152
1	24	34

Now let's take a closer look at the trained model, by examining the aggregates generated by the Auto Feature Engineering step and selected by Paricimonuous Training. The next cell shows the names of the selected aggregates, sorted by decreasing weight, and the level is also displayed:

the level measures the extent to which the variable is correlated with the target, reflecting the importance of the variable on its own (more details here).
the weight measures the importance of the variable within the learned classifier, reflecting the information provided by this variable orthogonally to the others (more details here).

In [22]:

Copied!

kpis = {"Feature" : [], "Level": [], "Weight": []}

for var in pkc_accidents.model_report_.modeling_report.get_snb_predictor().selected_variables:
    kpis["Feature"].append(var.name)
    kpis["Level"].append(var.level)
    kpis["Weight"].append(var.weight)

df_kpis = pd.DataFrame(kpis).sort_values(by = 'Weight', ascending=False)
df_kpis.head(10).style.set_properties(subset=['Feature'], **{'width': '400px'})
kpis = {"Feature" : [], "Level": [], "Weight": []}

for var in pkc_accidents.model_report_.modeling_report.get_snb_predictor().selected_variables:
    kpis["Feature"].append(var.name)
    kpis["Level"].append(var.level)
    kpis["Weight"].append(var.weight)

df_kpis = pd.DataFrame(kpis).sort_values(by = 'Weight', ascending=False)
df_kpis.head(10).style.set_properties(subset=['Feature'], **{'width': '400px'})

Out[22]:

	Feature	Level	Weight
27	Count(Vehicles) where Category = Car<=3.5T	0.000940	0.792969
10	CountDistinct(Vehicles.Direction)	0.003901	0.754883
11	Mode(Vehicles.Category) where MobileObstacle not in {Vehicle, }	0.004293	0.655273
20	Mode(Vehicles.ImpactPoint) where Category <> Car<=3.5T	0.002214	0.541992
0	InAgglomeration	0.062196	0.539062
24	CountDistinct(Vehicles.Category) where ImpactPoint = Front	0.001865	0.534180
1	Department	0.050247	0.504395
13	PostalAddress	0.005056	0.487305
28	Weather	0.001443	0.428711
3	Light	0.025051	0.319336

As shown in the previous cell, the aggregates generated by khiops are identified by their calculation formula, which greatly facilitates their interpretation. For example, here's what the first three aggregates mean:

"Count(Vehicles) where Category = Car<=3.5T" is the number of light car involved in the accident;
"CountDistinct(Vehicles.Direction)" is the number of different directions taken by the vehicles involved;
"Mode(Vehicles.Category) where MobileObstacle not in {Vehicle, }" majority category of vehicles striking a moving obstacle (excluding vehicles).

In practice, the models trained by khiops are easy to understand and make it easy to interact with technical experts. To make the link with the figure in the introduction to this notebook, let's now look at how the size of the data representation evolves throughout the pipeline:

In [23]:

Copied!





# nb generated vs. selected features :
Nb_original_features = len(df_accident_train.head())
Nb_selected_features = len(df_kpis)

print("Nb original features  = " + str(Nb_original_features))
print("Nb generated features = 1000")
print("Nb selected features  = " + str(Nb_selected_features))

# nb generated vs. selected features :
Nb_original_features = len(df_accident_train.head())
Nb_selected_features = len(df_kpis)

print("Nb original features  = " + str(Nb_original_features))
print("Nb generated features = 1000")
print("Nb selected features  = " + str(Nb_selected_features))

Nb original features  = 5
Nb generated features = 1000
Nb selected features  = 47

This training pipeline starts with 5 features in the main table. The Auto Features Engineering step generates 1000 aggregates (this is a maximum number provided by the user). Then, the Parcimonious Training step selects 47 features out of all those available, in order to find the smallest subset of variables that are collectively the most informative and the most independent of each other. In the end, the resulting model is both accurate and easy to interpret. For more in-depth model interpretation, please consult the page describing the visualization tool supplied with Khiops.