Interpretable Variable Encoding¶

Introduction¶

Variable encoding is essential for feeding learning algorithms from raw data, particularly for categorical data which often has to be numerically encoded. Ensuring this coding is useful for training the model, safe from over-fitting and interpretable, improves both the quality of the models and their relevance in an industrial context. Exploring Khiops' aptitude with raw data, we now delve deeper into its unique approach to variable encoding, ensuring that the transformed data retains its interpretability—a trait often sacrificed in the race for higher performance.

Unlike many traditional encoding techniques that demand tricky setup and hyperparameter fine-tuning, Khiops delivers optimal encoding effortlessly thanks to its supervised approach MODL. The endoding directly aligns with the predictive objective, enhancing accuracy and relevance. Moreover, Khiops' encoding systematically refines as your data grows, reaching more granularity and precision.

In this notebook, we'll explore the Khiops' encoding capabilities, drawing contrasts with conventional methods and highlighting the interpretability and target-driven optimization. We demonstrate that its optimal encoding seamlessly slots into any machine learning pipeline (here, on a pycaret pipeline), preventing the need for making predetermined choices about encoding strategies and extensive hyperparameter optimization processes.

Installation and set up¶

If you do not use our offical khiops-notebook Jupyter Docker image, you may have to install khiops locally using conda:

In [41]:

Copied!

#!conda install -y -c conda-forge -c khiops khiops
#!conda install -y -c conda-forge -c khiops khiops

For the experiments, you also need some external libraries you can install via pip:

In [42]:

Copied!





# Installation of external libraries
#!pip install matplotlib seaborn lightgbm

# Note: Installing PyCaret can sometimes be complex due to its dependencies. 
# If you encounter any issues, please refer to the PyCaret documentation for detailed installation instructions:
# https://pycaret.gitbook.io/docs/get-started/installation
#!pip install pycaret
# Installation of external libraries
#!pip install matplotlib seaborn lightgbm

# Note: Installing PyCaret can sometimes be complex due to its dependencies. 
# If you encounter any issues, please refer to the PyCaret documentation for detailed installation instructions:
# https://pycaret.gitbook.io/docs/get-started/installation
#!pip install pycaret

We now import all the dependencies here:

In [43]:

Copied!





import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from random import randint
import seaborn as sns

from sklearn.model_selection import train_test_split
from khiops.sklearn.estimators import KhiopsEncoder
from khiops.sklearn import KhiopsClassifier

from pycaret.classification import *
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from random import randint
import seaborn as sns

from sklearn.model_selection import train_test_split
from khiops.sklearn.estimators import KhiopsEncoder
from khiops.sklearn import KhiopsClassifier

from pycaret.classification import *

Import and Preparation of Data¶

For this notebook, we use the "Adult" UCI standard Dataset. More details available here.

This dataset is also available on our khiops-sample repository on Github.

In [44]:

Copied!





# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/10.2.4/Adult/Adult.txt"
df = pd.read_csv(url, delimiter='\t',index_col="Label")

# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#If the samples have not been downloaded yet:
#from khiops.tools import download_datasets
#download_datasets() 
#
#from os import path
#from khiops import core as kh
#adult_path = path.join(kh.get_samples_dir(), "Adult", "Adult.txt")
#df = pd.read_csv(adult_path, sep="\t",index_col="Label")

# Display the first 10 records from the dataset
df.head(10)
# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/10.2.4/Adult/Adult.txt"
df = pd.read_csv(url, delimiter='\t',index_col="Label")

# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#If the samples have not been downloaded yet:
#from khiops.tools import download_datasets
#download_datasets() 
#
#from os import path
#from khiops import core as kh
#adult_path = path.join(kh.get_samples_dir(), "Adult", "Adult.txt")
#df = pd.read_csv(adult_path, sep="\t",index_col="Label")

# Display the first 10 records from the dataset
df.head(10)

Out[44]:

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	capital_loss	hours_per_week	native_country	class
Label
1	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	less
2	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	less
3	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	less
4	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	less
5	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	less
6	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	40	United-States	less
7	49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0	0	16	Jamaica	less
8	52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	45	United-States	more
9	31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	0	50	United-States	more
10	42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	0	40	United-States	more

In [45]:

Copied!

# Drop the "class" column to create the feature set (X).
X = df.drop("class", axis=1)
# Drop the "class" column to create the feature set (X).
X = df.drop("class", axis=1)

In [46]:

Copied!

# Extract the "class" column to create the target labels (y).
y = df["class"].map({'less': 0, 'more': 1})
# Extract the "class" column to create the target labels (y).
y = df["class"].map({'less': 0, 'more': 1})

Overview of Khiops Encoding¶

In this section, we'll explore how Khiops encodes the Adult dataset.

In our previous notebook, we highlighted Khiops' strengths: its robustness, ability to handle raw data, and the absence of any need for parameter tuning. Due to these characteristics, cross-validation is unlikely to change the model or the variable encoding. Consequently, for this experiment, we'll adopt a straightforward train-test split.

In [47]:

Copied!

# Splitting the dataset into training and test sets. 10% of the data is reserved for testing, and we set a random seed for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
# Splitting the dataset into training and test sets. 10% of the data is reserved for testing, and we set a random seed for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [48]:

Copied!





# Initialize a KhiopsEncoder with specified transformation types for both categorical and numerical data.
# Here, "part_label" means that the variables will be encoded using labels derived from the Khiops discretization.
encoder = KhiopsEncoder(transform_type_categorical="part_label", transform_type_numerical="part_label")

# Fit the encoder to the training data and simultaneously transform it.
# The encoder learns the optimal encoding based on the provided features and target.
X_encoded = encoder.fit_transform(X_train, y_train)
# Initialize a KhiopsEncoder with specified transformation types for both categorical and numerical data.
# Here, "part_label" means that the variables will be encoded using labels derived from the Khiops discretization.
encoder = KhiopsEncoder(transform_type_categorical="part_label", transform_type_numerical="part_label")

# Fit the encoder to the training data and simultaneously transform it.
# The encoder learns the optimal encoding based on the provided features and target.
X_encoded = encoder.fit_transform(X_train, y_train)

In [49]:

Copied!





# Convert the transformed (encoded) training data into a DataFrame.
# The column names for this DataFrame are fetched from the encoder's 'feature_names_out_' attribute.
X_transformed = pd.DataFrame(X_encoded, columns = encoder.feature_names_out_)

# Display the first 5 rows of the transformed training data to get a quick overview of the encoding.
X_transformed.head(5)
# Convert the transformed (encoded) training data into a DataFrame.
# The column names for this DataFrame are fetched from the encoder's 'feature_names_out_' attribute.
X_transformed = pd.DataFrame(X_encoded, columns = encoder.feature_names_out_)

# Display the first 5 rows of the transformed training data to get a quick overview of the encoding.
X_transformed.head(5)

Out[49]:

	LabelPage	LabelPworkclass	LabelPfnlwgt	LabelPeducation	LabelPeducation_num	LabelPmarital_status	LabelPoccupation	LabelPrelationship	LabelPrace	LabelPsex	LabelPcapital_gain	LabelPcapital_loss	LabelPhours_per_week	LabelPnative_country
0	]21.5;23.5]	{Private, Without-pay, Never-worked}	]-inf;80534]	{Bachelors}	]12.5;13.5]	{Never-married, Separated}	{Other-service, Priv-house-serv}	{Own-child}	{White, Asian-Pac-Islander}	{Male}	]-inf;57]	]-inf;70]	]39.5;40.5]	{United-States, Germany, Cuba, ...}
1	]42.5;55.5]	{Private, Without-pay, Never-worked}	]80534;201108.5]	{11th, 10th, 7th-8th, ...}	]-inf;8.5]	{Married-civ-spouse, Married-AF-spouse}	{Prof-specialty, Armed-Forces}	{Husband, Wife}	{White, Asian-Pac-Islander}	{Female}	]-inf;57]	]-inf;70]	]43.5;49.5]	{United-States, Germany, Cuba, ...}
2	]42.5;55.5]	{Private, Without-pay, Never-worked}	]80534;201108.5]	{HS-grad}	]8.5;9.5]	{Married-civ-spouse, Married-AF-spouse}	{Exec-managerial}	{Husband, Wife}	{White, Asian-Pac-Islander}	{Female}	]-inf;57]	]-inf;70]	]-inf;34.5]	{United-States, Germany, Cuba, ...}
3	]23.5;25.5]	{Private, Without-pay, Never-worked}	]80534;201108.5]	{Assoc-voc, Assoc-acdm}	]10.5;12.5]	{Married-civ-spouse, Married-AF-spouse}	{Craft-repair, Transport-moving}	{Husband, Wife}	{White, Asian-Pac-Islander}	{Male}	]-inf;57]	]-inf;70]	]43.5;49.5]	{United-States, Germany, Cuba, ...}
4	]23.5;25.5]	{Private, Without-pay, Never-worked}	]80534;201108.5]	{HS-grad}	]8.5;9.5]	{Married-civ-spouse, Married-AF-spouse}	{Other-service, Priv-house-serv}	{Husband, Wife}	{White, Asian-Pac-Islander}	{Male}	]-inf;57]	]-inf;70]	]39.5;40.5]	{Mexico, Puerto-Rico, El-Salvador, ...}

Let's compare the number of unique values between the original and encoded datasets. This will give us insights into Khiops' encoding strategy and its granularity, highlighting how it captures patterns relative to the target variable.

In [50]:

Copied!





# Retrieve the number of unique values for each column
unique_vals_orig = X_train.nunique()
unique_vals_encoded = X_transformed.nunique()

# Create a comparison dataframe with the same index
comparison = pd.DataFrame(index=unique_vals_orig.index)
comparison['Original'] = unique_vals_orig
comparison['Encoded'] = unique_vals_encoded.values  # Use ".values" to ignore the index and align the values by order

# Display the result to see the reduction of distinct values for each feature
comparison
# Retrieve the number of unique values for each column
unique_vals_orig = X_train.nunique()
unique_vals_encoded = X_transformed.nunique()

# Create a comparison dataframe with the same index
comparison = pd.DataFrame(index=unique_vals_orig.index)
comparison['Original'] = unique_vals_orig
comparison['Encoded'] = unique_vals_encoded.values  # Use ".values" to ignore the index and align the values by order

# Display the result to see the reduction of distinct values for each feature
comparison

Out[50]:

	Original	Encoded
age	74	12
workclass	8	4
fnlwgt	26655	3
education	16	7
education_num	16	7
marital_status	7	3
occupation	14	7
relationship	6	5
race	5	2
sex	2	2
capital_gain	121	23
capital_loss	99	17
hours_per_week	96	9
native_country	41	3

We can make several remarks here:

For many variables, Khiops has reduced the granularity. Look at age, which originally had 74 unique values but has been discretized into 12 categories. Similarly, fnlwgt had 26,655 unique values, but post-encoding, it's represented with just 3 categories.
For variables with fewer unique values like sex, the encoding hasn't changed the cardinality.

Essentially, this encoded representation capture useful patterns in data, since it's optimized for predicting the target based independently on each variable.

Encoding of numerical variables¶

The discretization in Khiops is rooted in the MODL approach, which partitions numerical variables into meaningful intervals. At its core, MODL seeks a balance between two competing objectives:

Prior encourages the selection of a simple discretization model, with few intervals (the most probable model has a single interval),
Likelihood advocates complex discretization models, i.e. composed of many intervals (the most likely model has as many intervals as there are changes in the sequence of labels).

Thus, optimizing for prior only leads to an under-fit. On the other hand, optimizing for likelyhood leads to overfitting. Both terms are derived from Bayes' formula and are mathematically consistent. So, Khiops optimizes between these two extremes, achieving a balance between underfitting and overfitting, ensuring that discretization models are relevant, robust (not too many intervals) and interpretable, capturing the underlying patterns in the data related to the target variable.

With this understanding, let's see how the variable age is transformed by this method.

In [51]:

Copied!





# Reset indices to ensure alignment
X_reset = X_train.reset_index(drop=True)
X_transformed_reset = X_transformed.reset_index(drop=True)

# Combine original 'age' values and their transformed values into a single dataframe
combined_age_df = pd.DataFrame({
    'Original_age': X_reset['age'],
    'Encoded_age': X_transformed_reset['LabelPage']
})

# Display the first few rows to see the transformation
combined_age_df.head(10)
# Reset indices to ensure alignment
X_reset = X_train.reset_index(drop=True)
X_transformed_reset = X_transformed.reset_index(drop=True)

# Combine original 'age' values and their transformed values into a single dataframe
combined_age_df = pd.DataFrame({
    'Original_age': X_reset['age'],
    'Encoded_age': X_transformed_reset['LabelPage']
})

# Display the first few rows to see the transformation
combined_age_df.head(10)

Out[51]:

	Original_age	Encoded_age
0	23	]21.5;23.5]
1	46	]42.5;55.5]
2	46	]42.5;55.5]
3	24	]23.5;25.5]
4	24	]23.5;25.5]
5	43	]42.5;55.5]
6	33	]30.5;33.5]
7	24	]23.5;25.5]
8	37	]35.5;42.5]
9	53	]42.5;55.5]

From the original 74 unique values of the "age" feature, we've reduced it to 12 encoded intervals. These interval bounds are optimized by Khiops in relation to the target variable.

Now, let's first examine the distribution of the original "age" values. Following that, we'll delve into the Khiops' discretization intervals.

In [52]:

Copied!





# Displaying the original age distribution
plt.figure(figsize=(12, 7))
X_train['age'].hist(bins=30, edgecolor='black')
plt.title('Distribution of Original Age Values')
plt.xlabel('Age')
plt.ylabel('Number of Individuals')
plt.show()
# Displaying the original age distribution
plt.figure(figsize=(12, 7))
X_train['age'].hist(bins=30, edgecolor='black')
plt.title('Distribution of Original Age Values')
plt.xlabel('Age')
plt.ylabel('Number of Individuals')
plt.show()

No description has been provided for this image

The above discretization is unsupervised, i.e. the target is not used in its construction. This histogram approximates the distribution the variable age without taking the learning task into account.

Let's now take a look at the encoding provided by Khiops, which corresponds to another type of discretization, this time supervised. A discretization model trained by Khiops is equivalent to a univariate classifier whose aim is to estimate the class probability as a function of a single input variable (or more precisely, the interval in which the numerical value falls). The aim is therefore to identify the interval bounds that optimize probabilistic inference of the class.

Now, let's visualize the class distribution across these intervals.

In [53]:

Copied!





# Merging the encoded ages, original ages, and target classes
combined_age_class_df = pd.DataFrame({
    'Original_age': X_train['age'].reset_index(drop=True),
    'Encoded_age': X_transformed['LabelPage'].reset_index(drop=True),
    'Class': y_train.reset_index(drop=True)
})

# Grouping by encoded age and by class, then counting the occurrences
grouped = combined_age_class_df.groupby(['Encoded_age', 'Class']).size().unstack()

# Sorting the bins
sorted_bins = combined_age_df.sort_values("Original_age").Encoded_age.drop_duplicates().tolist()
grouped = grouped.reindex(sorted_bins)

# Displaying the stacked bar chart
grouped.plot(kind='bar', stacked=True, figsize=(12, 7))
plt.title('Distribution of Instances and Class Proportions within Encoded Age Intervals')
plt.ylabel('Number of Observations')
plt.xlabel('Encoded Age Interval')
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
# Merging the encoded ages, original ages, and target classes
combined_age_class_df = pd.DataFrame({
    'Original_age': X_train['age'].reset_index(drop=True),
    'Encoded_age': X_transformed['LabelPage'].reset_index(drop=True),
    'Class': y_train.reset_index(drop=True)
})

# Grouping by encoded age and by class, then counting the occurrences
grouped = combined_age_class_df.groupby(['Encoded_age', 'Class']).size().unstack()

# Sorting the bins
sorted_bins = combined_age_df.sort_values("Original_age").Encoded_age.drop_duplicates().tolist()
grouped = grouped.reindex(sorted_bins)

# Displaying the stacked bar chart
grouped.plot(kind='bar', stacked=True, figsize=(12, 7))
plt.title('Distribution of Instances and Class Proportions within Encoded Age Intervals')
plt.ylabel('Number of Observations')
plt.xlabel('Encoded Age Interval')
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

As shown in the figure above, the Khiops discretization doesn't try to account for the original age distribution, but focuses on the distribution of classes in each interval (intervals are neither of equal width nor of equal number of examples). Note that the intervals are constructed for this purpose, as the proportion of classes in each interval is different from that of its neighbors. The probability of belonging to the richest group evolves smoothly with age. First, it increases, reaching a maximum around age 50, and then decreases.

Khiops effectively captures the most relevant nuances in the data to estimate the class distribution, which conventional discretization methods cannot do.

Encoding of categorical variables¶

As with numerical variables, Khiops applies the MODL approach to transform categorical variables into modality groups, this encoding being informative for predicting the target variable. During model selection, a balance is achieved between :

The Prior term, which encourages the selection of a simple model, with few groups of modalities (the most probable being composed of a single group);
The Likelihood term, which favors complex models that accurately describe the training data (possibly one group per modality).

As before, optimizing the prior alone leads to over-fitting. By contrast, optimizing likelihood alone leads to over-fitting. It is the balance given by these two terms, both derived from Bayes' formula and mathematically consistent, that enables us to achieve a balance between under-fitting and over-fitting. Ultimately, the selected models accurately and robustly describe the patterns underlying the training data, with respect to target variable.

MODL's grouping approach having been explained in detail on the dedicated page, we now turn to the encoding of the categorical variable workclass.

In [54]:

Copied!





# Combine original 'age' values and their transformed values into a single dataframe
combined_workclass_df = pd.DataFrame({
    'Original_workclass': X_reset['workclass'],
    'Encoded_workclass': X_transformed_reset['LabelPworkclass']
})

# Display the first few rows to see the transformation
combined_workclass_df.drop_duplicates().head(10)
# Combine original 'age' values and their transformed values into a single dataframe
combined_workclass_df = pd.DataFrame({
    'Original_workclass': X_reset['workclass'],
    'Encoded_workclass': X_transformed_reset['LabelPworkclass']
})

# Display the first few rows to see the transformation
combined_workclass_df.drop_duplicates().head(10)

Out[54]:

	Original_workclass	Encoded_workclass
0	Private	{Private, Without-pay, Never-worked}
5	Self-emp-inc	{Self-emp-inc}
9	Self-emp-not-inc	{Self-emp-not-inc, Local-gov, State-gov}
10	State-gov	{Self-emp-not-inc, Local-gov, State-gov}
19	Local-gov	{Self-emp-not-inc, Local-gov, State-gov}
36	Federal-gov	{Federal-gov}
978	Without-pay	{Private, Without-pay, Never-worked}
4494	Never-worked	{Private, Without-pay, Never-worked}

In [55]:

Copied!





# Merging the encoded workclass, original workclass, and target classes
combined_workclass_class_df = pd.DataFrame({
    'Original_workclass': X_train['workclass'].reset_index(drop=True),
    'Encoded_workclass': X_transformed['LabelPworkclass'].reset_index(drop=True),
    'Class': y_train.reset_index(drop=True)
})

# Grouping by encoded workclass and by class, then counting the occurrences
grouped = combined_workclass_class_df.groupby(['Encoded_workclass', 'Class']).size().unstack()

# Sorting the categories might be unnecessary as categorical values don't have a natural ordering like numerical ones. However, if you'd like to order them by frequency or another metric, you'd do it here.

# Displaying the stacked bar chart
grouped.plot(kind='bar', stacked=True, figsize=(12, 7))
plt.title('Distribution of Instances and Class Proportions within Encoded Workclass Categories')
plt.ylabel('Number of Observations')
plt.xlabel('Encoded Workclass Category')
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
# Merging the encoded workclass, original workclass, and target classes
combined_workclass_class_df = pd.DataFrame({
    'Original_workclass': X_train['workclass'].reset_index(drop=True),
    'Encoded_workclass': X_transformed['LabelPworkclass'].reset_index(drop=True),
    'Class': y_train.reset_index(drop=True)
})

# Grouping by encoded workclass and by class, then counting the occurrences
grouped = combined_workclass_class_df.groupby(['Encoded_workclass', 'Class']).size().unstack()

# Sorting the categories might be unnecessary as categorical values don't have a natural ordering like numerical ones. However, if you'd like to order them by frequency or another metric, you'd do it here.

# Displaying the stacked bar chart
grouped.plot(kind='bar', stacked=True, figsize=(12, 7))
plt.title('Distribution of Instances and Class Proportions within Encoded Workclass Categories')
plt.ylabel('Number of Observations')
plt.xlabel('Encoded Workclass Category')
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

Just as with numerical variable encoding, Khiops groups categorical values to differentiate class probabilities within each group. This coding aims to estimate the conditional probabilities of the classes, as would a univariate classifier.

How Khiops Encoder handles missing values and outliers¶

This section returns in more detail to the first notebook, which shows that training data should not be prepared with Khiops. We'll inject missing values, noise, and outliers into our dataset to understand how the Khiops Encoder manages these common challenges. The objective is to showcase the robustness and adaptability of the Khiops Encoder in the face of imperfect data. In the later section, we'll leverage this robust encoding to train an LGBM classifier, demonstrating the potential benefits of utilizing the Khiops encoding in a broader context.

In [56]:

Copied!





X_train_noisy = X_train.copy()

# Percentage of missing values depending on the class
missing_rate_class_0 = 0.2  # 20% missing when class is 0
missing_rate_class_1 = 0.05  # 5% missing when class is 1

# Identify indices where class is 0
indices_class_0 = X_train_noisy.index[y.loc[X_train_noisy.index] == 0]

# Identify indices where class is 1
indices_class_1 = X_train_noisy.index[y.loc[X_train_noisy.index] == 1]

# Number of missing values for each class
N_0 = int(len(indices_class_0) * missing_rate_class_0)
N_1 = int(len(indices_class_1) * missing_rate_class_1)

# Iterate over feature columns
for col in X_train_noisy.columns:

        # the column index (for the seed)
        col_index = X_train_noisy.columns.get_loc(col)
    
        # Varying the seed by column index to distribute NaNs across different rows.
        np.random.seed(42 + col_index) ; random_indices_0 = np.random.choice(indices_class_0, N_0, replace=False)
        np.random.seed(42 + col_index) ; random_indices_1 = np.random.choice(indices_class_1, N_1, replace=False)
        
        # Introduce missing values
        X_train_noisy.loc[random_indices_0, col] = np.nan
        X_train_noisy.loc[random_indices_1, col] = np.nan
X_train_noisy = X_train.copy()

# Percentage of missing values depending on the class
missing_rate_class_0 = 0.2  # 20% missing when class is 0
missing_rate_class_1 = 0.05  # 5% missing when class is 1

# Identify indices where class is 0
indices_class_0 = X_train_noisy.index[y.loc[X_train_noisy.index] == 0]

# Identify indices where class is 1
indices_class_1 = X_train_noisy.index[y.loc[X_train_noisy.index] == 1]

# Number of missing values for each class
N_0 = int(len(indices_class_0) * missing_rate_class_0)
N_1 = int(len(indices_class_1) * missing_rate_class_1)

# Iterate over feature columns
for col in X_train_noisy.columns:

        # the column index (for the seed)
        col_index = X_train_noisy.columns.get_loc(col)
    
        # Varying the seed by column index to distribute NaNs across different rows.
        np.random.seed(42 + col_index) ; random_indices_0 = np.random.choice(indices_class_0, N_0, replace=False)
        np.random.seed(42 + col_index) ; random_indices_1 = np.random.choice(indices_class_1, N_1, replace=False)
        
        # Introduce missing values
        X_train_noisy.loc[random_indices_0, col] = np.nan
        X_train_noisy.loc[random_indices_1, col] = np.nan

In [57]:

Copied!





# Introduce noise to 'age'
np.random.seed(42)
noise_age = np.random.normal(0, 3, len(X_train_noisy))
X_train_noisy['age'] = X_train_noisy['age'] + np.round(noise_age).astype(int)

# Introduce noise to 'hours_per_week'
np.random.seed(43)
noise_hours_per_week = np.random.normal(0, 3, len(X_train_noisy))
X_train_noisy['hours_per_week'] = X_train_noisy['hours_per_week'] + np.round(noise_hours_per_week).astype(int)

# Introduce outliers to 'age'
np.random.seed(42)
outlier_indices_age = np.random.choice(X_train_noisy.index, int(0.1 * len(X_train_noisy)), replace=False)
X_train_noisy.loc[outlier_indices_age, 'age'] *= np.random.randint(20, 40, size=len(outlier_indices_age))  

# Introduce outliers to 'hours_per_week'
np.random.seed(43)
outlier_indices_hours_per_week = np.random.choice(X_train_noisy.index, int(0.1 * len(X_train_noisy)), replace=False)
X_train_noisy.loc[outlier_indices_hours_per_week, 'hours_per_week'] *= np.random.randint(20, 40, size=len(outlier_indices_hours_per_week))

# Introduce outliers to 'capital_gain'
np.random.seed(44)
outlier_indices_capital_gain = np.random.choice(X_train_noisy.index, int(0.1 * len(X_train_noisy)), replace=False)
X_train_noisy.loc[outlier_indices_capital_gain, 'capital_gain'] *= np.random.randint(20, 40, size=len(outlier_indices_capital_gain)) 

X_train_noisy.head(10)
# Introduce noise to 'age'
np.random.seed(42)
noise_age = np.random.normal(0, 3, len(X_train_noisy))
X_train_noisy['age'] = X_train_noisy['age'] + np.round(noise_age).astype(int)

# Introduce noise to 'hours_per_week'
np.random.seed(43)
noise_hours_per_week = np.random.normal(0, 3, len(X_train_noisy))
X_train_noisy['hours_per_week'] = X_train_noisy['hours_per_week'] + np.round(noise_hours_per_week).astype(int)

# Introduce outliers to 'age'
np.random.seed(42)
outlier_indices_age = np.random.choice(X_train_noisy.index, int(0.1 * len(X_train_noisy)), replace=False)
X_train_noisy.loc[outlier_indices_age, 'age'] *= np.random.randint(20, 40, size=len(outlier_indices_age))  

# Introduce outliers to 'hours_per_week'
np.random.seed(43)
outlier_indices_hours_per_week = np.random.choice(X_train_noisy.index, int(0.1 * len(X_train_noisy)), replace=False)
X_train_noisy.loc[outlier_indices_hours_per_week, 'hours_per_week'] *= np.random.randint(20, 40, size=len(outlier_indices_hours_per_week))

# Introduce outliers to 'capital_gain'
np.random.seed(44)
outlier_indices_capital_gain = np.random.choice(X_train_noisy.index, int(0.1 * len(X_train_noisy)), replace=False)
X_train_noisy.loc[outlier_indices_capital_gain, 'capital_gain'] *= np.random.randint(20, 40, size=len(outlier_indices_capital_gain)) 

X_train_noisy.head(10)

Out[57]:

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	capital_loss	hours_per_week	native_country
Label
21729	24.0	NaN	38238.0	Bachelors	13.0	Never-married	Other-service	NaN	White	Male	0.0	0.0	41.0	United-States
11809	46.0	NaN	116952.0	7th-8th	4.0	Married-civ-spouse	NaN	Wife	White	Female	0.0	0.0	1302.0	United-States
23968	48.0	Private	191389.0	HS-grad	9.0	Married-civ-spouse	Exec-managerial	Wife	White	Female	0.0	0.0	27.0	United-States
13836	29.0	NaN	190290.0	NaN	11.0	Married-civ-spouse	NaN	NaN	White	NaN	0.0	0.0	43.0	United-States
40272	759.0	Private	200089.0	NaN	NaN	Married-civ-spouse	Other-service	Husband	White	Male	0.0	0.0	NaN	Guatemala
3190	42.0	Self-emp-inc	NaN	Some-college	10.0	NaN	Machine-op-inspct	Husband	White	Male	0.0	0.0	49.0	United-States
6189	38.0	Private	195488.0	NaN	9.0	Married-civ-spouse	Adm-clerical	Wife	White	Female	0.0	0.0	53.0	United-States
38671	NaN	NaN	166297.0	Bachelors	NaN	Never-married	NaN	Own-child	Asian-Pac-Islander	Male	0.0	0.0	NaN	United-States
32547	NaN	Private	NaN	Assoc-acdm	12.0	NaN	NaN	NaN	White	Female	0.0	0.0	44.0	NaN
35160	NaN	Self-emp-not-inc	NaN	7th-8th	4.0	Married-civ-spouse	Machine-op-inspct	Husband	White	NaN	0.0	NaN	39.0	NaN

In [58]:

Copied!





# Initialize a KhiopsEncoder with specified transformation types for both categorical and numerical data.
# Here, "part_label" means that the variables will be encoded using labels derived from the Khiops discretization.
encoder_noisy = KhiopsEncoder(transform_type_categorical="part_label", transform_type_numerical="part_label")

# Fit the encoder to the training data and simultaneously transform it.
# The encoder learns the optimal encoding based on the provided features and target.
X_encoded_noisy = encoder_noisy.fit_transform(X_train_noisy, y_train)
# Initialize a KhiopsEncoder with specified transformation types for both categorical and numerical data.
# Here, "part_label" means that the variables will be encoded using labels derived from the Khiops discretization.
encoder_noisy = KhiopsEncoder(transform_type_categorical="part_label", transform_type_numerical="part_label")

# Fit the encoder to the training data and simultaneously transform it.
# The encoder learns the optimal encoding based on the provided features and target.
X_encoded_noisy = encoder_noisy.fit_transform(X_train_noisy, y_train)

In [59]:

Copied!





# Convert the transformed (encoded) training data into a DataFrame.
# The column names for this DataFrame are fetched from the encoder's 'feature_names_out_' attribute.
X_transformed_noisy = pd.DataFrame(X_encoded_noisy, columns = encoder.feature_names_out_)

# Display the first 5 rows of the transformed training data to get a quick overview of the encoding.
X_transformed_noisy.head(5)
# Convert the transformed (encoded) training data into a DataFrame.
# The column names for this DataFrame are fetched from the encoder's 'feature_names_out_' attribute.
X_transformed_noisy = pd.DataFrame(X_encoded_noisy, columns = encoder.feature_names_out_)

# Display the first 5 rows of the transformed training data to get a quick overview of the encoding.
X_transformed_noisy.head(5)

Out[59]:

	LabelPage	LabelPworkclass	LabelPfnlwgt	LabelPeducation	LabelPeducation_num	LabelPmarital_status	LabelPoccupation	LabelPrelationship	LabelPrace	LabelPsex	LabelPcapital_gain	LabelPcapital_loss	LabelPhours_per_week	LabelPnative_country
0	]22.5;24.5]	{, Never-worked}	]-inf;68270.5]	{Bachelors}	]12.5;13.5]	{Never-married}	{Other-service, Priv-house-serv}	{, Unmarried}	{White, Asian-Pac-Islander}	{Male}	]-inf;57]	]-inf;100]	]34.5;42.5]	{United-States, Philippines, Germany, ...}
1	]42.5;56.5]	{, Never-worked}	]68270.5;187920]	{}	]-inf;8.5]	{Married-civ-spouse, Married-AF-spouse}	{, Handlers-cleaners}	{Husband, Wife}	{White, Asian-Pac-Islander}	{Female}	]-inf;57]	]-inf;100]	]815;1552.5]	{United-States, Philippines, Germany, ...}
2	]42.5;56.5]	{Private, Without-pay}	]187920;+inf[	{HS-grad}	]8.5;9.5]	{Married-civ-spouse, Married-AF-spouse}	{Exec-managerial}	{Husband, Wife}	{White, Asian-Pac-Islander}	{Female}	]-inf;57]	]-inf;100]	]-inf;29.5]	{United-States, Philippines, Germany, ...}
3	]28.5;32.5]	{, Never-worked}	]187920;+inf[	{}	]10.5;12.5]	{Married-civ-spouse, Married-AF-spouse}	{, Handlers-cleaners}	{, Unmarried}	{White, Asian-Pac-Islander}	{}	]-inf;57]	]-inf;100]	]42.5;45.5]	{United-States, Philippines, Germany, ...}
4	]72.5;788]	{Private, Without-pay}	]187920;+inf[	{}	]-inf;8.5]	{Married-civ-spouse, Married-AF-spouse}	{Other-service, Priv-house-serv}	{Husband, Wife}	{White, Asian-Pac-Islander}	{Male}	]-inf;57]	]-inf;100]	]-inf;29.5]	{, Mexico, Puerto-Rico, ...}

In [60]:

Copied!





# Reset indices to ensure alignment
X_reset_noisy = X_train_noisy.reset_index(drop=True)
X_transformed_noisy = X_transformed_noisy.reset_index(drop=True)

# Combine original 'age' values and their transformed values into a single dataframe
combined_df_noisy = pd.DataFrame({
    'Original_age': X_reset_noisy['age'],
    'Encoded_age': X_transformed_noisy['LabelPage'],
    'Original_workclass': X_reset_noisy['workclass'],
    'Encoded_workclass': X_transformed_noisy['LabelPworkclass']
})

# Display the first few rows to see the transformation
combined_df_noisy.head(10)
# Reset indices to ensure alignment
X_reset_noisy = X_train_noisy.reset_index(drop=True)
X_transformed_noisy = X_transformed_noisy.reset_index(drop=True)

# Combine original 'age' values and their transformed values into a single dataframe
combined_df_noisy = pd.DataFrame({
    'Original_age': X_reset_noisy['age'],
    'Encoded_age': X_transformed_noisy['LabelPage'],
    'Original_workclass': X_reset_noisy['workclass'],
    'Encoded_workclass': X_transformed_noisy['LabelPworkclass']
})

# Display the first few rows to see the transformation
combined_df_noisy.head(10)

Out[60]:

	Original_age	Encoded_age	Original_workclass	Encoded_workclass
0	24.0	]22.5;24.5]	NaN	{, Never-worked}
1	46.0	]42.5;56.5]	NaN	{, Never-worked}
2	48.0	]42.5;56.5]	Private	{Private, Without-pay}
3	29.0	]28.5;32.5]	NaN	{, Never-worked}
4	759.0	]72.5;788]	Private	{Private, Without-pay}
5	42.0	]37.5;42.5]	Self-emp-inc	{Self-emp-inc}
6	38.0	]37.5;42.5]	Private	{Private, Without-pay}
7	NaN	Missing	NaN	{, Never-worked}
8	NaN	Missing	Private	{Private, Without-pay}
9	NaN	Missing	Self-emp-not-inc	{Self-emp-not-inc, Local-gov, State-gov}

Depending on the type of variable, the Khiops encoder has a different way of dealing with outliers and missing values.

For numerical variables:

Outliers are simply considered as the largest (or smallest) values, since Khiops is based on rank statistics.
Missing values are internally coded as an extreme rank (i.e. equivalent to minus infinity) and labelled "Missing". If these absences of value are not random and bring useful information for the target prediction, a specific interval is naturally included in the model, comprising only the "Missing" value.

For categorical variables :

Missing values constitute a modality in their own right, which will be included in a group with modalities that have similar class correlations. For example, in the "Work class" variable, missing values can be grouped with "Never worked" because of a common association with the target class.

Using Khiops Encoder to improve LGBM performances¶

In [61]:

Copied!

# the pyCaret setup for the standard models:
setup(pd.concat([X_transformed_noisy.reset_index(drop=True), y_train.reset_index(drop=True)], axis=1), target = 'class', session_id=123, verbose=False,preprocess=False)
compare_models(include=["lightgbm"])
# the pyCaret setup for the standard models:
setup(pd.concat([X_transformed_noisy.reset_index(drop=True), y_train.reset_index(drop=True)], axis=1), target = 'class', session_id=123, verbose=False,preprocess=False)
compare_models(include=["lightgbm"])

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
lightgbm	Light Gradient Boosting Machine	0.9226	0.9702	0.8077	0.8608	0.8333	0.7830	0.7838	0.3300

Out[61]:

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent='warn',
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [62]:

Copied!





results_lgbm = pull()
results_lgbm['Model'].replace({
    'Light Gradient Boosting Machine': 'LGBM with Khiops Optimal Encoding',
    }, inplace=True)
results_lgbm = pull()
results_lgbm['Model'].replace({
    'Light Gradient Boosting Machine': 'LGBM with Khiops Optimal Encoding',
    }, inplace=True)

In [63]:

Copied!

# the pyCaret setup for the standard models:
setup(pd.concat([X_train_noisy, y_train], axis=1), target = 'class', session_id=123, verbose=False)
compare_models(include=["lightgbm"])
# the pyCaret setup for the standard models:
setup(pd.concat([X_train_noisy, y_train], axis=1), target = 'class', session_id=123, verbose=False)
compare_models(include=["lightgbm"])

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
lightgbm	Light Gradient Boosting Machine	0.8950	0.9491	0.7267	0.8148	0.7681	0.7005	0.7025	0.1890

Out[63]:

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent='warn',
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [64]:

Copied!





results_lgbm = pd.concat([results_lgbm, pull()], ignore_index=True)
results_lgbm['Model'].replace({
    'Light Gradient Boosting Machine': 'LGBM with standard Encoding',
    }, inplace=True)
results_lgbm.sort_values(by="Accuracy",ascending=False)
results_lgbm = pd.concat([results_lgbm, pull()], ignore_index=True)
results_lgbm['Model'].replace({
    'Light Gradient Boosting Machine': 'LGBM with standard Encoding',
    }, inplace=True)
results_lgbm.sort_values(by="Accuracy",ascending=False)

Out[64]:

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
0	LGBM with Khiops Optimal Encoding	0.9226	0.9702	0.8077	0.8608	0.8333	0.7830	0.7838	0.330
1	LGBM with standard Encoding	0.8950	0.9491	0.7267	0.8148	0.7681	0.7005	0.7025	0.189

In [65]:

Copied!

df_plot = results_lgbm.drop("TT (Sec)",axis=1).melt(id_vars=['Model'], var_name='Metric', value_name='Value')

plt.figure(figsize=(14, 6))

# Create a bar plot with Seaborn
sns.barplot(x='Metric', y='Value', hue='Model', data=df_plot, palette="Set3")

plt.title("Comparison of LGBM performances: Khiops Encoding vs Standard Encoding")
plt.ylabel('Value')
plt.xlabel('Metric')

plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()
df_plot = results_lgbm.drop("TT (Sec)",axis=1).melt(id_vars=['Model'], var_name='Metric', value_name='Value')

plt.figure(figsize=(14, 6))

# Create a bar plot with Seaborn
sns.barplot(x='Metric', y='Value', hue='Model', data=df_plot, palette="Set3")

plt.title("Comparison of LGBM performances: Khiops Encoding vs Standard Encoding")
plt.ylabel('Value')
plt.xlabel('Metric')

plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

In this section, the Khiops encoder has demonstrated its capability to significantly strengthen the performance of LGBM, especially in scenarios dealing with noisy datasets. We strongly encourage leveraging the Khiops encoder in your pipelines to benefit from its advanced encoding capabilities.

However, the true potential of this encoder is fully unlocked when coupled with the Selective Naive Bayes classifier of Khiops. Together, they ensure top-tier performance and deliver an end-to-end solution that is naturally robust, interpretable, and scalable.

Let's delve deeper into this in the following section, where we'll explore the combined strengths of the Khiops encoder and classifier.

Interpretability of supervised learning with end-to-end Khiops pipeline¶

In this section, we use the complete pipeline provided by Khiops, which chains Optimal Encoding and Parsimonious Training through a simplistic syntax (a simple fit function).

In [66]:

Copied!

# we use a KhiopsClassifier to leverage the full pipeline
pkc = KhiopsClassifier(n_trees=0)
# we use a KhiopsClassifier to leverage the full pipeline
pkc = KhiopsClassifier(n_trees=0)

We use a KhiopsClassifier to leverage the full pipeline, and we set the number of trees to zero (n_trees=0). By default, Khiops builds 10 decision trees to enrich training data with categorical variables (each corresponding to the leef of the trees). This improves the performance of the models, but this is not necessary for this simple tutorial.

In [67]:

Copied!

# fit 
pkc.fit(X_train, y_train)
# fit 
pkc.fit(X_train, y_train)

Out[67]:

KhiopsClassifier(auto_sort=True, internal_sort=None, key=None, n_features=100,
                 n_pairs=0, n_trees=0, output_dir=None, verbose=False)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

A simple fit function chains optimal coding and parsimonious learning. And it's extremely handy because there's no need to prepare training data, as missing values, noise, outliers and unbalanced classes are all handled for you. Categorical variable encoding is no problem either.

Now let's take a closer look at variable selection, which takes place in two successive steps. First to remove variables that are not correlated with the target, then to eliminate variables that are correlated between them:

During the Optimal Encoding step, the level measures the extent to which each variable is correlated with the target (more details here), variables with negative compression gain are removed.
Then during the Paricimonious Training step, the weight measures the importance of each variable within the learned classifier which is able to eliminate redundant variables (more details here). For instance, if two input variables are correlated, one of them can be discarded by being assigned a zero weight. In some cases, both variables are retained by the algorithm, and the weight of both variables is reduced because of this redundancy.

The next cell shows the names of the selected variables, sorted by decreasing weight, and the level is also displayed:

In [68]:

Copied!

kpis = {"Feature" : [], "Level": [], "Weight": []}

for var in pkc.model_report_.modeling_report.get_snb_predictor().selected_variables:
    kpis["Feature"].append(var.name)
    kpis["Level"].append(var.level)
    kpis["Weight"].append(var.weight)

df_kpis = pd.DataFrame(kpis).sort_values(by = 'Weight', ascending=False)
df_kpis.style.set_properties(subset=['Feature'], **{'width': '400px'})
kpis = {"Feature" : [], "Level": [], "Weight": []}

for var in pkc.model_report_.modeling_report.get_snb_predictor().selected_variables:
    kpis["Feature"].append(var.name)
    kpis["Level"].append(var.level)
    kpis["Weight"].append(var.weight)

df_kpis = pd.DataFrame(kpis).sort_values(by = 'Weight', ascending=False)
df_kpis.style.set_properties(subset=['Feature'], **{'width': '400px'})

Out[68]:

	Feature	Level	Weight
0	capital_gain	0.139777	0.890656
5	capital_loss	0.055614	0.804718
3	age	0.117285	0.572296
1	marital_status	0.199115	0.535675
4	education	0.114525	0.500031
6	occupation	0.089358	0.471710
7	hours_per_week	0.069999	0.449249
9	native_country	0.006893	0.421906
2	relationship	0.209471	0.371124
8	education_num	0.114941	0.249054
11	race	0.009712	0.210968
10	workclass	0.022131	0.097687
12	sex	0.045482	0.005890

Note that the variable fnlwgt was removed in the first step, as it is not at all correlated with the target. Next, we can see that the variables marital_status and relationship are probably highly correlated with each other. Endeed, these two variables have one of the highest Levels, so they are highly relevant for predicting the target. But their Weight, which isn't all that important, suggests that these two variables are correlated and therefore bring redundant information.

In addition, the capital_loss variable has a relatively low Level and a high Weight. This means that this variable is poorly correlated with the target, but provides complementary information compared with the other variables, which is very useful to the classifier in predicting the target.

For further details on model interpretation, please refer to the visualization tool page.