Single Table Classifier¶

This first tutorial trains a classifier on a single table dataset.

In [1]:

Copied!





import pandas as pd
from khiops.sklearn import KhiopsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
import pandas as pd
from khiops.sklearn import KhiopsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

Training a Classifier¶

We'll train a classifier for the Iris dataset. This is a classical dataset containing data of different plants belonging to the genus Iris. It contains 150 records, 50 for each of the three Iris's variants: Setosa, Virginica and Versicolor. Each record contains the length and the width of both the petal and the sepal of the plant. The standard task, when using this dataset, is to construct a classifier for the type of the Iris, based on the petal and sepal characteristics.

To train a classifier with Khiops, we only need a dataframe that we are going to load from a file.

In [2]:

Copied!





# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/main/Iris/Iris.txt"
iris_df = pd.read_csv(url, delimiter='\t')

# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#If the samples have not been downloaded yet:
#from khiops.tools import download_datasets
#download_datasets() 
#
#from os import path
#from khiops import core as kh
#iris_path = path.join(kh.get_samples_dir(), "Iris", "Iris.txt")
#iris_df = pd.read_csv(iris_path, sep="\t")

# Display the first 10 records from the dataset
iris_df.head(10)
# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/main/Iris/Iris.txt"
iris_df = pd.read_csv(url, delimiter='\t')

# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#If the samples have not been downloaded yet:
#from khiops.tools import download_datasets
#download_datasets() 
#
#from os import path
#from khiops import core as kh
#iris_path = path.join(kh.get_samples_dir(), "Iris", "Iris.txt")
#iris_df = pd.read_csv(iris_path, sep="\t")

# Display the first 10 records from the dataset
iris_df.head(10)

Out[2]:

	SepalLength	SepalWidth	PetalLength	PetalWidth	Class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
5	5.4	3.9	1.7	0.4	Iris-setosa
6	4.6	3.4	1.4	0.3	Iris-setosa
7	5.0	3.4	1.5	0.2	Iris-setosa
8	4.4	2.9	1.4	0.2	Iris-setosa
9	4.9	3.1	1.5	0.1	Iris-setosa

Before training the classifier, we split the data into the feature matrix (sepal length, width, etc) and the target vector containing the labels (the Class column).

In [3]:

Copied!





# Drop the "class" column to create the feature set (X).
X_iris = iris_df.drop("Class", axis=1)
# Extract the "class" column to create the target labels (y).
y_iris = iris_df["Class"]
# Drop the "class" column to create the feature set (X).
X_iris = iris_df.drop("Class", axis=1)
# Extract the "class" column to create the target labels (y).
y_iris = iris_df["Class"]

Then we can construct our final train / test dataset

In [4]:

Copied!

# Build our train and test dataset
X_iris_train, X_iris_test, y_iris_train, y_iris_test = train_test_split(X_iris, y_iris)
# Build our train and test dataset
X_iris_train, X_iris_test, y_iris_train, y_iris_test = train_test_split(X_iris, y_iris)

Let's check the contents of the feature matrix and the target vector:

In [5]:

Copied!

# Features of the Iris dataset
X_iris_train.head()
# Features of the Iris dataset
X_iris_train.head()

Out[5]:

	SepalLength	SepalWidth	PetalLength	PetalWidth
93	5.0	2.3	3.3	1.0
130	7.4	2.8	6.1	1.9
82	5.8	2.7	3.9	1.2
54	6.5	2.8	4.6	1.5
10	5.4	3.7	1.5	0.2

In [6]:

Copied!

#Labels of the Iris datase
y_iris_train.unique()
#Labels of the Iris datase
y_iris_train.unique()

Out[6]:

array(['Iris-versicolor', 'Iris-virginica', 'Iris-setosa'], dtype=object)

Let's now train the classifier with the pyKhiops function KhiopsClassifier. This method returns a model ready to classify new Iris plants.

In [7]:

Copied!

pkc_iris = KhiopsClassifier()
pkc_iris.fit(X_iris_train, y_iris_train)
pkc_iris = KhiopsClassifier()
pkc_iris.fit(X_iris_train, y_iris_train)

Out[7]:

KhiopsClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Accessing the Classifier' Basic Train Evaluation Metrics¶

Khiops calculates evaluation metrics for the training dataset. We access them via the model's attribute model_report which is an instance of the AnalysisResults class. Let's check this out:

In [8]:

Copied!

iris_train_performance = pkc_iris.model_report_.train_evaluation_report.get_snb_performance()
iris_train_performance = pkc_iris.model_report_.train_evaluation_report.get_snb_performance()

This object iris_train_performance is of class PredictorPerformance and has accuracy and auc attributes:

In [9]:

Copied!

print(f"Iris train accuracy: {iris_train_performance.accuracy}")
print(f"Iris train AUC     : {iris_train_performance.auc}")
print(f"Iris train accuracy: {iris_train_performance.accuracy}")
print(f"Iris train AUC     : {iris_train_performance.auc}")

Iris train accuracy: 0.964286
Iris train AUC     : 0.997464

The PredictorPerformance object has also a confusion matrix attribute:

In [10]:

Copied!





iris_classes = iris_train_performance.confusion_matrix.values
iris_confusion_matrix = pd.DataFrame(
    iris_train_performance.confusion_matrix.matrix,
    columns=iris_classes,
    index=iris_classes,
)
print("Iris train confusion matrix:")
iris_confusion_matrix
iris_classes = iris_train_performance.confusion_matrix.values
iris_confusion_matrix = pd.DataFrame(
    iris_train_performance.confusion_matrix.matrix,
    columns=iris_classes,
    index=iris_classes,
)
print("Iris train confusion matrix:")
iris_confusion_matrix

Iris train confusion matrix:

Out[10]:

	Iris-setosa	Iris-versicolor	Iris-virginica
Iris-setosa	36	0	0
Iris-versicolor	0	39	3
Iris-virginica	0	1	33

Deploying a Classifier¶

We are now going to deploy the Iris classifier pkc_iris, that we have just trained.

The learned classifier can be deployed in two different ways:

to predict a class that can be obtained using the predict method of the model.
to predict class probabilities that can be obtained using the predict_proba method of the model.

Let's first predict the Iris labels:

In [11]:

Copied!

iris_predictions = pkc_iris.predict(X_iris_test)
print("Iris model predictions (first 10 values):")
iris_predictions[:10]
iris_predictions = pkc_iris.predict(X_iris_test)
print("Iris model predictions (first 10 values):")
iris_predictions[:10]

Iris model predictions (first 10 values):

Out[11]:

array(['Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-setosa',
       'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
       'Iris-virginica', 'Iris-setosa', 'Iris-versicolor'], dtype='<U15')

From these predictions we can compute the accuracy score using sklearn.metrics

In [12]:

Copied!

# from sklearn.metrics
accuracy_score(y_iris_test, iris_predictions)
# from sklearn.metrics
accuracy_score(y_iris_test, iris_predictions)

Out[12]:

0.9473684210526315

Let's now predict the probabilities for each Iris type. Note that the column order of this matrix is given by the estimator attribute pkc.classes_:

In [13]:

Copied!





iris_probas = pkc_iris.predict_proba(X_iris_test)
print(f"Iris classes {pkc_iris.classes_}")
print("Iris model probabilities for each class (first 10 rows):")
iris_probas[:10]
iris_probas = pkc_iris.predict_proba(X_iris_test)
print(f"Iris classes {pkc_iris.classes_}")
print("Iris model probabilities for each class (first 10 rows):")
iris_probas[:10]

Iris classes ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Iris model probabilities for each class (first 10 rows):

Out[13]:

array([[0.99557689, 0.00221678, 0.00220633],
       [0.99557689, 0.00221678, 0.00220633],
       [0.00237667, 0.9812181 , 0.01640523],
       [0.99557689, 0.00221678, 0.00220633],
       [0.00218213, 0.08384243, 0.91397544],
       [0.99557689, 0.00221678, 0.00220633],
       [0.00218213, 0.08384243, 0.91397544],
       [0.00218213, 0.08384243, 0.91397544],
       [0.99557689, 0.00221678, 0.00220633],
       [0.00237667, 0.9812181 , 0.01640523]])

Then, we can compute a ROC_AUC score using sklearn.metrics setting the multi_class parameter

In [14]:

Copied!

# from sklearn.metrics
# Calculate the ROC-AUC score using the One-vs-Rest approach
roc_auc_score(y_iris_test, iris_probas, multi_class='ovr')
# from sklearn.metrics
# Calculate the ROC-AUC score using the One-vs-Rest approach
roc_auc_score(y_iris_test, iris_probas, multi_class='ovr')

Out[14]:

0.9890873015873015

	SepalLength	SepalWidth	PetalLength	PetalWidth
93	5.0	2.3	3.3	1.0
130	7.4	2.8	6.1	1.9
82	5.8	2.7	3.9	1.2
54	6.5	2.8	4.6	1.5
10	5.4	3.7	1.5	0.2

	SepalLength	SepalWidth	PetalLength	PetalWidth
93	5.0	2.3	3.3	1.0
130	7.4	2.8	6.1	1.9
82	5.8	2.7	3.9	1.2
54	6.5	2.8	4.6	1.5
10	5.4	3.7	1.5	0.2

	SepalLength	SepalWidth	PetalLength	PetalWidth
93	5.0	2.3	3.3	1.0
130	7.4	2.8	6.1	1.9
82	5.8	2.7	3.9	1.2
54	6.5	2.8	4.6	1.5
10	5.4	3.7	1.5	0.2