Single Table Classifier¶
On this tutorial we'll a classifier on a single table dataset.
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from khiops import core as kh
from khiops.sklearn import KhiopsClassifier
The Iris Dataset¶
We'll train a classifier for the Iris
dataset. This is a classical dataset containing data of different plants belonging to the genus Iris. It contains 150 records, 50 for each of the three Iris's variants: Setosa, Virginica and Versicolor. Each record contains the length and the width of both the petal and the sepal of the plant. The standard task, when using this dataset, is to construct a classifier for the type of the Iris, based on the petal and sepal characteristics.
To train a classifier with Khiops, we only need a dataframe containing the Iris
data:
# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/10.2.4/Iris/Iris.txt"
iris_df = pd.read_csv(url, delimiter='\t')
# Method 2: Load data locally after downloading all Khiops samples (best for offline use)
# from khiops.tools import download_datasets
# download_datasets()
# iris_path = f"{kh.get_samples_dir()}/Iris/Iris.txt"
# iris_df = pd.read_csv(iris_path, sep="\t")
# Display the first 10 records from the dataset
iris_df[:10]
SepalLength | SepalWidth | PetalLength | PetalWidth | Class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
Training the Classifier¶
Before training the classifier, we split the data into the feature matrix (sepal length, width, etc) and the target vector containing the labels (the Class
column).
# Drop the "class" column to create the feature set (X).
X = iris_df.drop("Class", axis=1)
# Extract the "class" column to create the target labels (y).
y = iris_df["Class"]
Then we can construct our final train/test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
Let's check the contents of the feature matrix and the target vector:
print("Features:")
display(X_train)
print("Labels:")
display(y_train.unique())
Features:
SepalLength | SepalWidth | PetalLength | PetalWidth | |
---|---|---|---|---|
16 | 5.4 | 3.9 | 1.3 | 0.4 |
82 | 5.8 | 2.7 | 3.9 | 1.2 |
60 | 5.0 | 2.0 | 3.5 | 1.0 |
35 | 5.0 | 3.2 | 1.2 | 0.2 |
143 | 6.8 | 3.2 | 5.9 | 2.3 |
... | ... | ... | ... | ... |
17 | 5.1 | 3.5 | 1.4 | 0.3 |
98 | 5.1 | 2.5 | 3.0 | 1.1 |
66 | 5.6 | 3.0 | 4.5 | 1.5 |
126 | 6.2 | 2.8 | 4.8 | 1.8 |
109 | 7.2 | 3.6 | 6.1 | 2.5 |
112 rows × 4 columns
Labels:
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
We are ready to train the KhiopsClassifier
: We use the fit
method on the training data. After its execution, the KhiopsClassifier
instance is ready to classify new Iris plants:
khc = KhiopsClassifier()
khc.fit(X_train, y_train)
KhiopsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KhiopsClassifier()
Displaying the Classifiers’ Training Accuracy and AUC¶
The fit
method calculates evaluation metrics on the training dataset. We access them via the estimator's attribute model_report_
which is an instance of the AnalysisResults
class. Let's check this out:
train_performance = khc.model_report_.train_evaluation_report.get_snb_performance()
This object train_performance
is of class PredictorPerformance
and has accuracy
and auc
attributes:
print(f"Iris train accuracy: {train_performance.accuracy}")
print(f"Iris train AUC : {train_performance.auc}")
Iris train accuracy: 0.964286 Iris train AUC : 0.993257
The PredictorPerformance
object has also a confusion matrix attribute:
confusion_matrix = pd.DataFrame(
train_performance.confusion_matrix.matrix,
columns=train_performance.confusion_matrix.values,
index=train_performance.confusion_matrix.values,
)
print("Iris train confusion matrix:")
confusion_matrix
Iris train confusion matrix:
Iris-setosa | Iris-versicolor | Iris-virginica | |
---|---|---|---|
Iris-setosa | 34 | 0 | 0 |
Iris-versicolor | 0 | 41 | 3 |
Iris-virginica | 0 | 1 | 33 |
If you have installed the Khiops Visualization app you may explore the full learning report by executing the code below.
# Uncomment the lines below
# khc.export_report_file("./iris_report.khj")
# kh.visualize_report("./iris_report.khj")
Deploying the Classifier and Displaying Its Test Performance¶
Now that we have a fitted KhiopsClassifier
, we are now going to deploy it on the test split.
This can be done in two different ways:
- to predict a class that can be obtained using its
predict
. - to predict class probabilities that can be obtained using its
predict_proba
.
Let's first predict the Iris
labels:
y_pred_test = khc.predict(X_test)
y_probas_test = khc.predict_proba(X_test)
print("Classes:")
display(khc.classes_)
print()
print("Predictions (first 10 values):")
display(y_pred_test[:10])
print()
print("Probabilities (first 10 rows):")
display(y_probas_test[:10,])
Classes:
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype='<U15')
Predictions (first 10 values):
array(['Iris-versicolor', 'Iris-virginica', 'Iris-virginica', 'Iris-versicolor', 'Iris-setosa', 'Iris-virginica', 'Iris-versicolor', 'Iris-setosa', 'Iris-setosa', 'Iris-versicolor'], dtype='<U15')
Probabilities (first 10 rows):
array([[0.00164867, 0.9298149 , 0.06853643], [0.00169434, 0.03944314, 0.95886252], [0.00169434, 0.03944314, 0.95886252], [0.00164867, 0.9298149 , 0.06853643], [0.99482629, 0.00347827, 0.00169544], [0.0019504 , 0.31942133, 0.67862827], [0.00164867, 0.9298149 , 0.06853643], [0.99482629, 0.00347827, 0.00169544], [0.99482629, 0.00347827, 0.00169544], [0.00164867, 0.9298149 , 0.06853643]])
From these predictions we compute the test accuracy and AUC (One-vs-Rest) scores using sklearn.metrics
accuracy_test = metrics.accuracy_score(y_test, y_pred_test)
auc_test = metrics.roc_auc_score(y_test, y_probas_test, multi_class="ovr")
print(f"Iris test accuracy: {accuracy_test}")
print(f"Iris test AUC : {auc_test}")
Iris test accuracy: 0.9473684210526315 Iris test AUC : 1.0