Single Table Classifier¶
This first tutorial trains a classifier on a single table dataset.
import pandas as pd
from khiops.sklearn import KhiopsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
Training a Classifier¶
We'll train a classifier for the Iris
dataset. This is a classical dataset containing data of different plants belonging to the genus Iris. It contains 150 records, 50 for each of the three Iris's variants: Setosa, Virginica and Versicolor. Each record contains the length and the width of both the petal and the sepal of the plant. The standard task, when using this dataset, is to construct a classifier for the type of the Iris, based on the petal and sepal characteristics.
To train a classifier with Khiops, we only need a dataframe that we are going to load from a file.
# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/main/Iris/Iris.txt"
iris_df = pd.read_csv(url, delimiter='\t')
# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#If the samples have not been downloaded yet:
#from khiops.tools import download_datasets
#download_datasets()
#
#from os import path
#from khiops import core as kh
#iris_path = path.join(kh.get_samples_dir(), "Iris", "Iris.txt")
#iris_df = pd.read_csv(iris_path, sep="\t")
# Display the first 10 records from the dataset
iris_df.head(10)
SepalLength | SepalWidth | PetalLength | PetalWidth | Class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
Before training the classifier, we split the data into the feature matrix (sepal length, width, etc) and the target vector containing the labels (the Class
column).
# Drop the "class" column to create the feature set (X).
X_iris = iris_df.drop("Class", axis=1)
# Extract the "class" column to create the target labels (y).
y_iris = iris_df["Class"]
Then we can construct our final train / test dataset
# Build our train and test dataset
X_iris_train, X_iris_test, y_iris_train, y_iris_test = train_test_split(X_iris, y_iris)
Let's check the contents of the feature matrix and the target vector:
# Features of the Iris dataset
X_iris_train.head()
SepalLength | SepalWidth | PetalLength | PetalWidth | |
---|---|---|---|---|
93 | 5.0 | 2.3 | 3.3 | 1.0 |
130 | 7.4 | 2.8 | 6.1 | 1.9 |
82 | 5.8 | 2.7 | 3.9 | 1.2 |
54 | 6.5 | 2.8 | 4.6 | 1.5 |
10 | 5.4 | 3.7 | 1.5 | 0.2 |
#Labels of the Iris datase
y_iris_train.unique()
array(['Iris-versicolor', 'Iris-virginica', 'Iris-setosa'], dtype=object)
Let's now train the classifier with the pyKhiops function KhiopsClassifier
. This method returns a model ready to classify new Iris plants.
pkc_iris = KhiopsClassifier()
pkc_iris.fit(X_iris_train, y_iris_train)
KhiopsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KhiopsClassifier()
Accessing the Classifier' Basic Train Evaluation Metrics¶
Khiops calculates evaluation metrics for the training dataset. We access them via the model's attribute model_report
which is an instance of the AnalysisResults
class. Let's check this out:
iris_train_performance = pkc_iris.model_report_.train_evaluation_report.get_snb_performance()
This object iris_train_performance
is of class PredictorPerformance
and has accuracy
and auc
attributes:
print(f"Iris train accuracy: {iris_train_performance.accuracy}")
print(f"Iris train AUC : {iris_train_performance.auc}")
Iris train accuracy: 0.964286 Iris train AUC : 0.997464
The PredictorPerformance
object has also a confusion matrix attribute:
iris_classes = iris_train_performance.confusion_matrix.values
iris_confusion_matrix = pd.DataFrame(
iris_train_performance.confusion_matrix.matrix,
columns=iris_classes,
index=iris_classes,
)
print("Iris train confusion matrix:")
iris_confusion_matrix
Iris train confusion matrix:
Iris-setosa | Iris-versicolor | Iris-virginica | |
---|---|---|---|
Iris-setosa | 36 | 0 | 0 |
Iris-versicolor | 0 | 39 | 3 |
Iris-virginica | 0 | 1 | 33 |
Deploying a Classifier¶
We are now going to deploy the Iris
classifier pkc_iris
, that we have just trained.
The learned classifier can be deployed in two different ways:
- to predict a class that can be obtained using the
predict
method of the model. - to predict class probabilities that can be obtained using the
predict_proba
method of the model.
Let's first predict the Iris
labels:
iris_predictions = pkc_iris.predict(X_iris_test)
print("Iris model predictions (first 10 values):")
iris_predictions[:10]
Iris model predictions (first 10 values):
array(['Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-virginica', 'Iris-virginica', 'Iris-setosa', 'Iris-versicolor'], dtype='<U15')
From these predictions we can compute the accuracy score using sklearn.metrics
# from sklearn.metrics
accuracy_score(y_iris_test, iris_predictions)
0.9473684210526315
Let's now predict the probabilities for each Iris
type.
Note that the column order of this matrix is given by the estimator attribute pkc.classes_
:
iris_probas = pkc_iris.predict_proba(X_iris_test)
print(f"Iris classes {pkc_iris.classes_}")
print("Iris model probabilities for each class (first 10 rows):")
iris_probas[:10]
Iris classes ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica'] Iris model probabilities for each class (first 10 rows):
array([[0.99557689, 0.00221678, 0.00220633], [0.99557689, 0.00221678, 0.00220633], [0.00237667, 0.9812181 , 0.01640523], [0.99557689, 0.00221678, 0.00220633], [0.00218213, 0.08384243, 0.91397544], [0.99557689, 0.00221678, 0.00220633], [0.00218213, 0.08384243, 0.91397544], [0.00218213, 0.08384243, 0.91397544], [0.99557689, 0.00221678, 0.00220633], [0.00237667, 0.9812181 , 0.01640523]])
Then, we can compute a ROC_AUC score using sklearn.metrics
setting the multi_class parameter
# from sklearn.metrics
# Calculate the ROC-AUC score using the One-vs-Rest approach
roc_auc_score(y_iris_test, iris_probas, multi_class='ovr')
0.9890873015873015