Single Table Classifier with the core API¶
In this tutorial, we're going to create a classifier on a single-table dataset.
import warnings
import pandas as pd
from khiops import core as kh
from khiops.tools import download_datasets
# Download the sample datasets from GitHub if not available
warnings.filterwarnings("ignore", message="Download.*") # Ignore dataset download warning
download_datasets()
The Iris Dataset¶
We'll train a classifier for the Iris
dataset. This is a classical dataset containing data of different plants belonging to the genus Iris. It contains 150 records, 50 for each of the three Iris's variants: Setosa, Virginica and Versicolor. Each record contains the length and the width of both the petal and the sepal of the plant. The standard task, when using this dataset, is to construct a classifier for the type of the Iris, based on the petal and sepal characteristics.
The Khiops core API is file-oriented: It reads and outputs files.
In particular, train the classifier on the Iris
dataset we need two input files:
- A data table file: Usually a CSV or TSV file
- A Khiops dictionary file: Contains the data table schema under the KDIC format
The Iris
sample dataset contains already these two files. We'll store their locations into variables and take a look both files:
# Store the locations of the `Iris` dataset files
iris_table_path = f"{kh.get_samples_dir()}/Iris/Iris.txt"
iris_kdic_path = f"{kh.get_samples_dir()}/Iris/Iris.kdic"
# Print the first lines of the data file
print("Iris table file:")
display(pd.read_csv(iris_table_path, sep="\t"))
# Print the Khiops dictionary file
print("Iris dictionary file:", end="")
with open(iris_kdic_path) as iris_kdic_file:
print(iris_kdic_file.read(), end="")
Iris table file:
SepalLength | SepalWidth | PetalLength | PetalWidth | Class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
150 rows × 5 columns
Iris dictionary file: Dictionary Iris { Numerical SepalLength ; Numerical SepalWidth ; Numerical PetalLength ; Numerical PetalWidth ; Categorical Class ; };
Note that the columns described in the dictionary file are coherent with those in the data file. For this training task the features are the first four columns which are all numerical, whereas the target is the Categorical class
column.
Training the Classifier¶
Let's now train the classifier with train_predictor
Khiops core API function:
report_path, model_kdic_path = kh.train_predictor(
iris_kdic_path, # Dictionary file path
"Iris", # Name of the data dictionary for the table
iris_table_path, # Data table file path,
"Class", # Target column
"./st_results" # Directory to store the target files
)
The train_predictor
method by default splits the data in 70% train and 30% test; it uses the test split evaluate the model. The method returns the paths of its two output files:
- A report file containing the model's information (including evaluation metrics on the train/test split), which can be explored with the Khiops Visualization app or used the Khiops core API.
- A Khiops dictionary file containing the classifier model
As you can see, Khiops dictionary files may be used to encode classifiers. In fact, they are a very powerful language to transform databases. You may learn more about them here.
Displaying the Classifiers’s Accuracy and AUC¶
Khiops calculates evaluation metrics for the train/test split datasets. We access them by loading the report file into an AnalysisResults
object. Let's check this out:
model_report = kh.read_analysis_results_file(report_path)
train_performance = model_report.train_evaluation_report.get_snb_performance()
test_performance = model_report.test_evaluation_report.get_snb_performance()
The iris_train_performance
and iris_test_performance
are of class PredictorPerformance
which has accuracy
and auc
attributes:
print(f"Iris train accuracy: {train_performance.accuracy}")
print(f"Iris train AUC : {train_performance.auc}")
print(f"Iris test accuracy : {test_performance.accuracy}")
print(f"Iris test AUC : {test_performance.auc}")
Iris train accuracy: 0.980952 Iris train AUC : 0.997868 Iris test accuracy : 0.955556 Iris test AUC : 0.984362
The PredictorPerformance
objects have also a confusion matrix attribute:
iris_classes = train_performance.confusion_matrix.values
train_confusion_matrix = pd.DataFrame(
train_performance.confusion_matrix.matrix,
columns=iris_classes,
index=iris_classes,
)
test_confusion_matrix = pd.DataFrame(
test_performance.confusion_matrix.matrix,
columns=iris_classes,
index=iris_classes,
)
print("Iris train confusion matrix:")
display(train_confusion_matrix)
print("Iris test confusion matrix:")
display(test_confusion_matrix)
Iris train confusion matrix:
Iris-setosa | Iris-versicolor | Iris-virginica | |
---|---|---|---|
Iris-setosa | 38 | 0 | 0 |
Iris-versicolor | 0 | 31 | 1 |
Iris-virginica | 0 | 1 | 34 |
Iris test confusion matrix:
Iris-setosa | Iris-versicolor | Iris-virginica | |
---|---|---|---|
Iris-setosa | 12 | 0 | 0 |
Iris-versicolor | 0 | 18 | 2 |
Iris-virginica | 0 | 0 | 13 |
Deploying the Classifier¶
We are now going to deploy the Iris
classifier that we have just trained.
To this end we use the model dictionary file that the train_predictor
function created in conjunction the the deploy_model
core API function. Note that the name of the dictionary for the model is SNB_Iris
.
For simplicity, we'll just deploy on the whole data table file (one usually would do this on new data):
iris_deployed_path = "./st_results/iris_deployed.txt"
kh.deploy_model(
model_kdic_path, # Path of the model dictionary file
"SNB_Iris", # Name of the model dictionary
iris_table_path, # Path of the table to deploy the model
iris_deployed_path, # Path of the output (deployed) file
)
The deployed model is in the path in the variable iris_deployed_path
, let's have a look at it
display(pd.read_csv(iris_deployed_path, sep="\t"))
PredictedClass | ProbClassIris-setosa | ProbClassIris-versicolor | ProbClassIris-virginica | |
---|---|---|---|---|
0 | Iris-setosa | 0.988190 | 0.008858 | 0.002951 |
1 | Iris-setosa | 0.988190 | 0.008858 | 0.002951 |
2 | Iris-setosa | 0.988190 | 0.008858 | 0.002951 |
3 | Iris-setosa | 0.988190 | 0.008858 | 0.002951 |
4 | Iris-setosa | 0.988190 | 0.008858 | 0.002951 |
... | ... | ... | ... | ... |
145 | Iris-virginica | 0.003303 | 0.014047 | 0.982650 |
146 | Iris-virginica | 0.003752 | 0.151320 | 0.844929 |
147 | Iris-virginica | 0.003303 | 0.014047 | 0.982650 |
148 | Iris-virginica | 0.003303 | 0.014047 | 0.982650 |
149 | Iris-virginica | 0.003752 | 0.151320 | 0.844929 |
150 rows × 4 columns
The deployed data table file contains four columns
PredictedClass
: Which contains the class predictionProbClassIris-setosa
,ProbClassIris-versicolor
andProbClassIris-virginica
: Which contain the probability of each class ofIris
.