Single-Table Tutorial with the core API¶

In this tutorial, we're going to create a classifier on a single-table dataset.

In [1]:

Copied!





import warnings
import pandas as pd
from khiops import core as kh
from khiops.tools import download_datasets

# Download the sample datasets from GitHub if not available
warnings.filterwarnings("ignore", message="Download.*") # Ignore dataset download warning
download_datasets()
import warnings
import pandas as pd
from khiops import core as kh
from khiops.tools import download_datasets

# Download the sample datasets from GitHub if not available
warnings.filterwarnings("ignore", message="Download.*") # Ignore dataset download warning
download_datasets()

The Iris Dataset¶

We'll train a classifier for the Iris dataset. This is a classical dataset containing data of different plants belonging to the genus Iris. It contains 150 records, 50 for each of the three Iris's variants: Setosa, Virginica and Versicolor. Each record contains the length and the width of both the petal and the sepal of the plant. The standard task, when using this dataset, is to construct a classifier for the type of the Iris, based on the petal and sepal characteristics.

The Khiops core API is file-oriented: It reads and outputs files. In particular, train the classifier on the Iris dataset we need two input files:

A data table file: Usually a CSV or TSV file
A Khiops dictionary file: Contains the data table schema under the KDIC format

The Iris sample dataset contains already these two files. We'll store their locations into variables and take a look both files:

In [2]:

Copied!





# Store the locations of the `Iris` dataset files
iris_table_path = f"{kh.get_samples_dir()}/Iris/Iris.txt"
iris_kdic_path = f"{kh.get_samples_dir()}/Iris/Iris.kdic"

# Print the first lines of the data file
print("Iris table file:")
display(pd.read_csv(iris_table_path, sep="\t"))

# Print the Khiops dictionary file
print("Iris dictionary file:", end="")
with open(iris_kdic_path) as iris_kdic_file:
    print(iris_kdic_file.read(), end="")
# Store the locations of the `Iris` dataset files
iris_table_path = f"{kh.get_samples_dir()}/Iris/Iris.txt"
iris_kdic_path = f"{kh.get_samples_dir()}/Iris/Iris.kdic"

# Print the first lines of the data file
print("Iris table file:")
display(pd.read_csv(iris_table_path, sep="\t"))

# Print the Khiops dictionary file
print("Iris dictionary file:", end="")
with open(iris_kdic_path) as iris_kdic_file:
    print(iris_kdic_file.read(), end="")

Iris table file:

	SepalLength	SepalWidth	PetalLength	PetalWidth	Class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	Iris-virginica
146	6.3	2.5	5.0	1.9	Iris-virginica
147	6.5	3.0	5.2	2.0	Iris-virginica
148	6.2	3.4	5.4	2.3	Iris-virginica
149	5.9	3.0	5.1	1.8	Iris-virginica

150 rows × 5 columns

Iris dictionary file:
Dictionary	Iris
{
	Numerical	SepalLength	;	
	Numerical	SepalWidth	;	
	Numerical	PetalLength	;	
	Numerical	PetalWidth	;	
	Categorical	Class	;	
};

Note that the columns described in the dictionary file are coherent with those in the data file. For this training task the features are the first four columns which are all numerical, whereas the target is the Categorical class column.

Training the Classifier¶

Let's now train the classifier with train_predictor Khiops core API function:

In [3]:

Copied!





report_path, model_kdic_path = kh.train_predictor(
    iris_kdic_path,  # Dictionary file path
    "Iris",          # Name of the data dictionary for the table
    iris_table_path, # Data table file path,
    "Class",         # Target column
    "./st_results"   # Directory to store the target files
)
report_path, model_kdic_path = kh.train_predictor(
    iris_kdic_path,  # Dictionary file path
    "Iris",          # Name of the data dictionary for the table
    iris_table_path, # Data table file path,
    "Class",         # Target column
    "./st_results"   # Directory to store the target files
)

The train_predictor method by default splits the data in 70% train and 30% test; it uses the test split evaluate the model. The method returns the paths of its two output files:

A report file containing the model's information (including evaluation metrics on the train/test split), which can be explored with the Khiops Visualization app or used the Khiops core API.
A Khiops dictionary file containing the classifier model

As you can see, Khiops dictionary files may be used to encode classifiers. In fact, they are a very powerful language to transform databases. You may learn more about them here.

Displaying the Classifiers’s Accuracy and AUC¶

Khiops calculates evaluation metrics for the train/test split datasets. We access them by loading the report file into an AnalysisResults object. Let's check this out:

In [4]:

Copied!

model_report = kh.read_analysis_results_file(report_path)
train_performance = model_report.train_evaluation_report.get_snb_performance()
test_performance = model_report.test_evaluation_report.get_snb_performance()
model_report = kh.read_analysis_results_file(report_path)
train_performance = model_report.train_evaluation_report.get_snb_performance()
test_performance = model_report.test_evaluation_report.get_snb_performance()

The iris_train_performance and iris_test_performance are of class PredictorPerformance which has accuracy and auc attributes:

In [5]:

Copied!





print(f"Iris train accuracy: {train_performance.accuracy}")
print(f"Iris train AUC     : {train_performance.auc}")
print(f"Iris test accuracy : {test_performance.accuracy}")
print(f"Iris test  AUC     : {test_performance.auc}")
print(f"Iris train accuracy: {train_performance.accuracy}")
print(f"Iris train AUC     : {train_performance.auc}")
print(f"Iris test accuracy : {test_performance.accuracy}")
print(f"Iris test  AUC     : {test_performance.auc}")

Iris train accuracy: 0.980952
Iris train AUC     : 0.997868
Iris test accuracy : 0.955556
Iris test  AUC     : 0.984362

The PredictorPerformance objects have also a confusion matrix attribute:

In [6]:

Copied!





iris_classes = train_performance.confusion_matrix.values
train_confusion_matrix = pd.DataFrame(
    train_performance.confusion_matrix.matrix,
    columns=iris_classes,
    index=iris_classes,
)
test_confusion_matrix = pd.DataFrame(
    test_performance.confusion_matrix.matrix,
    columns=iris_classes,
    index=iris_classes,
)
print("Iris train confusion matrix:")
display(train_confusion_matrix)

print("Iris test confusion matrix:")
display(test_confusion_matrix)
iris_classes = train_performance.confusion_matrix.values
train_confusion_matrix = pd.DataFrame(
    train_performance.confusion_matrix.matrix,
    columns=iris_classes,
    index=iris_classes,
)
test_confusion_matrix = pd.DataFrame(
    test_performance.confusion_matrix.matrix,
    columns=iris_classes,
    index=iris_classes,
)
print("Iris train confusion matrix:")
display(train_confusion_matrix)

print("Iris test confusion matrix:")
display(test_confusion_matrix)

Iris train confusion matrix:

	Iris-setosa	Iris-versicolor	Iris-virginica
Iris-setosa	38	0	0
Iris-versicolor	0	31	1
Iris-virginica	0	1	34

Iris test confusion matrix:

	Iris-setosa	Iris-versicolor	Iris-virginica
Iris-setosa	12	0	0
Iris-versicolor	0	18	2
Iris-virginica	0	0	13

Deploying the Classifier¶

We are now going to deploy the Iris classifier that we have just trained.

To this end we use the model dictionary file that the train_predictor function created in conjunction the the deploy_model core API function. Note that the name of the dictionary for the model is SNB_Iris.

For simplicity, we'll just deploy on the whole data table file (one usually would do this on new data):

In [7]:

Copied!





iris_deployed_path = "./st_results/iris_deployed.txt"
kh.deploy_model(
    model_kdic_path,     # Path of the model dictionary file
    "SNB_Iris",          # Name of the model dictionary
    iris_table_path,     # Path of the table to deploy the model
    iris_deployed_path,  # Path of the output (deployed) file
)
iris_deployed_path = "./st_results/iris_deployed.txt"
kh.deploy_model(
    model_kdic_path,     # Path of the model dictionary file
    "SNB_Iris",          # Name of the model dictionary
    iris_table_path,     # Path of the table to deploy the model
    iris_deployed_path,  # Path of the output (deployed) file
)

The deployed model is in the path in the variable iris_deployed_path, let's have a look at it

In [8]:

Copied!

display(pd.read_csv(iris_deployed_path, sep="\t"))
display(pd.read_csv(iris_deployed_path, sep="\t"))

	PredictedClass	ProbClassIris-setosa	ProbClassIris-versicolor	ProbClassIris-virginica
0	Iris-setosa	0.988190	0.008858	0.002951
1	Iris-setosa	0.988190	0.008858	0.002951
2	Iris-setosa	0.988190	0.008858	0.002951
3	Iris-setosa	0.988190	0.008858	0.002951
4	Iris-setosa	0.988190	0.008858	0.002951
...	...	...	...	...
145	Iris-virginica	0.003303	0.014047	0.982650
146	Iris-virginica	0.003752	0.151320	0.844929
147	Iris-virginica	0.003303	0.014047	0.982650
148	Iris-virginica	0.003303	0.014047	0.982650
149	Iris-virginica	0.003752	0.151320	0.844929

150 rows × 4 columns

The deployed data table file contains four columns

PredictedClass: Which contains the class prediction
ProbClassIris-setosa, ProbClassIris-versicolor and ProbClassIris-virginica: Which contain the probability of each class of Iris.