Core Basics 1: Train, Evaluate and Deploy a Classifier¶
In this lesson we will learn how to train, evaluate and deploy classifiers with Khiops.
Make sure you have installed Khiops and Khiops Visualization.
We start by importing Khiops and defining some helper functions:
import os
import platform
import subprocess
from khiops import core as kh
# Define peek helper function
def peek(file_path, n=10):
"""Shows the first n lines of a file"""
with open(file_path, encoding="utf8", errors="replace") as file:
for line in file.readlines()[:n]:
print(line, end="")
print("")
# If there are any issues, you may print Khiops status with the following command:
# kh.get_runner().print_status()
Training a Classifier¶
We’ll train a classifier for the Iris
dataset. This is a classical
dataset containing the data of different plants belonging to the genus
Iris. It contains 150 records, 50 for each of three variants of
Iris: Setosa, Virginica and Versicolor. The records for each
sample contain the length and width of its petal and sepal. The standard
task for this dataset is to construct a classifier for the type of
Iris taking as inputs the length and width characteristics.
Now to train a classifier with Khiops, we use two types of files: - A
plain-text delimited data file (for example a csv
file) - A
dictionary file which describes the schema of the above data table
(.kdic
file extension)
Let’s save, into variables, the locations of these files for the
Iris
dataset and then take a look at their contents:
iris_kdic = os.path.join(kh.get_samples_dir(), "Iris", "Iris.kdic")
iris_data_file = os.path.join(kh.get_samples_dir(), "Iris", "Iris.txt")
print(f"Iris dictionary file: {iris_kdic}")
peek(iris_kdic)
print(f"Iris data file: {iris_data_file}\n")
peek(iris_data_file)
Iris dictionary file: /github/home/khiops_data/samples/Iris/Iris.kdic
Dictionary Iris
{
Numerical SepalLength ;
Numerical SepalWidth ;
Numerical PetalLength ;
Numerical PetalWidth ;
Categorical Class ;
};
Iris data file: /github/home/khiops_data/samples/Iris/Iris.txt
SepalLength SepalWidth PetalLength PetalWidth Class
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa
4.4 2.9 1.4 0.2 Iris-setosa
Note that the Iris variant information is in the column Class
. Now
let’s specify the path to the analysis report file.
analysis_report_file_path_Iris = os.path.join("exercises", "Iris", "AnalysisReport.khj")
print(f"Iris analysis report file path: {analysis_report_file_path_Iris}")
Iris analysis report file path: exercises/Iris/AnalysisReport.khj
We are now ready to train the classifier with the Khiops function
train_predictor
. This method returns a tuple containing the location
of two files: - the modeling report (AnalysisReport.khj
): A JSON
file containing information such as the informativeness of each
variable, those selected for the model and performance metrics. It is
saved into analysis_report_file_path_Iris
variable that we just
defined. - model’s dictionary file (AnalysisReport.model.kdic
):
This file is an enriched version of the initial dictionary file that
contains the model. It can be used to make predictions on new data.
iris_report, iris_model_kdic = kh.train_predictor(
iris_kdic,
dictionary_name="Iris",
data_table_path=iris_data_file,
target_variable="Class",
analysis_report_file_path=analysis_report_file_path_Iris,
max_trees=0, # by default Khiops constructs 10 decision tree variables
)
print(f"Iris report file: {iris_report}")
print(f"Iris modeling dictionary: {iris_model_kdic}")
Iris report file: exercises/Iris/AnalysisReport.khj
Iris modeling dictionary: exercises/Iris/AnalysisReport.model.kdic
Note that iris_report
(the first element of the tuple returned by
train_predictor) is identical to analysis_report_file_path_Iris
.
In the next sections, we’ll use the file at iris_report
to assess
the models’ performances and the file at iris_model_kdic
to deploy
it. Now we can have a look at the report with the Khiops Visualization
app:
# To visualize uncomment the line below
# kh.visualize_report(iris_report)
Exercise¶
We’ll repeat the previous steps on the Adult
dataset. This dataset
contains characteristics of the adult population in USA such as age,
gender and education and its task is to predict the variable class
,
which indicates if the individual earns more
or less
than 50,000
dollars.
Let’s start by putting, into variables, the paths for the Adult
dataset:
adult_kdic = os.path.join(kh.get_samples_dir(), "Adult", "Adult.kdic")
adult_data_file = os.path.join(kh.get_samples_dir(), "Adult", "Adult.txt")
Print the file locations and use the function peek
to list their contents¶
print(f"Adult dictionary file: {adult_kdic}")
peek(adult_kdic)
print(f"Adult data file: {adult_data_file}\n")
peek(adult_data_file)
Adult dictionary file: /github/home/khiops_data/samples/Adult/Adult.kdic
Dictionary Adult
{
Categorical Label ;
Numerical age ;
Categorical workclass ;
Numerical fnlwgt ;
Categorical education ;
Numerical education_num ;
Categorical marital_status ;
Adult data file: /github/home/khiops_data/samples/Adult/Adult.txt
Label age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country class
1 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States less
2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States less
3 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States less
4 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States less
5 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba less
6 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States less
7 49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica less
8 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States more
9 31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States more
We now specify the path to the analysis report file for this exercise:
analysis_report_file_path_Adult = os.path.join(
"exercises", "Adult", "AnalysisReport.khj"
)
print(f"Adult analysis report file path: {analysis_report_file_path_Adult}")
Adult analysis report file path: exercises/Adult/AnalysisReport.khj
Train a classifier for the Adult
database¶
Note the name of the target variable is class
(in lower case!).
Do not forget to set max_trees=0
. Save the resulting file locations
into the variables adult_report
and adult_model_kdic
and print
them.
adult_report, adult_model_kdic = kh.train_predictor(
adult_kdic,
dictionary_name="Adult",
data_table_path=adult_data_file,
target_variable="class",
analysis_report_file_path=analysis_report_file_path_Adult,
max_trees=0,
)
print(f"Adult report file: {adult_report}")
print(f"Adult modeling dictionary file: {adult_model_kdic}")
Adult report file: exercises/Adult/AnalysisReport.khj
Adult modeling dictionary file: exercises/Adult/AnalysisReport.model.kdic
Inspect the results with the Khiops Visualization app¶
# To visualize uncomment the line below
# kh.visualize_report(adult_report)
Accessing a Classifiers’ Basic Evaluation Metrics¶
We access the classifier’s evaluation metrics by loading the file at
iris_report
with the Khiops function read_analysis_results_file
:
iris_results = kh.read_analysis_results_file(iris_report)
print(type(iris_results))
<class 'khiops.core.analysis_results.AnalysisResults'>
The resulting object is an instance of the AnalysisResults
class.
The model evaluation reports are stored in its
train_evaluation_report
and test_evaluation_report
attributes
which are of class EvaluationReport
.
iris_train_eval = iris_results.train_evaluation_report
iris_test_eval = iris_results.test_evaluation_report
print(type(iris_train_eval))
print(type(iris_test_eval))
<class 'khiops.core.analysis_results.EvaluationReport'>
<class 'khiops.core.analysis_results.EvaluationReport'>
We access the default predictor’s metrics with the
get_snb_performance
method of the evaluation report objects:
iris_train_performance = iris_train_eval.get_snb_performance()
iris_test_performance = iris_test_eval.get_snb_performance()
These objects are of class PredictorPerformance
. They have access to
accuracy
and auc
attributes:
print(f"Iris train accuracy: {iris_train_performance.accuracy}")
print(f"Iris test accuracy: {iris_test_performance.accuracy}")
print("")
print(f"Iris train AUC: {iris_train_performance.auc}")
print(f"Iris test AUC: {iris_test_performance.auc}")
Iris train accuracy: 0.980952
Iris test accuracy: 0.955556
Iris train AUC: 0.998134
Iris test AUC: 0.984362
Exercise¶
Read the contents of the file at adult_report
for the Adult analysis and print its type¶
adult_results = kh.read_analysis_results_file(adult_report)
type(adult_results)
khiops.core.analysis_results.AnalysisResults
Save the evaluation reports of the Adult
classification to the variables adult_train_eval
and adult_test_eval
¶
adult_train_eval = adult_results.train_evaluation_report
adult_test_eval = adult_results.test_evaluation_report
Show the model’s train and test accuracies and AUCs¶
adult_train_performance = adult_train_eval.get_snb_performance()
adult_test_performance = adult_test_eval.get_snb_performance()
print(f"Adult train accuracy: {adult_train_performance.accuracy}")
print(f"Adult test accuracy: {adult_test_performance.accuracy}")
print("")
print(f"Adult train AUC: {adult_train_performance.auc}")
print(f"Adult test AUC: {adult_test_performance.auc}")
Adult train accuracy: 0.86947
Adult test accuracy: 0.86592
Adult train AUC: 0.926153
Adult test AUC: 0.921511
Deploying a Classifier¶
We are going to deploy the Iris
classifier we have just trained on
the same dataset (normally we would do this on new data). We saved the
model in the file iris_model_kdic
. This file is usually large and
incomprehensible, so you should know what you are doing before editing
it. Let’s take a quick look at its contents:
peek(iris_model_kdic, 25)
#Khiops 11.0.0-b.0 Dictionary SNB_Iris <InitialDictionary="Iris"> <PredictorLabel="Selective Naive Bayes"> <PredictorType="Classifier"> { Unused Numerical SepalLength ; <Cost=1.38629> <Level=0.331855> Unused Numerical SepalWidth ; <Cost=1.38629> <Level=0.116679> Unused Numerical PetalLength ; <Cost=1.38629> <Importance=0.488024> <Level=0.621617> <Weight=0.453125> Unused Numerical PetalWidth ; <Cost=1.38629> <Importance=0.511976> <Level=0.663031> <Weight=0.5> Unused Categorical Class ; <TargetVariable> Unused Structure(DataGrid) VClass = DataGrid(ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 32, 35)) ; <TargetValues> Unused Structure(DataGrid) PPetalLength = DataGrid(IntervalBounds(3.15, 4.75, 5.15), ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 0, 0, 0, 1, 26, 5, 0, 0, 0, 9, 26)) ; <Level=0.621617> // DataGrid(PetalLength, Class) Unused Structure(DataGrid) PPetalWidth = DataGrid(IntervalBounds(0.75, 1.75), ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 0, 0, 0, 31, 1, 0, 2, 33)) ; <Level=0.663031> // DataGrid(PetalWidth, Class) Unused Structure(Classifier) SNBClass = SNBClassifier(Vector(0.453125, 0.5), DataGridStats(PPetalLength, PetalLength), DataGridStats(PPetalWidth, PetalWidth), VClass) ; Categorical PredictedClass = TargetValue(SNBClass) ; <Prediction> Unused Numerical ScoreClass = TargetProb(SNBClass) ; <Score> NumericalProbClassIris-setosa
= TargetProbAt(SNBClass, "Iris-setosa") ; <TargetProb1="Iris-setosa"> NumericalProbClassIris-versicolor
= TargetProbAt(SNBClass, "Iris-versicolor") ; <TargetProb2="Iris-versicolor"> NumericalProbClassIris-virginica
= TargetProbAt(SNBClass, "Iris-virginica") ; <TargetProb3="Iris-virginica"> };
Note that the modeling dictionary contains 4 used variables: -
PredictedClass
: The class with the highest probability according to
the model - ProbClassIris-setosa
, ProbClassIris-versicolor
,
ProbClassIris-virginica
: The probabilities of each class according
to the model
These will be the columns of the table obtained after deploying the
model. This table will be saved at iris_deployment_file
.
iris_deployment_file = os.path.join("exercises", "Iris", "iris_deployment.txt")
kh.deploy_model(
iris_model_kdic,
dictionary_name="SNB_Iris",
data_table_path=iris_data_file,
output_data_table_path=iris_deployment_file,
)
peek(iris_deployment_file)
PredictedClass ProbClassIris-setosa ProbClassIris-versicolor ProbClassIris-virginica
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Exercise¶
Use the deploy_model
function to deploy the model stored in the file at adult_model_kdic
¶
Which columns are deployed?
adult_deployment_file = os.path.join("exercises", "Adult", "adult_deployment.txt")
kh.deploy_model(
adult_model_kdic,
dictionary_name="SNB_Adult",
data_table_path=adult_data_file,
output_data_table_path=adult_deployment_file,
)
peek(adult_deployment_file)
Predictedclass Probclassless Probclassmore
less 0.9999926806 7.319380182e-06
more 0.4107568382 0.5892431618
less 0.9622314248 0.03776857516
less 0.9172269213 0.08277307874
less 0.5833340928 0.4166659072
more 0.2619499457 0.7380500543
less 0.9940101932 0.005989806772
more 0.4199564537 0.5800435463
more 0.001247535351 0.9987524646