Core Basics 1: Train, Evaluate and Deploy a Classifier¶

In this lesson we will learn how to train, evaluate and deploy classifiers with Khiops.

Make sure you have installed Khiops and Khiops Visualization.

We start by importing Khiops and defining some helper functions:

import os
import platform
import subprocess
from khiops import core as kh

# Define peek helper function
def peek(file_path, n=10):
    """Shows the first n lines of a file"""
    with open(file_path, encoding="utf8", errors="replace") as file:
        for line in file.readlines()[:n]:
            print(line, end="")
    print("")


# If there are any issues, you may print Khiops status with the following command:
# kh.get_runner().print_status()

Training a Classifier¶

We’ll train a classifier for the Iris dataset. This is a classical dataset containing the data of different plants belonging to the genus Iris. It contains 150 records, 50 for each of three variants of Iris: Setosa, Virginica and Versicolor. The records for each sample contain the length and width of its petal and sepal. The standard task for this dataset is to construct a classifier for the type of Iris taking as inputs the length and width characteristics.

Now to train a classifier with Khiops, we use two types of files: - A plain-text delimited data file (for example a csv file) - A dictionary file which describes the schema of the above data table (.kdic file extension)

Let’s save, into variables, the locations of these files for the Iris dataset and then take a look at their contents:

iris_kdic = os.path.join(kh.get_samples_dir(), "Iris", "Iris.kdic")
iris_data_file = os.path.join(kh.get_samples_dir(), "Iris", "Iris.txt")

print(f"Iris dictionary file: {iris_kdic}")
peek(iris_kdic)
print(f"Iris data file: {iris_data_file}\n")
peek(iris_data_file)

Iris dictionary file: /github/home/khiops_data/samples/Iris/Iris.kdic

Dictionary  Iris
{
    Numerical       SepalLength     ;
    Numerical       SepalWidth      ;
    Numerical       PetalLength     ;
    Numerical       PetalWidth      ;
    Categorical     Class   ;
};

Iris data file: /github/home/khiops_data/samples/Iris/Iris.txt

SepalLength SepalWidth      PetalLength     PetalWidth      Class
5.1 3.5     1.4     0.2     Iris-setosa
4.9 3.0     1.4     0.2     Iris-setosa
4.7 3.2     1.3     0.2     Iris-setosa
4.6 3.1     1.5     0.2     Iris-setosa
5.0 3.6     1.4     0.2     Iris-setosa
5.4 3.9     1.7     0.4     Iris-setosa
4.6 3.4     1.4     0.3     Iris-setosa
5.0 3.4     1.5     0.2     Iris-setosa
4.4 2.9     1.4     0.2     Iris-setosa

Note that the Iris variant information is in the column Class. Now let’s specify the path to the analysis report file.

analysis_report_file_path_Iris = os.path.join("exercises", "Iris", "AnalysisReport.khj")

print(f"Iris analysis report file path: {analysis_report_file_path_Iris}")

Iris analysis report file path: exercises/Iris/AnalysisReport.khj

We are now ready to train the classifier with the Khiops function train_predictor. This method returns a tuple containing the location of two files: - the modeling report (AnalysisReport.khj): A JSON file containing information such as the informativeness of each variable, those selected for the model and performance metrics. It is saved into analysis_report_file_path_Iris variable that we just defined. - model’s dictionary file (AnalysisReport.model.kdic): This file is an enriched version of the initial dictionary file that contains the model. It can be used to make predictions on new data.

iris_report, iris_model_kdic = kh.train_predictor(
    iris_kdic,
    dictionary_name="Iris",
    data_table_path=iris_data_file,
    target_variable="Class",
    analysis_report_file_path=analysis_report_file_path_Iris,
    max_trees=0,  # by default Khiops constructs 10 decision tree variables
)
print(f"Iris report file: {iris_report}")
print(f"Iris modeling dictionary: {iris_model_kdic}")

Iris report file: exercises/Iris/AnalysisReport.khj
Iris modeling dictionary: exercises/Iris/AnalysisReport.model.kdic

Note that iris_report (the first element of the tuple returned by train_predictor) is identical to analysis_report_file_path_Iris.

In the next sections, we’ll use the file at iris_report to assess the models’ performances and the file at iris_model_kdic to deploy it. Now we can have a look at the report with the Khiops Visualization app:

# To visualize uncomment the line below
# kh.visualize_report(iris_report)

Exercise¶

We’ll repeat the previous steps on the Adult dataset. This dataset contains characteristics of the adult population in USA such as age, gender and education and its task is to predict the variable class, which indicates if the individual earns more or less than 50,000 dollars.

Let’s start by putting, into variables, the paths for the Adult dataset:

adult_kdic = os.path.join(kh.get_samples_dir(), "Adult", "Adult.kdic")
adult_data_file = os.path.join(kh.get_samples_dir(), "Adult", "Adult.txt")

Print the file locations and use the function `peek` to list their contents¶

print(f"Adult dictionary file: {adult_kdic}")
peek(adult_kdic)
print(f"Adult data file: {adult_data_file}\n")
peek(adult_data_file)

Adult dictionary file: /github/home/khiops_data/samples/Adult/Adult.kdic

Dictionary  Adult
{
    Categorical     Label   ;
    Numerical       age     ;
    Categorical     workclass       ;
    Numerical       fnlwgt  ;
    Categorical     education       ;
    Numerical       education_num   ;
    Categorical     marital_status  ;

Adult data file: /github/home/khiops_data/samples/Adult/Adult.txt

Label       age     workclass       fnlwgt  education       education_num   marital_status  occupation      relationship    race    sex     capital_gain    capital_loss    hours_per_week  native_country  class
1   39      State-gov       77516   Bachelors       13      Never-married   Adm-clerical    Not-in-family   White   Male    2174    0       40      United-States   less
2   50      Self-emp-not-inc        83311   Bachelors       13      Married-civ-spouse      Exec-managerial Husband White   Male    0       0       13      United-States   less
3   38      Private 215646  HS-grad 9       Divorced        Handlers-cleaners       Not-in-family   White   Male    0       0       40      United-States   less
4   53      Private 234721  11th    7       Married-civ-spouse      Handlers-cleaners       Husband Black   Male    0       0       40      United-States   less
5   28      Private 338409  Bachelors       13      Married-civ-spouse      Prof-specialty  Wife    Black   Female  0       0       40      Cuba    less
6   37      Private 284582  Masters 14      Married-civ-spouse      Exec-managerial Wife    White   Female  0       0       40      United-States   less
7   49      Private 160187  9th     5       Married-spouse-absent   Other-service   Not-in-family   Black   Female  0       0       16      Jamaica less
8   52      Self-emp-not-inc        209642  HS-grad 9       Married-civ-spouse      Exec-managerial Husband White   Male    0       0       45      United-States   more
9   31      Private 45781   Masters 14      Never-married   Prof-specialty  Not-in-family   White   Female  14084   0       50      United-States   more

We now specify the path to the analysis report file for this exercise:

analysis_report_file_path_Adult = os.path.join(
    "exercises", "Adult", "AnalysisReport.khj"
)

print(f"Adult analysis report file path: {analysis_report_file_path_Adult}")

Adult analysis report file path: exercises/Adult/AnalysisReport.khj

Train a classifier for the `Adult` database¶

Note the name of the target variable is class (in lower case!). Do not forget to set max_trees=0. Save the resulting file locations into the variables adult_report and adult_model_kdic and print them.

adult_report, adult_model_kdic = kh.train_predictor(
    adult_kdic,
    dictionary_name="Adult",
    data_table_path=adult_data_file,
    target_variable="class",
    analysis_report_file_path=analysis_report_file_path_Adult,
    max_trees=0,
)
print(f"Adult report file: {adult_report}")
print(f"Adult modeling dictionary file: {adult_model_kdic}")

Adult report file: exercises/Adult/AnalysisReport.khj
Adult modeling dictionary file: exercises/Adult/AnalysisReport.model.kdic

Inspect the results with the Khiops Visualization app¶

# To visualize uncomment the line below
# kh.visualize_report(adult_report)

Accessing a Classifiers’ Basic Evaluation Metrics¶

We access the classifier’s evaluation metrics by loading the file at iris_report with the Khiops function read_analysis_results_file:

iris_results = kh.read_analysis_results_file(iris_report)
print(type(iris_results))

<class 'khiops.core.analysis_results.AnalysisResults'>

The resulting object is an instance of the AnalysisResults class. The model evaluation reports are stored in its train_evaluation_report and test_evaluation_report attributes which are of class EvaluationReport.

iris_train_eval = iris_results.train_evaluation_report
iris_test_eval = iris_results.test_evaluation_report
print(type(iris_train_eval))
print(type(iris_test_eval))

<class 'khiops.core.analysis_results.EvaluationReport'>
<class 'khiops.core.analysis_results.EvaluationReport'>

We access the default predictor’s metrics with the get_snb_performance method of the evaluation report objects:

iris_train_performance = iris_train_eval.get_snb_performance()
iris_test_performance = iris_test_eval.get_snb_performance()

These objects are of class PredictorPerformance. They have access to accuracy and auc attributes:

print(f"Iris train accuracy: {iris_train_performance.accuracy}")
print(f"Iris test accuracy:  {iris_test_performance.accuracy}")
print("")
print(f"Iris train AUC: {iris_train_performance.auc}")
print(f"Iris test AUC:  {iris_test_performance.auc}")

Iris train accuracy: 0.980952
Iris test accuracy:  0.955556

Iris train AUC: 0.998134
Iris test AUC:  0.984362

Exercise¶

Read the contents of the file at `adult_report` for the Adult analysis and print its type¶

adult_results = kh.read_analysis_results_file(adult_report)
type(adult_results)

khiops.core.analysis_results.AnalysisResults

Save the evaluation reports of the `Adult` classification to the variables `adult_train_eval` and `adult_test_eval`¶

adult_train_eval = adult_results.train_evaluation_report
adult_test_eval = adult_results.test_evaluation_report

Show the model’s train and test accuracies and AUCs¶

adult_train_performance = adult_train_eval.get_snb_performance()
adult_test_performance = adult_test_eval.get_snb_performance()
print(f"Adult train accuracy: {adult_train_performance.accuracy}")
print(f"Adult test accuracy:  {adult_test_performance.accuracy}")
print("")
print(f"Adult train AUC: {adult_train_performance.auc}")
print(f"Adult test AUC:  {adult_test_performance.auc}")

Adult train accuracy: 0.86947
Adult test accuracy:  0.86592

Adult train AUC: 0.926153
Adult test AUC:  0.921511

Deploying a Classifier¶

We are going to deploy the Iris classifier we have just trained on the same dataset (normally we would do this on new data). We saved the model in the file iris_model_kdic. This file is usually large and incomprehensible, so you should know what you are doing before editing it. Let’s take a quick look at its contents:

peek(iris_model_kdic, 25)

#Khiops 11.0.0-b.0

Dictionary  SNB_Iris
<InitialDictionary="Iris"> <PredictorLabel="Selective Naive Bayes"> <PredictorType="Classifier">
{
Unused      Numerical       SepalLength             ; <Cost=1.38629> <Level=0.331855>
Unused      Numerical       SepalWidth              ; <Cost=1.38629> <Level=0.116679>
Unused      Numerical       PetalLength             ; <Cost=1.38629> <Importance=0.488024> <Level=0.621617> <Weight=0.453125>
Unused      Numerical       PetalWidth              ; <Cost=1.38629> <Importance=0.511976> <Level=0.663031> <Weight=0.5>
Unused      Categorical     Class           ; <TargetVariable>
Unused      Structure(DataGrid)     VClass   = DataGrid(ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 32, 35))     ; <TargetValues>
Unused      Structure(DataGrid)     PPetalLength     = DataGrid(IntervalBounds(3.15, 4.75, 5.15), ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 0, 0, 0, 1, 26, 5, 0, 0, 0, 9, 26))        ; <Level=0.621617>      // DataGrid(PetalLength, Class)
Unused      Structure(DataGrid)     PPetalWidth      = DataGrid(IntervalBounds(0.75, 1.75), ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 0, 0, 0, 31, 1, 0, 2, 33))       ; <Level=0.663031>      // DataGrid(PetalWidth, Class)
Unused      Structure(Classifier)   SNBClass         = SNBClassifier(Vector(0.453125, 0.5), DataGridStats(PPetalLength, PetalLength), DataGridStats(PPetalWidth, PetalWidth), VClass)       ;
    Categorical     PredictedClass   = TargetValue(SNBClass)        ; <Prediction>
Unused      Numerical       ScoreClass       = TargetProb(SNBClass) ; <Score>
    Numerical       ProbClassIris-setosa   = TargetProbAt(SNBClass, "Iris-setosa")        ; <TargetProb1="Iris-setosa">
    Numerical       ProbClassIris-versicolor       = TargetProbAt(SNBClass, "Iris-versicolor")    ; <TargetProb2="Iris-versicolor">
    Numerical       ProbClassIris-virginica        = TargetProbAt(SNBClass, "Iris-virginica")     ; <TargetProb3="Iris-virginica">
};

Note that the modeling dictionary contains 4 used variables: - PredictedClass : The class with the highest probability according to the model - ProbClassIris-setosa, ProbClassIris-versicolor, ProbClassIris-virginica: The probabilities of each class according to the model

These will be the columns of the table obtained after deploying the model. This table will be saved at iris_deployment_file.

iris_deployment_file = os.path.join("exercises", "Iris", "iris_deployment.txt")
kh.deploy_model(
    iris_model_kdic,
    dictionary_name="SNB_Iris",
    data_table_path=iris_data_file,
    output_data_table_path=iris_deployment_file,
)

peek(iris_deployment_file)

PredictedClass      ProbClassIris-setosa    ProbClassIris-versicolor        ProbClassIris-virginica
Iris-setosa 0.9935139877    0.004559173379  0.001926838879
Iris-setosa 0.9935139877    0.004559173379  0.001926838879
Iris-setosa 0.9935139877    0.004559173379  0.001926838879
Iris-setosa 0.9935139877    0.004559173379  0.001926838879
Iris-setosa 0.9935139877    0.004559173379  0.001926838879
Iris-setosa 0.9935139877    0.004559173379  0.001926838879
Iris-setosa 0.9935139877    0.004559173379  0.001926838879
Iris-setosa 0.9935139877    0.004559173379  0.001926838879
Iris-setosa 0.9935139877    0.004559173379  0.001926838879

Exercise¶

Use the `deploy_model` function to deploy the model stored in the file at `adult_model_kdic`¶

Which columns are deployed?

adult_deployment_file = os.path.join("exercises", "Adult", "adult_deployment.txt")
kh.deploy_model(
    adult_model_kdic,
    dictionary_name="SNB_Adult",
    data_table_path=adult_data_file,
    output_data_table_path=adult_deployment_file,
)
peek(adult_deployment_file)

Predictedclass      Probclassless   Probclassmore
less        0.9999926806    7.319380182e-06
more        0.4107568382    0.5892431618
less        0.9622314248    0.03776857516
less        0.9172269213    0.08277307874
less        0.5833340928    0.4166659072
more        0.2619499457    0.7380500543
less        0.9940101932    0.005989806772
more        0.4199564537    0.5800435463
more        0.001247535351  0.9987524646

Core Basics 1: Train, Evaluate and Deploy a Classifier¶

Training a Classifier¶

Exercise¶

Print the file locations and use the function peek to list their contents¶

Train a classifier for the Adult database¶

Inspect the results with the Khiops Visualization app¶

Accessing a Classifiers’ Basic Evaluation Metrics¶

Exercise¶

Read the contents of the file at adult_report for the Adult analysis and print its type¶

Save the evaluation reports of the Adult classification to the variables adult_train_eval and adult_test_eval¶

Show the model’s train and test accuracies and AUCs¶

Deploying a Classifier¶

Exercise¶

Use the deploy_model function to deploy the model stored in the file at adult_model_kdic¶

Print the file locations and use the function `peek` to list their contents¶

Train a classifier for the `Adult` database¶

Read the contents of the file at `adult_report` for the Adult analysis and print its type¶

Save the evaluation reports of the `Adult` classification to the variables `adult_train_eval` and `adult_test_eval`¶

Use the `deploy_model` function to deploy the model stored in the file at `adult_model_kdic`¶