Multi-Table Classifier¶
In this notebook, we will learn how to train a classifier in a simple multi-table dataset. It is recommended to see the single table tutorial first.
import pandas as pd
from sklearn import metrics
from khiops import core as kh
from khiops.sklearn import KhiopsClassifier
from khiops.utils.helpers import train_test_split_dataset
The Accidents Dataset¶
We'll train a multi-table classifier on a the dataset Accidents
. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:
Accidents
|
+----0:n----Vehicles
| |
| +----0:n----Users
|
+----0:1----Places
- The main table
Accidents
- The table
Vehicles
in a0:n
relationship withAccidents
- The table
Users
in a0:n
relationship withVehicles
- The table
Places
in a0:1
relationship withAccidents
Let's first check the content of the tables:
# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url_datasets = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/10.2.4"
accidents_df = pd.read_csv(f"{url_datasets}/Accidents/Accidents.txt", delimiter='\t')
vehicles_df = pd.read_csv(f"{url_datasets}/Accidents/Vehicles.txt", delimiter='\t')
users_df = pd.read_csv(f"{url_datasets}/Accidents/Users.txt", delimiter='\t')
places_df = pd.read_csv(f"{url_datasets}/Accidents/Places.txt", delimiter='\t', low_memory=False)
# Method 2: Load data locally after downloading all Khiops samples (best for offline use)
# from khiops.tools import download_datasets
# download_datasets()
# accidents_dataset_dir = f"{kh.get_samples_dir()}/Accidents"
# accidents_df = pd.read_csv(f"{accidents_dataset_dir}/Accidents.txt", sep="\t")
# vehicles_df = pd.read_csv(f"{accidents_dataset_dir}/Vehicles.txt", sep="\t")
# users_df = pd.read_csv(f"{accidents_dataset_dir}/Users.txt", sep="\t")
# places_df = pd.read_csv(f"{accidents_dataset_dir}/Places.txt", sep="\t", low_memory=False)
# Display the first records from each table
print("Accidents table:")
display(accidents_df.head(5))
print("Vehicles table:")
display(vehicles_df.head(5))
print("Users table:")
display(users_df.head(5))
print("Places table:")
display(places_df.head(5))
Accidents table:
AccidentId | Gravity | Date | Hour | Light | Department | Commune | InAgglomeration | IntersectionType | Weather | CollisionType | PostalAddress | GPSCode | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201800000001 | NonLethal | 2018-01-24 | 15:05:00 | Daylight | 590 | 5 | No | Y-type | Normal | 2Vehicles-BehindVehicles-Frontal | route des Ansereuilles | M | 50.55737 | 2.55737 |
1 | 201800000002 | NonLethal | 2018-02-12 | 10:15:00 | Daylight | 590 | 11 | Yes | Square | VeryGood | NoCollision | Place du général de Gaul | M | 50.52936 | 2.52936 |
2 | 201800000003 | NonLethal | 2018-03-04 | 11:35:00 | Daylight | 590 | 477 | Yes | T-type | Normal | NoCollision | Rue nationale | M | 50.51243 | 2.51243 |
3 | 201800000004 | NonLethal | 2018-05-05 | 17:35:00 | Daylight | 590 | 52 | Yes | NoIntersection | VeryGood | 2Vehicles-Side | 30 rue Jules Guesde | M | 50.51974 | 2.51974 |
4 | 201800000005 | NonLethal | 2018-06-26 | 16:05:00 | Daylight | 590 | 477 | Yes | NoIntersection | Normal | 2Vehicles-Side | 72 rue Victor Hugo | M | 50.51607 | 2.51607 |
Vehicles table:
AccidentId | VehicleId | Direction | Category | PassengerNumber | FixedObstacle | MobileObstacle | ImpactPoint | Maneuver | |
---|---|---|---|---|---|---|---|---|---|
0 | 201800000001 | A01 | Unknown | Car<=3.5T | 0 | NaN | Vehicle | RightFront | TurnToLeft |
1 | 201800000001 | B01 | Unknown | Car<=3.5T | 0 | NaN | Vehicle | LeftFront | NoDirectionChange |
2 | 201800000002 | A01 | Unknown | Car<=3.5T | 0 | NaN | Pedestrian | NaN | NoDirectionChange |
3 | 201800000003 | A01 | Unknown | Motorbike>125cm3 | 0 | StationaryVehicle | Vehicle | Front | NoDirectionChange |
4 | 201800000003 | B01 | Unknown | Car<=3.5T | 0 | NaN | Vehicle | LeftSide | TurnToLeft |
Users table:
AccidentId | VehicleId | Seat | Category | Gender | TripReason | SafetyDevice | SafetyDeviceUsed | PedestrianLocation | PedestrianAction | PedestrianCompany | BirthYear | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201800000001 | A01 | 1.0 | Driver | Male | Leisure | SeatBelt | Yes | NaN | NaN | Unknown | 1960.0 |
1 | 201800000001 | B01 | 1.0 | Driver | Male | NaN | SeatBelt | Yes | NaN | NaN | Unknown | 1928.0 |
2 | 201800000002 | A01 | 1.0 | Driver | Male | NaN | SeatBelt | Yes | NaN | NaN | Unknown | 1947.0 |
3 | 201800000002 | A01 | NaN | Pedestrian | Male | NaN | Helmet | NaN | OnLane<=OnSidewalk0mCrossing | Crossing | Alone | 1959.0 |
4 | 201800000003 | A01 | 1.0 | Driver | Male | Leisure | Helmet | Yes | NaN | NaN | Unknown | 1987.0 |
Places table:
AccidentId | RoadType | RoadNumber | RoadSecNumber | RoadLetter | Circulation | LaneNumber | SpecialLane | Slope | RoadMarkerId | RoadMarkerDistance | Layout | StripWidth | LaneWidth | SurfaceCondition | Infrastructure | Localization | SchoolNear | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201800000001 | Departamental | 41 | NaN | C | TwoWay | 2.0 | 0 | Flat | NaN | NaN | RightCurve | NaN | NaN | Normal | Unknown | Lane | 0.0 |
1 | 201800000002 | Communal | 41 | NaN | D | TwoWay | 2.0 | 0 | Flat | NaN | NaN | LeftCurve | NaN | NaN | Normal | Unknown | Lane | 0.0 |
2 | 201800000003 | Departamental | 39 | NaN | D | TwoWay | 2.0 | 0 | Flat | NaN | NaN | Straight | NaN | NaN | Normal | Unknown | Lane | 0.0 |
3 | 201800000004 | Departamental | 39 | NaN | NaN | TwoWay | 2.0 | 0 | Flat | NaN | NaN | Straight | NaN | NaN | Normal | Unknown | Lane | 0.0 |
4 | 201800000005 | Communal | NaN | NaN | NaN | OneWay | 1.0 | 0 | Flat | NaN | NaN | Straight | NaN | NaN | Normal | Unknown | Lane | 0.0 |
Training the Classifier¶
We start by creating our X
and y
for the fit
method. For multi-table tasks KhiopsClassifier
uses a multi-table dataset specification: It is a dictionary that describes the schema of the dataset:
X = {
"main_table": "Accidents",
"tables": {
"Accidents": (accidents_df.drop("Gravity", axis=1), "AccidentId"), # We drop the target column "Gravity"
"Vehicles": (vehicles_df, ["AccidentId", "VehicleId"]),
"Users": (users_df, ["AccidentId", "VehicleId"]),
"Places": (places_df, "AccidentId"),
},
"relations": [
("Accidents", "Vehicles"),
("Vehicles", "Users"),
("Accidents", "Places", True),
],
}
y = accidents_df["Gravity"]
Note the main table has one key (AccidentId
) and the secondary table Vehicles
has two (AccidentId
and VehicleId
).
To describe relations between tables, the field relations
must be added to the dictionary of table specifications. This field is a list of pairs of tables of the form
(<parent table name>, <child table name>)
The khiops library provides the helper function train_test_split_dataset
that splits a multi-table specification into two specs for train and test:
X_train, X_test, y_train, y_test = train_test_split_dataset(X, y, random_state=123)
We now fit our classifier on the train split. By default, the Khiops creates at most 100 multi-table variables (n_features
) and 10 random decision trees (n_trees
). We change these values for this example:
khc = KhiopsClassifier(n_features=1000, n_trees=0)
khc.fit(X_train, y_train)
KhiopsClassifier(n_features=1000, n_trees=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KhiopsClassifier(n_features=1000, n_trees=0)
Displaying the Classifiers’ Training Accuracy and AUC¶
The fit
method calculates evaluation metrics on the training dataset. We access them via the estimator's attribute model_report_
which is an instance of the AnalysisResults
class. Let's check this out:
train_performance = khc.model_report_.train_evaluation_report.get_snb_performance()
print(f"Accidents train accuracy: {train_performance.accuracy}")
print(f"Accidents train auc : {train_performance.auc}")
Accidents train accuracy: 0.944205 Accidents train auc : 0.845932
The PredictorPerformance
object has also a confusion matrix attribute:
confusion_matrix = pd.DataFrame(
train_performance.confusion_matrix.matrix,
columns=train_performance.confusion_matrix.values,
index=train_performance.confusion_matrix.values,
)
print("AccidentsSummary train confusion matrix:")
confusion_matrix
AccidentsSummary train confusion matrix:
Lethal | NonLethal | |
---|---|---|
Lethal | 69 | 52 |
NonLethal | 2366 | 40850 |
If you have installed the Khiops Visualization app you may explore the full learning report by executing the code below.
# Uncomment the lines below
# khc.export_report_file("./adult_report.khj")
# kh.visualize_report("./adult_report.khj")
Deploying the Classifier and Displaying Its Test Performance¶
Now that we have a fitted KhiopsClassifier
, we are now going to deploy it on the test split.
This can be done in two different ways:
- to predict a class that can be obtained using its
predict
. - to predict class probabilities that can be obtained using its
predict_proba
.
Let's first predict the AccidentSummary
labels:
y_pred_test = khc.predict(X_test)
y_probas_test = khc.predict_proba(X_test)
print("Classes:")
display(khc.classes_)
print()
print("Predictions (first 10 values):")
display(y_pred_test[:10])
print()
print("Probabilities (first 10 rows):")
display(y_probas_test[:10,])
Classes:
array(['Lethal', 'NonLethal'], dtype='<U9')
Predictions (first 10 values):
array(['NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal'], dtype='<U9')
Probabilities (first 10 rows):
array([[3.09302957e-02, 9.69069704e-01], [1.56023274e-01, 8.43976726e-01], [3.33082936e-03, 9.96669171e-01], [8.29378593e-04, 9.99170621e-01], [5.21136609e-02, 9.47886339e-01], [4.55396978e-03, 9.95446030e-01], [1.46438522e-01, 8.53561478e-01], [4.71464454e-03, 9.95285355e-01], [5.65335988e-03, 9.94346640e-01], [4.26662644e-02, 9.57333736e-01]])
From these predictions we compute the test accuracy and AUC (One-vs-Rest) scores using sklearn.metrics
accuracy_test = metrics.accuracy_score(y_test, y_pred_test)
auc_test = metrics.roc_auc_score(y_test, y_probas_test[:,1])
print(f"Accidents test accuracy: {accuracy_test}")
print(f"Accidents test auc : {auc_test}")
Accidents test accuracy: 0.9475979509898934 Accidents test auc : 0.831591515559879