Multi-Table Classifier¶

In this notebook, we will learn how to train a classifier in a simple multi-table dataset. It is recommended to see the single table tutorial first.

In [1]:

Copied!





import pandas as pd
from sklearn import metrics
from khiops import core as kh
from khiops.sklearn import KhiopsClassifier
from khiops.utils.helpers import train_test_split_dataset
import pandas as pd
from sklearn import metrics
from khiops import core as kh
from khiops.sklearn import KhiopsClassifier
from khiops.utils.helpers import train_test_split_dataset

The Accidents Dataset¶

We'll train a multi-table classifier on a the dataset Accidents. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:

Accidents
|
+----0:n----Vehicles
|           |
|           +----0:n----Users 
|
+----0:1----Places

The main table Accidents
The table Vehicles in a 0:n relationship with Accidents
The table Users in a 0:n relationship with Vehicles
The table Places in a 0:1 relationship with Accidents

Let's first check the content of the tables:

In [2]:

Copied!





# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url_datasets = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/10.2.4"
accidents_df = pd.read_csv(f"{url_datasets}/Accidents/Accidents.txt", delimiter='\t')
vehicles_df = pd.read_csv(f"{url_datasets}/Accidents/Vehicles.txt", delimiter='\t')
users_df = pd.read_csv(f"{url_datasets}/Accidents/Users.txt", delimiter='\t')
places_df = pd.read_csv(f"{url_datasets}/Accidents/Places.txt", delimiter='\t', low_memory=False)

# Method 2: Load data locally after downloading all Khiops samples (best for offline use)
# from khiops.tools import download_datasets
# download_datasets() 

# accidents_dataset_dir = f"{kh.get_samples_dir()}/Accidents"
# accidents_df = pd.read_csv(f"{accidents_dataset_dir}/Accidents.txt", sep="\t")
# vehicles_df = pd.read_csv(f"{accidents_dataset_dir}/Vehicles.txt", sep="\t")
# users_df = pd.read_csv(f"{accidents_dataset_dir}/Users.txt", sep="\t")
# places_df = pd.read_csv(f"{accidents_dataset_dir}/Places.txt", sep="\t", low_memory=False)

# Display the first records from each table
print("Accidents table:")
display(accidents_df.head(5))
print("Vehicles table:")
display(vehicles_df.head(5))
print("Users table:")
display(users_df.head(5))
print("Places table:")
display(places_df.head(5))
# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url_datasets = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/10.2.4"
accidents_df = pd.read_csv(f"{url_datasets}/Accidents/Accidents.txt", delimiter='\t')
vehicles_df = pd.read_csv(f"{url_datasets}/Accidents/Vehicles.txt", delimiter='\t')
users_df = pd.read_csv(f"{url_datasets}/Accidents/Users.txt", delimiter='\t')
places_df = pd.read_csv(f"{url_datasets}/Accidents/Places.txt", delimiter='\t', low_memory=False)

# Method 2: Load data locally after downloading all Khiops samples (best for offline use)
# from khiops.tools import download_datasets
# download_datasets() 

# accidents_dataset_dir = f"{kh.get_samples_dir()}/Accidents"
# accidents_df = pd.read_csv(f"{accidents_dataset_dir}/Accidents.txt", sep="\t")
# vehicles_df = pd.read_csv(f"{accidents_dataset_dir}/Vehicles.txt", sep="\t")
# users_df = pd.read_csv(f"{accidents_dataset_dir}/Users.txt", sep="\t")
# places_df = pd.read_csv(f"{accidents_dataset_dir}/Places.txt", sep="\t", low_memory=False)

# Display the first records from each table
print("Accidents table:")
display(accidents_df.head(5))
print("Vehicles table:")
display(vehicles_df.head(5))
print("Users table:")
display(users_df.head(5))
print("Places table:")
display(places_df.head(5))

Accidents table:

	AccidentId	Gravity	Date	Hour	Light	Department	Commune	InAgglomeration	IntersectionType	Weather	CollisionType	PostalAddress	GPSCode	Latitude	Longitude
0	201800000001	NonLethal	2018-01-24	15:05:00	Daylight	590	5	No	Y-type	Normal	2Vehicles-BehindVehicles-Frontal	route des Ansereuilles	M	50.55737	2.55737
1	201800000002	NonLethal	2018-02-12	10:15:00	Daylight	590	11	Yes	Square	VeryGood	NoCollision	Place du général de Gaul	M	50.52936	2.52936
2	201800000003	NonLethal	2018-03-04	11:35:00	Daylight	590	477	Yes	T-type	Normal	NoCollision	Rue nationale	M	50.51243	2.51243
3	201800000004	NonLethal	2018-05-05	17:35:00	Daylight	590	52	Yes	NoIntersection	VeryGood	2Vehicles-Side	30 rue Jules Guesde	M	50.51974	2.51974
4	201800000005	NonLethal	2018-06-26	16:05:00	Daylight	590	477	Yes	NoIntersection	Normal	2Vehicles-Side	72 rue Victor Hugo	M	50.51607	2.51607

Vehicles table:

	AccidentId	VehicleId	Direction	Category	FixedObstacle	MobileObstacle	ImpactPoint	Maneuver
0	201800000001	A01	Unknown	Car<=3.5T	NaN	Vehicle	RightFront	TurnToLeft
1	201800000001	B01	Unknown	Car<=3.5T	NaN	Vehicle	LeftFront	NoDirectionChange
2	201800000002	A01	Unknown	Car<=3.5T	NaN	Pedestrian	NaN	NoDirectionChange
3	201800000003	A01	Unknown	Motorbike>125cm3	StationaryVehicle	Vehicle	Front	NoDirectionChange
4	201800000003	B01	Unknown	Car<=3.5T	NaN	Vehicle	LeftSide	TurnToLeft

Users table:

	AccidentId	VehicleId	Seat	Category	Gender	TripReason	SafetyDevice	SafetyDeviceUsed	PedestrianLocation	PedestrianAction	PedestrianCompany	BirthYear
0	201800000001	A01	1.0	Driver	Male	Leisure	SeatBelt	Yes	NaN	NaN	Unknown	1960.0
1	201800000001	B01	1.0	Driver	Male	NaN	SeatBelt	Yes	NaN	NaN	Unknown	1928.0
2	201800000002	A01	1.0	Driver	Male	NaN	SeatBelt	Yes	NaN	NaN	Unknown	1947.0
3	201800000002	A01	NaN	Pedestrian	Male	NaN	Helmet	NaN	OnLane<=OnSidewalk0mCrossing	Crossing	Alone	1959.0
4	201800000003	A01	1.0	Driver	Male	Leisure	Helmet	Yes	NaN	NaN	Unknown	1987.0

Places table:

	AccidentId	RoadType	RoadNumber	RoadSecNumber	RoadLetter	Circulation	LaneNumber	Slope	RoadMarkerId	RoadMarkerDistance	Layout	StripWidth	LaneWidth	SurfaceCondition	Infrastructure	Localization
0	201800000001	Departamental	41	NaN	C	TwoWay	2.0	Flat	NaN	NaN	RightCurve	NaN	NaN	Normal	Unknown	Lane
1	201800000002	Communal	41	NaN	D	TwoWay	2.0	Flat	NaN	NaN	LeftCurve	NaN	NaN	Normal	Unknown	Lane
2	201800000003	Departamental	39	NaN	D	TwoWay	2.0	Flat	NaN	NaN	Straight	NaN	NaN	Normal	Unknown	Lane
3	201800000004	Departamental	39	NaN	NaN	TwoWay	2.0	Flat	NaN	NaN	Straight	NaN	NaN	Normal	Unknown	Lane
4	201800000005	Communal	NaN	NaN	NaN	OneWay	1.0	Flat	NaN	NaN	Straight	NaN	NaN	Normal	Unknown	Lane

Training the Classifier¶

We start by creating our X and y for the fit method. For multi-table tasks KhiopsClassifier uses a multi-table dataset specification: It is a dictionary that describes the schema of the dataset:

In [3]:

Copied!





X = {
    "main_table": "Accidents",
    "tables": {
        "Accidents": (accidents_df.drop("Gravity", axis=1), "AccidentId"), # We drop the target column "Gravity"
        "Vehicles": (vehicles_df, ["AccidentId", "VehicleId"]),
        "Users": (users_df, ["AccidentId", "VehicleId"]),
        "Places": (places_df, "AccidentId"),
    },
    "relations": [
        ("Accidents", "Vehicles"),
        ("Vehicles", "Users"),
        ("Accidents", "Places", True),
    ],
}
y = accidents_df["Gravity"]
X = {
    "main_table": "Accidents",
    "tables": {
        "Accidents": (accidents_df.drop("Gravity", axis=1), "AccidentId"), # We drop the target column "Gravity"
        "Vehicles": (vehicles_df, ["AccidentId", "VehicleId"]),
        "Users": (users_df, ["AccidentId", "VehicleId"]),
        "Places": (places_df, "AccidentId"),
    },
    "relations": [
        ("Accidents", "Vehicles"),
        ("Vehicles", "Users"),
        ("Accidents", "Places", True),
    ],
}
y = accidents_df["Gravity"]

Note the main table has one key (AccidentId) and the secondary table Vehicles has two (AccidentId and VehicleId).

To describe relations between tables, the field relations must be added to the dictionary of table specifications. This field is a list of pairs of tables of the form

(<parent table name>, <child table name>)

The khiops library provides the helper function train_test_split_dataset that splits a multi-table specification into two specs for train and test:

In [4]:

Copied!

X_train, X_test, y_train, y_test = train_test_split_dataset(X, y, random_state=123)
X_train, X_test, y_train, y_test = train_test_split_dataset(X, y, random_state=123)

We now fit our classifier on the train split. By default, the Khiops creates at most 100 multi-table variables (n_features) and 10 random decision trees (n_trees). We change these values for this example:

In [5]:

Copied!

khc = KhiopsClassifier(n_features=1000, n_trees=0)
khc.fit(X_train, y_train)
khc = KhiopsClassifier(n_features=1000, n_trees=0)
khc.fit(X_train, y_train)

Out[5]:

KhiopsClassifier(n_features=1000, n_trees=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Displaying the Classifiers’ Training Accuracy and AUC¶

The fit method calculates evaluation metrics on the training dataset. We access them via the estimator's attribute model_report_ which is an instance of the AnalysisResults class. Let's check this out:

In [6]:

Copied!

train_performance = khc.model_report_.train_evaluation_report.get_snb_performance()
print(f"Accidents train accuracy: {train_performance.accuracy}")
print(f"Accidents train auc     : {train_performance.auc}")
train_performance = khc.model_report_.train_evaluation_report.get_snb_performance()
print(f"Accidents train accuracy: {train_performance.accuracy}")
print(f"Accidents train auc     : {train_performance.auc}")

Accidents train accuracy: 0.944205
Accidents train auc     : 0.845932

The PredictorPerformance object has also a confusion matrix attribute:

In [7]:

Copied!





confusion_matrix = pd.DataFrame(
    train_performance.confusion_matrix.matrix,
    columns=train_performance.confusion_matrix.values,
    index=train_performance.confusion_matrix.values,
)
print("AccidentsSummary train confusion matrix:")
confusion_matrix
confusion_matrix = pd.DataFrame(
    train_performance.confusion_matrix.matrix,
    columns=train_performance.confusion_matrix.values,
    index=train_performance.confusion_matrix.values,
)
print("AccidentsSummary train confusion matrix:")
confusion_matrix

AccidentsSummary train confusion matrix:

Out[7]:

	Lethal	NonLethal
Lethal	69	52
NonLethal	2366	40850

If you have installed the Khiops Visualization app you may explore the full learning report by executing the code below.

In [8]:

Copied!

# Uncomment the lines below
# khc.export_report_file("./adult_report.khj")
# kh.visualize_report("./adult_report.khj")
# Uncomment the lines below
# khc.export_report_file("./adult_report.khj")
# kh.visualize_report("./adult_report.khj")

Deploying the Classifier and Displaying Its Test Performance¶

Now that we have a fitted KhiopsClassifier, we are now going to deploy it on the test split.

This can be done in two different ways:

to predict a class that can be obtained using its predict.
to predict class probabilities that can be obtained using its predict_proba.

Let's first predict the AccidentSummary labels:

In [9]:

Copied!





y_pred_test = khc.predict(X_test)
y_probas_test = khc.predict_proba(X_test)
print("Classes:")
display(khc.classes_)
print()
print("Predictions (first 10 values):")
display(y_pred_test[:10])
print()
print("Probabilities (first 10 rows):")
display(y_probas_test[:10,])
y_pred_test = khc.predict(X_test)
y_probas_test = khc.predict_proba(X_test)
print("Classes:")
display(khc.classes_)
print()
print("Predictions (first 10 values):")
display(y_pred_test[:10])
print()
print("Probabilities (first 10 rows):")
display(y_probas_test[:10,])

Classes:

array(['Lethal', 'NonLethal'], dtype='<U9')

Predictions (first 10 values):

array(['NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal',
       'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal'],
      dtype='<U9')

Probabilities (first 10 rows):

array([[3.09302957e-02, 9.69069704e-01],
       [1.56023274e-01, 8.43976726e-01],
       [3.33082936e-03, 9.96669171e-01],
       [8.29378593e-04, 9.99170621e-01],
       [5.21136609e-02, 9.47886339e-01],
       [4.55396978e-03, 9.95446030e-01],
       [1.46438522e-01, 8.53561478e-01],
       [4.71464454e-03, 9.95285355e-01],
       [5.65335988e-03, 9.94346640e-01],
       [4.26662644e-02, 9.57333736e-01]])

From these predictions we compute the test accuracy and AUC (One-vs-Rest) scores using sklearn.metrics

In [10]:

Copied!





accuracy_test = metrics.accuracy_score(y_test, y_pred_test)
auc_test = metrics.roc_auc_score(y_test, y_probas_test[:,1])
print(f"Accidents test accuracy: {accuracy_test}")
print(f"Accidents test auc     : {auc_test}")
accuracy_test = metrics.accuracy_score(y_test, y_pred_test)
auc_test = metrics.roc_auc_score(y_test, y_probas_test[:,1])
print(f"Accidents test accuracy: {accuracy_test}")
print(f"Accidents test auc     : {auc_test}")

Accidents test accuracy: 0.9475979509898934
Accidents test auc     : 0.831591515559879