Multi-Table Classifier¶
In this notebook, we will learn how to train a classifier with complex relational data where a secondary table is itself a parent tables of another table (ie. snowflake schema). It is recommended to see the single table tutorial first.
import pandas as pd
from khiops.sklearn import KhiopsClassifier
from sklearn.metrics import accuracy_score
Training a Multi-Table Classifier¶
We'll train a multi-table classifier on a the dataset Accidents
organized in the following relational snowflake schema.
Accidents
|
| -- 1:n -- Vehicles
|
|-- 1:n -- Users
Note that the target variable is Gravity
.
To train the KhiopsClassifier for this setup, we must specify a multi-table dataset.
Let's first check the content of the tables:
- The main table
Accidents
- The secondary table
Vehicles
- The tertiary table
Users
Main table Accidents
# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url_accidents = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/main/AccidentsSummary/Accidents.txt"
accidents_df = pd.read_csv(url_accidents, delimiter='\t', encoding="latin1")
# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#If the samples have not been downloaded yet:
#from khiops.tools import download_datasets
#download_datasets()
#
#from os import path
#from khiops import core as kh
#accidents_dataset_path = path.join(kh.get_samples_dir(), "AccidentsSummary")
#accidents_df = pd.read_csv(path.join(accidents_dataset_path, "Accidents.txt"),sep="\t",encoding="latin1")
# Display the first 10 records from the dataset
accidents_df.head(10)
AccidentId | Gravity | Date | Hour | Light | Department | Commune | InAgglomeration | IntersectionType | Weather | CollisionType | PostalAddress | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201800000001 | NonLethal | 2018-01-24 | 15:05:00 | Daylight | 590 | 5 | No | Y-type | Normal | 2Vehicles-BehindVehicles-Frontal | route des Ansereuilles |
1 | 201800000002 | NonLethal | 2018-02-12 | 10:15:00 | Daylight | 590 | 11 | Yes | Square | VeryGood | NoCollision | Place du général de Gaul |
2 | 201800000003 | NonLethal | 2018-03-04 | 11:35:00 | Daylight | 590 | 477 | Yes | T-type | Normal | NoCollision | Rue nationale |
3 | 201800000004 | NonLethal | 2018-05-05 | 17:35:00 | Daylight | 590 | 52 | Yes | NoIntersection | VeryGood | 2Vehicles-Side | 30 rue Jules Guesde |
4 | 201800000005 | NonLethal | 2018-06-26 | 16:05:00 | Daylight | 590 | 477 | Yes | NoIntersection | Normal | 2Vehicles-Side | 72 rue Victor Hugo |
5 | 201800000006 | NonLethal | 2018-09-23 | 06:30:00 | TwilightOrDawn | 590 | 52 | Yes | NoIntersection | LightRain | Other | D39 |
6 | 201800000007 | NonLethal | 2018-09-26 | 00:40:00 | NightStreelightsOn | 590 | 133 | Yes | NoIntersection | Normal | Other | 4 route de camphin |
7 | 201800000008 | Lethal | 2018-11-30 | 17:15:00 | NightStreelightsOn | 590 | 11 | Yes | NoIntersection | Normal | Other | rue saint exupéry |
8 | 201800000009 | NonLethal | 2018-02-18 | 15:57:00 | Daylight | 590 | 550 | No | NoIntersection | Normal | Other | rue de l'égalité |
9 | 201800000010 | NonLethal | 2018-03-19 | 15:30:00 | Daylight | 590 | 51 | Yes | X-type | Normal | 2Vehicles-BehindVehicles-Frontal | face au 59 rue de Lille |
Secondary table Vehicules
# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url_vehicule = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/main/AccidentsSummary/Vehicles.txt"
vehicles_df = pd.read_csv(url_vehicule, delimiter='\t', encoding="latin1")
# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#vehicles_df = pd.read_csv(path.join(accidents_dataset_path, "Vehicles.txt"), sep="\t")
# Display the first 10 records from the dataset
vehicles_df.head(10)
AccidentId | VehicleId | Direction | Category | PassengerNumber | FixedObstacle | MobileObstacle | ImpactPoint | Maneuver | |
---|---|---|---|---|---|---|---|---|---|
0 | 201800000001 | A01 | Unknown | Car<=3.5T | 0 | None | Vehicle | RightFront | TurnToLeft |
1 | 201800000001 | B01 | Unknown | Car<=3.5T | 0 | None | Vehicle | LeftFront | NoDirectionChange |
2 | 201800000002 | A01 | Unknown | Car<=3.5T | 0 | None | Pedestrian | None | NoDirectionChange |
3 | 201800000003 | A01 | Unknown | Motorbike>125cm3 | 0 | StationaryVehicle | Vehicle | Front | NoDirectionChange |
4 | 201800000003 | B01 | Unknown | Car<=3.5T | 0 | None | Vehicle | LeftSide | TurnToLeft |
5 | 201800000003 | C01 | Unknown | Car<=3.5T | 0 | None | None | RightSide | Parked |
6 | 201800000004 | A01 | Unknown | Car<=3.5T | 0 | None | Other | RightFront | Avoidance |
7 | 201800000004 | B01 | Unknown | Bicycle | 0 | None | Vehicle | LeftSide | None |
8 | 201800000005 | A01 | Unknown | Moped | 0 | None | Vehicle | RightFront | PassLeft |
9 | 201800000005 | B01 | Unknown | Car<=3.5T | 0 | None | Vehicle | LeftFront | Park |
Tertiary table Users
# Method 1: Load data directly from GitHub (recommended for quick tests or small datasets)
url = "https://raw.githubusercontent.com/KhiopsML/khiops-samples/main/Accidents/Users.txt"
users_df = pd.read_csv(url, delimiter='\t', encoding="latin1").drop("Gravity", axis=1)
# Method 2: Load data locally after downloading all Khiops samples (best for offline use or large datasets)
#users_df = pd.read_csv(path.join(accidents_dataset_path, "Users.txt"), sep="\t", encoding="latin1").drop("Gravity", axis=1)
# Display the first 10 records from the dataset
users_df.head(10)
AccidentId | VehicleId | Seat | Category | Gender | TripReason | SafetyDevice | SafetyDeviceUsed | PedestrianLocation | PedestrianAction | PedestrianCompany | BirthYear | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201800000001 | A01 | 1.0 | Driver | Male | Leisure | SeatBelt | Yes | None | None | Unknown | 1960.0 |
1 | 201800000001 | B01 | 1.0 | Driver | Male | None | SeatBelt | Yes | None | None | Unknown | 1928.0 |
2 | 201800000002 | A01 | 1.0 | Driver | Male | None | SeatBelt | Yes | None | None | Unknown | 1947.0 |
3 | 201800000002 | A01 | NaN | Pedestrian | Male | None | Helmet | NaN | OnLane<=OnSidewalk0mCrossing | Crossing | Alone | 1959.0 |
4 | 201800000003 | A01 | 1.0 | Driver | Male | Leisure | Helmet | Yes | None | None | Unknown | 1987.0 |
5 | 201800000003 | C01 | 1.0 | Driver | Male | None | ChildrenDevice | NaN | None | None | Unknown | 1977.0 |
6 | 201800000004 | A01 | 1.0 | Driver | Male | Leisure | SeatBelt | Yes | None | None | Unknown | 1982.0 |
7 | 201800000004 | B01 | 1.0 | Driver | Male | Leisure | Helmet | NaN | None | None | Unknown | 2013.0 |
8 | 201800000005 | A01 | 1.0 | Driver | Male | Leisure | Helmet | Yes | None | None | Unknown | 2001.0 |
9 | 201800000005 | B01 | 1.0 | Driver | Male | Leisure | SeatBelt | Yes | None | None | Unknown | 1946.0 |
Creating the train dataset¶
In the context of multi-table data, we cannot use the train_test_split
from sklearn.model_selection. This is because it fails to account for the interconnected nature of the data, specially it does not preserve the integrity of the AccitendId
identifiers across all the tables, risking inconsistent training and testing sets.
# we first split the features set and the targets
accidents_main_df = accidents_df.drop("Gravity", axis=1)
y_accidents_train = accidents_df["Gravity"]
Create the multi-table dataset specification¶
Note the main table has one key AccidentId
. Tables Vehicles
(the secondary table) and Users
(the tertiary table) have two keys AccidentId
and VehicleId
.
To describe relations between tables, the field relations
must be added to the dictionary of table specifications. This field
is a list of pairs of tables of the form
(<parent table name>, <child table name>)
X_accidents_train = {
"main_table": "Accidents",
"tables": {
"Accidents": (accidents_main_df, "AccidentId"),
"Vehicles": (vehicles_df, ["AccidentId", "VehicleId"]),
"Users": (users_df, ["AccidentId", "VehicleId"]),
},
"relations": [
("Accidents", "Vehicles"),
("Vehicles", "Users"),
],
}
Train a classifier with this dataset¶
By default, the number of features n_features
to be created by the Khiops AutoML engine is set to 100 and the number of n_trees
to 10. We can change it for this example:
pkc_accidents = KhiopsClassifier(n_trees=0, n_features=1000)
pkc_accidents.fit(X_accidents_train, y_accidents_train)
KhiopsClassifier(n_features=1000, n_trees=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KhiopsClassifier(n_features=1000, n_trees=0)
Print the accuracy and auc of the model¶
To get the train performances, we use the model's attribute model_report
introduced a the previous "Single Table Classifier" Tutorial.
accidents_train_performance = (
pkc_accidents.model_report_.train_evaluation_report.get_snb_performance()
)
print(f"Accidents train accuracy: {accidents_train_performance.accuracy}")
print(f"Accidents train auc : {accidents_train_performance.auc}")
Accidents train accuracy: 0.944949 Accidents train auc : 0.841914
Deploy the classifier to obtain predictions on the training data¶
Note that usually one deploys the model on new test data. But as we eplain above, we cannot easily use de split_train_test
method of sklearn with relational data. So, we deploy on the train dataset to keep the tutorial simple.
accidents_pred = pkc_accidents.predict(X_accidents_train)
accidents_pred[:10]
array(['NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal'], dtype='<U9')
From these predictions we can compute the accuracy score using sklearn.metrics
# from sklearn.metrics
accuracy_score(y_accidents_train, accidents_pred)
0.9449492065140266