Multi-Table Classifier with the core API¶
In this notebook, we will learn how to train a classifier for a simple multi-table dataset. It is recommended to see the single table tutorial first and understand the basics of Khiops dictionary files.
import warnings
import pandas as pd
from khiops import core as kh
from khiops.tools import download_datasets
# Download the sample datasets from GitHub if not available
warnings.filterwarnings("ignore", message="Download.*") # Ignore dataset download warning
download_datasets()
The Accidents Dataset¶
We'll train a multi-table classifier on a the dataset Accidents
. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:
Accident
|
+----0:n----Vehicle
| |
| +----0:n----User
|
+----0:1----Place
- The main table
Accident
- The table
Vehicle
in a0:n
relationship withAccident
- The table
User
in a0:n
relationship withVehicle
- The table
Place
in a0:1
relationship withAccident
Let's first check the content of the tables:
# Store the locations of the `AccidentsSummary` dataset
accidents_table_path = f"{kh.get_samples_dir()}/Accidents/Accidents.txt"
vehicles_table_path = f"{kh.get_samples_dir()}/Accidents/Vehicles.txt"
users_table_path = f"{kh.get_samples_dir()}/Accidents/Users.txt"
places_table_path = f"{kh.get_samples_dir()}/Accidents/Places.txt"
# Print the first lines of the data files
print("Accidents table:")
display(pd.read_csv(accidents_table_path, sep="\t").head(5))
print("Vehicles table:")
display(pd.read_csv(vehicles_table_path, sep="\t").head(5))
print("Users table:")
display(pd.read_csv(users_table_path, sep="\t").head(5))
print("Places table:")
display(pd.read_csv(places_table_path, sep="\t", low_memory=False).head(5))
Accidents table:
AccidentId | Gravity | Date | Hour | Light | Department | Commune | InAgglomeration | IntersectionType | Weather | CollisionType | PostalAddress | GPSCode | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201800000001 | NonLethal | 2018-01-24 | 15:05:00 | Daylight | 590 | 5 | No | Y-type | Normal | 2Vehicles-BehindVehicles-Frontal | route des Ansereuilles | M | 50.55737 | 2.55737 |
1 | 201800000002 | NonLethal | 2018-02-12 | 10:15:00 | Daylight | 590 | 11 | Yes | Square | VeryGood | NoCollision | Place du général de Gaul | M | 50.52936 | 2.52936 |
2 | 201800000003 | NonLethal | 2018-03-04 | 11:35:00 | Daylight | 590 | 477 | Yes | T-type | Normal | NoCollision | Rue nationale | M | 50.51243 | 2.51243 |
3 | 201800000004 | NonLethal | 2018-05-05 | 17:35:00 | Daylight | 590 | 52 | Yes | NoIntersection | VeryGood | 2Vehicles-Side | 30 rue Jules Guesde | M | 50.51974 | 2.51974 |
4 | 201800000005 | NonLethal | 2018-06-26 | 16:05:00 | Daylight | 590 | 477 | Yes | NoIntersection | Normal | 2Vehicles-Side | 72 rue Victor Hugo | M | 50.51607 | 2.51607 |
Vehicles table:
AccidentId | VehicleId | Direction | Category | PassengerNumber | FixedObstacle | MobileObstacle | ImpactPoint | Maneuver | |
---|---|---|---|---|---|---|---|---|---|
0 | 201800000001 | A01 | Unknown | Car<=3.5T | 0 | NaN | Vehicle | RightFront | TurnToLeft |
1 | 201800000001 | B01 | Unknown | Car<=3.5T | 0 | NaN | Vehicle | LeftFront | NoDirectionChange |
2 | 201800000002 | A01 | Unknown | Car<=3.5T | 0 | NaN | Pedestrian | NaN | NoDirectionChange |
3 | 201800000003 | A01 | Unknown | Motorbike>125cm3 | 0 | StationaryVehicle | Vehicle | Front | NoDirectionChange |
4 | 201800000003 | B01 | Unknown | Car<=3.5T | 0 | NaN | Vehicle | LeftSide | TurnToLeft |
Users table:
AccidentId | VehicleId | Seat | Category | Gender | TripReason | SafetyDevice | SafetyDeviceUsed | PedestrianLocation | PedestrianAction | PedestrianCompany | BirthYear | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201800000001 | A01 | 1.0 | Driver | Male | Leisure | SeatBelt | Yes | NaN | NaN | Unknown | 1960.0 |
1 | 201800000001 | B01 | 1.0 | Driver | Male | NaN | SeatBelt | Yes | NaN | NaN | Unknown | 1928.0 |
2 | 201800000002 | A01 | 1.0 | Driver | Male | NaN | SeatBelt | Yes | NaN | NaN | Unknown | 1947.0 |
3 | 201800000002 | A01 | NaN | Pedestrian | Male | NaN | Helmet | NaN | OnLane<=OnSidewalk0mCrossing | Crossing | Alone | 1959.0 |
4 | 201800000003 | A01 | 1.0 | Driver | Male | Leisure | Helmet | Yes | NaN | NaN | Unknown | 1987.0 |
Places table:
AccidentId | RoadType | RoadNumber | RoadSecNumber | RoadLetter | Circulation | LaneNumber | SpecialLane | Slope | RoadMarkerId | RoadMarkerDistance | Layout | StripWidth | LaneWidth | SurfaceCondition | Infrastructure | Localization | SchoolNear | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201800000001 | Departamental | 41 | NaN | C | TwoWay | 2.0 | 0 | Flat | NaN | NaN | RightCurve | NaN | NaN | Normal | Unknown | Lane | 0.0 |
1 | 201800000002 | Communal | 41 | NaN | D | TwoWay | 2.0 | 0 | Flat | NaN | NaN | LeftCurve | NaN | NaN | Normal | Unknown | Lane | 0.0 |
2 | 201800000003 | Departamental | 39 | NaN | D | TwoWay | 2.0 | 0 | Flat | NaN | NaN | Straight | NaN | NaN | Normal | Unknown | Lane | 0.0 |
3 | 201800000004 | Departamental | 39 | NaN | NaN | TwoWay | 2.0 | 0 | Flat | NaN | NaN | Straight | NaN | NaN | Normal | Unknown | Lane | 0.0 |
4 | 201800000005 | Communal | NaN | NaN | NaN | OneWay | 1.0 | 0 | Flat | NaN | NaN | Straight | NaN | NaN | Normal | Unknown | Lane | 0.0 |
To train a classifier with the Khiops core API, we must specify a multi-table dataset.
The schema is specified via the Khiops dictionary file, let's see the contents its for the Accidents
dataset:
accidents_kdic_path = f"{kh.get_samples_dir()}/Accidents/Accidents.kdic"
with open(accidents_kdic_path) as accidents_kdic_file:
print(accidents_kdic_file.read())
Root Dictionary Accident(AccidentId) { Categorical AccidentId; Categorical Gravity; Date Date; Time Hour; Categorical Light; Categorical Department; Categorical Commune; Categorical InAgglomeration; Categorical IntersectionType; Categorical Weather; Categorical CollisionType; Categorical PostalAddress; Categorical GPSCode; Numerical Latitude; Numerical Longitude; Entity(Place) Place; Table(Vehicle) Vehicles; }; Dictionary Place(AccidentId) { Categorical AccidentId; Categorical RoadType; Categorical RoadNumber; Categorical RoadSecNumber; Categorical RoadLetter; Categorical Circulation; Numerical LaneNumber; Categorical SpecialLane; Categorical Slope; Categorical RoadMarkerId; Numerical RoadMarkerDistance; Categorical Layout; Numerical StripWidth; Numerical LaneWidth; Categorical SurfaceCondition; Categorical Infrastructure; Categorical Localization; Categorical SchoolNear; }; Dictionary Vehicle(AccidentId, VehicleId) { Categorical AccidentId; Categorical VehicleId; Categorical Direction; Categorical Category; Numerical PassengerNumber; Categorical FixedObstacle; Categorical MobileObstacle; Categorical ImpactPoint; Categorical Maneuver; Table(User) Users; }; Dictionary User(AccidentId, VehicleId) { Categorical AccidentId; Categorical VehicleId; Categorical Seat; Categorical Category; Categorical Gender; Categorical TripReason; Categorical SafetyDevice; Categorical SafetyDeviceUsed; Categorical PedestrianLocation; Categorical PedestrianAction; Categorical PedestrianCompany; Numerical BirthYear; };
We note that the Accident
table contains a special Table
variable. This special variable allows to create a 1:n
relation. The target table is in its argument between parentheses (Vehicle
).
Training the Classifier¶
While the dictionary file specifies the table schemas and their relations, it does not contain any information about the data files. On a single table task the third mandatory parameter of train_predictor
specifies the data table file. For multi-table tasks this parameter is still used to specify the main table; to specify the rest of the tables we use the optional parameter additional_data_tables
.
The additional_data_tables
parameter is a Python dict
whose keys are the data paths of each table and the values are their file paths (in our case just a single pair). For more information about data-paths see basics of Khiops dictionary files.
By default, the Khiops creates at most 100 multi-table variables (max_variables
) and 10 random decision trees (max_trees
). We change these values for this example:
model_report_path, model_kdic_path = kh.train_predictor(
accidents_kdic_path,
"Accident",
accidents_table_path,
"Gravity",
"./mt_results",
additional_data_tables={
"Accident`Vehicles": vehicles_table_path,
"Accident`Vehicles`Users": users_table_path,
"Accident`Place": places_table_path,
},
max_constructed_variables=1000,
max_trees=0,
)
Displaying the Classifier’s Accuracy and AUC¶
Khiops calculates evaluation metrics for the train/test split datasets. We access them by loading the report file into an AnalysisResults
object. Let's check this out:
model_report = kh.read_analysis_results_file(model_report_path)
train_performance = model_report.train_evaluation_report.get_snb_performance()
test_performance = model_report.test_evaluation_report.get_snb_performance()
print(f"Accidents train accuracy: {train_performance.accuracy}")
print(f"Accidents train auc : {train_performance.auc}")
print(f"Accidents test accuracy : {test_performance.accuracy}")
print(f"Accidents test auc : {test_performance.auc}")
Accidents train accuracy: 0.94475 Accidents train auc : 0.844525 Accidents test accuracy : 0.945303 Accidents test auc : 0.839569
Deploying the Classifier¶
We are now going to deploy the Accidents
classifier that we have just trained.
To this end we use the model dictionary file that the train_predictor
function created in conjunction the the deploy_model
core API function. Note that the name of the dictionary for the model is SNB_Accident
.
Similarly to the model training we must set the additional_data_tables
parameter to take into account the secondary table.
For simplicity, we'll just deploy on the whole data table file (one usually would do this on new data):
accidents_deployed_path = "./mt_results/accidents_deployed.txt"
kh.deploy_model(
model_kdic_path, # Path of the model dictionary file
"SNB_Accident", # Name of the model dictionary
accidents_table_path, # Path of the table to deploy the model
accidents_deployed_path, # Path of the output (deployed) file
additional_data_tables = { # Pairs of {"data-path": "file-path"} describing the other tables
"SNB_Accident`Vehicles": vehicles_table_path,
"SNB_Accident`Vehicles`Users": users_table_path,
"SNB_Accident`Place": places_table_path,
},
)
The deployed model is in the path in the variable accidents_deployed_path
, let's have a look at it
display(pd.read_csv(accidents_deployed_path, sep="\t").head(10))
AccidentId | PredictedGravity | ProbGravityLethal | ProbGravityNonLethal | |
---|---|---|---|---|
0 | 201800000001 | NonLethal | 0.153842 | 0.846158 |
1 | 201800000002 | NonLethal | 0.121561 | 0.878439 |
2 | 201800000003 | NonLethal | 0.067390 | 0.932610 |
3 | 201800000004 | NonLethal | 0.025705 | 0.974295 |
4 | 201800000005 | NonLethal | 0.012496 | 0.987504 |
5 | 201800000006 | NonLethal | 0.121613 | 0.878387 |
6 | 201800000007 | NonLethal | 0.095323 | 0.904677 |
7 | 201800000008 | NonLethal | 0.096077 | 0.903923 |
8 | 201800000009 | NonLethal | 0.167294 | 0.832706 |
9 | 201800000010 | NonLethal | 0.055217 | 0.944783 |
The deployed data table file contains three columns
PredictedGravity
: Which contains the class predictionProbGravityLethal
,ProbGravityNonLethal
: Which contain the probability of each class ofAccidents
.