Multi-Table Tutorial with the core API¶

In this notebook, we will learn how to train a classifier for a simple multi-table dataset. It is recommended to see the single table tutorial first and understand the basics of Khiops dictionary files.

In [2]:

Copied!





import warnings
import pandas as pd
from khiops import core as kh
from khiops.tools import download_datasets

# Download the sample datasets from GitHub if not available
warnings.filterwarnings("ignore", message="Download.*") # Ignore dataset download warning
download_datasets()
import warnings
import pandas as pd
from khiops import core as kh
from khiops.tools import download_datasets

# Download the sample datasets from GitHub if not available
warnings.filterwarnings("ignore", message="Download.*") # Ignore dataset download warning
download_datasets()

The Accidents Dataset¶

We'll train a multi-table classifier on a the dataset Accidents. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:

Accident
|
+----0:n----Vehicle
|           |
|           +----0:n----User
|
+----0:1----Place

The main table Accident
The table Vehicle in a 0:n relationship with Accident
The table User in a 0:n relationship with Vehicle
The table Place in a 0:1 relationship with Accident

Let's first check the content of the tables:

In [6]:

Copied!





# Store the locations of the `AccidentsSummary` dataset
accidents_table_path = f"{kh.get_samples_dir()}/Accidents/Accidents.txt"
vehicles_table_path = f"{kh.get_samples_dir()}/Accidents/Vehicles.txt"
users_table_path = f"{kh.get_samples_dir()}/Accidents/Users.txt"
places_table_path = f"{kh.get_samples_dir()}/Accidents/Places.txt"


# Print the first lines of the data files
print("Accidents table:")
display(pd.read_csv(accidents_table_path, sep="\t").head(5))
print("Vehicles table:")
display(pd.read_csv(vehicles_table_path, sep="\t").head(5))
print("Users table:")
display(pd.read_csv(users_table_path, sep="\t").head(5))
print("Places table:")
display(pd.read_csv(places_table_path, sep="\t", low_memory=False).head(5))
# Store the locations of the `AccidentsSummary` dataset
accidents_table_path = f"{kh.get_samples_dir()}/Accidents/Accidents.txt"
vehicles_table_path = f"{kh.get_samples_dir()}/Accidents/Vehicles.txt"
users_table_path = f"{kh.get_samples_dir()}/Accidents/Users.txt"
places_table_path = f"{kh.get_samples_dir()}/Accidents/Places.txt"


# Print the first lines of the data files
print("Accidents table:")
display(pd.read_csv(accidents_table_path, sep="\t").head(5))
print("Vehicles table:")
display(pd.read_csv(vehicles_table_path, sep="\t").head(5))
print("Users table:")
display(pd.read_csv(users_table_path, sep="\t").head(5))
print("Places table:")
display(pd.read_csv(places_table_path, sep="\t", low_memory=False).head(5))

Accidents table:

	AccidentId	Gravity	Date	Hour	Light	Department	Commune	InAgglomeration	IntersectionType	Weather	CollisionType	PostalAddress	GPSCode	Latitude	Longitude
0	201800000001	NonLethal	2018-01-24	15:05:00	Daylight	590	5	No	Y-type	Normal	2Vehicles-BehindVehicles-Frontal	route des Ansereuilles	M	50.55737	2.55737
1	201800000002	NonLethal	2018-02-12	10:15:00	Daylight	590	11	Yes	Square	VeryGood	NoCollision	Place du général de Gaul	M	50.52936	2.52936
2	201800000003	NonLethal	2018-03-04	11:35:00	Daylight	590	477	Yes	T-type	Normal	NoCollision	Rue nationale	M	50.51243	2.51243
3	201800000004	NonLethal	2018-05-05	17:35:00	Daylight	590	52	Yes	NoIntersection	VeryGood	2Vehicles-Side	30 rue Jules Guesde	M	50.51974	2.51974
4	201800000005	NonLethal	2018-06-26	16:05:00	Daylight	590	477	Yes	NoIntersection	Normal	2Vehicles-Side	72 rue Victor Hugo	M	50.51607	2.51607

Vehicles table:

	AccidentId	VehicleId	Direction	Category	FixedObstacle	MobileObstacle	ImpactPoint	Maneuver
0	201800000001	A01	Unknown	Car<=3.5T	NaN	Vehicle	RightFront	TurnToLeft
1	201800000001	B01	Unknown	Car<=3.5T	NaN	Vehicle	LeftFront	NoDirectionChange
2	201800000002	A01	Unknown	Car<=3.5T	NaN	Pedestrian	NaN	NoDirectionChange
3	201800000003	A01	Unknown	Motorbike>125cm3	StationaryVehicle	Vehicle	Front	NoDirectionChange
4	201800000003	B01	Unknown	Car<=3.5T	NaN	Vehicle	LeftSide	TurnToLeft

Users table:

	AccidentId	VehicleId	Seat	Category	Gender	TripReason	SafetyDevice	SafetyDeviceUsed	PedestrianLocation	PedestrianAction	PedestrianCompany	BirthYear
0	201800000001	A01	1.0	Driver	Male	Leisure	SeatBelt	Yes	NaN	NaN	Unknown	1960.0
1	201800000001	B01	1.0	Driver	Male	NaN	SeatBelt	Yes	NaN	NaN	Unknown	1928.0
2	201800000002	A01	1.0	Driver	Male	NaN	SeatBelt	Yes	NaN	NaN	Unknown	1947.0
3	201800000002	A01	NaN	Pedestrian	Male	NaN	Helmet	NaN	OnLane<=OnSidewalk0mCrossing	Crossing	Alone	1959.0
4	201800000003	A01	1.0	Driver	Male	Leisure	Helmet	Yes	NaN	NaN	Unknown	1987.0

Places table:

	AccidentId	RoadType	RoadNumber	RoadSecNumber	RoadLetter	Circulation	LaneNumber	Slope	RoadMarkerId	RoadMarkerDistance	Layout	StripWidth	LaneWidth	SurfaceCondition	Infrastructure	Localization
0	201800000001	Departamental	41	NaN	C	TwoWay	2.0	Flat	NaN	NaN	RightCurve	NaN	NaN	Normal	Unknown	Lane
1	201800000002	Communal	41	NaN	D	TwoWay	2.0	Flat	NaN	NaN	LeftCurve	NaN	NaN	Normal	Unknown	Lane
2	201800000003	Departamental	39	NaN	D	TwoWay	2.0	Flat	NaN	NaN	Straight	NaN	NaN	Normal	Unknown	Lane
3	201800000004	Departamental	39	NaN	NaN	TwoWay	2.0	Flat	NaN	NaN	Straight	NaN	NaN	Normal	Unknown	Lane
4	201800000005	Communal	NaN	NaN	NaN	OneWay	1.0	Flat	NaN	NaN	Straight	NaN	NaN	Normal	Unknown	Lane

To train a classifier with the Khiops core API, we must specify a multi-table dataset. The schema is specified via the Khiops dictionary file, let's see the contents its for the Accidents dataset:

In [7]:

Copied!

accidents_kdic_path = f"{kh.get_samples_dir()}/Accidents/Accidents.kdic"
with open(accidents_kdic_path) as accidents_kdic_file:
    print(accidents_kdic_file.read())
accidents_kdic_path = f"{kh.get_samples_dir()}/Accidents/Accidents.kdic"
with open(accidents_kdic_path) as accidents_kdic_file:
    print(accidents_kdic_file.read())

Root Dictionary Accident(AccidentId)
{
  Categorical AccidentId;
  Categorical Gravity;
  Date Date;
  Time Hour;
  Categorical Light;
  Categorical Department;
  Categorical Commune;
  Categorical InAgglomeration;
  Categorical IntersectionType;
  Categorical Weather;
  Categorical CollisionType;
  Categorical PostalAddress;
  Categorical GPSCode;
  Numerical Latitude;
  Numerical Longitude;
  Entity(Place) Place;
  Table(Vehicle) Vehicles;
};

Dictionary Place(AccidentId)
{
  Categorical AccidentId;
  Categorical RoadType;
  Categorical RoadNumber;
  Categorical RoadSecNumber;
  Categorical RoadLetter;
  Categorical Circulation;
  Numerical LaneNumber;
  Categorical SpecialLane;
  Categorical Slope;
  Categorical RoadMarkerId;
  Numerical RoadMarkerDistance;
  Categorical Layout;
  Numerical StripWidth;
  Numerical LaneWidth;
  Categorical SurfaceCondition;
  Categorical Infrastructure;
  Categorical Localization;
  Categorical SchoolNear;
};


Dictionary Vehicle(AccidentId, VehicleId)
{
  Categorical AccidentId;
  Categorical VehicleId;
  Categorical Direction;
  Categorical Category;
  Numerical PassengerNumber;
  Categorical FixedObstacle;
  Categorical MobileObstacle;
  Categorical ImpactPoint;
  Categorical Maneuver;
  Table(User) Users;
};

Dictionary User(AccidentId, VehicleId) {
  Categorical AccidentId;
  Categorical VehicleId;
  Categorical Seat;
  Categorical Category;
  Categorical Gender;
  Categorical TripReason;
  Categorical SafetyDevice;
  Categorical SafetyDeviceUsed;
  Categorical PedestrianLocation;
  Categorical PedestrianAction;
  Categorical PedestrianCompany;
  Numerical BirthYear;
};

We note that the Accident table contains a special Table variable. This special variable allows to create a 1:n relation. The target table is in its argument between parentheses (Vehicle).

Training the Classifier¶

While the dictionary file specifies the table schemas and their relations, it does not contain any information about the data files. On a single table task the third mandatory parameter of train_predictor specifies the data table file. For multi-table tasks this parameter is still used to specify the main table; to specify the rest of the tables we use the optional parameter additional_data_tables.

The additional_data_tables parameter is a Python dict whose keys are the data paths of each table and the values are their file paths (in our case just a single pair). For more information about data-paths see basics of Khiops dictionary files.

By default, the Khiops creates at most 100 multi-table variables (max_variables) and 10 random decision trees (max_trees). We change these values for this example:

In [9]:

Copied!





model_report_path, model_kdic_path = kh.train_predictor(
    accidents_kdic_path,
    "Accident",
    accidents_table_path,
    "Gravity",
    "./mt_results",
    additional_data_tables={
        "Accident`Vehicles": vehicles_table_path,
        "Accident`Vehicles`Users": users_table_path,
        "Accident`Place": places_table_path,
    },
    max_constructed_variables=1000,
    max_trees=0,
)
model_report_path, model_kdic_path = kh.train_predictor(
    accidents_kdic_path,
    "Accident",
    accidents_table_path,
    "Gravity",
    "./mt_results",
    additional_data_tables={
        "Accident`Vehicles": vehicles_table_path,
        "Accident`Vehicles`Users": users_table_path,
        "Accident`Place": places_table_path,
    },
    max_constructed_variables=1000,
    max_trees=0,
)

Displaying the Classifier’s Accuracy and AUC¶

Khiops calculates evaluation metrics for the train/test split datasets. We access them by loading the report file into an AnalysisResults object. Let's check this out:

In [10]:

Copied!





model_report = kh.read_analysis_results_file(model_report_path)
train_performance = model_report.train_evaluation_report.get_snb_performance()
test_performance = model_report.test_evaluation_report.get_snb_performance()

print(f"Accidents train accuracy: {train_performance.accuracy}")
print(f"Accidents train auc     : {train_performance.auc}")
print(f"Accidents test accuracy : {test_performance.accuracy}")
print(f"Accidents test auc      : {test_performance.auc}")
model_report = kh.read_analysis_results_file(model_report_path)
train_performance = model_report.train_evaluation_report.get_snb_performance()
test_performance = model_report.test_evaluation_report.get_snb_performance()

print(f"Accidents train accuracy: {train_performance.accuracy}")
print(f"Accidents train auc     : {train_performance.auc}")
print(f"Accidents test accuracy : {test_performance.accuracy}")
print(f"Accidents test auc      : {test_performance.auc}")

Accidents train accuracy: 0.94475
Accidents train auc     : 0.844525
Accidents test accuracy : 0.945303
Accidents test auc      : 0.839569

Deploying the Classifier¶

We are now going to deploy the Accidents classifier that we have just trained.

To this end we use the model dictionary file that the train_predictor function created in conjunction the the deploy_model core API function. Note that the name of the dictionary for the model is SNB_Accident.

Similarly to the model training we must set the additional_data_tables parameter to take into account the secondary table.

For simplicity, we'll just deploy on the whole data table file (one usually would do this on new data):

In [11]:

Copied!





accidents_deployed_path = "./mt_results/accidents_deployed.txt"
kh.deploy_model(
    model_kdic_path,             # Path of the model dictionary file
    "SNB_Accident",              # Name of the model dictionary
    accidents_table_path,        # Path of the table to deploy the model
    accidents_deployed_path,     # Path of the output (deployed) file
    additional_data_tables = {   # Pairs of {"data-path": "file-path"} describing the other tables
        "SNB_Accident`Vehicles": vehicles_table_path,
        "SNB_Accident`Vehicles`Users": users_table_path,
        "SNB_Accident`Place": places_table_path,
    },
)
accidents_deployed_path = "./mt_results/accidents_deployed.txt"
kh.deploy_model(
    model_kdic_path,             # Path of the model dictionary file
    "SNB_Accident",              # Name of the model dictionary
    accidents_table_path,        # Path of the table to deploy the model
    accidents_deployed_path,     # Path of the output (deployed) file
    additional_data_tables = {   # Pairs of {"data-path": "file-path"} describing the other tables
        "SNB_Accident`Vehicles": vehicles_table_path,
        "SNB_Accident`Vehicles`Users": users_table_path,
        "SNB_Accident`Place": places_table_path,
    },
)

The deployed model is in the path in the variable accidents_deployed_path, let's have a look at it

In [12]:

Copied!

display(pd.read_csv(accidents_deployed_path, sep="\t").head(10))
display(pd.read_csv(accidents_deployed_path, sep="\t").head(10))

	AccidentId	PredictedGravity	ProbGravityLethal	ProbGravityNonLethal
0	201800000001	NonLethal	0.153842	0.846158
1	201800000002	NonLethal	0.121561	0.878439
2	201800000003	NonLethal	0.067390	0.932610
3	201800000004	NonLethal	0.025705	0.974295
4	201800000005	NonLethal	0.012496	0.987504
5	201800000006	NonLethal	0.121613	0.878387
6	201800000007	NonLethal	0.095323	0.904677
7	201800000008	NonLethal	0.096077	0.903923
8	201800000009	NonLethal	0.167294	0.832706
9	201800000010	NonLethal	0.055217	0.944783

The deployed data table file contains three columns

PredictedGravity: Which contains the class prediction
ProbGravityLethal, ProbGravityNonLethal: Which contain the probability of each class of Accidents.