sklearn.estimators¶

Submodule of khiops.sklearn

Scikit-Learn Estimator Classes for the Khiops AutoML Suite

Class Overview¶

The diagram below describes the relationships in this module:

KhiopsEstimator(ABC, BaseEstimator)
    |
    +- KhiopsCoclustering(ClusterMixin)
    |
    +- KhiopsSupervisedEstimator
       |
       +- KhiopsPredictor
       |  |
       |  +- KhiopsClassifier(ClassifierMixin)
       |  |
       |  +- KhiopsRegressor(RegressorMixin)
       |
       +- KhiopsEncoder(TransformerMixin)

Classes¶

`KhiopsClassifier`	Khiops Selective Naive Bayes Classifier
`KhiopsCoclustering`	A Khiops Coclustering model
`KhiopsEncoder`	Khiops supervised discretization/grouping encoder
`KhiopsEstimator`	Base class for Khiops Scikit-learn estimators
`KhiopsPredictor`	Abstract Khiops Selective Naive Bayes Predictor
`KhiopsRegressor`	Khiops Selective Naive Bayes Regressor
`KhiopsSupervisedEstimator`	Abstract Khiops Supervised Estimator

class khiops.sklearn.estimators.KhiopsClassifier(n_features=100, n_pairs=0, n_trees=10, n_selected_features=0, n_evaluated_features=0, specific_pairs=None, all_possible_pairs=True, construction_rules=None, group_target_value=False, verbose=False, output_dir=None, auto_sort=True)¶

Bases: ClassifierMixin, KhiopsPredictor

Khiops Selective Naive Bayes Classifier

This classifier supports automatic feature engineering on multi-table datasets. See Multi-Table Learning Primer for more details.

Note

Visit the Khiops site to learn about the automatic feature engineering algorithm.

Parameters:

n_featuresint, default 100: Multi-table only : Maximum number of multi-table aggregate features to construct. See Multi-Table Learning Primer for more details.
n_pairsint, default 0: Maximum number of pair features to construct. These features are 2D grid partitions of univariate feature pairs. The grid is optimized such that in each cell the target distribution is well approximated by a constant histogram. Only pairs that are jointly more informative than their marginals may be taken into account in the classifier.
n_treesint, default 10: Maximum number of decision tree features to construct. The constructed trees combine other features, either native or constructed. These features usually improve the classifier’s performance at the cost of interpretability of the model.
n_selected_featuresint, default 0: Maximum number of features to be selected in the SNB predictor. If equal to 0 it selects all the features kept in the training.
n_evaluated_featuresint, default 0: Maximum number of features to be evaluated in the SNB predictor training. If equal to 0 it evaluates all informative features.
specific_pairslist of tuple, optional: User-specified pairs as a list of 2-tuples of feature names. If a given tuple contains only one non-empty feature name, then it generates all the pairs containing it (within the maximum limit n_pairs). These pairs have top priority: they are constructed first.
all_possible_pairsbool, default True: If True tries to create all possible pairs within the limit n_pairs. Pairs specified with specific_pairs have top priority: they are constructed first.
construction_ruleslist of str, optional: Allowed rules for the automatic feature construction. If not set, it uses all possible rules.
group_target_valuebool, default False: Allows grouping of the target values in classification. It can substantially increase the training time.
verbosebool, default False: If True it prints debug information and it does not erase temporary files when fitting, predicting or transforming.
output_dirstr, optional: Path of the output directory for the AllReports.khj report file and the Modeling.kdic modeling dictionary file. By default these files are deleted.
auto_sortbool, default True: Advanced. Only for multi-table inputs: If True input tables are pre-sorted by their key before executing Khiops. If the input tables are already sorted by their keys set this parameter to False to speed up the processing. This affects the fit, predict and predict_proba methods. Note The sort by key is performed in a left-to-right, hierarchical, lexicographic manner.

Attributes:

n_classes_int

The number of classes seen in training.

classes_ndarray of shape (n_classes_,)

The list of classes seen in training. Depending on the training target, the contents are int or str.

n_features_evaluated_int

The number of features evaluated by the classifier.

feature_evaluated_names_ndarray of shape (n_features_evaluated_,)

Names of the features evaluated by the classifier.

feature_evaluated_importances_ndarray of shape (n_features_evaluated_,)

Level of the features evaluated by the classifier. See below for a definition of the level.

n_features_used_int

The number of features used by the classifier.

feature_used_names_ndarray of shape (n_features_used_, )

Names of the features used by the classifier.

feature_used_importances_ndarray of shape (n_features_used_, 3)

Level, Weight and Importance of the features used by the classifier:

Level: A measure of the predictive importance of the feature taken individually. It ranges between 0 (no predictive interest) and 1 (optimal predictive importance).
Weight: A measure of the predictive importance of the feature taken relative to all features selected by the classifier. It ranges between 0 (little contribution to the model) and 1 (large contribution to the model).
Importance: The geometric mean between the Level and the Weight.

is_multitable_model_bool

True if the model was fitted on a multi-table dataset.

model_DictionaryDomain

The Khiops dictionary domain for the trained classifier.

model_main_dictionary_name_str

The name of the main Khiops dictionary within the model_ domain.

model_report_AnalysisResults

The Khiops report object.

Examples

See the following functions of the samples_sklearn.py documentation script:

fit(X, y, **kwargs)¶

Fits a Selective Naive Bayes classifier according to X, y

Parameters:

Xarray-like of shape (n_samples, n_features_in) or dict: Training dataset. Either an array-like or a dict specification for multi-table datasets (see Multi-Table Learning Primer).
yarray-like of shape (n_samples,): The target values.

Returns:

selfKhiopsClassifier: The calling estimator instance.

predict(X)¶

Predicts the most probable class for the test dataset X

The predicted class of an input sample is the arg-max of the conditional probabilities P(y|X) for each value of y.

Parameters:

Xarray-like of shape (n_samples, n_features_in) or dict: Training dataset. Either an array-like or a dict specification for multi-table datasets (see Multi-Table Learning Primer).

Returns:

ndarray

An array containing the encoded columns. A first column containing key column ids is added in multi-table mode. The numpy.dtype of the array matches the type of y used during training. It will be integer, float, or boolean if the classifier was trained with a y of the corresponding type. Otherwise it will be str.

The key columns are added for multi-table tasks.

predict_proba(X)¶

Predicts the class probabilities for the test dataset X

Parameters:

Xarray-like of shape (n_samples, n_features_in) or dict: Training dataset. Either an array-like or a dict specification for multi-table datasets (see Multi-Table Learning Primer).

Returns:

numpy.array or str

The probability of the samples for each class in the model. The columns are named with the pattern Prob<class> for each <class> found in the training dataset. The output data container depends on X:

Dataframe or dataframe-based dict dataset specification: numpy.array

The key columns are added for multi-table tasks.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → KhiopsClassifier¶

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns:

selfobject: The updated object.

class khiops.sklearn.estimators.KhiopsCoclustering(verbose=False, output_dir=None, auto_sort=True, build_name_var=True, build_distance_vars=False, build_frequency_vars=False)¶

Bases: ClusterMixin, KhiopsEstimator

A Khiops Coclustering model

A coclustering is a non-supervised piecewise constant density estimator.

Parameters:

build_distance_varsbool, default False: If True includes a cluster distance variable in the deployment
build_frequency_varsbool, default False: If True includes the frequency variables in the deployment.
build_name_varbool, default False: If True includes a cluster id variable in the deployment.
verbosebool, default False: If True it prints debug information and it does not erase temporary files when fitting, predicting or transforming.
output_dirstr, optional: Path of the output directory for the Coclustering.khcj report file and the Coclustering.kdic modeling dictionary file.
auto_sortbool, default True: Advanced. Only for multi-table inputs: If True input tables are automatically sorted by their key before executing Khiops. If the input tables are already sorted by their keys set this parameter to False to speed up the processing. This affects the predict method. Note The sort by key is performed in a left-to-right, hierarchical, lexicographic manner.

Attributes:

is_multitable_model_bool: True if the model was fitted on a multi-table dataset.
model_DictionaryDomain: The Khiops dictionary domain for the trained coclustering. For coclustering it is a multi-table dictionary even though the model is single-table.
model_main_dictionary_name_str: The name of the main Khiops dictionary within the model_ domain.
model_report_CoclusteringResults: The Khiops report object.

Examples

See the following functions of the samples_sklearn.py documentation script:

samples_sklearn.khiops_coclustering()

fit(X, y=None, **kwargs)¶

Trains a Khiops Coclustering model

Parameters:

Xarray-like of shape (n_samples, n_features_in) or dict: Training dataset. Either an array-like or a dict specification for multi-table datasets (see Multi-Table Learning Primer).
id_columnstr: The column that contains the id of the instance.
columnslist, optional: The columns to be co-clustered. If not specified it uses all columns.

Returns:

selfKhiopsCoclustering: The calling estimator instance.

fit_predict(X, y=None, **kwargs)¶: Performs clustering on X and returns result (instead of labels)

predict(X)¶

Predicts the most probable cluster for the test dataset X

Parameters:

Xarray-like of shape (n_samples, n_features_in) or dict: Training dataset. Either an array-like or a dict specification for multi-table datasets (see Multi-Table Learning Primer).

Returns:

ndarray: An array containing the encoded columns. A first column containing key column ids is added in multi-table mode.

simplify(max_preserved_information=0, max_cells=0, max_total_parts=0, max_part_numbers=None)¶

Creates a simplified coclustering model from the current instance

Parameters:

max_preserved_informationint, default 0: Maximum information preserve in the simplified coclustering. If equal to 0 there is no limit.
max_cellsint, default 0: Maximum number of cells in the simplified coclustering. If equal to 0 there is no limit.
max_total_partsint, default 0: Maximum number of parts totaled over all variables. If equal to 0 there is no limit.
max_part_numbersdict, optional: Maximum number of clusters for each of the co-clustered column. Specifically, a key-value pair of this dictionary represents the column name and its respective maximum number of clusters. If not specified, then no maximum number of clusters is imposed on any column.

Returns:

selfKhiopsCoclustering: A new, simplified KhiopsCoclustering estimator instance.

class khiops.sklearn.estimators.KhiopsEncoder(categorical_target=True, n_features=100, n_pairs=0, n_trees=0, specific_pairs=None, all_possible_pairs=True, construction_rules=None, informative_features_only=True, group_target_value=False, keep_initial_variables=False, transform_type_categorical='part_id', transform_type_numerical='part_id', transform_type_pairs='part_id', verbose=False, output_dir=None, auto_sort=True)¶

Bases: TransformerMixin, KhiopsSupervisedEstimator

Khiops supervised discretization/grouping encoder

Parameters:

categorical_targetbool, default True

True if the target column is categorical.

n_featuresint, default 100

Multi-table only : Maximum number of multi-table aggregate features to construct. See Multi-Table Learning Primer for more details.

n_pairsint, default 0

Maximum number of pair features to construct. These features are 2D grid partitions of univariate feature pairs. The grid is optimized such that in each cell the target distribution is well approximated by a constant histogram. Only pairs that are jointly more informative than their marginals may be taken into account in the encoder.

n_treesint, default 10

Maximum number of decision tree features to construct. The constructed trees combine other features, either native or constructed. These features usually improve a predictor’s performance at the cost of interpretability of the model.

specific_pairslist of tuple, optional

User-specified pairs as a list of 2-tuples of feature names. If a given tuple contains only one non-empty feature name, then it generates all the pairs containing it (within the maximum limit n_pairs). These pairs have top priority: they are constructed first.

all_possible_pairsbool, default True

If True tries to create all possible pairs within the limit n_pairs. Pairs specified with specific_pairs have top priority: they are constructed first.

construction_ruleslist of str, optional

Allowed rules for the automatic feature construction. If not set, it uses all: possible rules.

informative_features_onlybool, default True

If True keeps only informative features.

group_target_valuebool, default False

Allows grouping of the target values in classification. It can substantially increase the training time.

keep_initial_variablesbool, default False

If True the original columns are kept in the transformed data.

transform_type_categoricalstr, default “part_id”

Type of transformation for categorical features. Valid values:

“part_id”
“part_label”
“dummies”
“conditional_info”

See the documentation for the categorical_recoding_method parameter of the train_recoder function for more details.

transform_type_numericalstr, default “part_id”

One of the following strings are valid:

“part_id”
“part_label”
“dummies”
“conditional_info”
“center_reduction”
“0-1_normalization”
“rank_normalization”

See the documentation for the numerical_recoding_method parameter of the train_recoder function for more details.

transform_type_pairsstr, default “part_id”

Type of transformation for bivariate features. Valid values:

“part_id”
“part_label”
“dummies”
“conditional_info”

verbosebool, default False

If True it prints debug information and it does not erase temporary files when fitting, predicting or transforming.

output_dirstr, optional

Path of the output directory for the AllReports.khj report file and the Modeling.kdic modeling dictionary file. By default these files are deleted.

auto_sortbool, default True

Advanced. Only for multi-table inputs: If True input tables are pre-sorted by their key before executing Khiops. If the input tables are already sorted by their keys set this parameter to False to speed up the processing. This affects the fit and transform methods. Note The sort by key is performed in a left-to-right, hierarchical, lexicographic manner.

Attributes:

n_features_evaluated_int: The number of features evaluated by the classifier.
feature_evaluated_names_ndarray of shape (n_features_evaluated_,): Names of the features evaluated by the classifier.
feature_evaluated_importances_ndarray of shape (n_features_evaluated_,): Level of the features evaluated by the classifier. The Level is measure of the predictive importance of the feature taken individually. It ranges between 0 (no predictive interest) and 1 (optimal predictive importance).
is_multitable_model_bool: True if the model was fitted on a multi-table dataset.
model_DictionaryDomain: The Khiops dictionary domain for the trained encoder.
model_main_dictionary_name_str: The name of the main Khiops dictionary within the model_ domain.
model_report_AnalysisResults: The Khiops report object.

Examples

See the following functions of the samples_sklearn.py documentation script:

fit(X, y=None, **kwargs)¶

Fits the Khiops Encoder according to X, y

Parameters:

Xarray-like of shape (n_samples, n_features_in) or dict: Training dataset. Either an array-like or a dict specification for multi-table datasets (see Multi-Table Learning Primer).
yarray-like of shape (n_samples,): The target values.

Returns:

selfKhiopsEncoder: The calling estimator instance.

fit_transform(X, y=None, **kwargs)¶

Fit and transforms its inputs

Parameters:

Xarray-like of shape (n_samples, n_features_in) or dict: Training dataset. Either an array-like or a dict specification for multi-table datasets (see Multi-Table Learning Primer).
yarray-like of shape (n_samples,): The target values.

Returns:

selfKhiopsEncoder: The calling estimator instance.

transform(X)¶

Transforms X with a fitted Khiops supervised encoder

Note

Numerical features are encoded to categorical ones. See the transform_type_numerical parameter for details.

Parameters:

Xarray-like of shape (n_samples, n_features_in) or dict: Training dataset. Either an array-like or a dict specification for multi-table datasets (see Multi-Table Learning Primer).

Returns:

ndarray: An array containing the encoded columns. A first column containing key column ids is added in multi-table mode.

class khiops.sklearn.estimators.KhiopsEstimator(verbose=False, output_dir=None, auto_sort=True)¶

Bases: ABC, BaseEstimator

Base class for Khiops Scikit-learn estimators

Note

The input features collection X needs to have single-line records so that Khiops can handle them. Hence, multi-line records are preprocessed: carriage returns / line feeds are replaced with blank spaces before being handed over to Khiops.

Parameters:

verbosebool, default False: If True it prints debug information and it does not erase temporary files when fitting, predicting or transforming.
output_dirstr, optional: Path of the output directory for the resulting artifacts of Khiops learning tasks. See concrete estimator classes for more information about this parameter.
auto_sortbool, default True: Advanced.: See concrete estimator classes for information about this parameter.

export_dictionary_file(dictionary_file_path)¶: Export the model’s Khiops dictionary file (.kdic)

export_report_file(report_file_path)¶

Exports the model report to a JSON file

Parameters:

report_file_pathstr: The location of the exported report file.

Raises:

ValueError: When the instance is not fitted.

fit(X, y=None, **kwargs)¶

Fit the estimator

Returns:

selfKhiopsEstimator: The fitted estimator instance.

class khiops.sklearn.estimators.KhiopsPredictor(n_features=100, n_trees=10, n_selected_features=0, n_evaluated_features=0, specific_pairs=None, all_possible_pairs=True, construction_rules=None, verbose=False, output_dir=None, auto_sort=True)¶

Bases: KhiopsSupervisedEstimator

Abstract Khiops Selective Naive Bayes Predictor

predict(X)¶

Predicts the target variable for the test dataset X

See the documentation of concrete subclasses for more details.

class khiops.sklearn.estimators.KhiopsRegressor(n_features=100, n_trees=0, n_selected_features=0, n_evaluated_features=0, construction_rules=None, verbose=False, output_dir=None, auto_sort=True)¶

Bases: RegressorMixin, KhiopsPredictor

Khiops Selective Naive Bayes Regressor

This regressor supports automatic feature engineering on multi-table datasets. See Multi-Table Learning Primer for more details.

Note

Visit the Khiops site to learn about the automatic feature engineering algorithm.

Parameters:

n_featuresint, default 100

Multi-table only : Maximum number of multi-table aggregate features to construct. See Multi-Table Learning Primer for more details.

n_selected_featuresint, default 0

Maximum number of features to be selected in the SNB predictor. If equal to 0 it selects all the features kept in the training.

n_evaluated_featuresint, default 0

Maximum number of features to be evaluated in the SNB predictor training. If equal to 0 it evaluates all informative features.

construction_ruleslist of str, optional

Allowed rules for the automatic feature construction. If not set, it uses all: possible rules.

verbosebool, default False

If True it prints debug information and it does not erase temporary files when fitting, predicting or transforming.

output_dirstr, optional

Path of the output directory for the AllReports.khj report file and the Modeling.kdic modeling dictionary file. By default these files are deleted.

auto_sortbool, default True

Advanced. Only for multi-table inputs: If True input tables are pre-sorted by their key before executing Khiops. If the input tables are already sorted by their keys set this parameter to False to speed up the processing. This affects the fit and predict methods. Note The sort by key is performed in a left-to-right, hierarchical, lexicographic manner.

Attributes:

n_features_evaluated_int

The number of features evaluated by the classifier.

feature_evaluated_names_ndarray of shape (n_features_evaluated_,)

Names of the features evaluated by the classifier.

feature_evaluated_importances_ndarray of shape (n_features_evaluated_,)

Level of the features evaluated by the classifier. See below for a definition of the level.

n_features_used_int

The number of features used by the classifier.

feature_used_names_ndarray of shape (n_features_used_, )

Names of the features used by the classifier.

feature_used_importances_ndarray of shape (n_features_used_, 3)

Level, Weight and Importance of the features used by the classifier:

Level: A measure of the predictive importance of the feature taken individually. It ranges between 0 (no predictive interest) and 1 (optimal predictive importance).
Weight: A measure of the predictive importance of the feature taken relative to all features selected by the classifier. It ranges between 0 (little contribution to the model) and 1 (large contribution to the model).
Importance: The geometric mean between the Level and the Weight.

is_multitable_model_bool

True if the model was fitted on a multi-table dataset.

model_DictionaryDomain

The Khiops dictionary domain for the trained regressor.

model_main_dictionary_name_str

The name of the main Khiops dictionary within the model_ domain.

model_report_AnalysisResults

The Khiops report object.

Examples

See the following functions of the samples_sklearn.py documentation script:

samples_sklearn.khiops_regressor()

fit(X, y=None, **kwargs)¶

Fits a Selective Naive Bayes regressor according to X, y

Warning

Make sure that the type of y is float. This is easily done with y = y.astype(float).

Parameters:

Xarray-like of shape (n_samples, n_features_in) or dict: Training dataset. Either an array-like or a dict specification for multi-table datasets (see Multi-Table Learning Primer).
yarray-like of shape (n_samples,): The target values.

Returns:

selfKhiopsRegressor: The calling estimator instance.

predict(X)¶

Predicts the regression values for the test dataset X

The predicted value is estimated by the Selective Naive Bayes Regressor learned during fit step.

Parameters:

Xarray-like of shape (n_samples, n_features_in) or dict: Training dataset. Either an array-like or a dict specification for multi-table datasets (see Multi-Table Learning Primer).

Returns:

numpy.ndarray or str

An array containing the encoded columns. A first column containing key column ids is added in multi-table mode. The key columns are added for multi-table tasks. The array is in the form of:

numpy.ndarray if X is array-like, or dataset spec containing pandas.DataFrame table.
str (a path for the file containing the array) if X is a dataset spec containing file-path tables.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → KhiopsRegressor¶

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns:

selfobject: The updated object.

class khiops.sklearn.estimators.KhiopsSupervisedEstimator(n_features=100, n_trees=10, specific_pairs=None, all_possible_pairs=True, construction_rules=None, verbose=False, output_dir=None, auto_sort=True)¶

Bases: KhiopsEstimator

Abstract Khiops Supervised Estimator

fit(X, y=None, **kwargs)¶

Fits a supervised estimator according to X,y

Called by the concrete sub-classes KhiopsEncoder, KhiopsClassifier, KhiopsRegressor.

Parameters:

Xarray-like of shape (n_samples, n_features_in) or dict: Training dataset. Either an array-like or a dict specification for multi-table datasets (see Multi-Table Learning Primer).
yarray-like of shape (n_samples,): The target values.

Returns:

selfKhiopsSupervisedEstimator: The calling estimator instance.