sklearn.estimators¶
Submodule of khiops.sklearn
Scikit-Learn Estimator Classes for the Khiops AutoML Suite
Class Overview¶
The diagram below describes the relationships in this module:
KhiopsEstimator(ABC, BaseEstimator)
|
+- KhiopsCoclustering(ClusterMixin)
|
+- KhiopsSupervisedEstimator
|
+- KhiopsPredictor
| |
| +- KhiopsClassifier(ClassifierMixin)
| |
| +- KhiopsRegressor(RegressorMixin)
|
+- KhiopsEncoder(TransformerMixin)
Classes¶
Khiops Selective Naive Bayes Classifier |
|
A Khiops Coclustering model |
|
Khiops supervised discretization/grouping encoder |
|
Base class for Khiops Scikit-learn estimators |
|
Abstract Khiops Selective Naive Bayes Predictor |
|
Khiops Selective Naive Bayes Regressor |
|
Abstract Khiops Supervised Estimator |
- class khiops.sklearn.estimators.KhiopsClassifier(n_features=100, n_pairs=0, n_trees=10, n_selected_features=0, n_evaluated_features=0, specific_pairs=None, all_possible_pairs=True, construction_rules=None, group_target_value=False, verbose=False, output_dir=None, auto_sort=True)¶
Bases:
ClassifierMixin,KhiopsPredictorKhiops Selective Naive Bayes Classifier
This classifier supports automatic feature engineering on multi-table datasets. See Multi-Table Learning Primer for more details.
Note
Visit the Khiops site to learn about the automatic feature engineering algorithm.
- Parameters:
- n_featuresint, default 100
Multi-table only : Maximum number of multi-table aggregate features to construct. See Multi-Table Learning Primer for more details.
- n_pairsint, default 0
Maximum number of pair features to construct. These features are 2D grid partitions of univariate feature pairs. The grid is optimized such that in each cell the target distribution is well approximated by a constant histogram. Only pairs that are jointly more informative than their marginals may be taken into account in the classifier.
- n_treesint, default 10
Maximum number of decision tree features to construct. The constructed trees combine other features, either native or constructed. These features usually improve the classifier’s performance at the cost of interpretability of the model.
- n_selected_featuresint, default 0
Maximum number of features to be selected in the SNB predictor. If equal to 0 it selects all the features kept in the training.
- n_evaluated_featuresint, default 0
Maximum number of features to be evaluated in the SNB predictor training. If equal to 0 it evaluates all informative features.
- specific_pairslist of tuple, optional
User-specified pairs as a list of 2-tuples of feature names. If a given tuple contains only one non-empty feature name, then it generates all the pairs containing it (within the maximum limit
n_pairs). These pairs have top priority: they are constructed first.- all_possible_pairsbool, default
True If
Truetries to create all possible pairs within the limitn_pairs. Pairs specified withspecific_pairshave top priority: they are constructed first.- construction_ruleslist of str, optional
Allowed rules for the automatic feature construction. If not set, it uses all possible rules.
- group_target_valuebool, default
False Allows grouping of the target values in classification. It can substantially increase the training time.
- verbosebool, default
False If
Trueit prints debug information and it does not erase temporary files when fitting, predicting or transforming.- output_dirstr, optional
Path of the output directory for the
AllReports.khjreport file and theModeling.kdicmodeling dictionary file. By default these files are deleted.- auto_sortbool, default
True Advanced. Only for multi-table inputs: If
Trueinput tables are pre-sorted by their key before executing Khiops. If the input tables are already sorted by their keys set this parameter toFalseto speed up the processing. This affects thefit,predictandpredict_probamethods. Note The sort by key is performed in a left-to-right, hierarchical, lexicographic manner.
- Attributes:
- n_classes_int
The number of classes seen in training.
- classes_
ndarrayof shape (n_classes_,) The list of classes seen in training. Depending on the training target, the contents are
intorstr.- n_features_evaluated_int
The number of features evaluated by the classifier.
- feature_evaluated_names_
ndarrayof shape (n_features_evaluated_,) Names of the features evaluated by the classifier.
- feature_evaluated_importances_
ndarrayof shape (n_features_evaluated_,) Level of the features evaluated by the classifier. See below for a definition of the level.
- n_features_used_int
The number of features used by the classifier.
- feature_used_names_
ndarrayof shape (n_features_used_, ) Names of the features used by the classifier.
- feature_used_importances_
ndarrayof shape (n_features_used_, 3) Level, Weight and Importance of the features used by the classifier:
Level: A measure of the predictive importance of the feature taken individually. It ranges between 0 (no predictive interest) and 1 (optimal predictive importance).
Weight: A measure of the predictive importance of the feature taken relative to all features selected by the classifier. It ranges between 0 (little contribution to the model) and 1 (large contribution to the model).
Importance: The geometric mean between the Level and the Weight.
- is_multitable_model_bool
Trueif the model was fitted on a multi-table dataset.- model_
DictionaryDomain The Khiops dictionary domain for the trained classifier.
- model_main_dictionary_name_str
The name of the main Khiops dictionary within the
model_domain.- model_report_
AnalysisResults The Khiops report object.
Examples
- See the following functions of the
samples_sklearn.pydocumentation script:
- fit(X, y, **kwargs)¶
Fits a Selective Naive Bayes classifier according to X, y
- Parameters:
- Xarray-like of shape (n_samples, n_features_in) or dict
Training dataset. Either an array-like or a
dictspecification for multi-table datasets (see Multi-Table Learning Primer).- yarray-like of shape (n_samples,)
The target values.
- Returns:
- self
KhiopsClassifier The calling estimator instance.
- self
- predict(X)¶
Predicts the most probable class for the test dataset X
The predicted class of an input sample is the arg-max of the conditional probabilities P(y|X) for each value of y.
- Parameters:
- Xarray-like of shape (n_samples, n_features_in) or dict
Training dataset. Either an array-like or a
dictspecification for multi-table datasets (see Multi-Table Learning Primer).
- Returns:
ndarrayAn array containing the encoded columns. A first column containing key column ids is added in multi-table mode. The
numpy.dtypeof the array matches the type ofyused during training. It will be integer, float, or boolean if the classifier was trained with ayof the corresponding type. Otherwise it will bestr.The key columns are added for multi-table tasks.
- predict_proba(X)¶
Predicts the class probabilities for the test dataset X
- Parameters:
- Xarray-like of shape (n_samples, n_features_in) or dict
Training dataset. Either an array-like or a
dictspecification for multi-table datasets (see Multi-Table Learning Primer).
- Returns:
numpy.arrayor strThe probability of the samples for each class in the model. The columns are named with the pattern
Prob<class>for each<class>found in the training dataset. The output data container depends onX:Dataframe or dataframe-based
dictdataset specification:numpy.array
The key columns are added for multi-table tasks.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') KhiopsClassifier¶
Request metadata passed to the
scoremethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
- Returns:
- selfobject
The updated object.
- class khiops.sklearn.estimators.KhiopsCoclustering(verbose=False, output_dir=None, auto_sort=True, build_name_var=True, build_distance_vars=False, build_frequency_vars=False)¶
Bases:
ClusterMixin,KhiopsEstimatorA Khiops Coclustering model
A coclustering is a non-supervised piecewise constant density estimator.
- Parameters:
- build_distance_varsbool, default
False If
Trueincludes a cluster distance variable in the deployment- build_frequency_varsbool, default
False If
Trueincludes the frequency variables in the deployment.- build_name_varbool, default
False If
Trueincludes a cluster id variable in the deployment.- verbosebool, default
False If
Trueit prints debug information and it does not erase temporary files when fitting, predicting or transforming.- output_dirstr, optional
Path of the output directory for the
Coclustering.khcjreport file and theCoclustering.kdicmodeling dictionary file.- auto_sortbool, default
True Advanced. Only for multi-table inputs: If
Trueinput tables are automatically sorted by their key before executing Khiops. If the input tables are already sorted by their keys set this parameter toFalseto speed up the processing. This affects thepredictmethod. Note The sort by key is performed in a left-to-right, hierarchical, lexicographic manner.
- build_distance_varsbool, default
- Attributes:
- is_multitable_model_bool
Trueif the model was fitted on a multi-table dataset.- model_
DictionaryDomain The Khiops dictionary domain for the trained coclustering. For coclustering it is a multi-table dictionary even though the model is single-table.
- model_main_dictionary_name_str
The name of the main Khiops dictionary within the
model_domain.- model_report_
CoclusteringResults The Khiops report object.
Examples
- See the following functions of the
samples_sklearn.pydocumentation script:
- fit(X, y=None, **kwargs)¶
Trains a Khiops Coclustering model
- Parameters:
- Xarray-like of shape (n_samples, n_features_in) or dict
Training dataset. Either an array-like or a
dictspecification for multi-table datasets (see Multi-Table Learning Primer).- id_columnstr
The column that contains the id of the instance.
- columnslist, optional
The columns to be co-clustered. If not specified it uses all columns.
- Returns:
- self
KhiopsCoclustering The calling estimator instance.
- self
- fit_predict(X, y=None, **kwargs)¶
Performs clustering on X and returns result (instead of labels)
- predict(X)¶
Predicts the most probable cluster for the test dataset X
- Parameters:
- Xarray-like of shape (n_samples, n_features_in) or dict
Training dataset. Either an array-like or a
dictspecification for multi-table datasets (see Multi-Table Learning Primer).
- Returns:
ndarrayAn array containing the encoded columns. A first column containing key column ids is added in multi-table mode.
- simplify(max_preserved_information=0, max_cells=0, max_total_parts=0, max_part_numbers=None)¶
Creates a simplified coclustering model from the current instance
- Parameters:
- max_preserved_informationint, default 0
Maximum information preserve in the simplified coclustering. If equal to 0 there is no limit.
- max_cellsint, default 0
Maximum number of cells in the simplified coclustering. If equal to 0 there is no limit.
- max_total_partsint, default 0
Maximum number of parts totaled over all variables. If equal to 0 there is no limit.
- max_part_numbersdict, optional
Maximum number of clusters for each of the co-clustered column. Specifically, a key-value pair of this dictionary represents the column name and its respective maximum number of clusters. If not specified, then no maximum number of clusters is imposed on any column.
- Returns:
- self
KhiopsCoclustering A new, simplified
KhiopsCoclusteringestimator instance.
- self
- class khiops.sklearn.estimators.KhiopsEncoder(categorical_target=True, n_features=100, n_pairs=0, n_trees=0, specific_pairs=None, all_possible_pairs=True, construction_rules=None, informative_features_only=True, group_target_value=False, keep_initial_variables=False, transform_type_categorical='part_id', transform_type_numerical='part_id', transform_type_pairs='part_id', verbose=False, output_dir=None, auto_sort=True)¶
Bases:
TransformerMixin,KhiopsSupervisedEstimatorKhiops supervised discretization/grouping encoder
- Parameters:
- categorical_targetbool, default
True Trueif the target column is categorical.- n_featuresint, default 100
Multi-table only : Maximum number of multi-table aggregate features to construct. See Multi-Table Learning Primer for more details.
- n_pairsint, default 0
Maximum number of pair features to construct. These features are 2D grid partitions of univariate feature pairs. The grid is optimized such that in each cell the target distribution is well approximated by a constant histogram. Only pairs that are jointly more informative than their marginals may be taken into account in the encoder.
- n_treesint, default 10
Maximum number of decision tree features to construct. The constructed trees combine other features, either native or constructed. These features usually improve a predictor’s performance at the cost of interpretability of the model.
- specific_pairslist of tuple, optional
User-specified pairs as a list of 2-tuples of feature names. If a given tuple contains only one non-empty feature name, then it generates all the pairs containing it (within the maximum limit
n_pairs). These pairs have top priority: they are constructed first.- all_possible_pairsbool, default
True If
Truetries to create all possible pairs within the limitn_pairs. Pairs specified withspecific_pairshave top priority: they are constructed first.- construction_ruleslist of str, optional
- Allowed rules for the automatic feature construction. If not set, it uses all
possible rules.
- informative_features_onlybool, default
True If
Truekeeps only informative features.- group_target_valuebool, default
False Allows grouping of the target values in classification. It can substantially increase the training time.
- keep_initial_variablesbool, default
False If
Truethe original columns are kept in the transformed data.- transform_type_categoricalstr, default “part_id”
- Type of transformation for categorical features. Valid values:
“part_id”
“part_label”
“dummies”
“conditional_info”
See the documentation for the
categorical_recoding_methodparameter of thetrain_recoderfunction for more details.- transform_type_numericalstr, default “part_id”
- One of the following strings are valid:
“part_id”
“part_label”
“dummies”
“conditional_info”
“center_reduction”
“0-1_normalization”
“rank_normalization”
See the documentation for the
numerical_recoding_methodparameter of thetrain_recoderfunction for more details.- transform_type_pairsstr, default “part_id”
- Type of transformation for bivariate features. Valid values:
“part_id”
“part_label”
“dummies”
“conditional_info”
- verbosebool, default
False If
Trueit prints debug information and it does not erase temporary files when fitting, predicting or transforming.- output_dirstr, optional
Path of the output directory for the
AllReports.khjreport file and theModeling.kdicmodeling dictionary file. By default these files are deleted.- auto_sortbool, default
True Advanced. Only for multi-table inputs: If
Trueinput tables are pre-sorted by their key before executing Khiops. If the input tables are already sorted by their keys set this parameter toFalseto speed up the processing. This affects thefitandtransformmethods. Note The sort by key is performed in a left-to-right, hierarchical, lexicographic manner.
- categorical_targetbool, default
- Attributes:
- n_features_evaluated_int
The number of features evaluated by the classifier.
- feature_evaluated_names_
ndarrayof shape (n_features_evaluated_,) Names of the features evaluated by the classifier.
- feature_evaluated_importances_
ndarrayof shape (n_features_evaluated_,) Level of the features evaluated by the classifier. The Level is measure of the predictive importance of the feature taken individually. It ranges between 0 (no predictive interest) and 1 (optimal predictive importance).
- is_multitable_model_bool
Trueif the model was fitted on a multi-table dataset.- model_
DictionaryDomain The Khiops dictionary domain for the trained encoder.
- model_main_dictionary_name_str
The name of the main Khiops dictionary within the
model_domain.- model_report_
AnalysisResults The Khiops report object.
Examples
- See the following functions of the
samples_sklearn.pydocumentation script:
- fit(X, y=None, **kwargs)¶
Fits the Khiops Encoder according to X, y
- Parameters:
- Xarray-like of shape (n_samples, n_features_in) or dict
Training dataset. Either an array-like or a
dictspecification for multi-table datasets (see Multi-Table Learning Primer).- yarray-like of shape (n_samples,)
The target values.
- Returns:
- self
KhiopsEncoder The calling estimator instance.
- self
- fit_transform(X, y=None, **kwargs)¶
Fit and transforms its inputs
- Parameters:
- Xarray-like of shape (n_samples, n_features_in) or dict
Training dataset. Either an array-like or a
dictspecification for multi-table datasets (see Multi-Table Learning Primer).- yarray-like of shape (n_samples,)
The target values.
- Returns:
- self
KhiopsEncoder The calling estimator instance.
- self
- transform(X)¶
Transforms X with a fitted Khiops supervised encoder
Note
Numerical features are encoded to categorical ones. See the
transform_type_numericalparameter for details.- Parameters:
- Xarray-like of shape (n_samples, n_features_in) or dict
Training dataset. Either an array-like or a
dictspecification for multi-table datasets (see Multi-Table Learning Primer).
- Returns:
ndarrayAn array containing the encoded columns. A first column containing key column ids is added in multi-table mode.
- class khiops.sklearn.estimators.KhiopsEstimator(verbose=False, output_dir=None, auto_sort=True)¶
Bases:
ABC,BaseEstimatorBase class for Khiops Scikit-learn estimators
Note
The input features collection X needs to have single-line records so that Khiops can handle them. Hence, multi-line records are preprocessed: carriage returns / line feeds are replaced with blank spaces before being handed over to Khiops.
- Parameters:
- verbosebool, default
False If
Trueit prints debug information and it does not erase temporary files when fitting, predicting or transforming.- output_dirstr, optional
Path of the output directory for the resulting artifacts of Khiops learning tasks. See concrete estimator classes for more information about this parameter.
- auto_sortbool, default
True Advanced.: See concrete estimator classes for information about this parameter.
- verbosebool, default
- export_dictionary_file(dictionary_file_path)¶
Export the model’s Khiops dictionary file (.kdic)
- export_report_file(report_file_path)¶
Exports the model report to a JSON file
- Parameters:
- report_file_pathstr
The location of the exported report file.
- Raises:
ValueErrorWhen the instance is not fitted.
- fit(X, y=None, **kwargs)¶
Fit the estimator
- Returns:
- self
KhiopsEstimator The fitted estimator instance.
- self
- class khiops.sklearn.estimators.KhiopsPredictor(n_features=100, n_trees=10, n_selected_features=0, n_evaluated_features=0, specific_pairs=None, all_possible_pairs=True, construction_rules=None, verbose=False, output_dir=None, auto_sort=True)¶
Bases:
KhiopsSupervisedEstimatorAbstract Khiops Selective Naive Bayes Predictor
- predict(X)¶
Predicts the target variable for the test dataset X
See the documentation of concrete subclasses for more details.
- class khiops.sklearn.estimators.KhiopsRegressor(n_features=100, n_trees=0, n_selected_features=0, n_evaluated_features=0, construction_rules=None, verbose=False, output_dir=None, auto_sort=True)¶
Bases:
RegressorMixin,KhiopsPredictorKhiops Selective Naive Bayes Regressor
This regressor supports automatic feature engineering on multi-table datasets. See Multi-Table Learning Primer for more details.
Note
Visit the Khiops site to learn about the automatic feature engineering algorithm.
- Parameters:
- n_featuresint, default 100
Multi-table only : Maximum number of multi-table aggregate features to construct. See Multi-Table Learning Primer for more details.
- n_selected_featuresint, default 0
Maximum number of features to be selected in the SNB predictor. If equal to 0 it selects all the features kept in the training.
- n_evaluated_featuresint, default 0
Maximum number of features to be evaluated in the SNB predictor training. If equal to 0 it evaluates all informative features.
- construction_ruleslist of str, optional
- Allowed rules for the automatic feature construction. If not set, it uses all
possible rules.
- verbosebool, default
False If
Trueit prints debug information and it does not erase temporary files when fitting, predicting or transforming.- output_dirstr, optional
Path of the output directory for the
AllReports.khjreport file and theModeling.kdicmodeling dictionary file. By default these files are deleted.- auto_sortbool, default
True Advanced. Only for multi-table inputs: If
Trueinput tables are pre-sorted by their key before executing Khiops. If the input tables are already sorted by their keys set this parameter toFalseto speed up the processing. This affects thefitandpredictmethods. Note The sort by key is performed in a left-to-right, hierarchical, lexicographic manner.
- Attributes:
- n_features_evaluated_int
The number of features evaluated by the classifier.
- feature_evaluated_names_
ndarrayof shape (n_features_evaluated_,) Names of the features evaluated by the classifier.
- feature_evaluated_importances_
ndarrayof shape (n_features_evaluated_,) Level of the features evaluated by the classifier. See below for a definition of the level.
- n_features_used_int
The number of features used by the classifier.
- feature_used_names_
ndarrayof shape (n_features_used_, ) Names of the features used by the classifier.
- feature_used_importances_
ndarrayof shape (n_features_used_, 3) Level, Weight and Importance of the features used by the classifier:
Level: A measure of the predictive importance of the feature taken individually. It ranges between 0 (no predictive interest) and 1 (optimal predictive importance).
Weight: A measure of the predictive importance of the feature taken relative to all features selected by the classifier. It ranges between 0 (little contribution to the model) and 1 (large contribution to the model).
Importance: The geometric mean between the Level and the Weight.
- is_multitable_model_bool
Trueif the model was fitted on a multi-table dataset.- model_
DictionaryDomain The Khiops dictionary domain for the trained regressor.
- model_main_dictionary_name_str
The name of the main Khiops dictionary within the
model_domain.- model_report_
AnalysisResults The Khiops report object.
Examples
- See the following functions of the
samples_sklearn.pydocumentation script:
- fit(X, y=None, **kwargs)¶
Fits a Selective Naive Bayes regressor according to X, y
Warning
Make sure that the type of
yis float. This is easily done withy = y.astype(float).- Parameters:
- Xarray-like of shape (n_samples, n_features_in) or dict
Training dataset. Either an array-like or a
dictspecification for multi-table datasets (see Multi-Table Learning Primer).- yarray-like of shape (n_samples,)
The target values.
- Returns:
- self
KhiopsRegressor The calling estimator instance.
- self
- predict(X)¶
Predicts the regression values for the test dataset X
The predicted value is estimated by the Selective Naive Bayes Regressor learned during fit step.
- Parameters:
- Xarray-like of shape (n_samples, n_features_in) or dict
Training dataset. Either an array-like or a
dictspecification for multi-table datasets (see Multi-Table Learning Primer).
- Returns:
numpy.ndarrayor strAn array containing the encoded columns. A first column containing key column ids is added in multi-table mode. The key columns are added for multi-table tasks. The array is in the form of:
numpy.ndarrayif X is array-like, or dataset spec containingpandas.DataFrametable.str (a path for the file containing the array) if X is a dataset spec containing file-path tables.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') KhiopsRegressor¶
Request metadata passed to the
scoremethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
- Returns:
- selfobject
The updated object.
- class khiops.sklearn.estimators.KhiopsSupervisedEstimator(n_features=100, n_trees=10, specific_pairs=None, all_possible_pairs=True, construction_rules=None, verbose=False, output_dir=None, auto_sort=True)¶
Bases:
KhiopsEstimatorAbstract Khiops Supervised Estimator
- fit(X, y=None, **kwargs)¶
Fits a supervised estimator according to X,y
Called by the concrete sub-classes
KhiopsEncoder,KhiopsClassifier,KhiopsRegressor.- Parameters:
- Xarray-like of shape (n_samples, n_features_in) or dict
Training dataset. Either an array-like or a
dictspecification for multi-table datasets (see Multi-Table Learning Primer).- yarray-like of shape (n_samples,)
The target values.
- Returns:
- self
KhiopsSupervisedEstimator The calling estimator instance.
- self