sklearn.dataset

Submodule of khiops.sklearn

Classes for handling diverse data tables

Functions

check_dataset_spec

Checks that a dataset spec is valid

get_khiops_type

Translates a numpy dtype to a Khiops dictionary type

get_khiops_variable_name

Return the khiops variable name associated to a column id

read_internal_data_table

Reads into a DataFrame a data table file with the internal format settings

table_name_of_path

Returns the table name as the last fragment of the table data path

write_internal_data_table

Writes a DataFrame to data table file with the internal format settings

Classes

Dataset

A representation of a dataset

DatasetTable

A generic dataset table

NumpyTable

DatasetTable encapsulating a NumPy array

PandasTable

DatasetTable encapsulating a pandas dataframe

SparseTable

DatasetTable encapsulating a SciPy sparse matrix

class khiops.sklearn.dataset.Dataset(X, y=None, categorical_target=True)

Bases: object

A representation of a dataset

Parameters:
Xpandas.DataFrame or dict
Either:
  • A single dataframe

  • A dict dataset specification

ypandas.Series, pandas.DataFrame or numpy.ndarray, optional

The target column.

categorical_targetbool, default True

True if the vector y should be considered as a categorical variable. If False it is considered as numeric. Ignored if y is None.

copy()

Creates a copy of the dataset

Referenced pandas.DataFrame’s, numpy.nparray’s and scipy.sparse.spmatrix’s in tables are copied as references.

create_khiops_dictionary_domain()

Creates a Khiops dictionary domain representing this dataset

Returns:
DictionaryDomain

The dictionary domain object representing this dataset

create_table_files_for_khiops(output_dir, sort=True)

Prepares the tables of the dataset to be used by Khiops

If this is a multi-table dataset it will create sorted copies the tables.

Parameters:
output_dirstr

The directory where the sorted tables will be created.

Returns:
tuple

A tuple containing:

  • The path of the main table

  • A dictionary containing the relation [data-path -> file-path] for the secondary tables. The dictionary is empty for monotable datasets.

get_table(table_name)

Returns a table by its name

Parameters:
table_name: str

The name of the table to be retrieved.

Returns:
DatasetTable

The table object for the specified name.

Raises:
KeyError

If there is no table with the specified name.

property is_multitable

bool : True if the dataset is multitable

property table_type

type : The table type of this dataset’s tables

Possible values:

to_spec()

Returns a dictionary specification of this dataset

class khiops.sklearn.dataset.DatasetTable(name, key=None)

Bases: ABC

A generic dataset table

check_key()

Checks that the key columns exist

create_khiops_dictionary()

Creates a Khiops dictionary representing this table

Returns:
Dictionary:

The Khiops Dictionary object describing this table’s schema

abstract create_table_file_for_khiops(output_dir, sort=True)

Creates a copy of the table at the specified directory

n_features()

Returns the number of features of the table

The target column does not count.

class khiops.sklearn.dataset.NumpyTable(name, array, key=None)

Bases: DatasetTable

DatasetTable encapsulating a NumPy array

Parameters:
namestr

Name for the table.

arraynumpy.ndarray of shape (n_samples, n_features_in) or Sequence

The data frame to be encapsulated.

key:external:term`array-like` of int, optional

The names of the columns composing the key.

create_table_file_for_khiops(output_dir, sort=True, target_column=None, target_column_id=None)

Creates a copy of the table at the specified directory

class khiops.sklearn.dataset.PandasTable(name, dataframe, key=None)

Bases: DatasetTable

DatasetTable encapsulating a pandas dataframe

Parameters:
namestr

Name for the table.

dataframepandas.DataFrame

The data frame to be encapsulated. It must be non-empty.

keylist of str, optional

The names of the columns composing the key.

create_table_file_for_khiops(output_dir, sort=True, target_column=None, target_column_id=None)

Creates a copy of the table at the specified directory

class khiops.sklearn.dataset.SparseTable(name, matrix, key=None)

Bases: DatasetTable

DatasetTable encapsulating a SciPy sparse matrix

Parameters:
namestr

Name for the table.

matrixscipy.sparse.spmatrix

The sparse matrix to be encapsulated.

keylist of str, optional

The names of the columns composing the key.

create_khiops_dictionary()

Creates a Khiops dictionary representing this sparse table

Adds metadata to each sparse variable

Returns:
Dictionary:

The Khiops Dictionary object describing this table’s schema

create_table_file_for_khiops(output_dir, sort=True, target_column=None, target_column_id=None)

Creates a copy of the table at the specified directory

khiops.sklearn.dataset.check_dataset_spec(ds_spec)

Checks that a dataset spec is valid

Parameters:
ds_specdict

A specification of a multi-table dataset (see Multi-Table Learning Primer).

Raises:
TypeError

If there are objects of the spec with invalid type.

ValueError

If there are objects of the spec with invalid values.

khiops.sklearn.dataset.get_khiops_type(numpy_type)

Translates a numpy dtype to a Khiops dictionary type

Parameters:
numpy_typenumpy.dtype:

Numpy type of the column

Returns:
str

Khiops type name. Either “Categorical”, “Numerical” or “Timestamp”

khiops.sklearn.dataset.get_khiops_variable_name(column_id)

Return the khiops variable name associated to a column id

khiops.sklearn.dataset.read_internal_data_table(file_path_or_stream, column_dtypes=None)

Reads into a DataFrame a data table file with the internal format settings

The table is read with the following settings:

  • Use tab as separator

  • Read the column names from the first line

  • Use ‘”’ as quote character

  • Use csv.QUOTE_MINIMAL

  • double quoting enabled (quotes within quotes can be escaped with ‘””’)

  • UTF-8 encoding

  • User-specified dtypes (optional)

Parameters:
file_path_or_streamstr or file object

The path of the internal data table file to be read or a readable file object.

column_dtypesdict, optional

Dictionary linking column names with dtypes. See dtype parameter of the pandas.read_csv function. If not set, then the column types are detected automatically by pandas.

Returns:
pandas.DataFrame

The dataframe representation of the data table.

khiops.sklearn.dataset.table_name_of_path(table_path)

Returns the table name as the last fragment of the table data path

Parameters:
table_path: str

Data path of the table, in the format “path/to/table”.

Returns:
str

The name of the table.

khiops.sklearn.dataset.write_internal_data_table(dataframe, file_path_or_stream)

Writes a DataFrame to data table file with the internal format settings

The table is written with the following settings:

  • Use tab as separator

  • Write the column names on the first line

  • Use ‘”’ as quote character

  • Use csv.QUOTE_MINIMAL

  • double quoting enabled (quotes within quotes can be escaped with ‘””’)

  • UTF-8 encoding

  • The index is not written

Khiops cannot handle multi-line records. Hence, the carriage returns / line feeds need to be removed from the records before data is handed over to Khiops.

Parameters:
dataframepandas.DataFrame

The dataframe to write.

file_path_or_streamstr or file object

The path of the internal data table file to be written or a writable file object.