sklearn.dataset¶
Submodule of khiops.sklearn
Classes for handling diverse data tables
Functions¶
Checks that a dataset spec is valid |
|
Translates a numpy dtype to a Khiops dictionary type |
|
Return the khiops variable name associated to a column id |
|
Reads into a DataFrame a data table file with the internal format settings |
|
Returns the table name as the last fragment of the table data path |
|
Writes a DataFrame to data table file with the internal format settings |
Classes¶
A representation of a dataset |
|
A generic dataset table |
|
DatasetTable encapsulating a NumPy array |
|
DatasetTable encapsulating a pandas dataframe |
|
DatasetTable encapsulating a SciPy sparse matrix |
- class khiops.sklearn.dataset.Dataset(X, y=None, categorical_target=True)¶
Bases:
objectA representation of a dataset
- Parameters:
- X
pandas.DataFrameor dict - Either:
A single dataframe
A
dictdataset specification
- y
pandas.Series,pandas.DataFrameornumpy.ndarray, optional The target column.
- categorical_targetbool, default True
Trueif the vectoryshould be considered as a categorical variable. IfFalseit is considered as numeric. Ignored ifyisNone.
- X
- copy()¶
Creates a copy of the dataset
Referenced pandas.DataFrame’s, numpy.nparray’s and scipy.sparse.spmatrix’s in tables are copied as references.
- create_khiops_dictionary_domain()¶
Creates a Khiops dictionary domain representing this dataset
- Returns:
DictionaryDomainThe dictionary domain object representing this dataset
- create_table_files_for_khiops(output_dir, sort=True)¶
Prepares the tables of the dataset to be used by Khiops
If this is a multi-table dataset it will create sorted copies the tables.
- Parameters:
- output_dirstr
The directory where the sorted tables will be created.
- Returns:
- tuple
A tuple containing:
The path of the main table
A dictionary containing the relation [data-path -> file-path] for the secondary tables. The dictionary is empty for monotable datasets.
- get_table(table_name)¶
Returns a table by its name
- Parameters:
- table_name: str
The name of the table to be retrieved.
- Returns:
DatasetTableThe table object for the specified name.
- Raises:
KeyErrorIf there is no table with the specified name.
- property is_multitable¶
bool :
Trueif the dataset is multitable
- property table_type¶
type : The table type of this dataset’s tables
Possible values:
- to_spec()¶
Returns a dictionary specification of this dataset
- class khiops.sklearn.dataset.DatasetTable(name, key=None)¶
Bases:
ABCA generic dataset table
- check_key()¶
Checks that the key columns exist
- create_khiops_dictionary()¶
Creates a Khiops dictionary representing this table
- Returns:
Dictionary:The Khiops Dictionary object describing this table’s schema
- abstract create_table_file_for_khiops(output_dir, sort=True)¶
Creates a copy of the table at the specified directory
- n_features()¶
Returns the number of features of the table
The target column does not count.
- class khiops.sklearn.dataset.NumpyTable(name, array, key=None)¶
Bases:
DatasetTableDatasetTable encapsulating a NumPy array
- Parameters:
- namestr
Name for the table.
- array
numpy.ndarrayof shape (n_samples, n_features_in) or Sequence The data frame to be encapsulated.
- key:external:term`array-like` of int, optional
The names of the columns composing the key.
- create_table_file_for_khiops(output_dir, sort=True, target_column=None, target_column_id=None)¶
Creates a copy of the table at the specified directory
- class khiops.sklearn.dataset.PandasTable(name, dataframe, key=None)¶
Bases:
DatasetTableDatasetTable encapsulating a pandas dataframe
- Parameters:
- namestr
Name for the table.
- dataframe
pandas.DataFrame The data frame to be encapsulated. It must be non-empty.
- keylist of str, optional
The names of the columns composing the key.
- create_table_file_for_khiops(output_dir, sort=True, target_column=None, target_column_id=None)¶
Creates a copy of the table at the specified directory
- class khiops.sklearn.dataset.SparseTable(name, matrix, key=None)¶
Bases:
DatasetTableDatasetTable encapsulating a SciPy sparse matrix
- Parameters:
- namestr
Name for the table.
- matrix
scipy.sparse.spmatrix The sparse matrix to be encapsulated.
- keylist of str, optional
The names of the columns composing the key.
- create_khiops_dictionary()¶
Creates a Khiops dictionary representing this sparse table
Adds metadata to each sparse variable
- Returns:
Dictionary:The Khiops Dictionary object describing this table’s schema
- create_table_file_for_khiops(output_dir, sort=True, target_column=None, target_column_id=None)¶
Creates a copy of the table at the specified directory
- khiops.sklearn.dataset.check_dataset_spec(ds_spec)¶
Checks that a dataset spec is valid
- Parameters:
- ds_specdict
A specification of a multi-table dataset (see Multi-Table Learning Primer).
- Raises:
- TypeError
If there are objects of the spec with invalid type.
- ValueError
If there are objects of the spec with invalid values.
- khiops.sklearn.dataset.get_khiops_type(numpy_type, categorical_str_max_size=None)¶
Translates a numpy dtype to a Khiops dictionary type
- Parameters:
- numpy_type
numpy.dtype Numpy type of the column
- categorical_str_max_size
int, optional Maximum length of the entries of the column whose type is
numpy_type.
- numpy_type
- Returns:
- str
Khiops type name. Either “Categorical”, “Text”, “Numerical” or “Timestamp”.
Note
The “Text” Khiops type is inferred if the Numpy type is “string” and the maximum length of the entries of that type is greater than 100.
- khiops.sklearn.dataset.get_khiops_variable_name(column_id)¶
Return the khiops variable name associated to a column id
- khiops.sklearn.dataset.read_internal_data_table(file_path_or_stream, column_dtypes=None)¶
Reads into a DataFrame a data table file with the internal format settings
The table is read with the following settings:
Use tab as separator
Read the column names from the first line
Use ‘”’ as quote character
double quoting enabled (quotes within quotes can be escaped with ‘””’)
UTF-8 encoding
User-specified dtypes (optional)
- Parameters:
- file_path_or_streamstr or file object
The path of the internal data table file to be read or a readable file object.
- column_dtypesdict, optional
Dictionary linking column names with dtypes. See
dtypeparameter of thepandas.read_csvfunction. If not set, then the column types are detected automatically by pandas.
- Returns:
pandas.DataFrameThe dataframe representation of the data table.
- khiops.sklearn.dataset.table_name_of_path(table_path)¶
Returns the table name as the last fragment of the table data path
- Parameters:
- table_path: str
Data path of the table, in the format “path/to/table”.
- Returns:
- str
The name of the table.
- khiops.sklearn.dataset.write_internal_data_table(dataframe, file_path_or_stream)¶
Writes a DataFrame to data table file with the internal format settings
The table is written with the following settings:
Use tab as separator
Write the column names on the first line
Use ‘”’ as quote character
double quoting enabled (quotes within quotes can be escaped with ‘””’)
UTF-8 encoding
The index is not written
Khiops cannot handle multi-line records. Hence, the carriage returns / line feeds need to be removed from the records before data is handed over to Khiops.
- Parameters:
- dataframe
pandas.DataFrame The dataframe to write.
- file_path_or_streamstr or file object
The path of the internal data table file to be written or a writable file object.
- dataframe