Technical Structures
This section summarizes the derivation rules that are used internally by Khiops to:
- store data preparation and predictor models in dictionary files, and
- to compute any model output values such as variable recoding, prediction, density estimation.
The rules in this section are not intended to be used by the user to construct new variables, but rather to have a quick understanding of them.
Overview
Khiops exploits several technical types to store the results of data preparation and modeling into dictionaries:
- Vectors:
Structure(VectorC)
:Categorical
values vector.Structure(Vector)
:Numerical
values vector.
- Hash Maps:
Structure(HashMapC)
:Categorical
values hash map.Structure(HashMap)
:Numerical
values hash map.
- Data Preparation:
Structure(DataGrid)
: A data grid is defined by its input partitions (discretizations or value groupings) and by the frequencies of the cells of the cross-product of the input partitions.- Partitions:
Structure(IntervalBounds)
:Numerical
partition, defined by a sorted list of interval bounds.Structure(ValueGroups)
:Categorical
partition, defined by a set of exclusive groups of values (Structure(ValueGroup)
).Structure(ValueGroup)
: Set of exclusive categorical values.Structure(ValueSet)
:Numerical
partition, defined by a set of exclusive numerical values.Structure(ValueSetC)
:Categorical
partition, defined by a set of exclusive categorical values.
Structure(Frequencies)
: Frequency vector.
- Data Preparation Statistics:
Structure(DataGridStats)
: TheDataGridStats
structure is defined by a data grid and a set a variables related to the input partitions of the data grid; the statistics are computed with respect to the data grid cell related to the input values.
- Predictors:
Structure(Classifier)
: Classifier specification.Structure(Regressor)
: Regressor specification.Structure(RankRegressor)
: Rank regressor specification.
- Coclustering Deployment:
Structure(DataGridDeployment)
: TheDataGridDeployment
structure is defined by a data grid, the index of the deployed variable and a set a vectors of values (Vector
andVectorC
) for each input partitions of the data grid except the deployed variable, plus an optional vector of frequencies; this structure provides predictions related to the deployed variable, given the input distribution of the other variables.