OneHotEncoder
¶
-
class
ibex.sklearn.preprocessing.
OneHotEncoder
(n_values='auto', categorical_features='all', dtype=<class 'numpy.float64'>, sparse=True, handle_unknown='error')¶ Bases:
sklearn.preprocessing.data.OneHotEncoder
,ibex._base.FrameMixin
Note
The documentation following is of the class wrapped by this class. There are some changes, in particular:
- A parameter
X
denotes apandas.DataFrame
. - A parameter
y
denotes apandas.Series
.
Encode categorical integer features using a one-hot aka one-of-K scheme.
The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature. It is assumed that input features take on values in the range [0, n_values).
This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.
Note: a one-hot encoding of y labels should use a LabelBinarizer instead.
Read more in the User Guide.
- n_values : ‘auto’, int or array of ints
Number of values per feature.
‘auto’ : determine value range from training data.
- int : number of categorical values per feature.
Each feature value should be in
range(n_values)
- array :
n_values[i]
is the number of categorical values in X[:, i]
. Each feature value should be inrange(n_values[i])
- array :
- categorical_features : “all” or array of indices or mask
Specify what features are treated as categorical.
- ‘all’ (default): All features are treated as categorical.
- array of indices: Array of categorical feature indices.
- mask: Array of length n_features and with dtype=bool.
Non-categorical features are always stacked to the right of the matrix.
- dtype : number type, default=np.float
- Desired dtype of output.
- sparse : boolean, default=True
- Will return sparse matrix if set True else will return an array.
- handle_unknown : str, ‘error’ or ‘ignore’
- Whether to raise an error or ignore if a unknown categorical feature is present during transform.
- active_features_ : array
- Indices for active features, meaning values that actually occur
in the training set. Only available when n_values is
'auto'
. - feature_indices_ : array of shape (n_features,)
- Indices to feature ranges.
Feature
i
in the original data is mapped to features fromfeature_indices_[i]
tofeature_indices_[i+1]
(and then potentially masked by active_features_ afterwards) - n_values_ : array of shape (n_features,)
- Maximum number of values per feature.
Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.
>>> from sklearn.preprocessing import OneHotEncoder >>> enc = OneHotEncoder() >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>, handle_unknown='error', n_values='auto', sparse=True) >>> enc.n_values_ array([2, 3, 4]) >>> enc.feature_indices_ array([0, 2, 5, 9]) >>> enc.transform([[0, 1, 1]]).toarray() array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
- sklearn.feature_extraction.DictVectorizer : performs a one-hot encoding of
- dictionary items (also handles string-valued features).
- sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot
- encoding of dictionary items or strings.
- sklearn.preprocessing.LabelBinarizer : binarizes labels in a one-vs-all
- fashion.
- sklearn.preprocessing.MultiLabelBinarizer : transforms between iterable of
- iterables and a multilabel format, e.g. a (samples x classes) binary matrix indicating the presence of a class label.
- sklearn.preprocessing.LabelEncoder : encodes labels with values between 0
- and n_classes-1.
-
fit
(X, y=None)[source]¶ Note
The documentation following is of the class wrapped by this class. There are some changes, in particular:
- A parameter
X
denotes apandas.DataFrame
. - A parameter
y
denotes apandas.Series
.
Fit OneHotEncoder to X.
- X : array-like, shape [n_samples, n_feature]
- Input array of type int.
self
- A parameter
-
fit_transform
(X, y=None)[source]¶ Note
The documentation following is of the class wrapped by this class. There are some changes, in particular:
- A parameter
X
denotes apandas.DataFrame
. - A parameter
y
denotes apandas.Series
.
Fit OneHotEncoder to X, then transform X.
Equivalent to self.fit(X).transform(X), but more convenient and more efficient. See fit for the parameters, transform for the return value.
- X : array-like, shape [n_samples, n_feature]
- Input array of type int.
- A parameter
-
transform
(X)[source]¶ Note
The documentation following is of the class wrapped by this class. There are some changes, in particular:
- A parameter
X
denotes apandas.DataFrame
. - A parameter
y
denotes apandas.Series
.
Transform X using one-hot encoding.
- X : array-like, shape [n_samples, n_features]
- Input array of type int.
- X_out : sparse matrix if sparse=True else a 2-d array, dtype=int
- Transformed input.
- A parameter
- A parameter