
Writing new estimators is easy. One way of doing this is by writing a estimator conforming to the sickit-learn protocol, and then wrapping it with ibex.frame() (see Adapting Estimators). A different way is writing it directly as a pandas estimator. This might be the only way to go, if the logic of the estimator is pandas specific. This chapter shows how to write a new estimator from scratch.

Example Transformation

Suppose we have a pandas.DataFrame like this:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 3, 2, 1, 2], 'b': range(5), 'c': range(2, 7)})
>>> df
   a  b  c
0  1  0  2
1  3  1  3
2  2  2  4
3  1  3  5
4  2  4  6

We think that, for each row, the mean values of 'b' and 'c', aggregated by 'a', might make a useful feature. In pandas, we could write this as follows:

>>> df.groupby(df.a).transform(np.mean)
     b    c
0  1.5  3.5
1  1.0  3.0
2  3.0  5.0
3  1.5  3.5
4  3.0  5.0

We now want write a transformer to do this, in order to use it for more general settings (e.g., cross validation).

Writing A New Transformer Step

We can write a (slightly more general) estimator, as follows:

>>> from sklearn import base
>>> import ibex
>>> class GroupbyAggregator(
...            base.BaseEstimator, # (1)
...            base.TransformerMixin, # (2)
...            ibex.FrameMixin): # (3)
...     def __init__(self, group_col, agg_func=np.mean):
...         self._group_col, self._agg_func = group_col, agg_func
...     def fit(self, X, _=None):
...         self.x_columns = X.columns # (4)
...         self._agg = X.groupby(df[self._group_col]).apply(self._agg_func)
...         return self
...     def transform(self, X):
...         Xt = X[self.x_columns] # (5)
...         Xt = pd.merge(
...             Xt[[self._group_col]],
...             self._agg,
...             how='left')
...         return Xt[[c for c in Xt.columns if c != self._group_col]]

Note the following general points:

  1. We subclass sklearn.base.BaseEstimator, as this is an estimator.
  2. We subclass sklearn.base.TransformerMixin, as, in this case, this is specifically a transformer.
  3. We subclass ibex.FrameMixin, as this estimator deals with pandas entities.
  4. In fit, we make sure to set ibex.FrameMixin.x_columns; this will ensure that the transformer will “remember” the columns it should see in further calls.
  5. In transform, we first use x_columns. This will verify the columns of X, and also reorder them according to the original order seen in fit (if needed).

The rest is logic specific to this transformer.

  • In __init__, the group column and aggregation function are stored.
  • In fit, X is aggregated by the group column according to the aggregation function, and the result is recorded.
  • In transform, X (which is not necessarily the one used in fit) is left-merged with the aggregation result, and then the relevant columns of the result are returned.

We can now use this as a regular step. If we fit it on df and transform it on the same df, we get the result above:

>>> GroupbyAggregator('a').fit(df).transform(df)
     b    c
0  1.5  3.5
1  1.0  3.0
2  3.0  5.0
3  1.5  3.5
4  3.0  5.0

We can, however, now use it for fitting on one DataFrame, and transforming another:

>>> try:
...     from sklearn.model_selection import train_test_split
... except: # Older sklearn versions
...     from ibex.sklearn.cross_validation import train_test_split
>>> tr, te = train_test_split(df, random_state=3)
>>> GroupbyAggregator('a').fit(tr).transform(te)
     b    c
0  0...  2...
1  2...  4...