Verification and Processing

Since sklearn is defined in terms of numpy.ndarray (and not pandas.DataFrame), Ibex estimators perform verification and processing on their inputs and outputs.

In this chapter we’ll use a DataFrame X, with columns 'a' and 'b', and (implied) index 1, 2, 3.

>>> import pandas as pd
>>> X = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})

a scaling transformer trn which is fit-ted on X

>>> from ibex.sklearn import preprocessing as pd_preprocessing
>>> trn = pd_preprocessing.StandardScaler().fit(X)

and a linear-regression predictor prd which is also fit-ted on X

>>> from ibex.sklearn import linear_model as pd_linear_model
>>> prd = pd_linear_model.LinearRegression().fit(X, pd.Series([3, 4]))

Input Verification

Following the call to fit, we can apply further methods of trn to any DataFrame with the same column-set. For example, this is OK

>>> X_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]})
>>> trn.transform(X_1)
      a    b
0 -1... -1...
1  1...  1...
2  3...  3...

but this is not

>>> X_2 = X_1.rename(columns={'b': 'c'})
>>> trn.transform(X_2)
Traceback (most recent call last):
...
KeyError: "...'b'...not in index"

Once an estimator has been fit-ed, the order of columns of further inputs no longer matters:

>>> trn.transform(X_1[['a', 'b']])
      a    b
0 -1... -1...
1  1...  1...
2  3...  3...
>>> trn.transform(X_1[['b', 'a']])
      a    b
0 -1... -1...
1  1...  1...
2  3...  3...

The step will reorder the DataFrame to the same order of columns seen by fit.

Output Processing

Indexes

The index of a returned DataFrame or Series objects, is that of the input:

>>> X_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]}, index=[10, 20, 30])
>>> trn.transform(X_1)
      a    b
10 -1... -1...
20  1...  1...
30  3...  3...
>>>
>>> prd.predict(X_1)
10    3...
20    4...
30    5...
dtype: ...

DataFrame Columns

In general, the columns of an outputted DataFrame object are those on which the estimator was fit-ted:

>>> trn.transform(X_1[['a', 'b']])
      a    b
10 -1... -1...
20  1...  1...
30  3...  3...
>>> trn.transform(X_1[['b', 'a']])
      a    b
10 -1... -1...
20  1...  1...
30  3...  3...

Some outputted DataFrame objects have a number of columns that is different from that of the input. If this is the case, the resulting DataFrame’s columns will all be blank strings (''):

# Tmp Ami

>>> from ibex.sklearn import decomposition as pd_decomposition
>>> pd_decomposition.PCA(n_components=1).fit(X).transform(X)
  comp_0
0 -0.707107
1  0.707107

Note

In some cases, we might want greater control over the naming of output columns. For example, when transforming a 2-component PCA, we might want to name the DataFrame columns 'pc1' and 'pc2'. Specifying Output Columns in Transforming shows how to do this.