Transforming

This chapter describes the ibex.trans() function, which allows

  1. applying functions or estimators to pandas.DataFrame objects
  2. selecting a subset of columns for applications
  3. naming the output columns of the results

or any combination of these.

We’ll use a DataFrame X, with columns 'a' and 'b', and (implied) index 1, 2, 3,

>>> import pandas as pd
>>> X = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})

and also import trans:

>>> from ibex import trans

Specifying Functions

The (positionally first) func argument allows specifying the transformation to apply.

This can be None, meaning that the output should be the input:

>>> trans().fit_transform(X)
   a  b
0  1  3
1  2  4

The func argument can alternatively be a function, which will be applied to the pandas.DataFrame.values of the input:

>>> import numpy as np
>>> trans(np.sqrt).fit_transform(X)
          a         b
0  1.000000  1.732051
1  1.414214  2.000000

Finally, it can be a different estimator:

>>> from ibex.sklearn.decomposition import PCA
>>> trans(PCA(n_components=2)).fit_transform(X)
          a  b
0 -0.707107  ...
1  0.707107  ...

Specifying Input Columns

The (positionally second) in_cols argument allows specifying the columns to which to apply the function.

If it is None, then the function will be applied to all columns.

If it is a string, the function will be applied to the DataFrame consisting of the single column corresponding to this string:

>>> trans(None, 'a').fit_transform(X)
   a
0  1
1  2
>>> trans(np.sqrt, 'a').fit_transform(X)
          a
0  1.000000
1  1.414214
>>> trans(PCA(n_components=1), 'a').fit_transform(X)
     a
0 -0.5
1  0.5

If it is a list of strings, the function will be applied to the DataFrame consisting of the columns corresponding to these strings:

>>> trans(None, ['a']).fit_transform(X)
   a
0  1
1  2
>>> trans(np.sqrt, ['a']).fit_transform(X)
          a
0  1.000000
1  1.414214
>>> trans(PCA(n_components=1), ['a']).fit_transform(X)
     a
0 -0.5
1  0.5

Specifying Output Columns

The (positionally third) out_cols argument allows specifying the names of the columns of the result.

If it is None, then the output columns will be as explained in _verification_and_processing_output_dataframe_columns in _verification_and_processing:

>>> trans(np.sqrt, out_cols=None).fit_transform(X)
          a         b
0  1.000000  1.732051
1  1.414214  2.000000

If it is a string, it will become the (single) column of the resulting DataFrame.

>>> trans(PCA(n_components=1), out_cols='pc').fit_transform(X)
        pc
0 -0.707107
1  0.707107

If it is a list of strings, these will become the columns of the resulting DataFrame.

>>> trans(out_cols=['c', 'd']).fit_transform(X)
   c  d
0  1  3
1  2  4
>>> trans(np.sqrt, out_cols=['c', 'd']).fit_transform(X)
          c         d
0  1.000000  1.732051
1  1.414214  2.000000
>>> trans(PCA(n_components=2), out_cols=['pc1', 'pc2']).fit_transform(X)
          pc1  pc2
0 -0.707107  ...
1  0.707107  ...

Tip

As can be seen from the first of the examples just above, this can be used to build a step that simply changes the column names of a DataFrame.

Specifying Combinations

Of course, you can combine the arguments specified above:

>>> trans(None, 'a', 'c').fit_transform(X)
   c
0  1
1  2
>>> trans(None, ['a'], ['c']).fit_transform(X)
   c
0  1
1  2
>>> trans(np.sqrt, ['a', 'b'], ['c', 'd']).fit_transform(X)
          c         d
0  1.000000  1.732051
1  1.414214  2.000000
>>> trans(PCA(n_components=1), 'a', 'pc').fit_transform(X)
     pc
0 -0.5
1  0.5

Multiple Transformations

Applying multiple transformations on a single DataFrame is no different than any other case of uniting features (see Uniting Features). In particular, it’s possible to succinctly use the + operator:

>>> trn = trans(np.sin, 'a', 'sin_a') + trans(np.cos, 'b', 'cos_b')
>>> trn.fit_transform(X)
  functiontransformer_0 functiontransformer_1
                  sin_a                 cos_b
0              0.841471             -0.989992
1              0.909297             -0.653644
>>> trn = trans() + trans(np.sin, 'a', 'sin_a') + trans(np.cos, 'b', 'cos_b')
>>> trn.fit_transform(X)
  functiontransformer_0    functiontransformer_1 functiontransformer_2
                      a  b                 sin_a                 cos_b
0                     1  3              0.841471             -0.989992
1                     2  4              0.909297             -0.653644

Tip

As can be seen from the last of the examples just above, this can be used to build a step that simply adds to the existing columns of some DataFrame.