.. _verification_and_processing: Verification and Processing ======================================== Since ``sklearn`` is defined in terms of :class:`numpy.ndarray` (and not :class:`pandas.DataFrame`), Ibex estimators perform verification and processing on their inputs and outputs. In this chapter we'll use a ``DataFrame`` ``X``, with columns ``'a'`` and ``'b'``, and (implied) index ``1, 2, 3``. >>> import pandas as pd >>> X = pd.DataFrame({'a': [1, 2], 'b': [3, 4]}) a scaling transformer ``trn`` which is ``fit``-ted on ``X`` >>> from ibex.sklearn import preprocessing as pd_preprocessing >>> trn = pd_preprocessing.StandardScaler().fit(X) and a linear-regression predictor ``prd`` which is also ``fit``-ted on ``X`` >>> from ibex.sklearn import linear_model as pd_linear_model >>> prd = pd_linear_model.LinearRegression().fit(X, pd.Series([3, 4])) Input Verification ------------------ Following the call to ``fit``, we can apply further methods of ``trn`` to any ``DataFrame`` with the same column-set. For example, this is OK >>> X_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]}) >>> trn.transform(X_1) a b 0 -1... -1... 1 1... 1... 2 3... 3... but this is not >>> X_2 = X_1.rename(columns={'b': 'c'}) >>> trn.transform(X_2) Traceback (most recent call last): ... KeyError: "...'b'...not in index" | Once an estimator has been ``fit``-ed, the order of columns of further inputs no longer matters: >>> trn.transform(X_1[['a', 'b']]) a b 0 -1... -1... 1 1... 1... 2 3... 3... >>> trn.transform(X_1[['b', 'a']]) a b 0 -1... -1... 1 1... 1... 2 3... 3... The ``step`` will reorder the ``DataFrame`` to the same order of columns seen by ``fit``. .. _verification_and_processing_output: Output Processing ----------------- Indexes ~~~~~~~ The index of a returned ``DataFrame`` or ``Series`` objects, is that of the input: >>> X_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]}, index=[10, 20, 30]) >>> trn.transform(X_1) a b 10 -1... -1... 20 1... 1... 30 3... 3... >>> >>> prd.predict(X_1) 10 3... 20 4... 30 5... dtype: ... .. _verification_and_processing_output_dataframe_columns: ``DataFrame`` Columns ~~~~~~~~~~~~~~~~~~~~~ In general, the columns of an outputted ``DataFrame`` object are those on which the estimator was ``fit``-ted: >>> trn.transform(X_1[['a', 'b']]) a b 10 -1... -1... 20 1... 1... 30 3... 3... >>> trn.transform(X_1[['b', 'a']]) a b 10 -1... -1... 20 1... 1... 30 3... 3... Some outputted ``DataFrame`` objects have a number of columns that is different from that of the input. If this is the case, the resulting ``DataFrame``'s columns will all be blank strings (``''``): # Tmp Ami >>> from ibex.sklearn import decomposition as pd_decomposition >>> pd_decomposition.PCA(n_components=1).fit(X).transform(X) comp_0 0 -0.707107 1 0.707107 .. note:: In some cases, we might want greater control over the naming of output columns. For example, when transforming a 2-component PCA, we might want to name the ``DataFrame`` columns ``'pc1'`` and ``'pc2'``. :ref:`function_transformer_specifying_output_columns` in :ref:`function_transformer` shows how to do this.