Johannes Haupt remerge.io

Extracting Column Names from the ColumnTransformer

scikit-learn’s ColumnTransformer is a great tool for data preprocessing but returns a numpy array without column names. Its method get_feature_names() fails if at least one transformer does not create new columns. Here’s a quick solution to return column names that works for all transformers and pipelines

Extracting Column Names from the ColumnTransformer

The following quick and dirty helper function is built around the get_feature_names() method of the ColumnTransformer, which can be found here: https://github.com/scikit-learn/scikit-learn/blob/fd237278e895b42abe8d8d09105cbb82dc2cbba7/sklearn/compose/_column_transformer.py#L345

The function walks through the steps of the ColumnTransformer and returns the input column names when the transformer does not provide a get_feature_names() method. For pipelines, it walks through the pipeline steps and will return either the output columns of the pipeline or the input columns, if the pipeline creates no new columns.

import warnings
import sklearn
import pandas as pd
def get_feature_names(column_transformer):
    """Get feature names from all transformers.
    Returns
    -------
    feature_names : list of strings
        Names of the features produced by transform.
    """
    # Remove the internal helper function
    #check_is_fitted(column_transformer)
    
    # Turn loopkup into function for better handling with pipeline later
    def get_names(trans):
        # >> Original get_feature_names() method
        if trans == 'drop' or (
                hasattr(column, '__len__') and not len(column)):
            return []
        if trans == 'passthrough':
            if hasattr(column_transformer, '_df_columns'):
                if ((not isinstance(column, slice))
                        and all(isinstance(col, str) for col in column)):
                    return column
                else:
                    return column_transformer._df_columns[column]
            else:
                indices = np.arange(column_transformer._n_features)
                return ['x%d' % i for i in indices[column]]
        if not hasattr(trans, 'get_feature_names'):
        # >>> Change: Return input column names if no method avaiable
            # Turn error into a warning
            warnings.warn("Transformer %s (type %s) does not "
                                 "provide get_feature_names. "
                                 "Will return input column names if available"
                                 % (str(name), type(trans).__name__))
            # For transformers without a get_features_names method, use the input
            # names to the column transformer
            if column is None:
                return []
            else:
                return [name + "__" + f for f in column]

        return [name + "__" + f for f in trans.get_feature_names()]
    
    ### Start of processing
    feature_names = []
    
    # Allow transformers to be pipelines. Pipeline steps are named differently, so preprocessing is needed
    if type(column_transformer) == sklearn.pipeline.Pipeline:
        l_transformers = [(name, trans, None, None) for step, name, trans in column_transformer._iter()]
    else:
        # For column transformers, follow the original method
        l_transformers = list(column_transformer._iter(fitted=True))
    
    
    for name, trans, column, _ in l_transformers: 
        if type(trans) == sklearn.pipeline.Pipeline:
            # Recursive call on pipeline
            _names = get_feature_names(trans)
            # if pipeline has no transformer that returns names
            if len(_names)==0:
                _names = [name + "__" + f for f in column]
            feature_names.extend(_names)
        else:
            feature_names.extend(get_names(trans))
    
    return feature_names

Example

This example is based on the sklearn tutorial on the ColumnTransformer with different data types: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py

We would like to structure preprocessing efficiently in a ColumnTransformer for different data types and using pipelines of transformers for each type. To interprete the model results, we would like to preserve the variable names in the transformed data.

from sklearn.datasets import fetch_openml

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

Create a complex data preprocessing pipeline using ColumnTransformer and pipelines of transformers and dropping some variables.

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
model score: 0.775
get_feature_names(preprocessor)
<ipython-input-170-7a1be6e049c5>:27: UserWarning: Transformer imputer (type SimpleImputer) does not provide get_feature_names. Will return input column names if available
  warnings.warn("Transformer %s (type %s) does not "
<ipython-input-170-7a1be6e049c5>:27: UserWarning: Transformer scaler (type StandardScaler) does not provide get_feature_names. Will return input column names if available
  warnings.warn("Transformer %s (type %s) does not "





['num__age',
 'num__fare',
 'onehot__x0_C',
 'onehot__x0_Q',
 'onehot__x0_S',
 'onehot__x0_missing',
 'onehot__x1_female',
 'onehot__x1_male',
 'onehot__x2_1.0',
 'onehot__x2_2.0',
 'onehot__x2_3.0']

The unhelpful variable names in the one-hot encoded variables are an issue with the names returned by the OneHotEncoder.

We need the variable names to understand the model structure.

clf.named_steps['classifier'].coef_
array([[-0.48401448,  0.0064347 ,  0.23762479, -0.15954077, -0.34818517,
         0.27042239,  1.25211668, -1.25179543,  1.01259174,  0.05134565,
        -1.06361614]])

This is where the feature extractor function comes in handy:

pd.DataFrame(clf.named_steps['classifier'].coef_.flatten(), index=get_feature_names(preprocessor))
<ipython-input-170-7a1be6e049c5>:27: UserWarning: Transformer imputer (type SimpleImputer) does not provide get_feature_names. Will return input column names if available
  warnings.warn("Transformer %s (type %s) does not "
<ipython-input-170-7a1be6e049c5>:27: UserWarning: Transformer scaler (type StandardScaler) does not provide get_feature_names. Will return input column names if available
  warnings.warn("Transformer %s (type %s) does not "
0
num__age -0.484014
num__fare 0.006435
onehot__x0_C 0.237625
onehot__x0_Q -0.159541
onehot__x0_S -0.348185
onehot__x0_missing 0.270422
onehot__x1_female 1.252117
onehot__x1_male -1.251795
onehot__x2_1.0 1.012592
onehot__x2_2.0 0.051346
onehot__x2_3.0 -1.063616