Johannes Haupt remerge.io

Categorical Variables and ColumnTransformer in scikit-learn

Dealing with Categorical Variables in Scikit-learn

import numpy as np
import scipy.stats as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import minmax_scale, scale, MinMaxScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, brier_score_loss

Data with mixed data types

Create some data. To create categorical variables, I bin them into an arbitrary number of bins.

no_cont = 5
no_cat = 5
no_vars = no_cont + no_cat
N= 50000

# Create single dataset to avoid random effects
# Only works for all informative features
X,y = make_classification(n_samples=N, weights=[0.9,0.1], n_clusters_per_class=5,
                              n_features=no_vars, 
                              n_informative=no_vars, 
                              n_redundant=0, n_repeated=0,
                             random_state=123)
binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy="quantile", )
X[:,no_cont:] = binner.fit_transform(X[:,no_cont:])

X = pd.DataFrame(X, columns=["X"+str(i) for i in [0,2,4,6,8,1,3,5,7,9]])
X[0:5]
X0 X2 X4 X6 X8 X1 X3 X5 X7 X9
0 1.921406 4.480504 -1.231670 -1.814375 4.187405 4.0 4.0 2.0 1.0 1.0
1 1.544821 0.948336 0.472346 -1.126138 2.157616 2.0 3.0 2.0 2.0 2.0
2 -0.874012 0.131283 -3.637079 0.447905 -1.041823 4.0 4.0 3.0 3.0 4.0
3 -1.737486 -1.664507 -0.084009 1.294248 0.492214 2.0 1.0 3.0 0.0 4.0
4 1.292494 0.992953 1.559877 -1.070859 1.391606 3.0 0.0 3.0 3.0 0.0

The raw categorical variables are often not ordinally encoded but contain an ID or hash for each level.

import string
# Efficiently map values to another value with .map(dict)
X.iloc[:,no_cont:] = X.iloc[:,no_cont:].apply(
    lambda x: x.map({i:letter for i,letter in enumerate(string.ascii_uppercase)})
)

The raw data typically also mixes the order of categorical and numeric variables

X.sort_index(axis=1, inplace=True)

So this is how the raw data looks like when we receive it from the client!

X[0:5]
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9
0 1.921406 E 4.480504 E -1.231670 C -1.814375 B 4.187405 B
1 1.544821 C 0.948336 D 0.472346 C -1.126138 C 2.157616 C
2 -0.874012 E 0.131283 E -3.637079 D 0.447905 D -1.041823 E
3 -1.737486 C -1.664507 B -0.084009 D 1.294248 A 0.492214 E
4 1.292494 D 0.992953 A 1.559877 D -1.070859 D 1.391606 A

Categorical variables (and pandas)

In pandas, the new way to handle categorical variables as to define their type as ‘category’ similar to R’s factor type.

cat_columns = [1,3,5,7,9]
X.iloc[:,cat_columns] = X.iloc[:,cat_columns].astype("category")
X.dtypes
X0     float64
X1    category
X2     float64
X3    category
X4     float64
X5    category
X6     float64
X7    category
X8     float64
X9    category
dtype: object
X[0:5]
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9
0 1.921406 E 4.480504 E -1.231670 C -1.814375 B 4.187405 B
1 1.544821 C 0.948336 D 0.472346 C -1.126138 C 2.157616 C
2 -0.874012 E 0.131283 E -3.637079 D 0.447905 D -1.041823 E
3 -1.737486 C -1.664507 B -0.084009 D 1.294248 A 0.492214 E
4 1.292494 D 0.992953 A 1.559877 D -1.070859 D 1.391606 A

Like in R, the levels are now saved as integer values and the mapping from integer value to original level is saved.

X.X9.cat.categories
Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
X.X9.cat.codes[0:5]
0    1
1    2
2    4
3    4
4    0
dtype: int8

Sadly, there’s no describe() method in place for category variables that could be similar to R’s summary() for factor variables and give, e.g. the counts.

X.describe()
X0 X2 X4 X6 X8
count 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000
mean 0.070105 0.077555 0.492110 -0.555985 0.157141
std 2.172805 2.211541 2.065442 1.863217 2.068314
min -9.274609 -10.270653 -8.321548 -8.591301 -7.706999
25% -1.370276 -1.419127 -0.894180 -1.789640 -1.282510
50% 0.138365 0.099075 0.514206 -0.675296 0.123451
75% 1.577229 1.594109 1.886729 0.548743 1.551513
max 8.803489 9.373858 9.656678 9.396778 9.688887

I’ll also split the data as we often do before building models.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                    test_size=0.5, random_state=123)

The column transformer in scikit-learn

Why do we need a column transposer? It applies different transformations to subsets of the data columns. Why is that useful? We like to build a pipeline that does preprocessing (and training) and predicting in one go. When new data comes in, we don’t need to look for the values to standardize it, we can just apply the full pipeline.

But what if we have categorical variables in the raw data? We could first one hot encode them, but then we don’t want to apply the standardizer to the one hot encoded values! We want to tell the preprocessor to standardize the numeric variables and one hot encode the categorical variables. That’s what the ColumnTransformer does.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

The ColumnTransformer looks like a sklearn pipepline with an additional argument to select the columns for each transformation.

Take care here when collecting the column indices automatically. They need to be basic int not numpy types or the sklearn checks within the ColumnTransformer will fail.

num_columns = [idx for idx in range(X.shape[1]) if idx not in cat_columns]
print(cat_columns, num_columns)
np.all([isinstance(x,int) for x in cat_columns])
[1, 3, 5, 7, 9] [0, 2, 4, 6, 8]





True

I would like to impute potential missing values in the numeric variables and scale them, so I’ll build a pipe to do these transformations. I’ll then integrate the pipe into the ColumnTransformer to see how that works.

num_preproc = Pipeline([
    ('cat_imputer', SimpleImputer(strategy='most_frequent')),
    ("scaler", StandardScaler())
])
ct = ColumnTransformer([
    # (name, transformer, columns)
    # Transformer for categorical variables
    ("onehot", 
         OneHotEncoder(categories='auto', handle_unknown='ignore', ),  
         cat_columns),
    # Transformer for numeric variables
    ("num_preproc",num_preproc, num_columns)
    ], 
    # what to do with variables not specified in the indices?
    remainder="drop")
X_train_transformed = ct.fit_transform(X_train, y)

Let’s see how that looks!

pd.DataFrame(X_train_transformed)[0:5]
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 -0.349437 -0.268779 0.002266 -1.042649 -0.827631
1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 2.128186 -0.382543 -0.827437 2.914705 -0.048655
2 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 -0.488686 -1.275460 0.335586 0.154043 -0.325587
3 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 1.0 0.0 -0.069370 1.443652 1.264377 -0.265769 1.344830
4 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 1.0 0.0 0.0 0.736065 0.806747 0.381054 -0.793548 1.775778

5 rows × 30 columns

For the numeric variables, the column means should be (close to) 0.

X_train_transformed[:,25:].mean(axis=0)
array([-1.09756648e-17, -1.09601217e-17, -2.52770027e-17, -2.64499533e-17,
        1.84696702e-17])

See how the column transformer takes the data apart and pipes it through each transformer separately? Be careful that the output of the combined transformer is a concatenation of the different transformers. The order of variables may change and the indices that we defined above are no longer correct. If I’d pass the indices to the classifier or data balancer in the next step, I’d have a bad time.

Next, I’ll integrate this in a modeling pipeline.

from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
    ("preprocessing", ct),
    ("classifier", LogisticRegression(C=1, solver='lbfgs'))
])
pipe.fit(X_train, y_train)
Pipeline(memory=None,
     steps=[('preprocessing', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('onehot', OneHotEncoder(categorical_features=None, categories='auto',
       dtype=<class 'numpy.float64'>, handle_unknown='ignore',
       n_val...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

We can dive into the pipeline to extract the model coefficients. We can see that the 10 raw variables where extended to 30 variables after the 5 categorical variables are one-hot encoded with 5 binary variables each.

pipe.named_steps.classifier.coef_
array([[ 0.95781925,  0.42040793, -0.03151696, -0.48188476, -0.86403698,
         0.66353437, -0.03752612, -0.11350985, -0.32103785, -0.19067207,
         0.151134  ,  0.13419442,  0.07671957, -0.13323205, -0.22802746,
         0.35972866,  0.00602037, -0.22095139, -0.11843307, -0.02557609,
         0.0825338 ,  0.17822471,  0.19712107, -0.0180598 , -0.4390313 ,
        -0.52420744, -0.53297249, -0.53005165,  0.27383866,  0.10902432]])

The cool part is this: When we get new data, we don’t need to worry about the cleaning steps. As long as they are included in the pipeline, they are applied during prediction time.

from sklearn.metrics import roc_auc_score
X_test[0:5]
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9
8248 -1.160837 D 2.131385 C 0.686961 D -0.880318 E 3.303355 E
2404 2.530105 B 1.366410 D -2.676441 A -0.179998 C -1.467805 E
19796 -2.832175 A 0.291876 E 0.232748 A -0.700043 C -1.170993 C
4970 -2.146905 C -1.138431 B 1.503194 C -3.747254 D 2.213337 D
38743 2.613218 B -3.237024 E 2.826875 D -1.015023 E 2.429915 A
pred = pipe.predict_proba(X_test)[:,1]

roc_auc_score(y_test, pred)
0.8065415723608013