Building a custom Python scikit-learn transformer for machine learning.

I’ve started working with scikit-learn’s pipelines. Zac Stewart’s blog post was a tremendous start but it wasn’t long until I needed to craft my own custom transformers. Based off of his example and some help from the Stack Overflow question I asked (link below) I built the following Python notebook to summarize what I learned.

See the source on GitHub.

Question: How should I transform multiple key/value columns in a scikit-learn pipeline?

See http://stackoverflow.com/questions/31749812/how-should-i-transform-multiple-key-value-columns-in-a-scikit-learn-pipeline/

Input data:

In [11]:
import pandas as pd

D = pd.DataFrame([ ['a', 1, 'b', 2], ['b', 2, 'c', 3]], columns = ['k1', 'v1', 'k2', 'v2'])
print(D)
  k1  v1 k2  v2
0  a   1  b   2
1  b   2  c   3

This is the type of output data that is required:

In [12]:
from sklearn.feature_extraction import DictVectorizer

row1 = {'a':1, 'b':2}
row2 = {'b':2, 'c':3}
data = [row1, row2]
print(data)

DictVectorizer( sparse=False ).fit_transform(data)
[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]
Out[12]:
array([[ 1.,  2.,  0.],
       [ 0.,  2.,  3.]])

Solution

Courtesy of Mikehttp://stackoverflow.com/a/31752733/1185562 and extended into a general pipeline transformer.

Here is the transformer:

In [13]:
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion

class KVExtractor(TransformerMixin):
    def __init__(self, kvpairs):
        self.kpairs = kvpairs
        
    def transform(self, X, *_):
        result = []
        for index, rowdata in X.iterrows():
            rowdict = {}
            for kvp in self.kpairs:
                rowdict.update( { rowdata[ kvp[0] ]: rowdata[ kvp[1] ] } )
            result.append(rowdict)
        return result
    
    def fit(self, *_):
        return self

Lets try it out:

In [14]:
kvpairs = [ ['k1', 'v1'], ['k2', 'v2'] ]
KVExtractor( kvpairs ).transform(D)
Out[14]:
[{'a': 1, 'b': 2}, {'b': 2, 'c': 3}]

Now try it out in a pipeline with DictVectorizer:

In [15]:
pipeline = Pipeline(
    [( 'kv', KVExtractor( kvpairs ) )] +
    [( 'dv', DictVectorizer(sparse=False) )] +
    []
)
print(D)
A=pipeline.fit_transform(D)
print A.shape
print A
  k1  v1 k2  v2
0  a   1  b   2
1  b   2  c   3
(2, 3)
[[ 1.  2.  0.]
 [ 0.  2.  3.]]

Try a new key without transforming:

In [16]:
D['k2'] = ['x', 'c']
print D
print pipeline.transform(D)
  k1  v1 k2  v2
0  a   1  x   2
1  b   2  c   3
[[ 1.  0.  0.]
 [ 0.  2.  3.]]

Perfect!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s