python - Vectorizing a Pandas dataframe for Scikit-Learn

ID : 274317

viewed : 31

Tags : pythonpandasscikit-learnpython





Top 5 Answer for python - Vectorizing a Pandas dataframe for Scikit-Learn

vote vote

90

First, I don't get where in your sample array are features, and where observations.

Second, DictVectorizer holds no data, and is only about transformation utility and metadata storage. After transformation it stores features names and mapping. It returns a numpy array, used for further computations. Numpy array (features matrix) size equals to features count x number of observations, with values equal to feature value for an observation. So if you know your observations and features, you can create this array any other way you like.

In case you expect sklearn do it for you, you don't have to reconstruct dict manually, as it can be done with to_dict applied to transposed dataframe:

>>> df   col1 col2 0    A  foo 1    B  bar 2    C  foo 3    A  bar 4    A  foo 5    B  bar >>> df.T.to_dict().values() [{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}] 

Since scikit-learn 0.13.0 (Jan 3, 2014) there is a new parameter 'records' for the to_dict() method available, so now you can simple use this method without additional manipulations:

>>> df = pandas.DataFrame({'col1': ['A', 'B', 'C', 'A', 'A', 'B'], 'col2': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar']}) >>> df   col1 col2 0    A  foo 1    B  bar 2    C  foo 3    A  bar 4    A  foo 5    B  bar >>> df.to_dict('records') [{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}] 
vote vote

88

Take a look at sklearn-pandas which provides exactly what you're looking for. The corresponding Github repo is here.

vote vote

76

You can definitely use DictVectorizer. Because DictVectorizer expects an iterable of dict-like objects, you could do the following:

from sklearn.base import TransformerMixin from sklearn.pipeline import make_pipeline from sklearn.feature_extraction import DictVectorizer   class RowIterator(TransformerMixin):     """ Prepare dataframe for DictVectorizer """     def fit(self, X, y=None):         return self      def transform(self, X):         return (row for _, row in X.iterrows())   vectorizer = make_pipeline(RowIterator(), DictVectorizer())  # now you can use vectorizer as you might expect, e.g. vectorizer.fit_transform(df) 
vote vote

62

You want to build a design matrix from a pandas DataFrame containing categoricals (or simply strings) and the easiest way to do it is using patsy, a library that replicates and expands R formulas functionality.

Using your example, the conversion would be:

import pandas as pd import patsy  my_df = pd.DataFrame({'col1':['A', 'B', 'C', 'A', 'A', 'B'],                        'col2':['foo', 'bar', 'something', 'foo', 'bar', 'foo']})  patsy.dmatrix('col1 + col2', data=my_df) # With added intercept patsy.dmatrix('0 + col1 + col2', data=my_df) # Without added intercept 

The resulting design matrices are just NumPy arrays with some extra information and can be directly used in scikit-learn.

Example result with intercept added:

DesignMatrix with shape (6, 5)   Intercept  col1[T.B]  col1[T.C]  col2[T.foo]  col2[T.something]           1          0          0            1                  0           1          1          0            0                  0           1          0          1            0                  1           1          0          0            1                  0           1          0          0            0                  0           1          1          0            1                  0   Terms:     'Intercept' (column 0)     'col1' (columns 1:3)     'col2' (columns 3:5) 

Note that patsy tried to avoid multicolinearity by incorporating the effects of A and bar into the intercept. That way, for example, the col1[T.B] predictor should be interpreted as the additional effect of B in relation to observations that are classified as A.

vote vote

54

Top 3 video Explaining python - Vectorizing a Pandas dataframe for Scikit-Learn







Related QUESTION?