python - numpy: most efficient frequency counts for unique values in an array

ID : 10078

viewed : 33

Tags : pythonarraysperformancenumpypython

Top 5 Answer for python - numpy: most efficient frequency counts for unique values in an array

vote vote

91

As of Numpy 1.9, the easiest and fastest method is to simply use numpy.unique, which now has a return_counts keyword argument:

import numpy as np  x = np.array([1,1,1,2,2,2,5,25,1,1]) unique, counts = np.unique(x, return_counts=True)  print np.asarray((unique, counts)).T 

Which gives:

 [[ 1  5]   [ 2  3]   [ 5  1]   [25  1]] 

A quick comparison with scipy.stats.itemfreq:

In [4]: x = np.random.random_integers(0,100,1e6)  In [5]: %timeit unique, counts = np.unique(x, return_counts=True) 10 loops, best of 3: 31.5 ms per loop  In [6]: %timeit scipy.stats.itemfreq(x) 10 loops, best of 3: 170 ms per loop 
vote vote

84

Take a look at np.bincount:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

import numpy as np x = np.array([1,1,1,2,2,2,5,25,1,1]) y = np.bincount(x) ii = np.nonzero(y)[0] 

And then:

zip(ii,y[ii])  # [(1, 5), (2, 3), (5, 1), (25, 1)] 

or:

np.vstack((ii,y[ii])).T # array([[ 1,  5],          [ 2,  3],          [ 5,  1],          [25,  1]]) 

or however you want to combine the counts and the unique values.

vote vote

72

Update: The method mentioned in the original answer is deprecated, we should use the new way instead:

>>> import numpy as np >>> x = [1,1,1,2,2,2,5,25,1,1] >>> np.array(np.unique(x, return_counts=True)).T     array([[ 1,  5],            [ 2,  3],            [ 5,  1],            [25,  1]]) 

Original answer:

you can use scipy.stats.itemfreq

>>> from scipy.stats import itemfreq >>> x = [1,1,1,2,2,2,5,25,1,1] >>> itemfreq(x) /usr/local/bin/python:1: DeprecationWarning: `itemfreq` is deprecated! `itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)` array([[  1.,   5.],        [  2.,   3.],        [  5.,   1.],        [ 25.,   1.]]) 
vote vote

64

I was also interested in this, so I did a little performance comparison (using perfplot, a pet project of mine). Result:

y = np.bincount(a) ii = np.nonzero(y)[0] out = np.vstack((ii, y[ii])).T 

is by far the fastest. (Note the log-scaling.)

enter image description here


Code to generate the plot:

import numpy as np import pandas as pd import perfplot from scipy.stats import itemfreq   def bincount(a):     y = np.bincount(a)     ii = np.nonzero(y)[0]     return np.vstack((ii, y[ii])).T   def unique(a):     unique, counts = np.unique(a, return_counts=True)     return np.asarray((unique, counts)).T   def unique_count(a):     unique, inverse = np.unique(a, return_inverse=True)     count = np.zeros(len(unique), dtype=int)     np.add.at(count, inverse, 1)     return np.vstack((unique, count)).T   def pandas_value_counts(a):     out = pd.value_counts(pd.Series(a))     out.sort_index(inplace=True)     out = np.stack([out.keys().values, out.values]).T     return out   b = perfplot.bench(     setup=lambda n: np.random.randint(0, 1000, n),     kernels=[bincount, unique, itemfreq, unique_count, pandas_value_counts],     n_range=[2 ** k for k in range(26)],     xlabel="len(a)", ) b.save("out.png") b.show() 
vote vote

57

Using pandas module:

>>> import pandas as pd >>> import numpy as np >>> x = np.array([1,1,1,2,2,2,5,25,1,1]) >>> pd.value_counts(x) 1     5 2     3 25    1 5     1 dtype: int64 

Top 3 video Explaining python - numpy: most efficient frequency counts for unique values in an array

Related QUESTION?