python - How is set() implemented?

ID : 20160

viewed : 22

Tags : pythondata-structuressetcpythonpython

Top 5 Answer for python - How is set() implemented?

vote vote

93

According to this thread:

Indeed, CPython's sets are implemented as something like dictionaries with dummy values (the keys being the members of the set), with some optimization(s) that exploit this lack of values

So basically a set uses a hashtable as its underlying data structure. This explains the O(1) membership checking, since looking up an item in a hashtable is an O(1) operation, on average.

If you are so inclined you can even browse the CPython source code for set which, according to Achim Domma, was originally mostly a cut-and-paste from the dict implementation.

Note: Nowadays, set and dict's implementations have diverged significantly, so the precise behaviors (e.g. arbitrary order vs. insertion order) and performance in various use cases differs; they're still implemented in terms of hashtables, so average case lookup and insertion remains O(1), but set is no longer just "dict, but with dummy/omitted keys".

vote vote

87

When people say sets have O(1) membership-checking, they are talking about the average case. In the worst case (when all hashed values collide) membership-checking is O(n). See the Python wiki on time complexity.

The Wikipedia article says the best case time complexity for a hash table that does not resize is O(1 + k/n). This result does not directly apply to Python sets since Python sets use a hash table that resizes.

A little further on the Wikipedia article says that for the average case, and assuming a simple uniform hashing function, the time complexity is O(1/(1-k/n)), where k/n can be bounded by a constant c<1.

Big-O refers only to asymptotic behavior as n → ∞. Since k/n can be bounded by a constant, c<1, independent of n,

O(1/(1-k/n)) is no bigger than O(1/(1-c)) which is equivalent to O(constant) = O(1).

So assuming uniform simple hashing, on average, membership-checking for Python sets is O(1).

vote vote

80

I think its a common mistake, set lookup (or hashtable for that matter) are not O(1).
from the Wikipedia

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

Related: Is a Java hashmap really O(1)?

vote vote

62

We all have easy access to the source, where the comment preceding set_lookkey() says:

/* set object implementation  Written and maintained by Raymond D. Hettinger <python@rcn.com>  Derived from Lib/sets.py and Objects/dictobject.c.  The basic lookup function used by all operations.  This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.  The initial probe index is computed as hash mod the table size.  Subsequent probe indices are computed as explained in Objects/dictobject.c.  To improve cache locality, each probe inspects a series of consecutive  nearby entries before moving on to probes elsewhere in memory.  This leaves  us with a hybrid of linear probing and open addressing.  The linear probing  reduces the cost of hash collisions because consecutive memory accesses  tend to be much cheaper than scattered probes.  After LINEAR_PROBES steps,  we then use open addressing with the upper bits from the hash value.  This  helps break-up long chains of collisions.  All arithmetic on hash should ignore overflow.  Unlike the dictionary implementation, the lookkey function can return  NULL if the rich comparison returns an error. */   ... #ifndef LINEAR_PROBES #define LINEAR_PROBES 9 #endif  /* This must be >= 1 */ #define PERTURB_SHIFT 5  static setentry * set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)   { ... 
vote vote

51

To emphasize a little more the difference between set's and dict's, here is an excerpt from the setobject.c comment sections, which clarify's the main difference of set's against dicts.

Use cases for sets differ considerably from dictionaries where looked-up keys are more likely to be present. In contrast, sets are primarily about membership testing where the presence of an element is not known in advance. Accordingly, the set implementation needs to optimize for both the found and not-found case.

source on github

Top 3 video Explaining python - How is set() implemented?

Related QUESTION?