Hash Tables: Content Addressable
- Hash tables allow content addressable data structures
- Access elements using using a key rather than an index
- Key is frequently a string
- Access via an index provided by arrays
- Code Example
phonebook = {"Jack" : "831-1111", "Jill" : "831-9999"} print phonebook("Jack")
- Access by index uses index value to calculate address of element
- Access by key requires some kind of search to find element
- Balanced Search Tree: Lg n search
- Hash Table: expected constant time search
Hash Table: Common in Modern Languages
- Modern languages provide hash tables
- Either built-in: Perl, Python, Ruby
- Or via a library: Java HashMap, Ada Hashed_Map
- Common names
- Map
- Dictionary
- Hash Map
- Associative array
Problem Overview
- How to do fast insertion, search, deletion of data with keys
- Hash tables give expected case behavior of O(1)
- Better than balanced tree
- However, worst case behavior is O(n)
- History:
- Invented in IBM, about 1950
- One of first uses of linked lists
- Related to content addressable memory:
- Look up memory location by its contents (ie value of key)
- Used for page tables, etc
- Presentation based on CLRS, Knuth3, Neop.
Hash Table: Implementation Illustration
- Consider 100 records with keys 1 .. 100
- Easy implementation: array of 100 records: O(1) search
- Consider 100 records with 9-digit ID as keys
- Array of 10^9 records is not practical (array is large and sparse)
- Hash table solution:
- Array of 100 records
- Use last 2 digits of ID as the key (ie array index)
- Fast, but must deal with problem of collisions
- Collision: two records with same key
- Need a better way to get the 2 digit key
- Using the last 2 digits may lead to too many collisions
Hash Tables: Problem Generalization
- Goal: store and search a set S of records
- Assumptions:
- Each element of S has a unique key
- Keys are drawn from some universe U
- |S| << |U|, so allocating array of size |U| would waste space
- |U| is too large to allocate an array 1 .. |U|
- Solution: Store set S in array T
- T: array(0 .. m-1) of Data_Records
- |T| = m
- How big should m be (ie how big should T be)?
- |T| = m = O(|S|)
- In practice, m < c|S|, where c=2(?) or 3
Hash Function
- We want to delegate each data record to be stored in a specific element of T
- This is done by using a function that maps keys to array elements
- We define a function h that maps k to h(k)
- We store record k in slot h(k), not in slot k
- Sometimes we use the phrase "record k" or "element k" to mean the element with key k
- Call the function h, a hash function
- Function h transforms a key k to a unique element of T
- h: U → 0 .. m-1
- That is, h(k) ∈ 0 .. m-1
- We say that k hashes to slot h(k)
- A slot is a location in hash table
- We will consider hash functions after looking at collisions and performance
Collisions
- What if two or more keys hash to the same slot?
- We call this a collision
- When are collisions possible/guaranteed?
- if |S| < m then collision possible
- if |S| ≥ m then collision guaranteed (by the ... principle)
- Two methods of handling collisions:
- Chaining (usually the better method)
- Open addressing
- Other terms:
- Chaining: aka open hashing
- Open addressing: aka closed hashing
- One text says that open hashing is aka open addressing
- Other sources disagree with this
Chaining
- Put all elements that hash to the same slot into a linked list
- Slot j contains a pointer to the head of the list of all stored elements that hash to j
- If no elements hash to j, slot j contains NIL
- Table is initially all nil
- Diagram: Universe of keys U, Set of keys S = {k: k in 1 .. 8}, slots in 1 .. 8
- h(1) = h(4) = 1
- h(2) = h(5) = h(7) = 5
- h(3) = 7
- h(6) = h(8) = 8
Chaining - Insert
- Insert(T, x) inserts x at the head of list T(h(x.key))
- Worst case running time: O(1)
- Precondition: x isn't already in the list
- Performance to guarantee precondition: look at cost of search later
Chaining - Delete
- delete(T, x) delete x from the list T(h(x.key))
- No search of list needed: x is a pointer
- Worst case: O(1), if doubly linked lists
- Worst case: time for search for predecessor, if singly linked lists
- Based on load factor (See below)
Chaining - Search
- Search(T, k) search for element x with x.key=k in list T(h(k))
- Running time: proportional to length of list T(h(k))
- Analyze below
Load Factor: α
- Analysis of hashing with chaining uses the load factor
- α = n / m
- n = number of elements in the table
- m = number of slots in the table = number of linked lists
Simple Uniform Hashing
- Our analysis of hashing will assume simple uniform hashing
- Simple uniform hashing: any given element is equally likely to hash into any of the m slots in the table
- Thus, the probability that xi maps to slot j is 1/m
- The probability that two keys map to the same slot is also 1/m
- If k elements are inserted, what is the expected number of elements at slot j? $$ \sum_{i=1}^k 1/m = (1/m) \sum_{i=1}^k 1 = (1/m) * k = k/m$$
Worst Case Analysis of Search (Hashing with Chaining)
- Search - Worst case: all n elements has to same slot
- Assume m slots
- Worst case: Θ(n), plus time to compute hash
- What is the probability of the worst case occurring? $$ m \times \left ( \frac{1}{m}\right )^{n} = m^{-n+1} $$
- In opening example - 100 keys and 100 slots: $$ 100 \times \left ( \frac{1}{100}\right )^{100} = 10^{-98} $$
Analyzing Search: Some Notation and Facts
- nj = length of list at T(j) [ie slot j]
- n = n0 + n1 + ... + nm-1
- Expected value of nj = α = n / m
- Time to compute hash function is O(1)
Average Case Analysis of Search: Two Cases
- Case 1: Unsuccessful search - hash table contains no element with key k
- Case 2: Successful search - hash table contains an element with key k
Average Case Analysis of Unsuccessful Search
- Any key k that is not in the table is equally likely to hash to any of the m slots (Why?)
- Unsuccessful search must, of course, examine all elements of the list at the slot T(k)
- Expected length of list T(k) is α
- Total time: Θ(1 + α), including cost of computing h(k)
- Add 1 for time to compute value of h(k)
Average Case Analysis of Successful Search - Simple Version
- Assume all lists are length α = n/m (and that m divides n evenly so that α = n/m is an integer)
- Number of elements to look at: 1 + number inserted before element with key k
- Then average number of comparisons is (1 + n/m) / 2 = Θ(1/2 + α/2): $$ 1/\alpha \times \sum_{i=1}^\alpha i = 1/\alpha \times (\alpha(\alpha + 1)) / 2 = (1 + \alpha) / 2 $$
- Including the cost of computing the hash function gives Θ(3/2 + α/2) = Θ(1 + α)
- Same as unsuccessful
Average Case Analysis of Successful Search - More Complex
- What if lists are not same length?
- Probability that each list is searched is proportional to number of elements in that list
- Number of elements examined during a succesful search for x depends number of elements to left of x in T(h(x.key))
- Need to calculate the average, over the n elements xi in the table, of how many elements were inserted into xi's list after xi was inserted
- Let's think about x1, x2, x3, ..., x8 being inserted into a table of size 4 (ie n=8, m=4)
- Assume lists of size 2, 3, 1, 2
- Let xi be the ith element added into the table
- Where does x1 go?
- Where does x2 go?
- After xi is entered, what is the probability that xi+1 will map to the same list (ie be to xi's left)?
- After xi is entered, what is the expected number of elements to its left?
Average Case Analysis of Successful Search (continued)
- Let xi be the ith element added to the table
- On average, how many of the remaining n-i elements will map to the same list as element xi?
- Answer: $\sum_{j=i+1}^n 1/m = (1/m) * \sum_{j=i+1}^n 1 = (1/m) * (n-i) = (n-i)/m$
- Thus, on average, the number of searches needed to find element xi is
$1 + \sum_{j=i+1}^n 1/m = 1 + (n-i)/m $ - Thus, on average, the number of searches to find any element is the average over all xi of finding that xi
- This average is calculated as follows:
- Adding time for a computing h(x) gives Θ(2 + α/2 - α/2n) = Θ(1 + α)
(1/n) [number of comparisons to find element xi, for all xi] |
= (1/n) [Σi=1..n (1 + (n-i)/m)] |
= (1/n) [n + Σi=1..n (n-i)/m] |
= 1 + (1/n) [Σi=1..n (n-i)/m] |
= 1 + (1/mn) [Σi=1..n (n-i)] |
= 1 + (1/mn) [Σj=0..n-1 j] |
= 1 + (1/mn) [(n-1)n/2] |
= 1 + (1/m) [(n-1)/2] |
= 1 + n/2m - 1/2m |
= 1 + α/2 - α/2n |
Average Case Analysis of Successful Search: Bottom Line
- If n = O(m) [ie number of elements proportional to size of table]
- then α = O(1)
- and Search time = O(1)
Comparison with Binary Search
- Binary search run time = O(lg n)
- In a hash table with n=m, what is the probability that search takes lg n time?
- (ie that any list is length lg n)
- Probability that a given bucket contains k keys ≤ ${n \choose k} \times (1/m)^k$
- Probability of any k ending up in same slot is (1/m)^k
- There are ${n \choose k}$ ways of choosing k elements from n elements
- For n=2^16, this gives less than 3 out of 10^9
Hash Functions
- Goal: provide Simple Uniform Hashing (ie each slot equally likely)
- Problem: Probability distribution of keys usually unknown
- Problem: Keys not drawn independently
- Requirement: Must map key to an integer
- Two common functions: Division and Multiplication
Mapping Keys to Integers
- Example: Consider character strings to be numbers in base 128 = 2^7
- 128 basic ascii values
- Example: String "CLRS"
- Ascii: C=67, L=76, R=82, S=83
- CLRS yields 67*128^3 + 76*128^2 + 82*128^1 + 83*128^0 = 141_764_947
Division Method
- h(k) = k mod m
- Example: m = 20, k = 91, yields h(k) = 11
- Advantage: fast
- Disadvantage: must avoid certain values of m
- avoid powers of 2 (x mod m gives low bits of x)
- m = 2^p - 1 can also be bad
- Typical choice: m should be prime not too close to 2^r for some r
Multiplication Method
- h(k) = floor m*(k*A mod 1)
- A in range 0 < A < 1
- kA mod 1 = kA - floor(kA) = fractional part of kA
- Disadvantage: slower than division
- Advantage: Value of m not critical
- Some values of A work better than others
- Knuth suggests A = (sqrt(5) - 1)/2
- What is this number?
Universal Hashing
- Problem with hashing: for any specific hash function, a specific set of keys will cause cause worst case performance
- Hacker could deliberately use that set of keys
- Avoid case where all values hash to same location by
- Choosing a hash function at random from a set of possible functions
- Having no fixed hash function guarantees that no specific set of values will lead to poor performance
- Similar to picking random pivots for quicksort
Open Addressing
- Store collisions in open cell of table instead of in a list
- AKA: Closed hashing
- Schemes:
- Linear probing:
- store collision value in next open slot, in order
- h(k, i) = (h'(k) + i) mod m
- tends to have clustering
- Quadratic probing and Double hashing:
- store collision value in open slot, some distance away
- h(k, i) = (h'(k) + c*i + d*i^2) mod m [quadratic probing]
- h(k, i) = (h'(k) + i*s(k)) mod m [double hashing]
- tends to have less clustering than Linear
Perfect Hashing
- Worst case performance: O(1)
- Requires a fixed set of keys
- Uses two levels of hashing:
- Keep collisions in hash table instead of list
- Each slot has its own has hash function (based on its index)
- Worst case space can be shown to be O(n)