Hash Tables (2024)

Hash Tables: Content Addressable

Hash tables allow content addressable data structures

Access elements using using a key rather than an index

Key is frequently a string
Access via an index provided by arrays

Code Example

 phonebook = {"Jack" : "831-1111", "Jill" : "831-9999"} print phonebook("Jack")

Accessing by index or key

Access by index uses index value to calculate address of element
Access by key requires some kind of search to find element

Common Use: Compiler symbol table

Possible implementations:

Balanced Search Tree: Lg n search
Hash Table: expected constant time search

Hash Table: Common in Modern Languages

Modern languages provide hash tables

Either built-in: Perl, Python, Ruby
Or via a library: Java HashMap, Ada Hashed_Map

Common names

Map
Dictionary
Hash Map
Associative array

Problem Overview

How to do fast insertion, search, deletion of data with keys

Hash tables give expected case behavior of O(1)

Better than balanced tree
However, worst case behavior is O(n)

History:

Invented in IBM, about 1950
One of first uses of linked lists
Related to content addressable memory:

Look up memory location by its contents (ie value of key)
Used for page tables, etc

Presentation based on CLRS, Knuth3, Neop.

Hash Table: Implementation Illustration

Consider 100 records with keys 1 .. 100

Easy implementation: array of 100 records: O(1) search

Consider 100 records with 9-digit ID as keys

Array of 10^9 records is not practical (array is large and sparse)
Hash table solution:

Array of 100 records
Use last 2 digits of ID as the key (ie array index)
Fast, but must deal with problem of collisions

Collision: two records with same key

Need a better way to get the 2 digit key

Using the last 2 digits may lead to too many collisions

Hash Tables: Problem Generalization

Goal: store and search a set S of records

Assumptions:

Each element of S has a unique key

Keys are drawn from some universe U

|S| << |U|, so allocating array of size |U| would waste space

|U| is too large to allocate an array 1 .. |U|

Solution: Store set S in array T

T: array(0 .. m-1) of Data_Records

|T| = m

How big should m be (ie how big should T be)?

|T| = m = O(|S|)
In practice, m < c|S|, where c=2(?) or 3

Hash Function

We want to delegate each data record to be stored in a specific element of T

This is done by using a function that maps keys to array elements

We define a function h that maps k to h(k)

We store record k in slot h(k), not in slot k

Sometimes we use the phrase "record k" or "element k" to mean the element with key k

Call the function h, a hash function

Function h transforms a key k to a unique element of T
h: U → 0 .. m-1
That is, h(k) ∈ 0 .. m-1

We say that k hashes to slot h(k)

A slot is a location in hash table

We will consider hash functions after looking at collisions and performance

Collisions

What if two or more keys hash to the same slot?

We call this a collision

When are collisions possible/guaranteed?

if |S| < m then collision possible
if |S| ≥ m then collision guaranteed (by the ... principle)

Two methods of handling collisions:

Chaining (usually the better method)
Open addressing

Other terms:

Chaining: aka open hashing
Open addressing: aka closed hashing

One text says that open hashing is aka open addressing
Other sources disagree with this

We focus on chaining

Chaining

Put all elements that hash to the same slot into a linked list

Slot j contains a pointer to the head of the list of all stored elements that hash to j

If no elements hash to j, slot j contains NIL

Table is initially all nil

Diagram: Universe of keys U, Set of keys S = {k: k in 1 .. 8}, slots in 1 .. 8

h(1) = h(4) = 1
h(2) = h(5) = h(7) = 5
h(3) = 7
h(6) = h(8) = 8

Chaining - Insert

Insert(T, x) inserts x at the head of list T(h(x.key))

Worst case running time: O(1)

Precondition: x isn't already in the list

Performance to guarantee precondition: look at cost of search later

Chaining - Delete

delete(T, x) delete x from the list T(h(x.key))

No search of list needed: x is a pointer

Worst case: O(1), if doubly linked lists

Worst case: time for search for predecessor, if singly linked lists

Based on load factor (See below)

Chaining - Search

Search(T, k) search for element x with x.key=k in list T(h(k))

Running time: proportional to length of list T(h(k))

Analyze below

Load Factor: α

Analysis of hashing with chaining uses the load factor

α = n / m

n = number of elements in the table
m = number of slots in the table = number of linked lists

Simple Uniform Hashing

Our analysis of hashing will assume simple uniform hashing

Simple uniform hashing: any given element is equally likely to hash into any of the m slots in the table

Thus, the probability that x_i maps to slot j is 1/m

The probability that two keys map to the same slot is also 1/m

If k elements are inserted, what is the expected number of elements at slot j? $$ \sum_{i=1}^k 1/m = (1/m) \sum_{i=1}^k 1 = (1/m) * k = k/m$$

Worst Case Analysis of Search (Hashing with Chaining)

Search - Worst case: all n elements has to same slot
Assume m slots

Worst case: Θ(n), plus time to compute hash

What is the probability of the worst case occurring?
In opening example - 100 keys and 100 slots:

Analyzing Search: Some Notation and Facts

n_j = length of list at T(j) [ie slot j]

n = n₀ + n₁ + ... + n_m-1

Expected value of n_j = α = n / m

Time to compute hash function is O(1)

Average Case Analysis of Search: Two Cases

Case 1: Unsuccessful search - hash table contains no element with key k

Case 2: Successful search - hash table contains an element with key k

Average Case Analysis of Unsuccessful Search

Any key k that is not in the table is equally likely to hash to any of the m slots (Why?)

Unsuccessful search must, of course, examine all elements of the list at the slot T(k)

Expected length of list T(k) is α

Total time: Θ(1 + α), including cost of computing h(k)

Add 1 for time to compute value of h(k)

Average Case Analysis of Successful Search - Simple Version

Assume all lists are length α = n/m (and that m divides n evenly so that α = n/m is an integer)

Number of elements to look at: 1 + number inserted before element with key k

Then average number of comparisons is (1 + n/m) / 2 = Θ(1/2 + α/2): $$ 1/\alpha \times \sum_{i=1}^\alpha i = 1/\alpha \times (\alpha(\alpha + 1)) / 2 = (1 + \alpha) / 2 $$
Including the cost of computing the hash function gives Θ(3/2 + α/2) = Θ(1 + α)

Same as unsuccessful

Average Case Analysis of Successful Search - More Complex

What if lists are not same length?

Probability that each list is searched is proportional to number of elements in that list

Number of elements examined during a succesful search for x depends number of elements to left of x in T(h(x.key))

Need to calculate the average, over the n elements x_i in the table, of how many elements were inserted into x_i's list after x_i was inserted

Let's think about x1, x2, x3, ..., x8 being inserted into a table of size 4 (ie n=8, m=4)

Assume lists of size 2, 3, 1, 2
Let xi be the ith element added into the table
Where does x₁ go?
Where does x₂ go?
After x_i is entered, what is the probability that x_i+1 will map to the same list (ie be to x_i's left)?
After x_i is entered, what is the expected number of elements to its left?

Average Case Analysis of Successful Search (continued)

Let x_i be the i^th element added to the table

On average, how many of the remaining n-i elements will map to the same list as element x_i?

Answer: $\sum_{j=i+1}^n 1/m = (1/m) * \sum_{j=i+1}^n 1 = (1/m) * (n-i) = (n-i)/m$

Thus, on average, the number of searches needed to find element x_i is
$1 + \sum_{j=i+1}^n 1/m = 1 + (n-i)/m $

Thus, on average, the number of searches to find any element is the average over all x_i of finding that x_i

This average is calculated as follows:

(1/n) [number of comparisons to find element x_i, for all xi]

= (1/n) [Σ_i=1^..n (1 + (n-i)/m)]

= (1/n) [n + Σ_i=1^..n (n-i)/m]

= 1 + (1/n) [Σ_i=1^..n (n-i)/m]

= 1 + (1/mn) [Σ_i=1^..n (n-i)]

= 1 + (1/mn) [Σ_j=0^..n-1 j]

= 1 + (1/mn) [(n-1)n/2]

= 1 + (1/m) [(n-1)/2]

= 1 + n/2m - 1/2m

= 1 + α/2 - α/2n

Adding time for a computing h(x) gives Θ(2 + α/2 - α/2n) = Θ(1 + α)

Average Case Analysis of Successful Search: Bottom Line

If n = O(m) [ie number of elements proportional to size of table]
then α = O(1)
and Search time = O(1)

Comparison with Binary Search

Binary search run time = O(lg n)

In a hash table with n=m, what is the probability that search takes lg n time?

(ie that any list is length lg n)

Probability that a given bucket contains k keys ≤ ${n \choose k} \times (1/m)^k$

Probability of any k ending up in same slot is (1/m)^k
There are ${n \choose k}$ ways of choosing k elements from n elements

For n=2^16, this gives less than 3 out of 10^9

Hash Functions

Goal: provide Simple Uniform Hashing (ie each slot equally likely)

Problem: Probability distribution of keys usually unknown
Problem: Keys not drawn independently

Requirement: Must map key to an integer

Two common functions: Division and Multiplication

Mapping Keys to Integers

Example: Consider character strings to be numbers in base 128 = 2^7

128 basic ascii values

Example: String "CLRS"

Ascii: C=67, L=76, R=82, S=83

CLRS yields 67*128^3 + 76*128^2 + 82*128^1 + 83*128^0 = 141_764_947

Division Method

h(k) = k mod m

Example: m = 20, k = 91, yields h(k) = 11

Advantage: fast

Disadvantage: must avoid certain values of m

avoid powers of 2 (x mod m gives low bits of x)
m = 2^p - 1 can also be bad

Typical choice: m should be prime not too close to 2^r for some r

Multiplication Method

h(k) = floor m*(k*A mod 1)

A in range 0 < A < 1
kA mod 1 = kA - floor(kA) = fractional part of kA

Disadvantage: slower than division

Advantage: Value of m not critical

Some values of A work better than others

Knuth suggests A = (sqrt(5) - 1)/2

What is this number?

Universal Hashing

Problem with hashing: for any specific hash function, a specific set of keys will cause cause worst case performance

Hacker could deliberately use that set of keys

Avoid case where all values hash to same location by

Choosing a hash function at random from a set of possible functions

Having no fixed hash function guarantees that no specific set of values will lead to poor performance

Similar to picking random pivots for quicksort

Open Addressing

Store collisions in open cell of table instead of in a list
AKA: Closed hashing
Schemes:

Linear probing:

store collision value in next open slot, in order
h(k, i) = (h'(k) + i) mod m
tends to have clustering

Quadratic probing and Double hashing:

store collision value in open slot, some distance away
h(k, i) = (h'(k) + c*i + d*i^2) mod m [quadratic probing]
h(k, i) = (h'(k) + i*s(k)) mod m [double hashing]
tends to have less clustering than Linear

Perfect Hashing

Worst case performance: O(1)
Requires a fixed set of keys

Uses two levels of hashing:

Keep collisions in hash table instead of list
Each slot has its own has hash function (based on its index)
Worst case space can be shown to be O(n)