Lecture 21: Hash functions (2024)

Hash functions

Hash tables are one of the most useful data structures ever invented.Unfortunately, they are also one of the most misused. Code built using hashtables often falls far short of achievable performance. There aretwo reasons for this:

Clients choose poor hash functions that do not act like random numbergenerators, invalidating the simple uniform hashing assumption.
Hash table abstractions do not adequately specify what is required of thehash function, or make it difficult to provide a good hash function.

Clearly, a bad hash function can destroy our attempts at a constantrunning time. A lot of obvious hash function choices are bad. For example,if we're mapping names to phone numbers, then hashing each name to itslength would be a very poor function, as would a hash function that used onlythe first name, or only the last name. We want our hash function to use all of the information in the key. This is a bit of an art. While hash tables are extremely effective when used well, all too often poor hash functions are usedthat sabotage performance.

Recall that hash tables work well when the hash function satisfies thesimple uniform hashing assumption -- that the hash function should look random.If it is to look random, this means that any change to a key, even a small one,should change the bucket index in an apparently random way. If we imaginewriting the bucket index as a binary number, a small change to the key shouldrandomly flip the bits in the bucket index. This is called informationdiffusion. For example, a one-bit change to the key should causeevery bit in the index to flip with 1/2 probability.

Client vs. implementer

As we've described it, the hash function is a single function that mapsfrom the key type to a bucket index. In practice, the hash functionis the composition of two functions, one provided by the client andone by the implementer. This is because the implementer doesn't understandthe element type, the client doesn't know how many buckets there are, andthe implementer probably doesn't trust the client to achieve diffusion.

The client function h_clientfirst converts the key into an integer hash code,and the implementation function h_implconverts the hash code into a bucket index. The actualhash function is the composition of these two functions,h_client∘h_impl:

To see what goes wrong, suppose our hash code function on objects is thememory address of the objects, as in Java. This is the usual choice. Andsuppose that our implementation hash function is like the one in SML/NJ; ittakes the hash code modulo the number of buckets, where the number of bucketsis always a power of two. This is also the usual implementation-side choice.But memory addresses are typically equal to zero modulo 16, so at most1/16 of the buckets will be used, and the performance of the hash table willbe 16 times slower than one might expect.

Measuring clustering

When the distribution of keys into buckets is not random, we say that the hashtable exhibits clustering. It's a good idea to test yourfunction to make sure it does not exhibit clustering with the data. With anyhash function, it is possible to generate data that cause it to behave poorly,but a good hash function will make this unlikely.

Designing a hash function

For a hash table to work well, we want the hash function to have twoproperties:

Injection: for two keys k₁ ≠ k₂,the hash function should give different results h(k₁) ≠h(k₂), with probability m-1/m.
Diffusion(stronger than injection):if k₁ ≠ k₂, knowing h(k₁) givesno information about h(k₂). For example, if k₂ isexactly the same as k₁, except for one bit, then every bit inh(k₂) should change with 1/2 probability compared toh(k₁). Knowing the bits of h(k₁) does not give anyinformation about the bits of h(k₂).

As a hash table designer, you need to figure out which of theclient hash function and the implementation hash function is going toprovide diffusion. For example, Java hash tables provide (somewhat weak)information diffusion, allowing the client hashcode computation tojust aim for the injection property. In SML/NJ hash tables, the implementationprovide only the injection property. Regardless, the hash table specificationshould say whether the client is expected to provide a hash code withgood diffusion (unfortunately, few do).

If clients are sufficiently savvy, it makes sense topush the diffusion onto them, leaving the hashtable implementation as simple and fast as possible.The easy way to accomplish this is to breakthe computation of the bucket index into three steps.

Serialization: Transform the key into a stream of bytes that contains all of the information in the original key. Two equal keys must result in the same byte stream. Two byte streams should be equal only if the keys are actually equal. How to do this depends on the form of the key. If the key is a string, then the stream of bytes would simply be the characters of the string.
Diffusion: Map the stream of bytes into a large integer x in a way that causes every change in the stream to affect the bits of x apparently randomly. There are a number of good off-the-shelf ways to accomplish this, with a tradeoff in performance versus randomness (and security).
Compute the hash bucket index as x mod m. This is particularly cheap if m is a power of two, but see the caveats below.

There are several different good ways to accomplish step 2: multiplicative hashing, modular hashing, cyclic redundancy checks, and secure hash functions such as MD5 and SHA-1.

Frequently, hash tables are designed in a way that doesn't let the client fully control the hash function. Instead, the client is expected to implement steps 1 and 2 to produce an integer hash code, as in Java. The implementation then uses the hash code and the value of m (usually not exposed to the client, unfortunately) to compute the bucket index.

Some hash table implementations expect the hash code to look completely random, because they directly use the low-order bits of the hash code as a bucket index, throwing away the information in the high-order bits. Other hash table implementations take a hash code and put it through an additional step of applying an integer hash function that provides additional diffusion. With these implementations, the client doesn't have to be as careful to produce a good hash code,

Any hash table interface should specify whether the hash function is expected to look random. If the client can't tell from the interface whether this is the case, the safest thing is to compute a high-quality hash code by hashing into the space of all integers. This may duplicate work done on the implementation side, but it's better than having a lot of collisions.

Modular hashing

With modular hashing, the hash function is simply h(k) = k mod mfor some m (usually, the numberof buckets). The value k is an integer hashcode generated from the key.If m is a power oftwo (i.e., m=2^p),then h(k) is just thep lowest-order bits of k. TheSML/NJ implementation of hash tables does modular hashing with m equal to a power of two. This is very fast but thethe client needs to design the hash function carefully.

The Java Hashmap class is a little friendlier butalso slower: it uses modular hashing with mequal to a prime number. Modulo operations can be accelerated byprecomputing 1/m as a fixed-point number, e.g. (2³¹/m). A precomputed tableof various primes and their fixed-point reciprocals is thereforeuseful with this approach, because the implementation can then usemultiplication instead of division to implement the mod operation.

Multiplicative hashing

A faster but often misused alternative is multiplicative hashing,in which the hash index is computed as⌊m * frac(ka)⌋. Herek is again an integer hash code,a is a real number and frac is the function that returns the fractionalpart of a real number.Multiplicative hashing sets the hash index from the fractional part ofmultiplying kby a large real number.It's faster if this computation is done using fixed point rather than floatingpoint, which is accomplished by computing (ka/2^q)modmfor appropriately chosen integer values of a, m, and q. So qdetermines the number of bits of precision in the fractional part of a.

Here is an example of multiplicative hashing code,written assuming a word size of 32 bits:

val multiplier: Word.word = 0wx678DDE6F (* a recommendation by Knuth *) fun findBucket({arr, nelem}, e) (f:bucket array*int*bucket*elem->'a) = let val n = Word.fromInt(Array.length(arr)) val d = (0wxFFFFFFF div n)+0w1 val i = Word.toInt(Word.fromInt(Hash.hash(e)) * multiplier div d) val b = Array.sub(arr, i) in f(arr, i, b, e) end

Multiplicative hashing works well for the same reason thatlinear congruential multipliers generate apparently random numbers—it's likegenerating a pseudo-random number with the hashcode as the seed. Themultiplier a should be large and its binary representation should be a"random" mix of 1's and 0's. Multiplicative hashing ischeaper than modular hashing because multiplication is usuallyconsiderably faster than division (or mod).It also works well with a bucket array of sizem=2^p,which is convenient.

In the fixed-point version,The division by 2^q is crucial.The common mistake when doing multiplicative hashing is to forget to do it,and in fact you can find web pages highly ranked by Googlethat explain multiplicative hashingwithout this step. Without this division, there is little point to multiplyingby a, becausekamodm= (kmodm) * (amodm) mod m. This is no better than modular hashing with a modulus of m, and quite possibly worse.

Cyclic redundancy checks (CRCs)

For a longer stream of serialized key data, a cyclic redundancycheck (CRC) makes a good, reasonably fast hash function.A CRC of a data stream is the remainder after performing a longdivision of the data (treated as a large binary number), but using exclusive orinstead of subtraction at each long division step. This corresponds to computinga remainder in the field of polynomials with binary coefficients. CRCs can becomputed very quickly in specialized hardware.Fast software CRC algorithms rely on accessing precomputed tables of data.

Cryptographic hash functions

Sometimes software systems are used by adversaries who might try to pickkeys that collide in the hash function, thereby making the system have poorperformance. Cryptographic hash functions are hash functions that try tomake it computationally infeasible to invert them: if you knowh(x), there is no way to computex that is asymptotically faster thanjust trying all possible values and see which one hashes to the right result.Usually these functions also try to make it hard to find differentvalues of x that cause collisions. Examples of cryptographic hashfunctions are MD5 and SHA-1. Some attacks are known on MD5, but it isfaster than SHA-1 and still fine for use in generating hash table indices.

Precomputing hash codes

High-quality hash functions can be expensive. If the same values are beinghashed repeatedly, one trick is to precompute their hash codes and storethem with the value. Hash tables can also store the full hash codes of values,which makes scanning down one bucket fast. In fact, if the hash code is longand the hash function is high-quality (e.g., 64+ bits of a properly constructedMD5 digest), two keys with the same hash code are almost certainly thesame value. Your computer is then more likely to get a wrong answer from acosmic ray hitting it than from a hash code collision.

FAQs

What is Davies Meyer hash function? ›

Definition. The Davies–Meyer hash function is a construction for an iterated hash function based on a block cipher, where the length in bits of the hash result is equal to the block length of the block cipher.

Read On ›

How do you explain a hash function? ›

What is a Hash Function? A hash function is a mathematical function or algorithm that simply takes a variable number of characters (called a ”message”) and converts it into a string with a fixed number of characters (called a hash value or simply, a hash).

Discover More Details ›

How to calculate the hash function? ›

With modular hashing, the hash function is simply h(k) = k mod m for some m (usually, the number of buckets). The value k is an integer hash code generated from the key. If m is a power of two (i.e., m=2^p), then h(k) is just the p lowest-order bits of k.

Why is 31 used in hash functions? ›

There is perhaps a couple of reasons for choosing 31. The main reason is that it is a prime number and prime numbers have better distribution results in hashing algorithms, by other words the hashing outputs have less collisions for different inputs.

See Details ›

What is the most famous hash function? ›

The MD5 algorithm, defined in RFC 1321, is probably the most well-known and widely used hash function. It is the fastest of all the . NET hashing algorithms, but it uses a smaller 128-bit hash value, making it the most vulnerable to attack over the long term.

Find Out More ›

What is one of the most common uses for the hash function? ›

The most popular use of hashing is for setting up hash tables. A hash table stores key and value pairs in a list that's accessible through its index. Because the number of keys and value pairs is unlimited, the hash function maps the keys to the table size. A hash value then becomes the index for a specific element.

Tell Me More ›

What is the primary purpose of a hash function? ›

Hash functions are a way to ensure data integrity in public key cryptography. What I mean by that is that hash functions serve as a check-sum, or a way for someone to identify whether data has been tampered with after it's been signed. It also serves as a means of identity verification.

Show Me More ›

Why can't hash be reversed? ›

A hash cannot be reversed back to the original data because it is a one-way operation. Hashing is commonly used to verify the integrity of data, commonly referred to as a checksum. If two pieces of identical data are hashed using the same hash function, the resulting hash will be identical.

Explore More ›

What are the three criteria for hash functions? ›

Key Properties of Hash Functions

Deterministic: A hash function must consistently produce the same output for the same input. Fixed Output Size: The output of a hash function should have a fixed size, regardless of the size of the input. Efficiency: The hash function should be able to process input quickly.

What makes a bad hash function? ›

A poor choice of hash function is likely to lead to clustering behavior, in which the probability of keys mapping to the same hash bucket (i.e. a collision) is significantly greater than would be expected from a random function.

Show Me More ›

Why is hashing irreversible? ›

It would be impossible to figure out the original data of the function with just the resulting hash – as not much of that data is left – the only workable method is to brute force every possible combination. If we could reverse a hash, we would be able to compress data of any size into a mere few bytes of data.

Read The Full Story ›

What is a good hash function to use? ›

A good hash function to use with integer key values is the mid-square method. The mid-square method squares the key value, and then takes out the middle r bits of the result, giving a value in the range 0 to 2r−1. This works well because most or all bits of the key value contribute to the result.

See Details ›

Do hash functions need a key? ›

Overview. In a hash table, a hash function takes a key as an input, which is associated with a datum or record and used to identify it to the data storage and retrieval application. The keys may be fixed-length, like an integer, or variable-length, like a name. In some cases, the key is the datum itself.

Get More Info Here ›

Why is it called a hash function? ›

The term "hash" comes by way of analogy with its non-technical meaning, to "chop and mix". Indeed, typical hash functions, like the mod operation, "chop" the input domain into many sub-domains that get "mixed" into the output range to improve the uniformity of the key distribution.

What does a hash function do in DHT? ›

Key-Value Storage: To store a key-value pair in the DHT, a node hashes the key using the same hashing algorithm used for node identifiers. This hashing process transforms the key into a numeric value.

What does the hash function determine? ›

For hash functions in cryptography, the definition is a bit more straightforward. A hash function is a unique identifier for any given piece of content. It's also a process that takes plaintext data of any size and converts it into a unique ciphertext of a specific length.

View Details ›

What is the principle of hash function? ›

2.1. Hash Function Principles. Hashing generally takes records whose key values come from a large range and stores those records in a table with a relatively small number of slots. Collisions occur when two records hash to the same slot in the table.