re: Why databases use ordered indexes but programming uses hash tables (2024)

This post asked an interesting question, why are hash table so prevalent for in memory usage and (relatively) rare in the case of databases. There is some good points in the post, as well as in the Hacker News thread.

Given that I just did a spike of persistent hash table and have been working on database engines for the past decade, I thought that I might throw my own two cents into the ring.

B+Tree is a profoundly simple concept. You can explain it in 30 minutes, and it make sense. There are some tricky bits to a proper implementation, for sure, but they are more related to performance than correctness.

Hash tables sounds simple, but the moment you have to handle collisions gracefully, you are going to run into real challenges. It is easy to get into nasty bugs with hash tables, the kind that silently corrupt your state without you realizing it.

For example, consider the following code:

This is a hash table using linear addressing. Collisions are handled by adding them to the next available node. And in this case, we have a problem. We want to put “ghi” in position zero, but we can’t, because it is already full. We move it to the first available location. That is well understood and easy. But when we delete “def”, we remove the entry from the array, but we forgot to do fixups for the relocated “ghi”, that value is now gone from the table, effectively. This is the kind of bug you need the moon to be in a certain position while a cat sneeze to figure out.

A B+Tree also maps very nicely to persistent model, but it is entirely non obvious how you can go from the notion of a hash table in memory to one on disk. Extendible hashing exists, and has for a very long time. Literally for more time than I’m alive, but it is not very well known / generically used. It is a beautiful algorithm, mind you. But just mapping the concept to a persistence model isn’t enough, typically, you also had a bunch of additional requirements from disk data structure. In particular, concurrency in database systems is frequently tied closely to the structure of the tree (page level locks).

There is also the cost issue. When talking about disk based data access, we are rarely interested in the actual O(N) complexity, we are far more interested in the number of disk seeks that are involved. Using extendible hashing, you’ll typically get 1 – 2 disk seeks. If the directory is in memory, you have only one, which is great. But with a B+Tree, you can easily make sure that the top levels of the tree will also be memory resident (similar to the extendible hash directory), that leads to typical 1 disk access to read the data, so in many cases, they are roughly the same performance for either option.

Related to the cost issue, you have to also consider security risks. There have been a number of attacks against hash tables that relied on generating hash collisions. The typical in memory fix is to randomize the hash to avoid this, but if you are persistent, you have to use the same hash function forever. That means that an attacker can very easily kill your database server, by generating bad keys.

But these are all relatively minor concerns. The key issue is that B+Tree is just so much more useful. A B+Tree can allow me to:

  • Store / retrieve my data by key
  • Perform range queries
  • Index using a single column
  • Index using multiple columns (and then search based on full / partial key)
  • Iterate over the data in specified order

Hashes allow me to:

  • Store / retrieve my data by key

And that is pretty much it. So B+Tree can do everything that Hashes can, but also so much more. They are typically as fast where it matters (disk reads) and more than sufficiently fast regardless.

Hashes are only good for that one particular scenario of doing lookup by exact key. That is actually a lot more limited than what you’ll consider.

Finally, and quite important, you have to consider the fact that B+Tree has certain access patterns that they excel at. For example, inserting sorted data into a B+Tree is going to be a joy. Scanning the B+Tree in order is also trivial and highly performant.

With hashes? There isn’t an optimal access pattern for inserting data into a hash. And while you can scan a hash at roughly the same cost as you would a B+Tree, you are going to get the data out of order. That means that it is a lot less useful than it would appear to upfront.

All of that said, hashes are still widely used in databases. But they tend to be used as specialty tools. Deployed carefully and for very specific tasks. This isn’t the first thing that you’ll reach to, you need to justify its use.

re: Why databases use ordered indexes but programming uses hash tables (2024)
Top Articles
21 Best FAQ Page Examples (+ How To Create Your Own) (2023) - Shopify
Disable SMBv1: Understanding Risks and Remediation Steps
Calvert Er Wait Time
UPS Paketshop: Filialen & Standorte
It may surround a charged particle Crossword Clue
Wordscapes Level 5130 Answers
Rabbits Foot Osrs
CKS is only available in the UK | NICE
Doby's Funeral Home Obituaries
Elle Daily Horoscope Virgo
Best Restaurants Ventnor
Oscar Nominated Brings Winning Profile to the Kentucky Turf Cup
Transfer Credits Uncc
Grace Caroline Deepfake
Becu Turbotax Discount Code
Zack Fairhurst Snapchat
Mahpeople Com Login
Water Trends Inferno Pool Cleaner
Why Should We Hire You? - Professional Answers for 2024
Touchless Car Wash Schaumburg
A Person That Creates Movie Basis Figgerits
Sand Dollar Restaurant Anna Maria Island
Milwaukee Nickname Crossword Clue
Darrell Waltrip Off Road Center
Egusd Lunch Menu
Cable Cove Whale Watching
Table To Formula Calculator
Dhs Clio Rd Flint Mi Phone Number
How often should you visit your Barber?
The Bold and the Beautiful
Helloid Worthington Login
Here’s how you can get a foot detox at home!
Does Iherb Accept Ebt
Ny Post Front Page Cover Today
How much does Painttool SAI costs?
Dee Dee Blanchard Crime Scene Photos
Atom Tickets – Buy Movie Tickets, Invite Friends, Skip Lines
manhattan cars & trucks - by owner - craigslist
Free Crossword Puzzles | BestCrosswords.com
Pgecom
Gabrielle Abbate Obituary
Menu Forest Lake – The Grillium Restaurant
Tropical Smoothie Address
Ratchet And Clank Tools Of Destruction Rpcs3 Freeze
Craigslist Sparta Nj
Www.homedepot .Com
Blippi Park Carlsbad
Oak Hill, Blue Owl Lead Record Finastra Private Credit Loan
About us | DELTA Fiber
Deviantart Rwby
91 East Freeway Accident Today 2022
Latest Posts
Article information

Author: Tyson Zemlak

Last Updated:

Views: 6149

Rating: 4.2 / 5 (43 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Tyson Zemlak

Birthday: 1992-03-17

Address: Apt. 662 96191 Quigley Dam, Kubview, MA 42013

Phone: +441678032891

Job: Community-Services Orchestrator

Hobby: Coffee roasting, Calligraphy, Metalworking, Fashion, Vehicle restoration, Shopping, Photography

Introduction: My name is Tyson Zemlak, I am a excited, light, sparkling, super, open, fair, magnificent person who loves writing and wants to share my knowledge and understanding with you.