1 year ago

#123 What is a Hash Table?

Storage for key-value pairs.

Transcript

Speaker A:

Hash tables are some of the most widely used and powerful data structures. In this episode we'll explain what they are and how they work. Welcome to COPEC Explain Software, the podcast where we make computing intelligible. In this episode we're going to be discussing hash tables. But to understand hash tables, you need to understand two more primitive data structures. We already discussed arrays and linked lists on a prior episode called what is a data structure? I'm going to link to that episode in the show Notes, and if you don't know what they are, you may want to listen to it first. To get into hash tables, we first need to motivate why we even want to use them. And that involves understanding the type of data called key value pairs.

Speaker B:

So walk us through that. What are key value pairs?

Speaker A:

A lot of data naturally lends itself to be stored as key value pairs. A key is an identifier usually, but not always. That's a string, so just some kind of text. And the identifier is the identifier for some piece of data. That data is what we call the value. So the key is what we use to look up the value. Let me give you a few examples. One example might be a high score list where we want to know for some particular username what is their highest score. So in that scenario, the key would be the username and the value would be the high score. Another example would be products and prices. Maybe we have a product name like Apple and we want to look up how much it costs. So we go into our key value pairs and we find the key value pair that has the key apple, and the value is going to be the cost of that apple. Here's another one. Let's say we have countries and we want to look up their populations. The key might be the country name and the value might be the population. So I want to know, for example, that the United States has 350,000,000 people or Canada has 40 million people. So in all of these scenarios there's some kind of identifier. We call that the key. That's the thing I'm looking up by and there's some kind of value I want to get back. That's the thing associated with the key. So the key together with the value is called the key value pair. So hash tables are just a convenient data structure for storing key value pairs. A lot of our listeners are programmers. Not all of them, but a lot of them. And if you are a programmer, you've probably used an abstract data type known as either a dictionary, a map, or an associative array. For example, if you're a Python programmer, you've probably used a dictionary before. If you're a PHP programmer, you've probably used an associative array. They're both really the same sort of thing. They're a way of storing key value pairs in fact, the names are almost synonyms and all of those abstract data types are backed by hash tables. So if you've ever used a dictionary in Python or a map in C plus plus, or an associative array in PHP, you've actually used hash tables before. They were just below the surface of those abstract data types that you've already been using. And actually I think dictionary is a really good way of thinking about key value pairs. And it makes sense that that name exists as an abstract data type that sits on top of a hash table. Because if you think about something like an English dictionary, the keys are the words and the values are the definitions. And so you're going in there and you're looking up something by a word to find what does it mean, what's the value associated with it? That's the definition. So I like the word dictionary in particular because it maps a lot to common usage of similar language.

Speaker B:

Tell us about the components of a hash table.

Speaker A:

There are three essential components we must have in place in order to create a typical hash table. The first is a backing array. Now it's not going to be an array of our key value pairs. Typically it'll often actually be an array of linked lists of the key value pairs. So it's actually a hybrid data structure. It's an array of linked lists. So each slot in the array actually points to a linked list. And then we also need to have what's called a hash function. A hash function is a way of taking our key and mapping it to our particular array slot. And the key value pair will exist within the linked list at that array slot. We're going to go through this quite a few more times in the rest of the episode. It can be hard to understand without a diagram, but I'm going to do my best as we go through this. So a hash table, again, is a hybrid data structure. It's an array of linked lists. And the hash function is our way of knowing which of those linked lists contains our key value pair.

Speaker B:

Why don't you walk us through a specific example of how this could work?

Speaker A:

Sure. And in order to do a specific example, we're going to need to create an imaginary hash function. So again, a hash function is a way in the context of hash tables, of taking a key and mapping it to an array index. So a location in the array. So let's say our keys are going to be strings, and that's very common with hash tables. A lot of hash tables will only work with strings as their keys. And let's say specifically we're working with products in a store. Okay, I'm going to use a really simple hash function. Later on we'll talk about why it might not be a good one. But in this hash function, we're just going to take the number of letters in the key and that will be the array slot where the key value pair of the product and its price is going to reside in a linked list.

Speaker B:

So if I'm looking for pens that's going to be in slot three or pen would be in slot three, right?

Speaker A:

Because pen is a three letter word, right? And so the string would be three letters. And so we just say, well we're going to store that key value pair in the linked list in slot three in the array. So for example, let's say we had the word apple and we wanted to know what its price is. So our hash function would go and say, well that string apple has five characters in it and so I'm going to go to erase slot five. And in the linked list, in array slot five I'll look through the key value pairs in that linked list and one of them hopefully will be Apple. And then I'll look at the value associated with it, maybe it's forty five cents and that's what I'll return back. So the hash function again is telling us based on the key, which of the array slots is likely to have the key value pair that I'm looking for. So now that we've run through a simple example, why is this useful? We could just put all the key value pairs in a giant linked list or we could put them in a giant array or something without using a hash function at all.

Speaker B:

So tell us.

Speaker A:

Well it turns out that the average case performance, if the hash function is good and that hash function we were just talking about is not good, is actually going to be constant time. So if I had just my key value pairs in a big array or a big linked list I might have to look through all of them to find the particular one I'm looking for. But using this hash function will allow me to narrow in really quickly on where that key value pair is likely to be, whether I want to insert it for the first time, look up the value associated with the key, or update the value associated with the key. The hash function is going to allow me to really quickly drill in on the right place in the data structure to find that key value pair. So this is going to be a lot faster on average than having to look through a whole linked list or a whole array for the particular key value pair. Remember, with an array we have random access by an integer. So if I was looking something up and I always knew it was going to be an integer and there was going to be a finite number of them, then maybe actually a plain array would be great. But if I want to look something up by a string, well this is where the concept of the hash table gets really powerful because that hash function is going to help me get to the right place where that key value pair with the string being the key is going to reside. So, again, with a hash table, we get constant time, lookups updates and deletes on average, for these key values. That's really fast.

Speaker B:

What are some of the rules of a hash function?

Speaker A:

Yeah, we talked about a really simple one earlier. That's not a good one. So here are some rules that hash functions have to have. One is they have to be consistent. So if I expect that this key value pair is supposed to be in slot five and then I run the hash function again and it tells me it's going to be in slot seven, that's going to be a big problem. So they should always return the same index for the same key. Now, our existing hash function was good on that property, right? It will always return the same slot because there always are, let's say, the same number of letters in the same string. Right? So it was okay on that property. But here's another property. They should also distribute keys as evenly as possible across the array. So we shouldn't just have all of the key value pairs end up in the same location. Do you think the hash function we were using before is going to be good at distributing the keys evenly?

Speaker B:

No.

Speaker A:

Yeah, probably not. Why is that?

Speaker B:

Because it's such a random?

Speaker A:

No, it's not random at all. It's very precise. It's based on the number of letters or characters in the string.

Speaker B:

I don't know why.

Speaker A:

Well, because strings are going to tend to congregate around certain lengths. So there's a lot more let's say our strings are words. There's going to be a lot more words that are four, six letters long than there are that are ten letters long. So you're going to end up with tons of things in the fourth slot or the 6th slot and not a lot of things in the 10th slot. They're definitely not going to be evenly distributed because words tend to be certain lengths. They don't tend to be like just random lengths.

Speaker B:

Well, now I'm going to think a lot about how long words are when I'm in the store.

Speaker A:

Yeah. So hash functions also need to be fast. That's another property. Now, we did have that property. It's going to be really quick to find out how many characters are in a string in most programming languages, but that's not enough if they're not evenly distributed, because that's a really important property. And the reason they have to be fast is if hash functions weren't fast, they would defeat the whole purpose of having a hash table, right? The whole purpose of having a hash table is having that really fast lookup, insert or delete. And if I can't run the hash function quickly, if the hash function has just taken forever. Well, now I've defeated this whole idea of having a high performance data structure. So a lot of work has gone into coming up with good hash functions, hash functions that are consistent, fast, and distribute content as evenly as possible. The mathematics of good hash functions are beyond the scope of what we're talking about today. But rest assured that any good hash function must have the three properties that we discussed. But just to reinforce the idea of a hash function, and the idea being that we take in our case a string and map it into an integer, which is an array location, let's just talk about one other hash function. So we talked earlier about the idea of just the number of characters in the string. What if we just looked at the alphabetical order of the first character in the string and let's just assume that all of our strings start with something in the Latin alphabet. So A through Z, how do you think that would work as a hash function going across our three different properties? Would that be consistent?

Speaker B:

Yes.

Speaker A:

Right, so Apple starts with an A, so it's always going to be in slot zero or one, whatever we decide to start at. Let's say we start at zero. So that's good. Yeah. So it's always going to be in slot zero. So that's consistent. Would it be fast? Yeah, it would. Looking at the first character of the string would be really fast. But would it distribute the keys as evenly as possible across the array? No, because there tends to be a lot more words that start with A than there are that start with Z. So we'd end up or X, for example. So there'd be way too many things concentrated in certain slots. Now you might be wondering what happens if we have more potential types of keys than we have slots, right? Let's say we only have five slots, but we have a possible mapping using our hash function to 100 different locations. How do we get that hundred back into just the five that we have? What we tend to do is after we've turned the original key into a number, so a slot location, if it's greater than the number of slots in the array, we just mod it. So we just basically see how many times does the number of slots that are in the array go into that number and whatever the remainder is, that's the slot we use. So this will work for any hash function or an array with any number of slots being our backing store. Now, what happens if we try to put something into a particular array slot and there's already something there?

Speaker B:

Well, that would be an issue, right?

Speaker A:

That's what we call a collision. Right. If there's already something there, we have to have some way of saying, well, okay, how am I still going to store this thing? Because I don't want to just drop it. I don't want to just say and that could be a decision you make depending on what the type of application you're writing is. But most of the time you want to be able to store anything. So there's two different ways of resolving this problem and we've already explained how we're going to resolve it for those who've been listening closely, because we said every array location is going to contain a linked list. So if there's already something in the slot, that means it's already in the linked list. We just add another link to the end of the linked list. We just add one more node to the end of that linked list. And then when we need to look something up, we have to go through the linked list at that slot and find the particular thing we're looking up by going through all of the items in that linked list. That is called separate chaining. And that is one way of building a hash table. The other way of building a hash table and resolving collisions is what's called open addressing. In this scenario, we still have an array that backs the hash table, but we don't have linked lists at every array slot. Instead, we just look for the next open slot in the array after the slot that was filled already when we have that collision. And so that's called probing. We kind of go through, oh, I thought there was nothing in slot three, but there is something and I want to put this thing in. Let me try putting in slot four. Oh, slot four is also full. Let me see if slot five is open. And then if slot five is open, I put it into slot five. Then when I go to look up, I'll go and I'll look up again and I'll say, oh, there's something in slot three, let me go the next one. See if it's there. There's something in slot four. Let me go the next one. Oh, slot five. That's the thing I was looking for before. So separate chaining and open addressing are two different ways of resolving collisions. They have pros and cons separate chaining. We never run out of space because with open addressing it's possible to run out of array slots, right? On the other hand, there are performance positives and negatives depending on the type of data and the way that we have the specific implementation for both separate chaining and open addressing, they're beyond the scope of our episode. But whether you're using separate chaining or open addressing, you'll reach some point where the hash table is going to get pretty full.

Speaker B:

So then what do you do? Is the hash table just done?

Speaker A:

Typically what you do is you resize it. And to know exactly when it's a good idea to resize the hash table, we look at what's called the load factor. The load factor has a really simple formula. It's just the number of items in the hash table divided by the number of slots in the hash table's backing array. When that ratio becomes greater than typically something like 0.7, it's a good idea to make the hash table bigger. So to expand the number of slots in the backing array and then you have to rearrange all the existing data in the hash table. To give you an example, the Java Standard Library uses a load factor of zero point 75 before it typically resizes its hash tables. Now, if you were following along about the difference between open addressing and separate chaining, you might realize that you can actually have a load factor that's greater than 1.0 in separate chaining because those linked lists can have as many items as we want in them and there's as many linked lists as there are array slots. We can have more items than there are array slots in separate chaining. So it is possible to have a load factor greater than 1.0 in open addressing. It's not because once you've filled up every slot, there's simply no other slots to put things in. There's no linked list hanging off the end of every slot for us to just shove things on the end. So hash tables are a great way of storing key value pairs. And anytime you want to store a key value pair, you probably should be using a hash table or one of the abstract data types like a dictionary, map or associative array that sits on top of them. But there's actually other things we can use hash tables for as well. Like what if you've ever used a set data structure which is analogous to a mathematical set, they can easily be programmed using hash tables. What we can do is each of the keys can be the items and the value can just be a boolean. True or false? Is the thing in there or is it not? Hash tables are often also used as caches. A cache which is going to be a subject of a future episode is a way of storing something temporarily for quick lookup in memory. And hash tables are perfect for that application. And some programming languages that are object oriented actually store the entirety of their objects behind the scenes in hash tables. For example, in JavaScript and Ruby and Python, this is the case. So there's a hash table behind the scenes that says here's where the member variables are for that object, here's where its methods are, and those can really quickly be looked up thanks to the great performance characteristics of hash tables.

Speaker B:

Are there any downsides to hash tables?

Speaker A:

Yeah, there are a couple downsides. One is that if the hash table grows too large, it's going to actually be larger than the CPU's cache. And now we're going to be jumping all over memory in order to find anything that we're looking up. And so the performance can degrade over time. If the hash table becomes too large and starts to exceed the constraints of the cache. And we have to be careful about crafting hash tables for certain data types that may not work well with certain kinds of hash functions. So this is very rarely a problem, but there could be situations where a hash function doesn't work well for a particular kind of data type. Let's summarize what we've talked about.

Speaker B:

Hash tables are an incredibly useful data structure for storing key values, right?

Speaker A:

And a key is some kind of identifier that we want to look something up by, and the value is the thing that we wanted to look up. They're built out of arrays and linked lists, typically with a hash function guiding, where in the backing array is the key value pair.

Speaker B:

Some of the really powerful tools that are built into many of the programming languages are built on a kind of foundation of a hash table.

Speaker A:

And those hash functions need to have three properties in order to work well. They need to be fast, they need to distribute the keys as evenly as possible, and they need to be consistent. So they need to always give us the same array index for the same key. A good exercise for a programmer who wants to learn more about data structures and algorithms is to implement your own hash table. In fact, while many programming language standard libraries have hash tables, the C standard library does not. And so if you're a C programmer, this is an especially good exercise for you, and used to be one that people commonly did back in the day. Thanks for listening to us this week, Rebecca. How can people get in touch with us on Twitter?

Speaker B:

We're at COPEC explains K-O-P-E-C-E-X-P-L-A-I-N-S.

Speaker A:

Thanks to all our listeners. Don't forget to follow us and subscribe. Thanks for all the episode suggestions we've been getting, and we'll see you in two weeks. Bye.

Hash tables are some of the most widely used and powerful data structures. They allow for the efficient storage of key-value pairs. Keys are identifiers that we want to lookup data by, while values are the actual data. Hash tables underly common abstract data types in programming languages used for key-value data known as dictionaries, maps, or associative arrays. Hash tables can accomplish lookups, insertions, updates, and deletions in constant time on average. In this episode we explain what hash tables are used for and how they work.

If you don't know what an array or linked list is, you probably first want to listen to our prior episode, "What is a Data Structure?" Arrays and linked lists are component parts of hash tables and referred to in the episode with assumed knowledge about them.

Show Notes

Episode 61: What is a Data Structure?

Find out more at http://kopec.live