5 years ago

#8 How do Web Search Engines Work?

They always seem to find just what you're looking for.

Transcript

Speaker A:

How would you find anything without it? Today we talk about search engines. Welcome to CopeC explains software, the podcast where we make software intelligible.

Speaker B:

Okay, Dave, today's question, how does a search engine work?

Speaker A:

Well, that's a big topic. You know, there's really three different parts to a search engine, and I think we need to go through each of them individually. So first there's, how does the search engine actually find all the content? And we call that crawling. Then the search engine has to take all that content and store it somewhere, and we call that indexing. And then the search engine has to take the things that you want to search for and find them within that index, and we call that ranking. So we're gonna go through each of those one at a time, and let's do this in a pretty generic way without being too specific to any specific search engine. So we're gonna talk about all of these things at a pretty high level.

Speaker B:

So before we even get to those individual parts, I guess we should define what even is a search engine.

Speaker A:

Well, a search engine could be a tool for searching any kind of database. Now, the one that we're usually most interested in is a database of the web pages on the Internet, right? The World Wide Web is so big, there's billions and billions of pages, and it's very difficult when you just want to find one of those pages, to find it really quickly without using a tool that already has gone through all of them and found one similar to the one that you're looking for and then found a way of bringing it up quickly for you. So, you know, we could always bookmark every site we go to, but how do we find those sites in the first place? Well, that's the purpose of a search engine. So the way we're thinking about search engines for today, although they could be for other kinds of databases, is as a tool for going through a database of web pages and finding ones that are relevant to the terms that you're looking for.

Speaker B:

Great. So that is kind of the broad definition of the search engine. And each search engine is really part of, is the three parts that you just, you spoke about earlier, right?

Speaker A:

Crawling, indexing, and ranking.

Speaker B:

Okay, so how does a search engine crawl the web?

Speaker A:

So crawling the web has to do with going through all the web pages on the Internet, ideally, although very rarely will we actually get through all of them, but as many as we can, and storing basically the information on them so that we can later on go and retrieve that information. So what we have is each of the search engines using bots, quote unquote. And these are just automated programs that run all the time. Go start at some web page, look at all the links on that webpage, and then follow those links and go to more web pages.

Speaker B:

So crawling is happening even if I'm not searching for something immediately, right.

Speaker A:

They're always crawling. So for example, Google has a bot, I think it's called Googlebot. And Googlebot is constantly scouring web pages, looking at all of their links to go to further web pages and downloading all the content on those various sites to index. And we'll get into indexing a little bit later. Now, they might at some point reach a dead end, right? Because you go through all the links and at some point there's no more links to go through. So they also do some random kind of scanning of domains to look at index pages and crawl out from there. They're also going to probably look at ip ranges. Now, a lot of this information is also proprietary, so we don't know exactly how every search engine works. We don't know exactly how Googlebot works, but based on the behavior that we've observed of how it's come across different web pages, we can surmise some of the details, but we're not never going to know all the details. That's what makes Google so good, is a lot of its proprietary algorithms. And we'll get into more of how search engines might differ later on today. But basically a crawler has to go through each webpage, look at all of its links, go to all of those further web pages, and keep following links until there's no more links to follow. And if you keep doing that and you keep going to kind of a little bit of random ip addresses, looking at new domains that were registered through the various registrars, eventually you can follow that to crawl most of the web. So get through most of the information on the web. Now, of course, that's a huge task, and that involves going through billions and billions of web pages. And so of course that's not one computer doing it. You're going to have thousands, if not hundreds of thousands of these computers running these bots. For each of the search engines, trying to get through that massive amount of data, it's not an easy task and it has to be parallelized. So we have to have some computers working on some part of that huge amount of data, while other computers are working on other parts of that huge amount of data, and then all that data being put back together later on. If you try to do it yourself with one computer, with one bot, you would never finish. There's way too much data to do that. So you need a large amount of resources in the first place to be able to have enough computers and bots to go through and crawl all the web. This was much easier when the web was starting early 1990s to mid 1990s, there were few enough web pages that you could still have a startup company that would go and crawl all of them. But today there's so many web pages that you would need really a significant amount of monetary resources just to afford all the computers to go do all the crawling of all these different web pages.

Speaker B:

It's not even just the size, but it's always changing. New websites and new things are getting added to the Internet. It has to be a dynamic process.

Speaker A:

Yeah, thanks for bringing that up. Right. Because the content on a web page might change over time. So I might originally have not had certain keywords on a webpage, and then a couple months later I edit it and I add those keywords in. So these bots are actually going and recrawling the same pages over and over again. And if it's a really important website, they're going to be crawling it constantly for all kinds of updates that the search index, which we'll get to later, is always up to date and so that we can then rank the correct pages for up to the minute results.

Speaker B:

So something, a term I hear a lot, is search engine optimization. So is that for the crawling part of this, so that the bots or the software that does this is really finding, pulling things up and finding what you want it to be finding?

Speaker A:

Yeah, it's really for all three parts. So search engine optimization is about making a web page appear high in search results. Now, of course, the first part of that is, yes, it has to be discoverable in the first place by the bots. So yes, that is a part of it. But then it's also about making sure that the content on that webpage, including some of the meta content, which is content about content, is developed in such a way that the search engine is going to think this is a really important result and it should be high in my rankings of the results for a certain set of keywords. And so oftentimes when you're doing marketing for a new business, you employ someone who's an expert in search engine optimization to make sure that you're going to appear high in a search engine's results.

Speaker B:

On the opposite side of that, what if I want my website to not be crawled? I guess yeah.

Speaker A:

So there's actually a standard for this. There's a file called a robots TXT file. It's just a plain text file where there's a format for specifying pages that you want on your site to be crawled by a bot and pages that you don't want on your site to be crawled. So every time that Google bot or one of these other search engine bots goes to a new site, they look at the robots TXT file and they see in that file, oh, you know, I should go and put these files into the index. And then there's these other files on the website that I should not put into the index. These other pages I should not put into the index. Now do they have to respect the robots TXT file? Well, really no. There's no way of being certain that they're going to respect the robots TXT file. But it is a standard out there to allow you to make some of your stuff not indexed by a search engine. But I wouldn't really rely on it. I wouldn't go and put in your robots TXT file. Okay. Don't put these really confidential documents in the search engine. You shouldn't be putting those on your website in the first place.

Speaker B:

So we've covered crawling. That's the first step of the search engines. And the next part would be an index.

Speaker A:

Right. So as we crawl and it kind of goes together with crawling, we look for all the different terms and usually we're talking about text here that are on a webpage and we're saying, okay, there's all these different pieces of text. I want to store them in a database that I can later retrieve them. So when I store them in a database, how do I store them? And so you're probably going to store them by the keywords that are on that page. And there might be certain ways that you can identify what is a keyword. It might be how many times it appears on the page, or it might be that the person chose to make that word a headline. Or it might be that the person explicitly said in some metadata, here are the keywords associated with this page. But we need some way of saying, here are the terms that are associated with this page and that is the purpose of the index. Now I can't tell you specifically, again, because this information is proprietary, how each of the search engines goes and builds up its index. But I can tell you that we know that there are certain things they do look for. For example, like I said, repeated terms, terms that might be in a header and terms that might be in metadata that specifies what the keywords are. And there's actually some guidelines put out by some of the big search engines like Google to help you make your page properly indexed. And you can go and read those guidelines if you're a website maintainer to make sure that really your page is quote unquote, known for the right things. So what index acts as later on is when you go and do a search, and we'll talk in a minute about how ranking works. But when you go and do a search, how do we know the terms that you've searched for, which pages are actually associated with those terms? That's the purpose of the index. So for example, if I have a web page about golden retrievers and someone goes and does a search for golden retrievers, the index knows it has a mapping between the term golden retrievers and all the pages about golden retrievers. And so it might, for example, find my page now about golden retrievers because it saw in the index that I searched for the keywords golden retrievers. And I know that from the index, this is a page associated with golden retrievers.

Speaker B:

So you've mentioned that for indexing it really relies on text files. Does that mean that it can't search images, or is it that where you describe it uses a description of images to find an image?

Speaker A:

Yeah. So there's something in HTML called Alt text alt, and it's an attribute that goes with an IMG, which is an image tag in HTML that specifies a text alternative for the image. So for example, if I had an image of a golden retriever, I might put in the alt tag golden retriever. So that people, for example, who are blind and use screen readers, know when their screen reader goes to that image. This was supposed to be an image of a golden retriever, but it also might be helpful to a search engine to know, here was an image of a golden retriever based on the Alt text. Now there are more advanced techniques that can be used today. For example, we have machine learning algorithms that can identify what's in an image, but Alt text is still the main way that we're going to find out what's in various images. And so whenever you've done an image search in some search engine, a lot of what you're getting back is probably generated through that Alt attribute. There are also now reverse image searches and those use much more advanced techniques. That's where you actually provide the image and it provides back to you other images like it that uses computer vision algorithms and that's really beyond the context of what we're talking about today.

Speaker B:

But one of the things I'm hearing is just how important it is as a developer, as someone who's managing a website, to be accurate in your descriptions and your keywords, to make sure that people can find your website in the vast of all the websites.

Speaker A:

Right. And we would hope that if the search engines have good algorithms, that naturally, what we say, organically good results are coming to the top. Is that always the case? No. There are people who are so good at what we call search engine optimization that they can actually go and take a page that's not super relevant to someone's organic search and still make it appear pretty high in the rankings. And there's kind of a war going on back and forth between the search engines and the search engine optimization people, and they're back and forth using techniques as they change their algorithms in the search engines. And then the optimization folks figure out how those algorithms work and are able to maybe erroneously make some contents go to the top of the rankings. They have a push and pull there. But the algorithms do change over time. And so what might have worked really well to organically get your page to the top of results might not work as well as it did, let's say, a few years ago.

Speaker B:

Crawling is changing all the time. Like the web, the Internet is changing.

Speaker A:

I don't think the crawling is really changing that much. I think crawling pretty much works the way that it did in the mid nineties when the first search engines were becoming prevalent, where you just have a bot that goes, it looks at the page, it looks at all the links on the page, follows all those links to go to the next page, and then they're constantly scanning for newly registered sites and sometimes doing some random scanning.

Speaker B:

So that's the volume. Crawling is changing.

Speaker A:

The volume of crawling has certainly changed. So if you think about the web in the mid 1990s, we were talking in the millions of pages. Now we're talking in the billions. I wouldn't be surprised if we're in the tens of billions or hundreds of billions of pages. I don't know the exact figure today.

Speaker B:

So we've crawled, we've made an index, and now we need to rank the results. How does that work?

Speaker A:

Okay, so we have this index, this way of associating various keywords with various pages, and now we need a way of finding what is the best page for some keyword. And there are many different ways of doing this, but really, what brought the most famous search engine, of course, Google, to prominence was the algorithm that it used for this part of the process, the ranking. And not surprisingly, they called their algorithm pagerank. And this is actually the founders of Google. This is how they got started. Sergey Brin and Larry Page, they're the two co founders of Google, and they're still involved with the company to this day. They invented, while in graduate school, the pagerank algorithm. And the way the pagerank algorithm works, without getting into all the specifics, is it looks at backlinks, so it looks at links back to a particular page. So let's say I'm on that golden retriever page, right? And how do I know if that golden retriever page is like a good page about golden retrievers? Well, one of the most useful indicators might be how many other pages link back to my page about golden retrievers. So, for example, if ten other sites link to my page about golden retrievers, and then there's another page about golden retrievers, and only two sites link back to that page about golden retrievers, it might be that my page is better. Why is everyone linking to my page unless I had a pretty good page about golden retrievers?

Speaker B:

It's kind of like how a works cited, if you were looking at a journal article or a works cited page, how sometimes, you know, like, oh, this has been really vetted or really researched is by seeing how many times it's been cited other places.

Speaker A:

Yeah, absolutely. That's a great way of thinking about it. And there's a lot more two page rank than that. But that is the core idea, and it's really a very brilliant idea. And you kind of wonder, wow, how come all the other search engines weren't able to come up with that in the late 1990s? And I've read some things that some of them were doing things sort of like that, but obviously Google was doing it better. And maybe we'll come back a little later into how Google might have been operating a little bit differently than the other search engines of the late 1990s when it came into prominence.

Speaker B:

That's a good transition point, actually, where we can maybe talk about what are the different search engines, the differences between search engines, because they're not all using the same algorithms or working, working the exact same way.

Speaker A:

Right. So I think the crawling step is pretty easy for any company with enough resources. So the crawling step, I don't think, is going to be a big differentiator between any search engine. And again, I have to use the word, I think, because these search engines are mostly using proprietary algorithms. So we know that page rank is a part of Google's process. And actually I've also read that they are now way beyond page rank. It's just one part of a multi step algorithmic process that they use. But we don't know for sure how any of them are working specifically because that's their competitive advantage. They don't just publish, this is how we do it. And then we could just have other googles. Right? So this is their secret sauce. But anyway, I think the crawling step between them is probably pretty similar. So I don't think that's a main point of differentiation. My guess would be that the indexing step is also not that dissimilar, although there might be differences in that that are more interesting than the differences in the crawling. But I think the main difference is going to be in the ranking and the different algorithms that they're using for the ranking. So given the same set of data, how do we determine. You search for golden retrievers? I search for golden retrievers. Let's say I even have the same index for both of us. So we both have the same web pages associated with the keyword golden retriever. How do I decide which of those web pages is the best? Well, I think they're all using different ranking algorithms. I think that's really what's differentiating them. If we go back to when Google started, the biggest web portal at the time was Yahoo. And Yahoo at the time was doing and how it started actually was with a lot of human curation. So Yahoo was not actually originally a search engine in the same way that we think about a search engine today in terms of crawling, indexing and ranking, Yahoo was actually a human curated directory of sites. So there were actually people going and taking each site that was submitted to Yahoo and putting it into a category. And then the first versions of Yahoo, the way you would navigate them is you would click on a category and then you click on a subcategory, and then you click on a sub subcategory, and then you'd see the web pages that you wanted to see. And there was a way of searching through that index of pages, but that index of pages was generated by hand by human beings. There were then search engines in the late 1990s that, of course, were getting into algorithmic processes for doing this big job. And they were companies like Lycos, Exite, Altavista, Dogpile. I mean, there were many of them. Google did it better. That was the big deal, right? Google before it had Android, before it had Gmail, before it had any of its other products. It was purely just a better search engine than any of the competition, thanks to the page rank algorithm and whatever other proprietary algorithmic magic they were doing at the ranking stage. So what we can say today no longer is anyone doing human curation to the level that Yahoo was doing in the mid 1990s. It's impossible to do. There's too many pages. So the difference is probably mainly that we have different ranking algorithms. So, and the major search engines today globally are Google, which has most of the market share. I just looked it up. It's 86% from the statistics that I just saw. Then we have Bing for Microsoft, and then we have Duckduckgo, which is an independent company. So those three make up the vast majority of searches in the world. There are other country specific ones. For example, if you're in Russia, Yandex is the big search engine in Russia, but for most of the world, it's those three. It's Google, Bing and Duckduckgo. Now, one interesting thing when we're comparing them is that they actually compare to each other, and that's really an easy thing for them to do, right? So they can have an automated process that goes and says, when I search golden retriever, what results do I get on my competitors search engines? And there's been some kind of scandalous sort of stuff where some of the search engines have been going and building their results based on the other search engines. So literally copying the other search engines for common terms. So that's a great way of operating. So you don't have to figure out your own algorithm, you'll just go and say, what's Google doing? And let's just show the same results that Google would show. And so there actually is quite a bit of that going on. The other thing that happened historically is that some of the search engines would license technology from other search engines. So when Yahoo realized that they were way behind the game in search and that algorithmic searching was really going to take the place of human curation, they had their own internal efforts to build a search engine. But for a while, they actually used Google. They licensed Google's technology and for a while they licensed Bing's technology. So they never unfortunately, invested enough early on in their own algorithmic approaches. And that's how they got so far behind when they originally had a huge lead in terms of number of visitors and users. But anyway, you asked, how are they different? The ranking process is really how they differ. They do in fact go and build off of one another and use each other's intelligence to improve themselves. But the only reason they have to do that is because they sometimes have worse algorithmic results than their competitors. But today, you will still see pretty big differences. There's people who go and do measurements of how good are the results using some human judging of one engine versus another. And Google still seems to do better on those than Bing or Duckduckgo.

Speaker B:

One of the, I think, topics or areas that we hear a lot about today is around privacy. When we're searching the Internet, whether it's between ads that we're getting that, like, you search something one time, and then all of a sudden, every time you search, all the ads that are coming up are about golden retrievers or something like that. Can you just talk a little bit about how privacy is protected or not in search engines?

Speaker A:

Yeah. So, I mean, this is the big criticism, of course, of Google, right, is that every time you do a search, they are using the data of your search to serve you an ad. And over time, based on your searches and your search history, they are finding out more and more about you, and they can give you better and better targeted ads. Now, they have been attuned to this criticism, and they have started to provide more and more privacy controls over the years for users, but most of those privacy controls are opt in. So you actually have to go into your Google settings and say, hey, I don't want you to track me, or I don't want you to store this data, or I want you to erase the old data that you have of me. And your average user is never going to do that. So, of course, the way that they're able to provide such a great service for free is by funding it through these ads. And these ads can be pretty scary sometimes in terms of how much they know about you, it can be kind of shocking, like how specific they are targeted to you based on your demographics, targeted to you based on your location, targeted to you based on your prior search history. Now, of course, it's a dream for companies, right? Because if I'm a pizza place and I want to specifically sell the people looking for pizza in my zip code where I exist, well, there never was such a great way to get people right when they want to make a purchasing decision, as there is today with search engine ads. So, of course, they're a great tool for businesses, and in fact, they're good for consumers, too. Sometimes when you see an ad that's super relevant to you, that's actually nice, rather than seeing ads that are not relevant to you, for example, I have no interest in cigars. Right. So I would hate to just be seeing cigar ads all the time and if there's no targeting happening to me at all. So they don't know anything about me, they might throw me cigar ads. Right. So you could say that there's a benefit to users, too, but it is kind of scary how much data is known about you and, you know, it ends up in a lot of court cases. People's Google search history all the time comes up in cases you might remember. I forgot it was the case of a mom in Florida. It was really famous about five to ten years ago, and her daughter ended up murdered. And they wanted to see were any of the family members involved in the murder? Well, one of the crucial pieces of evidence used in court was the search history of these family members. And I forgot who it was, but one of them had actually searched. Like, how do you do a strangling or something like that? How do you do a murder or something like that? And actually the person was found innocent, but it was a major piece of evidence in the trial and you can see why. Right. So you can sometimes also be targeted based on this for civil liberties reasons. Right. There's of course, concern that. Right. If I had done some searches on, you know, Saddam Hussein, was I a ba'athist who supported his regime or something like that? Right. There's all kinds of civil liberties issues with this. So the main value proposition of one of the current search engines, DuckDuckgo, is that they don't store your search history. So Google makes all their money by looking at your search history and delivering you targeted ads. DuckDuckGo's value proposition against Google is, okay, maybe our search results are not quite as good as Google's, but we're not going to go and store all your data and use it to target you in the same way that Google does. And I think that is a pretty valuable proposition considering just how much is known about us. You know, me personally, I end up when I really need to find something, I do go and use Google because it does provide better results, in my opinion. But on a day to day basis, it's not my default search engine because I don't want to have my privacy invaded that much, but I will switch over to it when I need it. So that's kind of my halfway approach to trying to defend my privacy a little bit. Of course, I've gone into Google and set all the privacy settings, and I recommend that other users do that as well because they do offer some settings that just didn't exist five to ten years ago to protect yourself.

Speaker B:

I feel like it's a trend we've talked about in other episodes. It's just really about being an informed consumer, and that's where this is. And recognizing that with Google or any search engine, really, you are making a choice here, and you can make some informed decisions and change some settings and utilize these search engines the way that that you want. At the very least, you should be aware of what's being collected about you.

Speaker A:

Absolutely. And you know, there's a big discussion going on right now in our country. The tech CEO's, several of them, including Google, were up in Congress a few days ago for hearings about whether or not they're using predatory monopolistic practices. Now, knowing that Google has 86% global market share, some people might go and say, well, Google is a monopoly. And I would almost agree with that, actually, Google does have a dominant share in the search engine space. So if your definition of monopoly is just who is the dominant player that has more than 50% share, then sure, Google is a monopoly, but it's not like they don't have competitors. There are valid competitors. Bing is a real competitor. Its results are not as good as Google, in my opinion, but it is a valid competitor. They do index the entire web on a regular basis, and you can use it. And DuckDuckGo is a valid competitor too. Search engine results, again, are not as good as Google, but it does protect your privacy and you can go use it and it is free, just like Google is. So it's not like Google has somehow used its market position to stop other entrants or players from existing. So DuckDuckgo exists. It's up to you as a consumer to go and decide to use DuckDuckgo and not be lazy. And just because the default on your device is Google, use Google. But hey, you know, it's beyond my pay grade to decide whether or not that's by definition that makes them in line to be broken up.

Speaker B:

So overall, I think search engines are just such an important tool that we now rely on so much in our everyday lives. I mean, Googling has become a verb in our language. So it's just important that to have some understanding of how does this actually work. It isn't. While we might not know the specific algorithms, right, it's not a complete black box of how, how this is being functioned. And there is something that's always happening behind the scenes of this crawling and indexing of websites, of what's happening in the Internet, on the Internet, so that we can find the resources that we need and that that's so powerful, the access to information that these search engines provide for us.

Speaker A:

Yeah, it's an amazing free service. I mean, it's totally changed the world. It's put information at everyone's fingertips that we couldn't even dream of. When you and I were born in the eighties, it just wasn't a thing. There was no such giant database of information available to everybody to instantly find what they were looking for. And it's in some ways, it's enhanced all of our lives in ways that we couldn't imagine. In other ways. Some people argue that it's actually made us dumber, right? Because now we don't have to memorize as many facts because we can instantly find answers to anything that we could think of. I think that. I think it is a real problem. Personally, I teach college, and when I teach people in college, sometimes I'm surprised at some of the facts they don't know. And I'll sometimes say to them, not in an accusatory way, but I'm surprised you don't know that. And they say, well, I don't need to know that because I can just google it. And that's true, they can. So what am I supposed to answer to that? They're absolutely correct. There's no reason that they have to have that information stored in their brain. But I'll tell you where I think not having some of that information stored in our brain anymore that we get from Google now is a problem is when we make connections across different disciplines. So, for example, right, let's say I find out the first computer was invented in 1945, right? Well, and it was approximately the first modern computer came out in the mid 1940s, right? Well, what was going on in the early to mid 1940s? Well, somebody who didn't memorize all these facts and always dependent on Google might not know that World War Two was going on from 1941 to 1945 for the US and from 1939 for other countries. And not knowing that fact, you might not make the connection instantly. Oh, wait a second, that's when the computer came out. Maybe modern computers were a direct result of World War Two, and they were, but you might not be able to instantly make that connection and have that greater understanding and insight unless you had all those facts not just at your fingertips on Google, but actually at the ends of your neurons in your brain so that you can make those instant connections. So I think that while Google's really been an incredible boon to most of society. It has actually made all of us a little bit dumber, too, because we rely on it so much for facts instead of memorizing some facts that help us make other connections.

Speaker B:

Yeah, I think that's probably true. And I think the other thing that's really important with, with search engines is what are the questions that you're asking and how do you, sometimes it's easy to search for surface things and how do you do that? In depth analysis. And that sometimes when things come so simply for us, it's hard to sit and really think about something and make those connections.

Speaker A:

Yeah, absolutely. But let's not blame it on Google, though. It's. They create a great service. We're really happy that they have this service to make the world a better place, but maybe we should use duckduckgo. Yeah. All right, well, thanks for listening. It's been great having everyone this week.

Speaker B:

Yeah, thanks for listening, everybody, and we'll see you next week.

Speaker A:

And don't forget to leave us a review on Apple podcasts, overcast, Spotify, wherever you're listening, and don't forget to subscribe. Have a great week. Bye.

Information on the Web is always at our fingertips thanks to search engines. But, what makes them tick? In this episode we go over crawling, indexing, and ranking, the three phases a web page must go through to end up in your search results. We briefly discuss the PageRank algorithm and differences between various search engines. We conclude by discussing privacy issues.

Find out more at http://kopec.live