1 year ago

#132 What is Machine Learning?

The Field Within Artificial Intelligence Pushing Computing Forward

Transcript

David Kopec

When we think about the last couple decades of computing, the most exciting advances have all been related to a subsection of artificial intelligence known as machine learning. In this episode, we'll give you a general overview of the field of machine learning. Welcome to Copeck explains software, the podcast where we make computing intelligible. In this episode, we're going to be discussing machine learning, a big topic, and it's a subset, actually, of an even bigger topic, artificial intelligence. Now, we've done a prior episode on artificial intelligence, which I'm going to link to in the show notes. In that episode, we specifically explain how how machine learning is a subset of artificial intelligence and how it differs from other subfields of artificial intelligence. And we also go over the general term artificial intelligence. So if you don't have a lot of familiarity with that, I recommend you go and listen to that prior episode before listening to this one. And also, if you're curious about how machine learning itself kind of came to be and evolved from the discipline of artificial intelligence into its own exciting sub field, then listen to that episode first.

Rebecca Kopec

But today we're really talking about machine learning. So can you give us or start with a general overview?

David Kopec

I want to start with an example that we're going to refer back to several times in the episode, and that's about pricing a home. So suppose you know a little bit about homes in the area where you're thinking about buying one. Perhaps you know what the, what the sale price has been of many homes over the last year, and you also know what the size of those homes have been. And so you know, well, for a certain size home, on average, how much is it selling in that area, you could actually make a reasonable, rough prediction of the size of home that you want to buy and how much it might cost. It's not probably going to be super accurate with just that criteria, but it's going to give you a ballpark. That's reasonable. Now, a lot of people are familiar with the company Zillow. They have something called their zestimate. They tell you how much they predict that a certain home is worth that a real estate agent hasn't even gone to see yet, and their predictions are more accurate than using just the method we talked about before. They have a lot more information. They know how many bedrooms the home has. They might know what condition the home has been established for tax purposes to be in. They might know how many bathrooms it has. They might know how big the lot is. They might know how expensive the homes that are just around the block are how much they sold for just last year so they can get really hyper local. The more information we have, the more accurate a machine learning algorithm gets. And of course, that's what Zillow is using. Zillow is using, of course, machine learning to make those estimate predictions. They're not sending real estate agents out to go to every home and take a tour of it, and yet their predictions tend to be highly accurate. So that's kind of the power and excitement of machine learning. We can use machines and algorithms to automate processes that used to be highly human centric. We used to need to have a real estate appraiser go to every home to know how much it was worth. But now we can get a pretty good idea just using Zillow. Now, obviously there's both positive and negative societal implications for technology like that. We're not going to get into that in this episode. We're more talking about what is machine learning and what are the different aspects of it, and giving you a broad understanding. We're also not going to get into a lot of formal definitions today, but we'll come back to that zillow example as we progress.

Rebecca Kopec

Really what Zillow has and what machine learning relies on is a lot of data that is the basis of machine learning. Right?

David Kopec

Right. Machine learning always starts with a dataset, and the more rich and large the data set, generally the more accurate that the machine learning algorithm can be in whatever it's trying to do. So if you don't have good data on something, you can't generally use machine learning for it. And if you have biased data on something, then you're going to end up with biased results. And we'll talk more about bias at the end of the episode.

Rebecca Kopec

It's called machine learning. But is the learning happening like how you or I learn about something?

David Kopec

Absolutely not. And I actually want to quote from a book. It's called the hundred page machine learning book. It's a book I recommend that programmers read who are interested in machine learning. It's by Andre Berkov, and he writes in the preface, let's start by telling the truth. Machines don't learn. What a typical learning machine does is finding a mathematical formula which, when applied to a collection of inputs, produces the desired outputs. And later on he goes on, why isn't that learning? Because if you slightly distort the inputs, the output is very likely to become completely wrong. It's not how learning in animals works. If you learned to play a video game by looking straight at the screen, you would still be a good player. If someone rotates the screen slightly, a machine learning algorithm, if it was trained by looking straight at the screen, unless it was also trained to recognize rotation, will fail to play the game on a rotated screen. Let me extrapolate on this a little bit. As a human being, when we're very young and Rebecca and I have very young children, we're able to learn from just a few examples. So, my son, if he's learning, what does a cow look like? After seeing two or three cows, he'll recognize a cow. Even if it's tipped over, even if it's lying down and sleeping, he'll recognize a cow. Even if it has a slightly different color, he'll still recognize it. A machine learning algorithm will typically need many, many, many examples. And if the examples are biased or tilted in even one way. So if we only see standing cows, and we only see cows that are black and white, and we don't see brown cows, we might not recognize brown cows as cows well. Whereas a human is going to quickly learn the general principles of what it means to be a cow and be able to recognize different configurations of cows. So what machines do when they're, what machine learning algorithms do, I should say, is really making approximations based on large data sets. Whereas what humans are doing when they're learning is learning general principles that they're able to then apply to many different flexible situations. Machines are trying to mathematically optimize. Humans are learning general principles. Now, there's a lot of debate about how humans really learn, and there's a lot of study and research into that. That's my impression and my understanding. So experts may disagree with me.

Rebecca Kopec

There are different types of machine learning algorithms. Can you talk us through them? Sure.

David Kopec

And I'm not going to talk through all the different types, but I will talk through three different main types that most people who are in an intro to machine learning class become familiar with. The first is classification algorithms. In classification, we have a data point, and we want to know what kind of data point it is or what group it belongs to. For example, say we have an image of an animal. We may want to know, is that an image of a dog or a cat? That would be classifying it? We either classify it as a dog or classify it as a cat image. It can also be a piece of text, and we want to classify it as either angry text or happy text, or we want to classify what kind of emotion is associated with that text. You do this every day when you use your email clients. There's a machine learning algorithm that's working behind the scenes to classify an incoming email as either legitimate or spam. Classification is easy to understand. Let's go to another thing that's slightly harder to understand, which is regression. Regression is about making predictions, typically numerical predictions. And this might go back to our zillow example. Say we have a data point being a house, and we know a lot about the house, but we don't know how much the house is worth. We want to be able to predict what is the value of that house, what would its sale price be. That's a type of regression. And then another interesting area is clustering. Clustering is we have a data set and we want to know what the groups are. In classification. We already knew the groups, and we have a new data point, and we want to know which group it belongs to. In clustering, we don't know what the groups are. We want the algorithm to figure out what the groups are, and that can't be done for all different kinds of data, some kinds of data requirements that we ahead of time know what the groups are, but certain kinds of data, you can imagine there might be a way to figure out that some of the data looks different than other parts of the data, and those are naturally some groups. For example, clustering algorithms are sometimes used for detecting tumors in a radiographic examination. So we want to know which of the cells seem to kind of be different than the other cells, and they might actually be a tumor. So if we cluster all the cells, we might be able to figure out that a bunch of them that are all near each other are related to one another in because they are a bad type of cell.

Rebecca Kopec

We often hear that these machine learning algorithms are resource intensive, that it could take both a lot of money, a lot of time, a lot of energy to be useful and be applied. Why is that?

David Kopec

Right. Because many of these algorithms require what's called a training phase. In the training phase, the algorithm is trained on the data set, and usually it's an initial data set that then we have new data that comes after that that we want to use the algorithm with. For example, we hear about the big expense that OpenAI is putting towards GPU's and compute resources, and the months it's taking them to come out with a new version of chat GPT. That's because they're training on literally billions and billions and billions of text documents. And it takes a lot of time and a lot of computer power to run all of that huge dataset through the deep learning neural networks that they use. And that's called the training phase. And the training phase is distinct from when we're actually using the machine learning algorithm. Then on new data that we're interested in, that's called the inference phase. In the inference phase, we're actually getting our classification or getting our prediction, or in the case of chat GPT, generating the text that we want as a reply to some query that we've sent to it. So machine learning algorithms typically have both this training phase and this inference phase. The training phase is when they're getting ready to be used, and the inference phase is when they're actually being used. Now, not all machine learning algorithms work that way. Some simple machine learning algorithms actually have no training phase. One is k. Nearest neighbors is actually going to be in my next book, which ill talk about on the podcast in a future episode. But most algorithms, and especially the deep learning ones that have been behind a lot of the exciting advances of the last decade, do require significant training before they become useful. And thats when youre exposing the algorithm and its kind of optimizing the curve, so to speak, on a huge amount of data before you give it that little bit of new data that you actually want to know about.

Rebecca Kopec

During this training period, does a person have to be sitting there showing a new picture or new piece of information? How does it work?

David Kopec

Yeah, so there's really two different categories of machine learning algorithms when it comes to training. There are supervised algorithms and there are unsupervised algorithms. In supervised algorithms, there is some kind of human intervention, and that doesn't necessarily mean that the human is there actively being involved as it's training, although those type of algorithms or scenarios do exist. But it often means that there's human labels in the initial data set. So if we go back to the recognition of the cat versus the dog, a human has already gone through maybe hundreds, thousands, even hundreds of thousands of pictures of cats and dogs and said, this one's a cat, this one's a dog, this one's a cat, this one's a dog. So when the mathematics is optimized, the stats are run. We actually know that we're optimizing for the right thing. There are also unsupervised algorithms that instead of having any human intervention completely operate autonomously, there's no necessarily human labeled data that goes into their training. An example of that would be k means clustering, which is actually chapter one of my previous books, the class computer science problem series, which I'll link to in the show notes. Chapter six, if you're interested, and you're a programmer. Anyway, in that algorithm, we want to figure out what the groups are in a dataset. And instead of having any human saying, well, I think there's some groups here, or I think this particular data point is part of some particular group, we just unleash the algorithm. The algorithm just looks at which of the data points are closest together. On average, where is the center of the different groups of data that are close together. And over time, we're then able to see where the boundaries are between the different groups through an iterative process of checking which things are closest to which things. So it is possible that algorithms that have almost no human intervention, but a lot of the useful ones, are supervised, or in some cases, semi supervised, which means that there was some human labeling in the data set that kicked off the ability to do training.

Rebecca Kopec

I think this would be a good time to talk a little bit about bias, because it's not the algorithms that are biased, but there is a place where a software developer or the human involved needs to be really mindful of the data set that they're choosing, right?

David Kopec

What if those labels are wrong? What if somebody had a certain bias themselves, so they decided to label things a certain way? Or more commonly than that, what if examples that are important are not even included in the original data set? To give you a very silly but realistic example, if you were training a program for an automatic food dispenser for pets, maybe it's for cats, and it has a camera on it, so it's supposed to detect when the cat walks up to the food dispenser and then automatically dispense a little bit of food. Let's say you train that camera, the pictures it takes, and the algorithm that's used with it only on cats that are orange, well, then when a gray cat walks up to the camera, it might not get detected. So that's a bias that you've introduced by not having a rich enough data set that you were building your program out of and training your machine learning algorithm out of. So we need to be mindful of this. It's something that not only programmers need to be aware of, they absolutely do, but also project managers, researchers, the folks who are actually collecting the data. It needs to really be understood throughout the organization. If you have a biased data set, you end up with a biased machine learning result. Now, it's like you said, Rebecca, it's very unusual that the problem would actually be in the algorithm. The algorithms kind of operate the same way. No matter what data set they're fed. It's not that there's going to be somebody that's programming an if statement inside the algorithm that says, if Gray cat don't dispense treats right, or only dispense if cat is orange dispense treat, it's not going to be like that. It's going to be that the images that it was trained on, the big data set doesn't include a diversity of different cats.

Rebecca Kopec

There are different types of algorithms that go into machine learning. There are some that are more simple or straightforward. And then we get the other side of the spectrum, much more complicated, more rich, and in some ways maybe more interesting in what they can do. Give us a broad overview.

David Kopec

Yeah, this is not the place to go into the details of specific algorithms, but I do want to just show you their range of complexity that exists. So, for example, one of the simplest machine learning algorithms is actually considered linear regression. And those of you who have studied economics or sociology or many different disciplines, you've probably done a linear regression in something like spss or in excel or even by hand. And so you know what the idea is for those of you that don't, it's basically the idea of trying to draw a line through a data set. So can I draw a line through a dataset and then make predictions using it? For example, if you imagine in your mind a chart where we have size of house along the x axis and we have price of house along the y axis for a lot of regions in the country, you could actually draw a reasonable line that would make reasonable predictions. And if you had a house that was new and it was somewhere you knew how many square feet it is along that x axis, you could go look on the y axis, where that line is that draws through all the previous data, and you can make a reasonable prediction about how much that house should be priced for. So that's one of the simplest machine learning techniques. And when you think about that, it's been around for like over 100 years. Linear regression. We've actually been doing machine learning for a long, long time, and a lot of machine learning techniques came out of the world of statistics. On the other end of the spectrum, we have neural networks, and neural networks are trying to model how biological neurons work, but they don't work any. To the best of our knowledge, they don't really work like real biological neurons work. The biological neurons just served as the inspiration for the artificial neurons in artificial neural networks. And the revolution in neural networks really happened in the oos when we started using GPU's graphics processing units to really speed up the training and inference with neural networks. And that has enabled multilayer neural networks, which is called when they get very big, is called deep learning. And that's been behind a lot of the most exciting applications of the past couple of decades, like image recognition, speech recognition, generative AI, like chat, GPT. All of that is being done by deep neural networks. And there's a real art and a science developing these deep neural networks. There's many different forms of them. There can be significant differences in how they're structured, and there's a lot of technique involved and a lot of research ongoing into what's the optimal structure, optimal mathematical functions to use for different applications. The machine learning. The reason I'm going into this is just to show you that machine learning goes from really simple techniques that can be highly effective. Linear regression can work very, very well, and there are other very simple techniques like k nearest neighbors or k means. And then there can be very sophisticated algorithms that are totally on the other end of the spectrum that also are highly effective. Most of the really exciting applications of machine learning, again, in the last decade have been on that more sophisticated spectrum. But that doesn't mean that the simpler algorithms are not still used every day. They absolutely are, and they are very effective, too.

Rebecca Kopec

What would you say is the most important or some of the most important things for the everyday person to know or understand about machine learning?

David Kopec

I think the most important thing to know is that it starts with the data set, and if it's going to work well, it starts with a good data set. As we discussed before, the richer the data set, the larger the dataset, generally, the better the machine learning algorithm will work. And if the dataset is biased, then those same biases will be in the results of the machine learning algorithm. So it really starts with a dataset, and you need to be really concerned about the quality and size of the data set before you even start to use a machine learning technique. I also just want to stress, even though it's not the emphasis of just this episode, but also our prior episode on artificial intelligence, that machine learning is just a subset of artificial intelligence. Artificial intelligence is a bigger field than just machine learning, and there are a lot of other techniques in artificial intelligence that are very interesting. Sometimes people think the two are just completely equivalent, but really, machine learning is just a subset. And we've done prior episodes not only on artificial intelligence itself, but also on some other subfields of artificial intelligence. And I'll link to at least one of those in the show notes as a programmer, if you're interested in getting into machine learning, I'm going to very biasedly recommend my own series of books, the classic computer science series we have. Out of the nine chapters in each of those books, you could say four to five of them actually come from the world of artificial intelligence, including at least two chapters that you could consider from machine learning on the k means algorithm for clustering chapter six, and on neural networks, where you actually build a real neural network, a really, really simple one, using no external libraries in chapter seven. And I have a new book coming out shortly, which I'm sure I'll talk about in the podcast in future episodes that's going to cover a really, really simple algorithm in machine learning in a couple of its chapters called k nearest neighbors. And if you want the most gentle possible introduction to machine learning, then those couple of chapters in my next book are really fantastic. I'm also going to recommend one other book doing a lot of book recommendations, and this one I'm unbiased on because I'm not the author. The hundred page machine learning book by Andre Berkov that I quoted from at the beginning of this episode. I highly recommend to programmers. I love books that are super succinct, that get to the point and don't waste your time, just like I hope we mostly do in this podcast. And this book is exactly that. If you're a programmer and you want to get started in machine learning, I don't think there's any better book. So if you're a programmer, those are great places to start. If you're not a programmer, hopefully we already gave you enough in this episode to understand what other people are talking about when they're talking about machine learning. And just remember, it always comes back to the data. Thanks for listening to us this week, Rebecca. How can people get in touch with us on x?

Rebecca Kopec

We're OPec explains K o P e C e X p l a I.

David Kopec

N S we have a lot of great new topics that we plan to do in the coming months. We hope to have you listening with us again soon. If you like the episodes that we put out, please don't forget to leave us a review on your podcast player of choice. Talk to you soon. Bye.

Machine Learning is a discipline within the broader field of Artificial Intelligence concerned with using insights from datasets to make predictions, classify new data points, and generate content. The algorithms used vary greatly in complexity and the real world applications that they are applicable to. Instead of concentrating on any particular algorithm, in this episode we aim to provide a broad understanding of machine learning and what it is used for. We also discuss bias in datasets and some common misconceptions.

You may want to listen to our prior episode on Artificial Intelligence before diving into this episode.

Show Notes

Find out more at http://kopec.live