{"id":8087,"date":"2019-07-17T12:01:32","date_gmt":"2019-07-17T19:01:32","guid":{"rendered":"https:\/\/www.springboard.com\/?p=8087"},"modified":"2023-07-27T01:44:16","modified_gmt":"2023-07-27T08:44:16","slug":"introduction-regression-classification-machine-learning","status":"publish","type":"post","link":"https:\/\/www.springboard.com\/blog\/data-science\/introduction-regression-classification-machine-learning\/","title":{"rendered":"Introduction to Regression and Classification in Machine Learning"},"content":{"rendered":"\n<p><span style=\"font-weight: 400;\">In my last post, we explored a general overview of data analysis methods, ranging from basic statistics to machine learning (ML) and advanced simulations. It was a pretty high-level overview, and aside from the statistics, we didn\u2019t dive into much detail. In this post, we\u2019ll take a deeper look at machine-learning-driven regression and classification, two very powerful, but rather broad, tools in the data analyst\u2019s toolbox. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">As my university math professors always said, the devil is in the details. While we will look at these two subjects in more depth, I don\u2019t have programming examples for you. We&#8217;ll go over how the methods work and a couple of examples. We&#8217;ll look at some pros and cons, and we will talk about a couple of important issues when using machine learning. But you&#8217;ll need to learn a little programming and debug your code to interpret your results. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Modern data analysis is fundamentally computer-based. Could you do these calculations by hand? Probably. But it would take a very long time, and it would be extremely tedious. That means you\u2019ll probably need <\/span><i><span style=\"font-weight: 400;\">some<\/span><\/i><span style=\"font-weight: 400;\"> programming skills to accomplish classification or regression tasks. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">However, don\u2019t let coding scare you away from the tutorial. If you have any contact with machine learning, these basics are important to understand, even if you never write a line of code.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Now, let\u2019s get started.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Background<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Types of Machine Learning<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">There are two main types of machine learning: supervised and unsupervised. Supervised ML requires pre-labeled data, which is often a time-consuming process. If your data isn\u2019t already labeled, set aside some time to label it. It will be needed when you test your model. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">By labeling, I mean that your data set should have inputs and the outputs should already be known. For example, a supervised regression model that determines the price of a used car should have many examples of used cars previously sold. It must know the inputs and the consequent output to build a model. For a supervised classifier that, for example, determines whether a person has a disease, the algorithm must have inputs and it must know which output those inputs led to. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Unsupervised ML requires no initial labeling from the data scientist. There is a kind of label \u201cout there in the ether,\u201d we just can\u2019t see it. Unsupervised algorithms are complex to implement and generally require a lot of money and data. The algorithm receives no guidance from the data scientist, like: Patient 1 with symptoms A, D, and Z has cancer, while Patient 2 with symptoms A, D, and \u2013Z does not. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">When the general public thinks of <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/best-books-on-artificial-intelligence\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/best-books-on-artificial-intelligence\/\" rel=\"noreferrer noopener\">artificial intelligence<\/a>, they think of (very good) unsupervised ML algorithms. Their power lies in their ability to learn on their own and to make connections in higher-dimensional vector spaces. That phrase may sound intimidating, but it really just means the algorithms can look at multiple connections at once and discover insights that we humans, with our limited memories, would miss. They might make connections humans cannot understand, too, leading to black box algorithmic decisions. More on that later.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Regression and Classification<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">In the <\/span><a href=\"https:\/\/www.springboard.com\/resources\/learning-paths\/data-analysis\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/resources\/learning-paths\/data-analysis\/\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">last article<\/span><\/a><span style=\"font-weight: 400;\">, I discussed these a bit. Classification tries to discover into which category the item fits, based on the inputs. Regression attempts to predict a certain number based on the inputs. There\u2019s not much more to it than that at the surface level.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Splitting Data Sets<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">When we train a ML model, we need to also test it. Because data can be expensive and time-consuming to gather, we often split the (labeled) data set we have into two sections. One is the training set, which the supervised algorithm uses to adjust its internal parameters and make the most accurate prediction based on the inputs.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The remainder of the data set (usually around 30%) is used to test the trained model. If the model is accurate, the test data set should have a similar accuracy score to that of the training data. However, we often see underfitting or overfitting of the model, and that will become apparent in the testing stage.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">For now, it\u2019s sufficient to know that this \u201ctraining-test split\u201d is a very common evaluation method in ML-based data science.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Common Tools<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">There are many tools for data analysts, some more popular than others. This is a <\/span><a href=\"https:\/\/www.springboard.com\/blog\/data-analytics\/data-analytics-tools\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-analytics\/data-analytics-tools\/\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">nice introduction<\/span><\/a><span style=\"font-weight: 400;\">. And while it doesn\u2019t cover all tools (the list would be enormous), it does touch on many of the popular ones. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Languages like R and Python are very common, and if you look at any <a href=\"https:\/\/www.springboard.com\/courses\/data-science-career-track\/\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/courses\/data-science-career-track\/\" target=\"_blank\" rel=\"noreferrer noopener\">data science course<\/a>, you\u2019re likely to see some mention of Python libraries like <\/span><i><span style=\"font-weight: 400;\">scikit-learn<\/span><\/i><span style=\"font-weight: 400;\">, <\/span><i><span style=\"font-weight: 400;\">pandas<\/span><\/i><span style=\"font-weight: 400;\">, and <\/span><i><span style=\"font-weight: 400;\">numpy<\/span><\/i><span style=\"font-weight: 400;\">. There is, fortunately, a large online community to help you learn these tools, and I strongly recommend starting out with a high-level language like Python, especially if you have little programming background.<\/span><\/p>\n\n\n<div class=\"bg-leaf-50 p-4 my-3\"><h4 class=\"fw-bold text-center\">Get To Know Other\tData Science Students<\/h4><div class=\"row row-cols-1 row-cols-lg-3\"><div class=\"col\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/samuel-okoye\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1635255723\/Student%20Success\/Samuel_Okoye_125x125.png\" alt=\"Samuel Okoye\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Samuel Okoye<\/p><p class=\"text-muted lh-1\">IT Consultant at Kforce<\/p><\/div><div class=\"w-100 d-block d-md-none mt-3\"><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/samuel-okoye\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/garrick-chu\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1629203194\/Student%20Success\/Garrick_Chu_125x125.png\" alt=\"Garrick Chu\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Garrick Chu<\/p><p class=\"text-muted lh-1\">Contract Data Engineer at Meta<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/garrick-chu\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/sunil-ayyappan\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1629203191\/Student%20Success\/Sunil_Ayyappan_125x125.png\" alt=\"Sunil Ayyappan\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Sunil Ayyappan<\/p><p class=\"text-muted lh-1\">Senior Technical Program Manager (AI) at LinkedIn<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/sunil-ayyappan\">Read Story<\/a><\/p><\/div><\/div><\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Regression<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Now that we have the preliminaries out of the way, let\u2019s start looking at some regression techniques.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Least Squares<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">The concept is straightforward: we try to draw a line through the data set and measure the distance from each point to the line. The distance is termed the <\/span><i><span style=\"font-weight: 400;\">error<\/span><\/i><span style=\"font-weight: 400;\">, and we add up all these errors. Then we draw another, slightly different line, add up all the errors, and, if the second line has a lower total error than the first one, we use the second line. This process is repeated until a line with the lowest error is found. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The name of the method derives from its formula, which actually squares the error value. This is because points above the line would offset points below the line (positive plus negative goes to zero). Squared values always give a positive number, so we can be sure that we are always adding positive numbers together.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Least squares regression will produce some linear equation, like: <\/span><\/p>\n\n\n\n<p class=\"has-text-align-center\"><span style=\"font-weight: 400;\"><em>car price<\/em> = 60,000 &#8211; 0.5 * <em>miles<\/em> &#8211; 2200 * <em>age<\/em> (<\/span><span style=\"font-weight: 400;\"><em>in years<\/em>)<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Every two miles driven reduces your sale price by $1, and every year of ownership reduces it a further $2,200. So a 5-year-old car with 50,000 miles will sell for $24,000. This assumes a brand new vehicle, with zero miles and zero years, costs $60,000. Whether this reflects reality is suspect, but this example illustrates the basic equation produced by the least squares method.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Least squares is very helpful when you have a linear relationship. It doesn\u2019t have to be in two dimensions, as our example above illustrates. Technically our car example would use a predictor <\/span><i><span style=\"font-weight: 400;\">plane <\/span><\/i><span style=\"font-weight: 400;\">instead of a predictor <\/span><i><span style=\"font-weight: 400;\">line<\/span><\/i><span style=\"font-weight: 400;\">, and higher dimensions (i.e., more variables) would result in <\/span><i><span style=\"font-weight: 400;\">hyperplane <\/span><\/i><span style=\"font-weight: 400;\">predictors. Species of a three-dimensional world, we cannot easily visualize hyperplanes geometrically. However, the idea remains: if a simple linear relationship can be found, linear regression is applicable.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Least squares is also a common first approach because it\u2019s very cheap in terms of computing power. However, this efficiency comes with the cost of not being useful on relationships that are not linear in nature\u2014which is actually many relationships.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Non-linear regressors like polynomial and logarithmic ones still use the least squares method, but shift from producing a line (or linear plane or hyperplane) to a polynomial curve or polynomial surface. A cubic function for car prices might look like:<\/span><\/p>\n\n\n\n<p class=\"has-text-align-center\"><span style=\"font-weight: 400;\"><em>car price<\/em> = 60,000 &#8211; 0.01 * <\/span><em><span style=\"font-weight: 400;\">miles<\/span><\/em><span style=\"font-weight: 400;\"><sup>3 <\/sup><\/span><span style=\"font-weight: 400;\">+ 0.3 * <\/span><em><span style=\"font-weight: 400;\">miles<\/span><\/em><span style=\"font-weight: 400;\"><sup>2 <\/sup><\/span><span style=\"font-weight: 400;\">* <em>years<\/em> &#8211; 3.5 * <em>miles<\/em> * <\/span><em><span style=\"font-weight: 400;\">years<\/span><\/em><span style=\"font-weight: 400;\"><sup>2 <\/sup><\/span><span style=\"font-weight: 400;\">+ \u2026 &#8211; 0.94 * <\/span><em><span style=\"font-weight: 400;\">years<\/span><span style=\"font-weight: 400;\"><sup>3<\/sup><\/span><\/em><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">I completely made up that equation, but you can see that to reach higher dimensional polynomials we just multiply variables together. The machine finds the constant factors (0.01, 0., 3.5, 0.94 in our example).<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">k-Nearest Neighbors (KNN) Regression<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">The k-nearest neighbors approach is intuitively more closely associated with classification, but it can be used for regression as well. If you remember the discussion about continuous and categorical variables in <\/span><a href=\"https:\/\/www.springboard.com\/resources\/learning-paths\/data-analysis\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/resources\/learning-paths\/data-analysis\/\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">my last post<\/span><\/a><span style=\"font-weight: 400;\">, the line isn\u2019t always clear (recall the checking account example). If we break down a continuous output into multiple categories, we can apply classifier models to generate regression-like predictors. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Pure KNN regression simply uses the average of the nearest points, using whatever number of points the programmer decides to apply. A regressor that uses five neighbors will use the five closest points (based on input) and output their average for the prediction.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">I will discuss k-nearest neighbors more later, because it fits better with classification, but know that it can be used for regression purposes. <\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Sanity Checking Your Model<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Sometimes we get so caught up in the programming and error rates and accuracy scores, we go a bit insane. ML requires quite a bit of tinkering, and when we start to see improvements, we get very excited. If you started out with 70% baseline accuracy and tweaked and tweaked and finally got to 80%, you must be doing something right, right? Well, maybe not.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Your data set might really have a 50-50 chance of outcome A or outcome B, and your original regressor was guessing too systematically. To ensure we can trust the results, it\u2019s a good idea to use a dummy regressor. These will predict outcomes based on predetermined rules that are not connected to your data at all. If your dummy regression is producing similar results to your trained regressor, you haven\u2019t made any true progress or insights. Common dummy regressors predict a value by the mean, the median, or a quantile. Your data set might not actually need machine learning insights if these dummy regressors are sufficient, or you might need a new approach.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Classification<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Classification attempts to classify a data point into a specific category based on its features or characteristics. Based on measurements, what is this plant? What kind of worker will someone be based on the answers to a personality test? Using solely color variation, which bananas are ripe, which are underripe, and which are overripe? <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">As you may have concluded, classification questions are usually \u201cwhat kind of\u2026\u201d while regression questions are usually \u201chow much \u2026\u201d or \u201cwhat is the probability that \u2026\u201d. These are not always mutually exclusive. But this is a good rule of thumb to help you determine whether your problem will require a classification or regression model.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Linear Support Vector Machines (SVMs)<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">In its simplest form, SVMs closely resemble least squares regression. In least squares, we tried to find the line that minimized the error term. With an SVM, we look for a line that fits between the two classes, then we try to expand that line as wide as possible. The line that can expand the farthest is considered our decision line. Points on one side are Class A and points on the other side are Class B. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">This visual might help:<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image5-4.png\"><img loading=\"lazy\" decoding=\"async\" width=\"937\" height=\"584\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image5-4.png\" alt=\"Linear Support Vector Machines\" class=\"wp-image-8088\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image5-4.png 937w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image5-4-400x249.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image5-4-768x479.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image5-4-380x237.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image5-4-700x436.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image5-4-380x237.png 420w\" sizes=\"(max-width: 937px) 100vw, 937px\" \/><\/a><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">All these lines can separate the two classes (chimps and humans), but some lines are probably better classifier lines than others. The blue one is heavily dependent on height, so slightly taller chimps or slightly shorter humans may be classified incorrectly. The red line is horizontal and therefore entirely dependent on weight. Extra-heavy chimps (say 62 kg.) will be incorrectly classified as human. The green one seems to take both factors into account.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">SVM will take each guess and try to widen it. The line that can be widened the most before it touches a data point is considered the best classifier. Intuitively, it is the decision line that has the greatest buffer between data points and the decision criteria.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Out of our choices, the green one can expand the most without touching a data point:<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image7-3.png\"><img loading=\"lazy\" decoding=\"async\" width=\"937\" height=\"584\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image7-3.png\" alt=\"SVM algorithm\" class=\"wp-image-8089\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image7-3.png 937w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image7-3-400x249.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image7-3-768x479.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image7-3-380x237.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image7-3-700x436.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/image7-3-380x237.png 420w\" sizes=\"(max-width: 937px) 100vw, 937px\" \/><\/a><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">If a SVM algorithm could only choose from these three lines, it would choose green. Note that the decision line will be the thin line from the first graph. Technically, the green \u201cline\u201d in the second graph is a rectangle. A real SVM will test hundreds or thousands of lines.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Non-Linear SVMs<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Linear classifiers are great because they\u2019re cheap in terms of compute power and compute time. That means they can easily scale to mammoth data sets. However, sometimes linear classifiers just don\u2019t conform to the data. In these cases, you might want to try a different <\/span><i><span style=\"font-weight: 400;\">kernel<\/span><\/i><span style=\"font-weight: 400;\">, often a radial kernel. This means instead of drawing lines, you draw circle-like decision boundaries. <\/span><\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/Low-Gamma.png\"><img loading=\"lazy\" decoding=\"async\" width=\"796\" height=\"518\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/Low-Gamma.png\" alt=\"Radial basis SVM, Low Gamma\" class=\"wp-image-8091\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/Low-Gamma.png 796w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/Low-Gamma-400x260.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/Low-Gamma-768x500.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/Low-Gamma-380x247.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/Low-Gamma-700x456.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/Low-Gamma-380x247.png 420w\" sizes=\"(max-width: 796px) 100vw, 796px\" \/><\/a><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Here, the green <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Spline_(mathematics)\" target=\"_blank\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">spline<\/span><\/a><span style=\"font-weight: 400;\"> represents the radial basis SVM and all points inside the shaded area will be predicted as Class A. It isn\u2019t a circle centered at the center of the group. It is actually multiple circles drawn together, each radiating out from the data points. My drawing isn\u2019t perfect, but the concept should be understandable: draw circles out from data points and then put them together to get the decision line that offers the widest buffer between Class A and Class B data points.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">SVMs with polynomial kernels are also popular, wherein polynomial lines are used instead of circles or straight lines. And while we only looked at problems with two inputs (height and weight, X and Y), SVMs can easily take more inputs.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">When writing the code, there are multiple parameters you can set. Especially important is <\/span><i><span style=\"font-weight: 400;\">gamma<\/span><\/i><span style=\"font-weight: 400;\">, which is particularly noticeable with radial-basis SVMs. The higher your gamma parameter, the tighter the circles are around the data points. High gammas might lead to tight circular boundaries that isolate individual data points, but this is extremely overfit and will not classify new data well.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/High-Gamma.png\"><img loading=\"lazy\" decoding=\"async\" width=\"796\" height=\"519\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/High-Gamma.png\" alt=\"Radial basis SVM, High Gamma\" class=\"wp-image-8092\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/High-Gamma.png 796w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/High-Gamma-400x261.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/High-Gamma-768x501.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/High-Gamma-380x248.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/High-Gamma-700x456.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/High-Gamma-380x248.png 420w\" sizes=\"(max-width: 796px) 100vw, 796px\" \/><\/a><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Decision Tree Classifiers (DTCs)<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Let\u2019s look at a different kind of classifier. Decision trees recursively split up the data points into groups (nodes). Each node is a subset of the node above, and if the decision tree is a good classifier, the accuracy of predictions improves as it moves down the branches. Decision trees can achieve extremely high accuracies on the first attempt. However, be cautious about such high accuracy rates, because decision trees are notorious for overfitting the training data. You might get 95% accuracy on your training set then get 65% on your test set. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">To illustrate, a decision tree might look like this:<\/span><span style=\"font-weight: 400;\"><br>\n<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"982\" height=\"594\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img1.png\" alt=\"decision tree\" class=\"wp-image-8095\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img1.png 982w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img1-400x242.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img1-768x465.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img1-380x230.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img1-700x423.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img1-380x230.png 420w\" sizes=\"(max-width: 982px) 100vw, 982px\" \/><\/a><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">We start with 100 samples, and the tree breaks down the groups multiple times. At the beginning, we have labels for 20 bicycles, 10 unicycles, 30 trucks, 10 motorcycles, and 30 cars. The classifier splits the data points up by their attributes. A classifier doesn\u2019t ask questions like \u201chow many wheels does it have\u201d, but it will mathematically consider that Data Point 1 has two wheels and one motor while Data Point 2 has one wheel and zero motors.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">If all the data points in a node are the same class, the DTC doesn\u2019t have to split it anymore, and that branch of the tree will terminate. These are called pure nodes. Conversely, if the node contains samples of more than one class, it <\/span><i><span style=\"font-weight: 400;\">could <\/span><\/i><span style=\"font-weight: 400;\">be split further. However, we do not want to split nodes continually if it makes the model excessively complex.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">There are two parameters commonly adjusted when trying to train the best DTC: maximum number of nodes and depth. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Depth refers to how many times the DTC should split up the data. If you set the maximum depth to three, the DTC will only split subsets three times, even if the ending nodes are not pure. Of course, the DTC tries to split the data such that the purity of the end nodes are as high as possible.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The maximum number of nodes is the upper limit on how many nodes there are in total. If you set this to four, there can only be four subsets. That could manifest in a depth of one, where the original set is split into four subsets right away. It could also be a depth of two, where the original splits into two subsets, and one of the subsets is itself split into two further subsets. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">To reduce overfitting, the maximum depth parameter is usually adjusted. This stops the DTC from continuing to split the data sets even when accuracy is high enough.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Random Forests<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">The law of large numbers is a central theme in probability and statistics. This idea flows over into <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/data-science-definition\/\" target=\"_blank\" rel=\"noreferrer noopener\">data science<\/a>, and having more samples is often viewed as better. Because DTCs can overfit very easily, sometimes we use a set of trees, which would be a forest. Hence the name random forest classifiers (RFCs), is a collection of DTCs with randomly selected data from the training set. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In the code, the <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/what-does-a-data-scientist-do\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/what-does-a-data-scientist-do\/\" rel=\"noreferrer noopener\">data scientist<\/a> would choose a maximum depth, maximum nodes, minimum samples per node, and other parameters for each tree, plus how many trees and features (input variables) to use. Then the algorithm builds trees using different subsets of the training data set. This is done to avoid biases. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">To make the RFC more random, you can also choose a subset of features. For instance, if there are 10 features and maximum_features = 7, the trees will only choose seven of the features. This prevents strongly influential features from dominating every tree and makes the forest more diverse.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The final model is a combination of those trees. If there are 30 trees in the forest and 20 trees predict Class A and 10 predict Class B, the RFC will call that data point Class A.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Cleary, RFCs are more robust because they take a sample of multiple trees. The two drawbacks are computing power required and the <\/span><a href=\"https:\/\/www.kdnuggets.com\/2019\/03\/ai-black-box-explanation-problem.html\" target=\"_blank\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">black box problem<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Since your model has to generate multiple DTCs, it has to process and reprocess the training data repeatedly. Then, once a model is settled, every new data point must be run through all the DTCs to make a decision. Depending on the dimensions of your data, the size of your data set, and the complexity of your RFC, this could require quite a bit of computing power. However, because each tree in the forest is independent, a RFC can easily be parallelized. If you\u2019re working with GPUs, the computation burden might not actually be that bad.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The other issue, black box algorithms, is a major focus of computer science, philosophy, and ethics. In a nutshell, the black box problem is our (humans\u2019) inability to decipher <\/span><i><span style=\"font-weight: 400;\">why <\/span><\/i><span style=\"font-weight: 400;\">a decision is reached. This happens frequently in RFCs because there are so many DTCs. Though this can also happen with larger trees, it\u2019s usually easier to visualize how a DTC reaches its conclusion. A 200-tree, depth-20 forest is much harder to decipher.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Neural Networks<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">The last classifier I will discuss is neural networks. This is probably the hottest data science topic in popular culture. <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/beginners-guide-neural-network-in-python-scikit-learn-0-18\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/beginners-guide-neural-network-in-python-scikit-learn-0-18\/\" rel=\"noreferrer noopener\">Neural networks<\/a> are modeled after the human brain and are extremely powerful without much need for tuning the model. Their implementations are very complex, and they too suffer from the black box problem.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In neuroscience, <\/span><a href=\"https:\/\/nba.uth.tmc.edu\/neuroscience\/m\/s1\/introduction.html\" target=\"_blank\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">thought and action are determined by the firing of neurons<\/span><\/a><span style=\"font-weight: 400;\">. Neurons fire based on their inner electrochemical state. Once a specific threshold is reached, the neuron \u201cfires,\u201d causing other neurons to react. This is the basis of neural networks. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Neural networks are layered sets of nodes, where input nodes send signals to nodes in a <\/span><i><span style=\"font-weight: 400;\">hidden layer<\/span><\/i><span style=\"font-weight: 400;\">, which in turn send signals to a final output. Every input node is connected to every hidden node, and every connection has its own weight. For example, Feature X might influence Hidden Node 1 by 0.5, while Feature X influences Hidden Node 2 by 0.1. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">If the hidden node reaches its threshold, it propagates forward a signal. This could be to another hidden layer, or it could be to the output.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">This illustration should be helpful in understanding the concept (yes, I painstakingly <\/span><a href=\"https:\/\/www.springboard.com\/blog\/data-analytics\/excel-functions-for-data-analysis\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-analytics\/excel-functions-for-data-analysis\/\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">made it in Excel<\/span><\/a><span style=\"font-weight: 400;\">):<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img2.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1115\" height=\"378\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img2.png\" alt=\"\" class=\"wp-image-8094\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img2.png 1115w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img2-400x136.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img2-768x260.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img2-380x129.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img2-700x237.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img2-380x129.png 420w\" sizes=\"(max-width: 1115px) 100vw, 1115px\" \/><\/a><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">This neural network has four features and two hidden layers, the first with three nodes and the second with two nodes. Each one of the arrowed lines carries a weight, which will impact the node it points to. Sophisticated neural networks might have hundreds of nodes and several hidden layers.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">To understand how these work better, let\u2019s look at an example. This is clearly contrived, and <\/span><i><span style=\"font-weight: 400;\">activation functions\u2014<\/span><\/i><span style=\"font-weight: 400;\">the formula determining whether a node fires\u2014are usually much more complex. But this should get the point across:<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img3.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1115\" height=\"405\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img3.png\" alt=\"\" class=\"wp-image-8093\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img3.png 1115w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img3-400x145.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img3-768x279.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img3-380x138.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img3-700x254.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/07\/img3-380x138.png 420w\" sizes=\"(max-width: 1115px) 100vw, 1115px\" \/><\/a><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Now, if we have an input set of {2, 1}, 2 will enter the Feature 0 node and send 4 to HN0 and 8 to HN1. The Feature 1 node will send 6 to HN0 and 0.3 to HN1. The hidden layer, in turn, sends 4*0.5=2 and -1*1 = -1 to the output. 2 + (-1) = 1, so an input set of {2,1} should correspond to Class B.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">These weights and signals are adjusted until the resultant data set matches the expected predictions as closely as possible. Bleeding edge approaches might put two neural networks in competition, send signals backward through the network, and do other clever operations to improve prediction ability.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The three disadvantages of neural networks are a voracious appetite for data, memory usage, and the black box problem.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Neural networks tend to perform best when they have huge amounts of input data, often on the levels that only Big Tech commands. Google and Facebook (among others) have far more data than any smaller organization could possibly own, and therefore they have the best algorithms. If your project is sparse on data, a neural network might not be a good idea.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Then, because the layers are not independent, a lot of information must be stored in working memory. Hard drives are cheap, but RAM is not. Yet neural networks can consume significant amounts of RAM. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">And, as you can probably guess from just my illustrations, these get complicated fast. When there are five hidden layers, each with 100 nodes, and there is back propagation occurring, humans may have a difficult time understanding the <\/span><i><span style=\"font-weight: 400;\">why <\/span><\/i><span style=\"font-weight: 400;\">of the decision. Some tools and clever programming can help, though.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Ending Remarks<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">I hope you have learned a little about machine learning for regression and classification. There is plenty more to learn, and this is just a first-step introduction. There are many online courses to teach you the programming and practical details, as well as some good classes on the mathematics that support all of these algorithms. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Remember that machine learning is just computers doing math, not magical spells that pull insights out of nowhere. ML is an awesome tool\u2014and I mean that in both senses: cool, but also so powerful that it inspires awe. Use it wisely and reap great benefit.<\/span><\/p>\n\n\n\n<p class=\"rm has-background\" style=\"background-color:#efeff6\"><strong>Since you\u2019re here\u2026<br><\/strong>Curious about a career in data science? Experiment with our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/resources\/guides\/data-science-process\/\" target=\"_blank\">free data science learning path<\/a>, or join our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/courses\/data-science-career-track\/\" target=\"_blank\">Data Science Bootcamp<\/a>, where you\u2019ll get your tuition back if you don&#8217;t land a job after graduating. We\u2019re confident because our courses work \u2013 check out our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/success\/\" target=\"_blank\">student success stories<\/a> to get inspired.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my last post, we explored a general overview of data analysis methods, ranging from basic statistics to machine learning (ML) and advanced simulations. It was a pretty high-level overview, and aside from the statistics, we didn\u2019t dive into much detail. In this post, we\u2019ll take a deeper look at machine-learning-driven regression and classification, two [&hellip;]<\/p>\n","protected":false},"author":73,"featured_media":8127,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_eb_attr":"","_eb_data_table":"","footnotes":""},"categories":[67],"tags":[],"marketing_tags":[],"class_list":{"0":"post-8087","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-data-science"},"acf":[],"_links":{"self":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/8087"}],"collection":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/users\/73"}],"replies":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/comments?post=8087"}],"version-history":[{"count":4,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/8087\/revisions"}],"predecessor-version":[{"id":48620,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/8087\/revisions\/48620"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media\/8127"}],"wp:attachment":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media?parent=8087"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/categories?post=8087"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/tags?post=8087"},{"taxonomy":"marketing_tags","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/marketing_tags?post=8087"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}