{"id":23632,"date":"2020-06-11T02:15:00","date_gmt":"2020-06-11T09:15:00","guid":{"rendered":"https:\/\/www.springboard.com\/blog\/?p=23632"},"modified":"2023-06-25T23:38:42","modified_gmt":"2023-06-26T06:38:42","slug":"k-means-clustering","status":"publish","type":"post","link":"https:\/\/www.springboard.com\/blog\/data-science\/k-means-clustering\/","title":{"rendered":"K Means Clustering Machine Learning Algorithm: Introduction and Implementation"},"content":{"rendered":"\n<p>In this blog post, we are going to discuss the &#8216;K Means clustering Machine Learning algorithm&#8217;. Unlike the KNN Algorithm, K Means clustering is an Unsupervised Learning algorithm. Unsupervised learning does not involve the target output which means no training is provided to the system. And the system must learn on its own through determining and adapting to the structural characteristics in the input patterns. Unsupervised learning method works with unlabeled data in which the output is just based on the result of observations. Unsupervised learning generates a moderate accurate output but it is reliable. We can see below the different types of Unsupervised learning algorithms.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1051\" height=\"792\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india.webp\" alt=\"\" class=\"wp-image-23633\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india.webp 1051w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-380x286.webp 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-380x286.webp 420w\" sizes=\"(max-width: 1051px) 100vw, 1051px\" \/><\/figure>\n\n\n\n<p>Now we will look at the Clustering technique and its use cases.<\/p>\n\n\n\n<p><em><strong>Inspired by this analysis and want to learn how to do it \/ wish to replicate this for your project? We can help you there. Just leave your email address in this <a href=\"https:\/\/docs.google.com\/forms\/d\/e\/1FAIpQLSc76hXWSBgCIV38-c3FKMk0J3NbZMUydcKZh47mByS8FU15YQ\/viewform\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\">google form<\/a> and we will share the analysis with you within 48 hours.<\/strong><\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>K<\/strong> <strong>Means Clustering: Clustering Technique<\/strong><\/h2>\n\n\n\n<p>Clustering is an unsupervised learning technique. Gestalt&#8217;s Law of similarity says that if two things are similar in some ways, they often share other characteristics. Similarly, a cluster is a set of similar data points or a set of points that are more similar to each other than to points in other clusters. In clustering unsupervised learning techniques, we provide unlabeled training datasets to an algorithm and we ask the algorithm to find some structure in the data. Output can be in the form of clustered or density-based or hierarchical.<\/p>\n\n\n\n<p>Clustering techniques can be used for:<\/p>\n\n\n\n<p>1) Market Segmentation: Analyzing market and customer requirements to target the sale of selective products in certain demography.<\/p>\n\n\n\n<p>2) Social Network Analysis: Based on user habits and social interactions we can group users in a certain cluster, assisting effective communication.<\/p>\n\n\n\n<p>3) Detecting Anomalies or Outliers: Using clustering techniques to identify the outliers student in the classroom as some students outperformed other students or failed to even pass. Another example of anomaly detection would be Fraud detection in credit card transactions in banks.<\/p>\n\n\n\n<p>In this post, we will discuss only the K Means Clustering Algorithm, its implementation, and use cases.<\/p>\n\n\n<div class=\"bg-leaf-50 p-4 my-3\"><h4 class=\"fw-bold text-center\">Get To Know Other\tData Science Students<\/h4><div class=\"row row-cols-1 row-cols-lg-3\"><div class=\"col\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/jonah-winninghoff\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1680561342\/Jonah_Winninghoff.png\" alt=\"Jonah Winninghoff\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Jonah Winninghoff<\/p><p class=\"text-muted lh-1\">Statistician at Rochester Institute Of Technology<\/p><\/div><div class=\"w-100 d-block d-md-none mt-3\"><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/jonah-winninghoff\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/haotian-wu\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1629203192\/Student%20Success\/Haotian_Wu_125x125.png\" alt=\"Haotian Wu\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Haotian Wu<\/p><p class=\"text-muted lh-1\">Data Scientist at RepTrak<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/haotian-wu\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/esme-gaisford\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1629203193\/Student%20Success\/Esme_Gaisford_125x125.png\" alt=\"Esme Gaisford\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Esme Gaisford<\/p><p class=\"text-muted lh-1\">Senior Quantitative Data Analyst at Pandora<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/esme-gaisford\">Read Story<\/a><\/p><\/div><\/div><\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>K Means Clustering Algorithm<\/strong><\/h2>\n\n\n\n<p>K Means Clustering Algorithm is the most popular algorithm. K-Means is an iterative algorithm. Let\u2019s imagine we have a set of unlabeled data and we want to group the dataset into three clusters. K-Means the algorithm will assign each data point to one of the K groups based on the feature and similarities. Here are the steps by which we can achieve this using K-Means clustering:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>We will define the value of K which means we are going to create a K number of clusters. As in our example, we would like to create 3 cluster groups from the data set. so, the value of K will be 3.<\/li>\n\n\n\n<li>Initialize two randomly selected points from the cluster. This is called a centroid.<\/li>\n\n\n\n<li>Traverse dataset till the last data point and assign then to cluster\/group.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Implementation of K Means Clustering Algorithm<\/strong><\/h3>\n\n\n\n<p>For our implementation, we are using Juypter Notebook and executing our algorithm in python v3.0.We have also created one sample dataset on which we will implement the K-Means algorithm. Below are the columns and few rows from our dataset:<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><tbody><tr><td><strong>CustomerId<\/strong><\/td><td><strong>CreditScore<\/strong><\/td><td><strong>Purchase In Lacs<\/strong><\/td><td><strong>State<\/strong><\/td><\/tr><tr><td>91001<\/td><td>68<\/td><td>5<\/td><td>Active<\/td><\/tr><tr><td>91002<\/td><td>84<\/td><td>8<\/td><td>Active<\/td><\/tr><tr><td>91003<\/td><td>59<\/td><td>21<\/td><td>Active<\/td><\/tr><tr><td>91004<\/td><td>85<\/td><td>4<\/td><td>Active<\/td><\/tr><tr><td>91005<\/td><td>91<\/td><td>3<\/td><td>Active<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Let us start by loading the necessary libraries. We are using NumPy for scientific computing with python. It\u2019s a widely opensource library used for applying <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/data-science-definition\/\" data-type=\"post\" data-id=\"2291\">data science<\/a>. Also, we are loading Mathplotlib for multiplatform data visualization. To implement K-Means clustering, we are going to load sklearn.cluster(<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/clustering.html\" target=\"_blank\" rel=\"noopener\">https:\/\/scikit-learn.org\/stable\/modules\/clustering.html<\/a>) module.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1430\" height=\"228\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-1-1.webp\" alt=\"\" class=\"wp-image-23634\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-1-1.webp 1430w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-1-1-380x61.webp 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-1-1-380x61.webp 420w\" sizes=\"(max-width: 1430px) 100vw, 1430px\" \/><\/figure>\n\n\n\n<p><em><strong>Related Read:<\/strong> <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/data-scientist-job-description\/\" data-type=\"post\" data-id=\"2371\">Data Scientist Job Description<\/a><\/em><\/p>\n\n\n\n<p>2) Now we will load our dataset which has these columns &#8211; CustomerId, CreditScore, Purchase Amount, State.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1430\" height=\"75\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-2.webp\" alt=\"\" class=\"wp-image-23635\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-2.webp 1430w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-2-380x20.webp 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-2-380x20.webp 420w\" sizes=\"(max-width: 1430px) 100vw, 1430px\" \/><\/figure>\n\n\n\n<p>3) Next, we will select only two columns from the dataset on which we want to perform the K-Means algorithm. We are using iloc function from pandas to select the required columns and assign that to the X vector.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1431\" height=\"69\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-4.webp\" alt=\"\" class=\"wp-image-23636\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-4.webp 1431w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-4-380x18.webp 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-4-380x18.webp 420w\" sizes=\"(max-width: 1431px) 100vw, 1431px\" \/><\/figure>\n\n\n\n<p>4) Let us specify how many cluster groups we would like to create. In this example, we are going to work with 3 clusters.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1429\" height=\"75\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-5-2.webp\" alt=\"\" class=\"wp-image-23637\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-5-2.webp 1429w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-5-2-380x20.webp 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-5-2-380x20.webp 420w\" sizes=\"(max-width: 1429px) 100vw, 1429px\" \/><\/figure>\n\n\n\n<p>5) Now create Y cluster for predicting the X vector. Here we are going to use fit. predict method and this is how it will look. Now Y_kmeans contain predicted value by K-mean:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1430\" height=\"80\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-6.webp\" alt=\"\" class=\"wp-image-23638\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-6.webp 1430w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-6-380x21.webp 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/06\/k-means-clustering-springboard-india-6-380x21.webp 420w\" sizes=\"(max-width: 1430px) 100vw, 1430px\" \/><\/figure>\n\n\n\n<p>6) Let us visualize the cluster using the plt.scatter method. Here we are going to create 3 scatters one for each cluster and label them as Cluster 1, Cluster 2 and Cluster 3.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"203\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-plt.scatter-method-1200x203.png\" alt=\"K Means Clustering, plt.scatter method\" class=\"wp-image-46228\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-plt.scatter-method-1200x203.png 1200w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-plt.scatter-method-400x68.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-plt.scatter-method-768x130.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-plt.scatter-method-380x64.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-plt.scatter-method-700x118.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-plt.scatter-method.png 1428w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-plt.scatter-method-380x64.png 420w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><\/figure>\n\n\n\n<p>7) We have added labels for X and Y vectors as CreditScore and Purchase amount. The output plot graph will look like this.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1022\" height=\"568\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-output-plot-graph.png\" alt=\"K Means Clustering, output plot graph\" class=\"wp-image-46230\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-output-plot-graph.png 1022w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-output-plot-graph-400x222.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-output-plot-graph-768x427.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-output-plot-graph-380x211.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-output-plot-graph-700x389.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2020\/06\/k-means-clustering-output-plot-graph-380x211.png 420w\" sizes=\"(max-width: 1022px) 100vw, 1022px\" \/><\/figure>\n\n\n\n<p>The outputs of performing K Means clustering algorithm on a dataset are:<\/p>\n\n\n\n<p>1) K centroids: Centroids for each of the k clusters identified from the dataset.<\/p>\n\n\n\n<p>2) Complete dataset labeled to ensure each data point is assigned to one of the clusters.<\/p>\n\n\n\n<p><em><strong>Inspired by this analysis and want to learn how to do it \/ wish to replicate this for your project? We can help you there. Just leave your email address in this <a rel=\"noreferrer noopener\" href=\"https:\/\/docs.google.com\/forms\/d\/e\/1FAIpQLSc76hXWSBgCIV38-c3FKMk0J3NbZMUydcKZh47mByS8FU15YQ\/viewform\" target=\"_blank\">google form<\/a> and we will share the analysis with you within 48 hours.<\/strong><\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Use Cases of K-Means Algorithm<\/h3>\n\n\n\n<p>1) Document Clustering: Classification of documents based on topics or content of the document.&nbsp;<\/p>\n\n\n\n<p>2) IT Alert based Clustering: By clustering of operational alerts, we can identify the categories of alerts could be network related, Database related or application related alerts or mean-time to repair the alert or system failure predictions.<\/p>\n\n\n\n<p>3) Fraud detection: Identifying the fraud from the historical dataset and cluster them into one group.<\/p>\n\n\n\n<p>4) Market Segment: Clustering of customer database and grouping them to different market segments.<\/p>\n\n\n\n<p>K Means clustering is one of the widely used clustering algorithms and very popular amongst data experts. Other algorithms in the clustering technique are the Density-based Clustering and Hierarchical model. We will discuss these algorithms in the coming posts.<\/p>\n\n\n\n<p class=\"rm has-background\" style=\"background-color:#efeff6\"><strong>Since you\u2019re here\u2026<br><\/strong>Curious about a career in data science? Experiment with our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/resources\/guides\/data-science-process\/\" target=\"_blank\">free data science learning path<\/a>, or join our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/courses\/data-science-career-track\/\" target=\"_blank\">Data Science Bootcamp<\/a>, where you\u2019ll get your tuition back if you don&#8217;t land a job after graduating. We\u2019re confident because our courses work \u2013 check out our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/success\/\" target=\"_blank\">student success stories<\/a> to get inspired.<\/p>\n\n\n\n<p><br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this blog post, we are going to discuss the &#8216;K Means clustering Machine Learning algorithm&#8217;. Unlike the KNN Algorithm, K Means clustering is an Unsupervised Learning algorithm. Unsupervised learning does not involve the target output which means no training is provided to the system. And the system must learn on its own through determining [&hellip;]<\/p>\n","protected":false},"author":100,"featured_media":46234,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_eb_attr":"","_eb_data_table":"","footnotes":""},"categories":[67],"tags":[],"marketing_tags":[],"class_list":{"0":"post-23632","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-data-science"},"acf":[],"_links":{"self":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/23632"}],"collection":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/users\/100"}],"replies":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/comments?post=23632"}],"version-history":[{"count":3,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/23632\/revisions"}],"predecessor-version":[{"id":46233,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/23632\/revisions\/46233"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media\/46234"}],"wp:attachment":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media?parent=23632"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/categories?post=23632"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/tags?post=23632"},{"taxonomy":"marketing_tags","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/marketing_tags?post=23632"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}