Atal Bihari Vajpayee once said –“You can change friends but not neighbors.”
We often judge people by their vicinity to the group of people they live with. People who belong to a particular group are usually considered similar based on the characteristics they possess. This is the simple principle on which the KNN algorithm works – “Birds of the same feather flock together.”
Understanding KNN with an Example
Let us consider the simple example of Game of Thrones to understand the KNN algorithm. Imagine you have to design a classification algorithm to identify whether a stranger is a Westerosi or a Dothraki. There are different features that can be used to classify which group the stranger belongs to. For instance, if the person is a Dothraki he is likely to have greater muscle mass whereas if he is a Westerosi, he is likely to be wealthy. In this case, wealth and muscle mass are the independent variables and features. When you place the stranger among the bulk of Westerosi and Dothraki clan, you can classify the person as a Dothraki or Westerosi based on majority voting i.e. based on whether the maximum number of nearest neighbors belong to Westerosi clan or the Dothraki clan.
How does the KNN algorithm work?
K nearest neighbors is a supervised machine learning algorithm often used in classification problems. It works on the simple assumption that “The apple does not fall far from the tree” meaning similar things are always in close proximity. This algorithm works by classifying the data points based on how the neighbors are classified. Any new case is classified based on a similarity measure of all the available cases. Technically, the algorithm classifies an unknown item by looking at k of its already -classified, nearest neighbor items by finding out majority votes from nearest neighbors that have similar attributes as those used to map the items.
KNN is a –
- Lazy Learning Algorithm – It is a lazy learner because it does not have a training phase but rather memorizes the training dataset. All computations are delayed until classification.
- Case-Based Learning Algorithm -The algorithm uses raw training instances from the problem domain to make predictions and is often referred to as an instance based or case-based learning algorithm. Case-based learning implies that KNN does not explicitly learn a model. Rather it memorizes the training instances/cases which are then used as “knowledge” for the prediction phase. Given an input, when we ask the algorithm to predict a label, it will make use of the memorized training instances to give out an answer.
Non-Parametric – A non-parametric method has either a fixed number of parameters regardless of the data size or has no parameters. In KNN, irrespective of the size of data, the only unknown parameter is K. There are no assumptions made about the functional form of the problem is solved but here is a trade-off as this comes with a computation cost. An important point to note is that KNN has minimal training phase but this comes both at a memory cost and as well computational cost. Memory cost because it requires storing a huge data set and computation cost during the test time because classifying a given observation needs a run-down of the complete data set.
Get To Know Other Data Science Students
Jonas Cuadrado
Senior Data Scientist at Feedzai
Karen Masterson
Data Analyst at Verizon Digital Media Services
Sunil Ayyappan
Senior Technical Program Manager (AI) at LinkedIn
What is K in KNN algorithm?
K in KNN is the number of nearest neighbors considered for assigning a label to the current point. K is an extremely important parameter and choosing the value of K is the most critical problem when working with the KNN algorithm. The process of choosing the right value of K is referred to as parameter tuning and is of great significance in achieving better accuracy. If the value of K is too small then there is a probability of overfitting the model and if it is too large then the algorithm becomes computationally expensive. Most data scientists usually choose an odd number value for K when the number of classes is 2 (learn more about data scientists here). Another formula that works well for choosing K is, k- sqrt(n) where n is the total number of data points.
Selecting the value of K depends on individual cases and sometimes the best method of choosing K is to run through different values of K and verify the outcomes. Using cross-validation, the KNN algorithm can be tested for different values of K and the value of K that results in good accuracy can be considered as an optimal value for K.
Related Read: What is Data Science?
When should you use KNN Algorithm
KNN algorithm is a good choice if you have a small dataset and the data is noise free and labeled. When the data set is small, the classifier completes execution in shorter time duration. If your dataset is large, then KNN, without any hacks, is of no use.
Pros of Using KNN
1) KNN is a perfect first step for machine learning beginners as it is very easy to explain, simple to understand, and extremely powerful. It yields highly competitive results, despite its simplicity. A fantastic application of this is the use of KNN in collaborative filtering algorithms for recommender systems. This is the go-to technique behind the screens of Amazon’s Recommender Systems.
2) KNN is a non-parametric algorithm and does not require any assumptions on the data distribution. This gives KNN an extra edge in specific settings where the data is highly unusual. This is the reason for KNN being the first choice when there is no prior knowledge or very little knowledge about the data distribution.
3) It is a versatile supervised machine learning algorithm that can be used for both regression and classification problems and also search.4) This algorithm does not have an explicit training step as it is an instance-based learning algorithm. The training step of KNN is pretty fast as it involves only storing feature vectors and class labels of the training samples. Considering the minimal training time, KNN can be a perfect choice for off-the-bat analysis of a dataset on which you are planning to run complex algorithms.
5) Most of the classification algorithms are by default hardcoded for the binary setting. Using them for multi-class problems requires extension from binary or transformation to binary. KNN easily lends itself with multiclass datasets.
6) Flexible distance criteria to choose from when building a KNN model – Euclidean, Manhattan, and Hamming distance. Each of the distance functions has a different purpose based on the type of dataset. Based on the nature of features, it’s possible to choose the best option -Manhattan and Euclidean for numeric, and Hamming for categorical features.
Cons of Using KNN
1) KNN does not have a training phase, however, this comes at a cost of making the prediction step relatively expensive. Every time a prediction is to be made, it searches for the nearest neighbor in the complete training set. This can speed up a bit with a few tricks like KDtrees and BallTrees.
2) The efficiency of the algorithm declines very fast as the dataset grows.
3) It cannot tackle any missing values and you will need a complete features vector for each instance to compute the distance. You can deal with this by filling the missing values with the average value of the feature across the entire dataset.
4) It suffers from skewed class distributions meaning if a specific class occurs frequently in the training set then it is most likely to dominate the majority voting of the new example.
5) The accuracy of KNN deteriorates with high-dimension data as there is hardly any difference between the nearest and farthest neighbor. High dimensionality of datasets is a major problem when working with classification algorithms like KNN. KNN suffers from the curse of dimensionality because it is usually implemented using an approximate nearest neighbor search algorithm such as KD-tree
Few Applications of KNN Algorithm1) The biggest application of KNN is recommender systems- recommending ads to display to a user (YouTube) or recommending products (Amazon ), or recommending media to consume. For example, if you buy a smartphone from Amazon, it recommends a mobile cover or earphones to go with it.
2) KNN is used in the retail industry to identify patterns in credit card usage. Most of the new transaction scrutinizing software applications today use KNN to analyze the register data and detect any unusual or suspicious activities. For instance, if the register data of a retail store shows that a lot of information is being entered manually instead of automatically scanning or swiping the card. This is an indication of the fact that the employee is stealing customers information.
3) KNN also finds application in politics for classifying a potential voter as a “will vote” or “will not vote” candidate.
4) Other advanced applications of KNN include video recognition, image recognition, and handwriting detection.
Since you’re here…
Curious about a career in data science? Experiment with our free data science learning path, or join our Data Science Bootcamp, where you’ll get your tuition back if you don’t land a job after graduating. We’re confident because our courses work – check out our student success stories to get inspired.