k Nearest Neighbors (kNN) in machine learning

Divij Sharma
5 min readAug 18, 2021
Color of Umbrella is similar to its neighbor (Photo by Lazar Gugleta on Unsplash)

k Nearest Neighbors (kNN) is one of the fundamental supervised machine learning algorithms. kNN works with the philosophy that usually neighbors are similar. It can be used for both classification and regression problems but is popularly used for classification problems and seldom for regression task.

kNN is a very simple classification model but it can compete with the most accurate models because it gives highly accurate predictions. In classification problems, it works for two or more categories. New point is matched with k nearest neighbors and the category with max nearest neighbors is picked. In case of regression problem, the average of values of target variable is taken. The nearness is determined by the similarity measure (distance) of already classified data points.

kNN is a non-parametric, instance-based, lazy-learning machine learning algorithm.

Non-parametric methods try to find the best fit training data while constructing the mapping function which also allows it to fit a large number of functional forms. It means that it does not assume any distribution. These methods are good when we have a lot of data and no prior knowledge and do not want to worry about choosing just the right features.

Instance-based learning algorithms are also called memory-based learning algorithms. These algorithms compare new instances with instances seen in training (which have been stored in memory). The instance-based learning algorithms construct hypotheses directly from the training instances themselves this leads to increase in hypothesis complexity with increase in number of observations.

Lazy-learning machine algorithm simply stores the training dataset with little or no processing. When the new data is provided to algorithm the distance of new data point with every other stored datapoint is calculated and sorted in ascending order. The top k least distances are picked and prediction is made based on these k data points. As the computation is postponed until a new instance is observed, the algorithm is referred to as lazy.

What does this really mean? It means that at the time of training kNN model, the only thing that happens in the background is storing the data. Every calculation of distance between new and stored data points happens at the time of prediction.

When to use kNN?

kNN algorithm makes very accurate predictions. kNN can be used in the applications that require high accuracy. The accuracy or quality of predictions depend on the distance measure used in kNN algorithm hence good domain knowledge is very important before using kNN algorithm . Also, kNN is lazy learning machine algorithm so the training time is less but as the generation of the predictions is deferred until classification, the prediction time can be more. This also increases the cost of computation as compared to other algorithms. All these points make kNN as a good choice for applications where accuracy is important but the predictions are not requested frequently.

How to choose k

k indicates the count of nearest neighbors. k is the main part of algorithm. To determine the best value of k, we train multiple models with different values of k and compare the evaluation metric for these models.

All the models with different values of k must be trained on same training dataset and should be tested with the same test dataset. This is done to ensure that no new variables are introduced in model training. If the training and test datasets are changed then the models produced cannot be compared with each other because the underlying data is different.

Code to find best k

In the below code I have have trained kNN model on Iris dataset. In this example, I have decided that accuracy is the best metric to evaluate models.

Training multiple kNN models

The graph of Accuracy vs k may look like below. You would notice that the accuracy drops for certain values of k and is same for a lot of other values of k.

Accuracy vs k

Our task is now to find the minimum value of k that gives the maximum accuracy.

Finding min k with max accuracy.

After finding the minimum k with maximum accuracy, we create final model with the minimum k value. This newly trained model will be deployed in production.

Final model to productionize

Distances for kNN

The kNN model used distances between data points for evaluating “nearness” of two data points. kNN classifier in scikit-learn package sklearn.neighbors.KNeighborsClassifier accepts parameters related to distance metrics namely metric, p and metric_params. The definition of each one of them as per the documentation is as follows —

metric: str or callable, default=‘minkowski’ — the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric.

p: int, default=2 — Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric_params: dict, default=None. Additional keyword arguments for the metric function.

This means that if the default setting of kNN classifier is used then it calculates Euclidean distances.

Different distance metrics that can be used

According to the documentation DistanceMetric of KNeighborsClassifier, the various distance metrics that can be input to classifier are —

Real Valued Vector Space Distances
2D Vector Space Distance
Integer Valued Vector Space Distances
Boolean Valued Vector Space Distances

One of the main aspect of choosing correct distance metric is domain knowledge. As already discussed good domain knowledge is very important for kNN algorithm.

--

--