Skip to main content

K Nearest Neighbors with Python | ML

K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining, and intrusion detection. The K-Nearest Neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems. The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics we might have learned in our childhood— calculating the distance between points on a graph. There are other ways of calculating distance, which might be preferable depending on the problem we are solving. However, the straight-line distance (also called the Euclidean distance) is a popular and familiar choice. It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make any underlying assumptions about the distribution of data (as opposed to other algorithms such as GMM, which assume a Gaussian distribution of the given data). This article illustrates K-nearest neighbors on a sample random data using sklearn library.

Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.

  • Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
  • Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
  • Matplotlib/Seaborn – This library is used to draw visualizations.
  • Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.

Python3




import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Now let’s load the dataset using the pandas dataframe. You can download the dataset from here which has been used for illustration purpose in this article.

Python3




df = pd.read_csv("prostate.csv")
df.head()

Output:

First Five rows of the dataset

First Five rows of the dataset

Standardize the Variables

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier than variables that are on a small scale. 

Python3




from sklearn.preprocessing import StandardScaler
 
scaler = StandardScaler()
 
scaler.fit(df.drop('Target', axis=1))
scaled_features = scaler.transform(df.drop('Target',
                                           axis=1))
 
df_feat = pd.DataFrame(scaled_features,
                       columns=df.columns[:-1])
df_feat.head()

Output:

Features for the KNN model

Features for the KNN model

Model Development and Evaluation

Now by using the sklearn library implementation of the KNN algorithm we will train a model on that. Also after the training purpose, we will evaluate our model by using the confusion matrix and classification report.

Python3




from sklearn.metrics import classification_report,\
    confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
 
X_train, X_test,\
    y_train, y_test = train_test_split(scaled_features,
                                       df['Taregt'],
                                       test_size=0.30)
 
# Remember that we are trying to come up
# with a model to predict whether
# someone will Target or not.
# We'll start with k = 1.
 
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
 
# Predictions and Evaluations
# Let's evaluate our KNN model !
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

Output:

[[19  3]
 [ 2  6]]
              precision    recall  f1-score   support

           0       0.90      0.86      0.88        22
           1       0.67      0.75      0.71         8

    accuracy                           0.83        30
   macro avg       0.79      0.81      0.79        30
weighted avg       0.84      0.83      0.84        30

Elbow Method

Let’s go ahead and use the elbow method to pick a good K Value.

Python3




error_rate = []
 
# Will take some time
for i in range(1, 40):
 
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))
 
plt.figure(figsize=(10, 6))
plt.plot(range(1, 40), error_rate, color='blue',
         linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
 
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.show()

Output:

Elbow method to determine the number of clusters

Elbow method to determine the number of clusters

Here we can observe that the error value is oscillating and then it increases to become saturated approximately. So, let’s take the value of K equal to 10 as that value of error is quite redundant.

Python3




# FIRST A QUICK COMPARISON TO OUR ORIGINAL K = 1
knn = KNeighborsClassifier(n_neighbors = 1)
 
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
 
print('WITH K = 1')
print('Confusion Matrix')
print(confusion_matrix(y_test, pred))
print('Classification Report')
print(classification_report(y_test, pred))

Output:

WITH K = 1
Confusion Matrix
[[19  3]
 [ 2  6]]
Classification Report
              precision    recall  f1-score   support

           0       0.90      0.86      0.88        22
           1       0.67      0.75      0.71         8

    accuracy                           0.83        30
   macro avg       0.79      0.81      0.79        30
weighted avg       0.84      0.83      0.84        30

Now let’s try to evaluate the performance of the model by using the number of clusters for which the error rate is the least.

R




# NOW WITH K = 10
knn = KNeighborsClassifier(n_neighbors = 10)
 
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
 
print('WITH K = 10')
print('Confusion Matrix')
print(confusion_matrix(y_test, pred))
print('Classification Report')
print(classification_report(y_test, pred))

Output:

WITH K = 10
Confusion Matrix
[[21  1]
 [ 3  5]]
Classification Report
              precision    recall  f1-score   support

           0       0.88      0.95      0.91        22
           1       0.83      0.62      0.71         8

    accuracy                           0.87        30
   macro avg       0.85      0.79      0.81        30
weighted avg       0.86      0.87      0.86        30

Great! We squeezed some more performance out of our model by tuning it to a better K value.

Advantages of KNN:

  • It is easy to understand and implement.
  • It can also handle multiclass classification problems.
  • Useful when data does not have a clear distribution.
  • It works on a non-parametric approach.

Disadvantages of KNN:

  • Sensitive to the noisy features in the dataset.
  • Computationally expansive for the large dataset.
  • It can be biased in the imbalanced dataset.
  • Requires the choice of the appropriate value of K.
  • Sometimes normalization may be required.

https://neveropen.tech/k-nearest-neighbors-with-python-ml/?feed_id=158&_unique_id=683d3b2150f95

Comments

Popular posts from this blog

Bare Metal Billing Client Portal Guide

Contents Order a Bare Metal Server My Custom / Contract Pricing View Contract Details Location Management Order History & Status View Order Details Introduction The phoenixNAP Client Portal allows you to purchase bare metal servers and other phoenixNAP products and services. Using the intuitive interface and its essential tools, you can also easily manage your infrastructure. This quick guide will show you how to use the new form to order a bare metal server and how to navigate through new bare metal features within the phoenixNAP Client Portal. Order a Bare Metal Server An order form is an accordion-based process for purchasing phoenixNAP products. Our order form allows you to view the pricing and order multiple products from the same category at the same time. Note: The prices on the form are per month . A contract is not required. However, if you want a contracted price, you may be eligible for a discount depending on the quantity and ...

Find pair of integers such that their sum is twice their Bitwise XOR

Given a positive integer N, the task is to find all pairs of integers (i, j) from the range [1, N] in increasing order of i such that: 1 ? i, j ? N i + j = N i + j = 2*(i ^ j) If there are no such pairs, return a pair -1, -1. Note: Here ‘^’ denotes the bitwise XOR operation. Examples: Input: N = 4 Output: 1, 3, 3, 1 Explanation: A total of 3 pairs satisfy the first condition: (1, 3), (2, 2), (3, 1). There are only two valid pairs out of them: (1, 3) and (3, 1) as 1 + 3 = 4 = 2 * (1 ^ 3). Input: 7 Output: -1, -1 Input: 144 Output: 36, 108,  44, 100, 100, 44,  108, 36 Approach:   The problem can be viewed as a bitwise manipulation problem satisfying pre-conditions.  If the pairs add upto N then it is obvious that the second element j of the pair can be generated using the first element i as j = N – i . Then we just have to check for the remaining condition   i + j = 2 * (i ^ j). Follow the steps mentioned below to solve the problem: Traverse from ...

Add an element in Array to make the bitwise XOR as K

Given an array arr[] containing N positive integers, the task is to add an integer such that the bitwise Xor of the new array becomes K. Examples: Input: arr[] = 1, 4, 5, 6, K = 4 Output: 2 Explanation: Bit-wise XOR of the array is 6.  And bit-wise XOR of 6 and 2 is 4. Input: arr[] = 2, 7, 9, 1, K = 5 Output: 8   Approach: The solution to the problem is based on the following idea of bitwise Xor: If for two numbers X and Y , the bitwise Xor of X and Y is Z then the bitwise Xor of X and Z is Y. Follow the steps to solve the problem: Let the bitwise XOR of the array elements be X .  Say the required value to be added is Y such that X Xor Y = K . From the above observation, it is clear that the value to be added (Y) is the same as X Xor K . Below is the implementation of the above approach: C++ // C++ code to implement the above approach   #include using namespace std;   // Function to find the required value int find_...