K-Nearest Neighbors (KNN) is one of the simplest yet most effective classification algorithms in machine learning. Unlike models that learn an explicit decision boundary during training, KNN makes predictions at runtime by looking at the K training samples closest to a new data point and taking a majority vote. This "lazy learning" approach requires no model fitting, which makes it fast to set up but slower at prediction time on large datasets.
In this tutorial you will build a KNN classifier on the Wine dataset — a 13-feature, 178-sample chemical analysis dataset with three cultivar classes. You will explore why raw feature magnitudes hurt KNN, standardize the data, and use cross-validation to find the K that minimizes test error. The final model reaches over 98 % accuracy.
Prerequisites: Python 3.x, scikit-learn, NumPy, Pandas, Matplotlib.
How KNN Classifies New Data
K is the number of nearest neighbors KNN consults before making a prediction. When K=1 the algorithm is called the nearest neighbor algorithm: it finds the single training point closest to the query point and assigns that training point's label. As K increases, more neighbors vote on the outcome and the decision boundary becomes smoother.
To measure "closeness," KNN relies on a distance metric.
The most common choices are:
- Euclidean distance — straight-line distance between two points in feature space
- Manhattan distance — sum of absolute differences along each axis
- Minkowski distance — a generalization that reduces to Euclidean when the power parameter equals 2
Given a new point , KNN follows three steps: calculate the distance from to every training point, rank those distances to find the K smallest, and assign the majority class label among those K neighbors.
The Curse of Dimensionality
KNN works well in low-dimensional feature spaces, but performance degrades as the number of features grows — a problem known as the curse of dimensionality. In high-dimensional spaces, all points tend to become roughly equidistant from each other, making the concept of "nearest neighbor" meaningless.
To mitigate this, you can apply Principal Component Analysis (PCA) to reduce the number of features before running KNN, or use a feature selection step to discard low-importance variables. Research has also shown that the Euclidean distance becomes unreliable in very high dimensions, making cosine similarity a better alternative in those settings.
Choosing the Optimal K
K is a hyperparameter — you set it before training, and its value strongly shapes the model's behavior. A small K produces a highly flexible, low-bias but high-variance boundary that overfits to local noise. A large K smooths the boundary and reduces variance, but can introduce bias by averaging over points that are not truly similar to the query.
No single value of K works best for every dataset. The standard approach is to try multiple values and evaluate each one using k-fold cross-validation — a technique that holds out a subset of the training data as a validation set, fits the model on the rest, and repeats this process K times so every sample is validated exactly once. Averaging the K validation scores gives a reliable, low-variance estimate of true test performance.
Building the Classifier on the Wine Dataset
The Wine dataset contains 178 samples, each described by 13 chemical measurements (alcohol content, malic acid, ash, and ten more). The target is the wine cultivar class: class_0, class_1, or class_2. The dataset is built into scikit-learn, so no download is needed.
Load the dataset and inspect its structure:
from sklearn import datasets
wine = datasets.load_wine()
wine.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])
Print the full description to confirm the feature list and class distribution:
print(wine.DESCR)
.. _wine_dataset:
Wine recognition dataset
------------------------
**Data Set Characteristics:**
:Number of Instances: 178 (50 in each of three classes)
:Number of Attributes: 13 numeric, predictive attributes and the class
:Attribute Information:
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
- class:
- class_0
- class_1
- class_2
:Summary Statistics:
============================= ==== ===== ======= =====
Min Max Mean SD
============================= ==== ===== ======= =====
Alcohol: 11.0 14.8 13.0 0.8
Malic Acid: 0.74 5.80 2.34 1.12
Ash: 1.36 3.23 2.36 0.27
Alcalinity of Ash: 10.6 30.0 19.5 3.3
Magnesium: 70.0 162.0 99.7 14.3
Total Phenols: 0.98 3.88 2.29 0.63
Flavanoids: 0.34 5.08 2.03 1.00
Nonflavanoid Phenols: 0.13 0.66 0.36 0.12
Proanthocyanins: 0.41 3.58 1.59 0.57
Colour Intensity: 1.3 13.0 5.1 2.3
Hue: 0.48 1.71 0.96 0.23
OD280/OD315 of diluted wines: 1.27 4.00 2.61 0.71
Proline: 278 1680 746 315
============================= ==== ===== ======= =====
:Missing Attribute Values: None
:Class Distribution: class_0 (59), class_1 (71), class_2 (48)
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.
Inspect the first three rows of raw feature data:
wine.data[: 3]
array([[1.423e+01, 1.710e+00, 2.430e+00, 1.560e+01, 1.270e+02, 2.800e+00,
3.060e+00, 2.800e-01, 2.290e+00, 5.640e+00, 1.040e+00, 3.920e+00,
1.065e+03],
[1.320e+01, 1.780e+00, 2.140e+00, 1.120e+01, 1.000e+02, 2.650e+00,
2.760e+00, 2.600e-01, 1.280e+00, 4.380e+00, 1.050e+00, 3.400e+00,
1.050e+03],
[1.316e+01, 2.360e+00, 2.670e+00, 1.860e+01, 1.010e+02, 2.800e+00,
3.240e+00, 3.000e-01, 2.810e+00, 5.680e+00, 1.030e+00, 3.170e+00,
1.185e+03]])
Notice that Proline values sit in the thousands while Nonflavanoid phenols values sit near 0.1–0.6. Without scaling, Euclidean distance will be dominated by the large-magnitude features, which is exactly the problem you will fix in the standardization step.
The dataset has 13 features and a three-class target variable. Import the supporting libraries and extract the feature matrix X and target vector y:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
X = wine.data
y = wine.target
X.shape, y.shape
((178, 13), (178,))
Splitting the Data
Splitting the data into a training set and a test set lets you measure how well the model generalizes to unseen samples. Use train_test_split() with a 70/30 split and stratify=y to preserve the class proportions in both halves:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify = y)
Training KNN Without Standardization
Before scaling, train KNN with three different values of K to see how raw features perform. Import the classifier and the accuracy metric:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
Train with K=3:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)
print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))
Accuracy: 0.6851851851851852
Train with K=5:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)
print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))
Accuracy: 0.7222222222222222
Train with K=7:
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)
print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))
Accuracy: 0.7407407407407407
Accuracy improves slightly as K grows, but all three models top out below 75 %. This weak performance is a direct consequence of unscaled features — high-magnitude variables like Proline dominate the distance calculation and mask the information contained in lower-magnitude features.
Inspect the raw feature values again to confirm the scale mismatch:
X[: 3]
array([[1.423e+01, 1.710e+00, 2.430e+00, 1.560e+01, 1.270e+02, 2.800e+00,
3.060e+00, 2.800e-01, 2.290e+00, 5.640e+00, 1.040e+00, 3.920e+00,
1.065e+03],
[1.320e+01, 1.780e+00, 2.140e+00, 1.120e+01, 1.000e+02, 2.650e+00,
2.760e+00, 2.600e-01, 1.280e+00, 4.380e+00, 1.050e+00, 3.400e+00,
1.050e+03],
[1.316e+01, 2.360e+00, 2.670e+00, 1.860e+01, 1.010e+02, 2.800e+00,
3.240e+00, 3.000e-01, 2.810e+00, 5.680e+00, 1.030e+00, 3.170e+00,
1.185e+03]])
The Proline column (last) contains values like 1065 and 1185, whereas Nonflavanoid phenols (eighth column) are all below 0.3. This scale mismatch is what cripples KNN's distance metric.
Improving KNN with Feature Standardization
Standardization — also called z-score normalization — transforms each feature to have zero mean and unit variance. After scaling, every feature contributes equally to Euclidean distance, regardless of its original magnitude.
Apply StandardScaler to the full feature matrix and inspect the result:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled[: 3]
array([[ 1.51861254, -0.5622498 , 0.23205254, -1.16959318, 1.91390522,
0.80899739, 1.03481896, -0.65956311, 1.22488398, 0.25171685,
0.36217728, 1.84791957, 1.01300893],
[ 0.24628963, -0.49941338, -0.82799632, -2.49084714, 0.01814502,
0.56864766, 0.73362894, -0.82071924, -0.54472099, -0.29332133,
0.40605066, 1.1134493 , 0.96524152],
[ 0.19687903, 0.02123125, 1.10933436, -0.2687382 , 0.08835836,
0.80899739, 1.21553297, -0.49840699, 2.13596773, 0.26901965,
0.31830389, 0.78858745, 1.39514818]])
All features now sit on the same scale — values typically range between −3 and +3. Now retrain with K=7 on the scaled data:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state = 0, stratify = y)
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)
print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))
Accuracy: 0.9629629629629629
Accuracy jumps from 74 % to 96 % simply by standardizing the features — a dramatic improvement that confirms how sensitive KNN is to feature scale.
Finding the Optimal K with Cross-Validation
Rather than guessing K manually, use k-fold cross-validation to systematically evaluate every candidate K. The idea is to partition the training data into 10 equal folds; for each candidate K, the model is trained on 9 folds and validated on the remaining fold, and this process repeats 10 times. Averaging the 10 accuracy scores yields a robust estimate of how each K generalizes.
Using the validation set as a selection criterion rather than the test set is critical. If you were to select K by its test-set performance, you would be inadvertently fitting to the test set, underestimating the true error rate and losing the ability to evaluate generalization honestly — a classic overfitting trap.
Import cross_val_score and compute the cross-validated accuracy for every odd K from 1 to 49:
from sklearn.model_selection import cross_val_score
neighbors = list(range(1, 50, 2))
cv_scores = []
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_scaled, y, cv=10, scoring = 'accuracy')
cv_scores.append(scores.mean())
Preview the cross-validated accuracy for the first five K values:
cv_scores[: 5]
[0.9434640522875817, 0.9545751633986927, 0.9604575163398692, 0.9663398692810456, 0.9718954248366012]
Convert accuracy to misclassification error (1 − accuracy) to make the minimum easy to find:
MSE = [1 - x for x in cv_scores]
MSE[: 5]
[0.05653594771241832, 0.04542483660130725, 0.0395424836601308, 0.03366013071895435, 0.028104575163398815]
Find the K with the lowest cross-validation error:
optimal_k = neighbors[MSE.index(min(MSE))]
print('The optimal number of k is: ', optimal_k)
The optimal number of k is: 23
Plot how error varies across all candidate K values to visualize the relationship:
plt.plot(neighbors, MSE)
plt.xlabel('Number of K')
plt.ylabel('Error')
plt.title('Variation of error with changing K')
plt.show()
The plot shows error declining as K increases from 1, reaching a minimum around K=23, then rising again as the model becomes too smooth. This U-shape is the classic bias-variance trade-off curve for KNN.
Training the Final Model at Optimal K
With the optimal K identified, train the final model and evaluate it on the held-out test set:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state = 0, stratify = y)
knn = KNeighborsClassifier(n_neighbors=23)
knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)
print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))
Accuracy: 0.9814814814814815
The final model achieves 98.1 % accuracy on the test set — a significant improvement over the 68–74 % baseline from the unscaled, untuned models.
Conclusion
In this tutorial you built a KNN classifier on the Wine dataset, starting from raw unscaled features and finishing with a cross-validated, standardized model. Without standardization, the best accuracy across K=3, 5, and 7 was under 75 %. After applying StandardScaler and using 10-fold cross-validation to identify K=23 as optimal, the final model reached 98.1 % accuracy — demonstrating that data preparation and hyperparameter selection matter as much as the algorithm itself.
Key takeaways:
- KNN makes predictions by majority vote among the K nearest training samples; it learns nothing during training, only at prediction time.
- Euclidean distance is sensitive to feature magnitude — always standardize your features before using any distance-based algorithm.
- A small K gives a flexible, high-variance boundary; a large K gives a smoother, higher-bias boundary. Cross-validation finds the sweet spot.
- In high-dimensional spaces, Euclidean distance loses meaning; apply PCA or feature selection before running KNN on datasets with many features.
- The test set must never be used to select hyperparameters — use a validation set or cross-validation instead.
Next steps:
- Explore Support Vector Machines to see another powerful distance-aware classifier that handles high-dimensional data more efficiently.
- Read K-Means Clustering to see how the same K concept applies to unsupervised grouping of data points.
- Experiment with
weights='distance'inKNeighborsClassifier— closer neighbors will count more than distant ones, which often improves accuracy on noisy datasets.
