KNN核心思想

K最近邻算法是指：在给定的训练数据(样本)集中，对于新的输入样本，通过计算新样本与所有训练样本的距离，找到与新样本最近的K个训练样本(K个邻居)，对于分类问题，K个训练样本中属于某类标签的个数最多，就把新样本分到那个标签类别中，对于回归问题，将K个训练样本的目标值的均值作为新样本的目标值。

KNN只有待调整(确定)的超参数K，没有待学习的模型参数。

超参数K可通过交叉验证(cross validation)来确定。具体地，先将待学习的数据集分为训练数据(train)和测试数据(test)，再将训练数据进一步训练数据(training data)和验证数据(validation data)，选择不同K的情况下，验证数据中系统性能指标最好的那个超参数K。

对于5-Fold 交叉验证，可将原有的训练数据分为5等份，每次选其中一份作为验证集，其他四份作为训练集，可做5次。将每次得到的系统指标(准确率)取平均作为最终的某个K时的系统指标。

随着K的增加，RNN的决策边界越来越平滑。

说明： 交叉验证只是起到增大数据量的目的，不用交叉验证，直接用全量的训练数据也可以确定超参数K。

注意事项

不同维度特征的数值不要相差过大，否则在计算距离时，会导致某一维度的特征不起作用。可利用线性归一化(min-max normalization)[$x_{new}=\frac{x-\min \left( x \right)}{\max \left( x \right) -\min \left( x \right)}$]或标准差标准化(z-score normalization)[$x_{new}=\frac{x-mean\left( x \right)}{std\left( x \right)}$]进行特征缩放
对于大数据量的训练数据集，当样本维度不大时，可以采用KD-tree的数据结构来提高计算效率；当样本维度较大时，采用 Locality Sensitivity Hashing (LSH)的近似算法提高效率(将距离比较近的样本以较大的概率放到同一个bucket里)
在计算距离时，可考虑两个样本点间的相关性。可利用 Mahalanobis Distance($d_M\left( x_1,x_2 \right) =\left( x_1-x_2 \right) ^TM\left( x_1-x_2 \right) $,其中$M$为待学习的参数矩阵)进行距离的计算。具体可参考论文Distance Metric Learning for Large Margin Nearest Neighbor Classification
在K个样本中可以考虑不同样本与新样本的权重，不只看不同类别样本的个数，也叫weighted KNN
引入kernel trick，kernelized KNN

优点

算法简单
比较适合用在低维空间
可作为baseline

缺点

因需要计算新样本与所有训练样本的距离，复杂度高，对大数据量需要一定的处理
不适合高维特征

KNN实战

鸢尾花(iris)的分类

源文件

调用KNN函数来实现分类

数据采用的是经典的iris数据，是三分类问题

# 读取相应的库
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# 读取数据 X, y
iris = datasets.load_iris()
X = iris.data
y = iris.target
print(X[:5])
print(y)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

# 把数据分成训练数据和测试数据
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2003)

# 构建KNN模型， K值为3、 并做训练
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

# 计算准确率
from sklearn.metrics import accuracy_score
# correct = np.count_nonzero((clf.predict(X_test)==y_test)==True)
print("Accuracy is: %.3f" % accuracy_score(y_test, clf.predict(X_test)))
# print ("Accuracy is: %.3f" %(correct/len(X_test)))

Accuracy is: 0.921

从零开始自己写一个KNN算法

from sklearn import datasets
from collections import Counter  # 为了做投票
from sklearn.model_selection import train_test_split
import numpy as np

# 导入iris数据
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2003)

def euc_dis(instance1, instance2):
    """
    计算两个样本instance1和instance2之间的欧式距离
    instance1: 第一个样本， array型
    instance2: 第二个样本， array型
    """
    # TODO
    dist = np.sqrt(sum((instance1 - instance2)**2))
    return dist
    
 
def knn_classify(X, y, testInstance, k):
    """
    给定一个测试数据testInstance, 通过KNN算法来预测它的标签。 
    X: 训练数据的特征
    y: 训练数据的标签
    testInstance: 测试数据，这里假定一个测试数据 array型
    k: 选择多少个neighbors? 
    """
    # TODO  返回testInstance的预测标签 = {0,1,2}
    distances = [euc_dis(x, testInstance) for x in X]
    kneighbors = np.argsort(distances)[:k]
    count = Counter(y[kneighbors]) 
    return count.most_common()[0][0]

# 预测结果。    
predictions = [knn_classify(X_train, y_train, data, 3) for data in X_test]
correct = np.count_nonzero((predictions==y_test)==True)
#print("Accuracy is: %.3f" % accuracy_score(y_test, clf.predict(X_test)))
print ("Accuracy is: %.3f" %(correct/len(X_test)))

Accuracy is: 0.921

KNN的决策边界

import matplotlib.pyplot as plt
import numpy as np
from itertools import product
from sklearn.neighbors import KNeighborsClassifier

# 生成一些随机样本
n_points = 100
X1 = np.random.multivariate_normal([1,50], [[1,0],[0,10]], n_points)
X2 = np.random.multivariate_normal([2,50], [[1,0],[0,10]], n_points)

X = np.concatenate([X1,X2])
y = np.array([0]*n_points + [1]*n_points)
print (X.shape, y.shape)

(200, 2) (200,)

# KNN模型的训练过程
clfs = []
neighbors = [1,3,5,9,11,13,15,17,19]
for i in range(len(neighbors)):
    clfs.append(KNeighborsClassifier(n_neighbors=neighbors[i]).fit(X,y))

# 可视化结果
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(3,3, sharex='col', sharey='row', figsize=(15, 12))
for idx, clf, tt in zip(product([0, 1, 2], [0, 1, 2]),
                        clfs,
                        ['KNN (k=%d)'%k for k in neighbors]):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.4)
    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y,
                                  s=20, edgecolor='k')
    axarr[idx[0], idx[1]].set_title(tt)
    
plt.show()

png

二手车价格的预测

源文件

样本数据

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

#读取数据
df = pd.read_csv('data.csv')
df  # data frame

	Brand	Type	Color	Construction Year	Odometer	Ask Price	Days Until MOT	HP
0	Peugeot 106	1.0	blue	2002	166879	999	138	60
1	Peugeot 106	1.0	blue	1998	234484	999	346	60
2	Peugeot 106	1.1	black	1997	219752	500	-5	60
3	Peugeot 106	1.1	red	2001	223692	750	-87	60
4	Peugeot 106	1.1	grey	2002	120275	1650	356	59
5	Peugeot 106	1.1	red	2003	131358	1399	266	60
6	Peugeot 106	1.1	green	1999	304277	799	173	57
7	Peugeot 106	1.4	green	1998	93685	1300	0	75
8	Peugeot 106	1.1	white	2002	225935	950	113	60
9	Peugeot 106	1.4	green	1997	252319	650	133	75
10	Peugeot 106	1.0	black	1998	220000	700	82	50
11	Peugeot 106	1.1	black	1997	212000	700	75	60
12	Peugeot 106	1.1	black	2003	255134	799	197	60

#清洗数据
# 把颜色独热编码
df_colors = df['Color'].str.get_dummies().add_prefix('Color: ')
# 把类型独热编码
df_type = df['Type'].apply(str).str.get_dummies().add_prefix('Type: ')
# 添加独热编码数据列
df = pd.concat([df, df_colors, df_type], axis=1)
# 去除独热编码对应的原始列
df = df.drop(['Brand', 'Type', 'Color'], axis=1)

df

	Construction Year	Odometer	Ask Price	Days Until MOT	HP	Color: black	Color: blue	Color: green	Color: grey	Color: red	Color: white	Type: 1.0	Type: 1.1	Type: 1.4
0	2002	166879	999	138	60	0	1	0	0	0	0	1	0	0
1	1998	234484	999	346	60	0	1	0	0	0	0	1	0	0
2	1997	219752	500	-5	60	1	0	0	0	0	0	0	1	0
3	2001	223692	750	-87	60	0	0	0	0	1	0	0	1	0
4	2002	120275	1650	356	59	0	0	0	1	0	0	0	1	0
5	2003	131358	1399	266	60	0	0	0	0	1	0	0	1	0
6	1999	304277	799	173	57	0	0	1	0	0	0	0	1	0
7	1998	93685	1300	0	75	0	0	1	0	0	0	0	0	1
8	2002	225935	950	113	60	0	0	0	0	0	1	0	1	0
9	1997	252319	650	133	75	0	0	1	0	0	0	0	0	1
10	1998	220000	700	82	50	1	0	0	0	0	0	1	0	0
11	1997	212000	700	75	60	1	0	0	0	0	0	0	1	0
12	2003	255134	799	197	60	1	0	0	0	0	0	0	1	0

# 数据转换
matrix = df.corr()
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(matrix, square=True)
plt.title('Car Price Variables')

Text(0.5, 1.0, 'Car Price Variables')

png

# 不同维度特征间的关系图
sns.pairplot(df[['Construction Year', 'Days Until MOT', 'Odometer', 'Ask Price']], size=2)
plt.show()

png

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
import numpy as np

X = df[['Construction Year', 'Days Until MOT', 'Odometer']]
y = df['Ask Price'].values.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=41)

# 特征缩放
X_normalizer = StandardScaler() # N(0,1)
X_train = X_normalizer.fit_transform(X_train)
X_test = X_normalizer.transform(X_test)

y_normalizer = StandardScaler()
y_train = y_normalizer.fit_transform(y_train)
y_test = y_normalizer.transform(y_test)

# 训练拟合
knn = KNeighborsRegressor(n_neighbors=2)
knn.fit(X_train, y_train.ravel())

# Now we can predict prices:
y_pred = knn.predict(X_test)
# 反向缩放，还原实际的量纲

# 预测的值
y_pred_inv = y_normalizer.inverse_transform(y_pred)
# 真实的值
y_test_inv = y_normalizer.inverse_transform(y_test)

# Build a plot
plt.scatter(y_pred_inv, y_test_inv)
plt.xlabel('Prediction')
plt.ylabel('Real value')

# Now add the perfect prediction line
diagonal = np.linspace(500, 1500, 100)
plt.plot(diagonal, diagonal, '-r')
plt.xlabel('Predicted ask price')
plt.ylabel('Ask price')
plt.show()

print('Prediction',y_pred_inv)
print('Real value',y_test_inv)

png

Prediction [1199. 1199.  700.  899.]
Real value [[1300.]
 [1650.]
 [ 650.]
 [ 799.]]

carlos9310

KNN(K-Nearest Neighbors) 总结