KNN(K-Nearest Neighbors) 总结

2019/06/26 ML

KNN核心思想

K最近邻算法是指:在给定的训练数据(样本)集中,对于新的输入样本,通过计算新样本与所有训练样本的距离,找到与新样本最近的K个训练样本(K个邻居),对于分类问题,K个训练样本中属于某类标签的个数最多,就把新样本分到那个标签类别中,对于回归问题,将K个训练样本的目标值的均值作为新样本的目标值。

KNN只有待调整(确定)的超参数K,没有待学习的模型参数。

超参数K可通过交叉验证(cross validation)来确定。具体地,先将待学习的数据集分为训练数据(train)和测试数据(test),再将训练数据进一步训练数据(training data)和验证数据(validation data),选择不同K的情况下,验证数据中系统性能指标最好的那个超参数K。

对于5-Fold 交叉验证,可将原有的训练数据分为5等份,每次选其中一份作为验证集,其他四份作为训练集,可做5次。将每次得到的系统指标(准确率)取平均作为最终的某个K时的系统指标。

随着K的增加,RNN的决策边界越来越平滑。

说明: 交叉验证只是起到增大数据量的目的,不用交叉验证,直接用全量的训练数据也可以确定超参数K。

注意事项

  • 不同维度特征的数值不要相差过大,否则在计算距离时,会导致某一维度的特征不起作用。可利用线性归一化(min-max normalization)[$x_{new}=\frac{x-\min \left( x \right)}{\max \left( x \right) -\min \left( x \right)}$]或标准差标准化(z-score normalization)[$x_{new}=\frac{x-mean\left( x \right)}{std\left( x \right)}$]进行特征缩放
  • 对于大数据量的训练数据集,当样本维度不大时,可以采用KD-tree的数据结构来提高计算效率;当样本维度较大时,采用 Locality Sensitivity Hashing (LSH)的近似算法提高效率(将距离比较近的样本以较大的概率放到同一个bucket里)
  • 在计算距离时,可考虑两个样本点间的相关性。可利用 Mahalanobis Distance($d_M\left( x_1,x_2 \right) =\left( x_1-x_2 \right) ^TM\left( x_1-x_2 \right) $,其中$M$为待学习的参数矩阵)进行距离的计算。具体可参考论文Distance Metric Learning for Large Margin Nearest Neighbor Classification
  • 在K个样本中可以考虑不同样本与新样本的权重,不只看不同类别样本的个数,也叫weighted KNN
  • 引入kernel trick,kernelized KNN

优点

  • 算法简单
  • 比较适合用在低维空间
  • 可作为baseline

缺点

  • 因需要计算新样本与所有训练样本的距离,复杂度高,对大数据量需要一定的处理
  • 不适合高维特征

KNN实战

鸢尾花(iris)的分类

源文件

调用KNN函数来实现分类

数据采用的是经典的iris数据,是三分类问题

# 读取相应的库
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
# 读取数据 X, y
iris = datasets.load_iris()
X = iris.data
y = iris.target
print(X[:5])
print(y)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
# 把数据分成训练数据和测试数据
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2003)
# 构建KNN模型, K值为3、 并做训练
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')
# 计算准确率
from sklearn.metrics import accuracy_score
# correct = np.count_nonzero((clf.predict(X_test)==y_test)==True)
print("Accuracy is: %.3f" % accuracy_score(y_test, clf.predict(X_test)))
# print ("Accuracy is: %.3f" %(correct/len(X_test)))
Accuracy is: 0.921

从零开始自己写一个KNN算法

from sklearn import datasets
from collections import Counter  # 为了做投票
from sklearn.model_selection import train_test_split
import numpy as np

# 导入iris数据
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2003)
def euc_dis(instance1, instance2):
    """
    计算两个样本instance1和instance2之间的欧式距离
    instance1: 第一个样本, array型
    instance2: 第二个样本, array型
    """
    # TODO
    dist = np.sqrt(sum((instance1 - instance2)**2))
    return dist
    
 
def knn_classify(X, y, testInstance, k):
    """
    给定一个测试数据testInstance, 通过KNN算法来预测它的标签。 
    X: 训练数据的特征
    y: 训练数据的标签
    testInstance: 测试数据,这里假定一个测试数据 array型
    k: 选择多少个neighbors? 
    """
    # TODO  返回testInstance的预测标签 = {0,1,2}
    distances = [euc_dis(x, testInstance) for x in X]
    kneighbors = np.argsort(distances)[:k]
    count = Counter(y[kneighbors]) 
    return count.most_common()[0][0]
# 预测结果。    
predictions = [knn_classify(X_train, y_train, data, 3) for data in X_test]
correct = np.count_nonzero((predictions==y_test)==True)
#print("Accuracy is: %.3f" % accuracy_score(y_test, clf.predict(X_test)))
print ("Accuracy is: %.3f" %(correct/len(X_test)))
Accuracy is: 0.921

KNN的决策边界

import matplotlib.pyplot as plt
import numpy as np
from itertools import product
from sklearn.neighbors import KNeighborsClassifier

# 生成一些随机样本
n_points = 100
X1 = np.random.multivariate_normal([1,50], [[1,0],[0,10]], n_points)
X2 = np.random.multivariate_normal([2,50], [[1,0],[0,10]], n_points)
X = np.concatenate([X1,X2])
y = np.array([0]*n_points + [1]*n_points)
print (X.shape, y.shape)
(200, 2) (200,)
# KNN模型的训练过程
clfs = []
neighbors = [1,3,5,9,11,13,15,17,19]
for i in range(len(neighbors)):
    clfs.append(KNeighborsClassifier(n_neighbors=neighbors[i]).fit(X,y))
# 可视化结果
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(3,3, sharex='col', sharey='row', figsize=(15, 12))
for idx, clf, tt in zip(product([0, 1, 2], [0, 1, 2]),
                        clfs,
                        ['KNN (k=%d)'%k for k in neighbors]):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.4)
    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y,
                                  s=20, edgecolor='k')
    axarr[idx[0], idx[1]].set_title(tt)
    
plt.show()

png

二手车价格的预测

源文件

样本数据

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
#读取数据
df = pd.read_csv('data.csv')
df  # data frame
Brand Type Color Construction Year Odometer Ask Price Days Until MOT HP
0 Peugeot 106 1.0 blue 2002 166879 999 138 60
1 Peugeot 106 1.0 blue 1998 234484 999 346 60
2 Peugeot 106 1.1 black 1997 219752 500 -5 60
3 Peugeot 106 1.1 red 2001 223692 750 -87 60
4 Peugeot 106 1.1 grey 2002 120275 1650 356 59
5 Peugeot 106 1.1 red 2003 131358 1399 266 60
6 Peugeot 106 1.1 green 1999 304277 799 173 57
7 Peugeot 106 1.4 green 1998 93685 1300 0 75
8 Peugeot 106 1.1 white 2002 225935 950 113 60
9 Peugeot 106 1.4 green 1997 252319 650 133 75
10 Peugeot 106 1.0 black 1998 220000 700 82 50
11 Peugeot 106 1.1 black 1997 212000 700 75 60
12 Peugeot 106 1.1 black 2003 255134 799 197 60
#清洗数据
# 把颜色独热编码
df_colors = df['Color'].str.get_dummies().add_prefix('Color: ')
# 把类型独热编码
df_type = df['Type'].apply(str).str.get_dummies().add_prefix('Type: ')
# 添加独热编码数据列
df = pd.concat([df, df_colors, df_type], axis=1)
# 去除独热编码对应的原始列
df = df.drop(['Brand', 'Type', 'Color'], axis=1)

df

Construction Year Odometer Ask Price Days Until MOT HP Color: black Color: blue Color: green Color: grey Color: red Color: white Type: 1.0 Type: 1.1 Type: 1.4
0 2002 166879 999 138 60 0 1 0 0 0 0 1 0 0
1 1998 234484 999 346 60 0 1 0 0 0 0 1 0 0
2 1997 219752 500 -5 60 1 0 0 0 0 0 0 1 0
3 2001 223692 750 -87 60 0 0 0 0 1 0 0 1 0
4 2002 120275 1650 356 59 0 0 0 1 0 0 0 1 0
5 2003 131358 1399 266 60 0 0 0 0 1 0 0 1 0
6 1999 304277 799 173 57 0 0 1 0 0 0 0 1 0
7 1998 93685 1300 0 75 0 0 1 0 0 0 0 0 1
8 2002 225935 950 113 60 0 0 0 0 0 1 0 1 0
9 1997 252319 650 133 75 0 0 1 0 0 0 0 0 1
10 1998 220000 700 82 50 1 0 0 0 0 0 1 0 0
11 1997 212000 700 75 60 1 0 0 0 0 0 0 1 0
12 2003 255134 799 197 60 1 0 0 0 0 0 0 1 0
# 数据转换
matrix = df.corr()
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(matrix, square=True)
plt.title('Car Price Variables')

Text(0.5, 1.0, 'Car Price Variables')

png

# 不同维度特征间的关系图
sns.pairplot(df[['Construction Year', 'Days Until MOT', 'Odometer', 'Ask Price']], size=2)
plt.show()

png

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
import numpy as np

X = df[['Construction Year', 'Days Until MOT', 'Odometer']]
y = df['Ask Price'].values.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=41)

# 特征缩放
X_normalizer = StandardScaler() # N(0,1)
X_train = X_normalizer.fit_transform(X_train)
X_test = X_normalizer.transform(X_test)

y_normalizer = StandardScaler()
y_train = y_normalizer.fit_transform(y_train)
y_test = y_normalizer.transform(y_test)

# 训练拟合
knn = KNeighborsRegressor(n_neighbors=2)
knn.fit(X_train, y_train.ravel())

# Now we can predict prices:
y_pred = knn.predict(X_test)
# 反向缩放,还原实际的量纲

# 预测的值
y_pred_inv = y_normalizer.inverse_transform(y_pred)
# 真实的值
y_test_inv = y_normalizer.inverse_transform(y_test)

# Build a plot
plt.scatter(y_pred_inv, y_test_inv)
plt.xlabel('Prediction')
plt.ylabel('Real value')

# Now add the perfect prediction line
diagonal = np.linspace(500, 1500, 100)
plt.plot(diagonal, diagonal, '-r')
plt.xlabel('Predicted ask price')
plt.ylabel('Ask price')
plt.show()

print('Prediction',y_pred_inv)
print('Real value',y_test_inv)

png

Prediction [1199. 1199.  700.  899.]
Real value [[1300.]
 [1650.]
 [ 650.]
 [ 799.]]

Search

    Post Directory