交叉验证:
k折交叉验证
k折交叉验证就是将整个数据集划分为k个大小相似、分布一致的子集,将其中k-1个子集作为训练集,剩下的1个子集作为测试集,最终返回这k个测试结果的均值。
分层K折交叉验证
在K折交叉验证中存在一个问题,也就是数据分布可能不均匀,比如刚好每K个数据都一致,这样会导致在测试集测试时没法泛化。所以使用分层K折交叉验证,它使用每K个数据中取不同的数据。
留一法
留一法算是K折交叉验证的特例,它将整个数据集都作为训练集,只留出一个样本作为测试集,这样会使得留一法中被实际评估的模型与期望评估的用D训练出的模型很相似,不过在训练大数据集时会导致训练成本很高。
网格搜索
网格搜索其实是一种穷举搜索,在所有候选的参数选择中,通过循环遍历,尝试每一种可能性,表现最好的参数就是最终的结果。其原理就像是在数组里找最大值。比如有两个参数需要测试,这样把每个组合值都尝试一遍,选出精度最好的那种组合就是网格搜索。
评估指标与评分
准确率(查准率,precision):真正例在所有正例(不论真假)中所占的比例
召回率(查全率,recall):真正例在所有正例(真正例+假反例)中所占的比例
F-分数:准确率和召回率的调和平均(其倒数为两率各自倒数相加取平均)
ROC(Receiver Operating Characteristic:受试者工作特征):ROC曲线的纵轴是“真正例率(True Positive Rate)”(在所有实际为阳性的样本里,被正确地判断为阳性),横轴是“假正例率(False Positive Rate)” (在所有实际为阴性的样本里,被正确地判断为阴性)。
AUC(Area Under ROC Curve):被ROC曲线所围的面积,其计算公式为
有关ROC曲线的画法及损失定义:
设共有m+n个例子进行判断,其中m个正例,n个反例,特例有10个例子,其中
5个正例概率为(0.9,0.8,0.5,0.4,0.3)
5个反例概率为(0.7,0.6,0.2,0.1,0.01)
则将上述所有概率从大到小排序为
[正、正、反、反、正、正、正、反、反、反]
坐标(0,0)处标记一个点然后,将分类阈值依次设为每个样例的预测值,即依次将每个样例划分为正例.设前一个标记点坐标为(x,y),若当前样例为正例,则其对应坐标为
其中m+代表样例中正例的个数,若当前样例为反例,则其对应坐标为
其中m-代表样例中反例的个数。
上图就是示例中对应的ROC,图中每个小块的大小为1/(正例*反例),且由图可知反例比正例共大6块格子,故损失为6/(正例*反例),但实际上当值为0.5的时候,其可正可反,故取1/2,即
故AUC的损失函数为
则
代码实现:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import mglearn
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GroupKFold
from sklearn.svm import SVC
import numpy as np
from sklearn.model_selection import GridSearchCV
import pandas as pd
from IPython.display import display
from sklearn.model_selection import ParameterGrid,StratifiedKFold
from sklearn.datasets import load_digits
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.datasets import make_blobs
#嵌套交叉验证展开
def nested_cv(X,y,inner_cv,outer_cv,Classifier,paramter_grid):
outer_scores=[]
#对于外层交叉验证的每次数据划分,split方法返回索引值
for training_samples,test_samples in outer_cv.split(X,y):
#利用内层交叉验证找到最佳参数
best_params={}
best_score=-np.inf
#遍历参数
for parameters in paramter_grid:
#在内层划分中累加分数
cv_scores=[]
#遍历内层交叉验证
for inner_train,inner_test in inner_cv.split(X[training_samples],y[training_samples]):
#对于给定的参数和训练树来构建分类器
clf = Classifier(**parameters)
clf.fit(X[inner_train],y[inner_train])
#在内层测试集上进行评估
score = clf.score(X[inner_test],y[inner_test])
cv_scores.append(score)
#计算内层交叉验证的平均分数
mean_score = np.mean(cv_scores)
if mean_score>best_score:
#如果比前面的模型都要好,则保存其参数
best_score = mean_score
best_params=parameters
#利用好外层训练集和最佳参数来构建模型
clf = Classifier(**best_params)
clf.fit(X[training_samples],y[training_samples])
#评估模型
outer_scores.append(clf.score(X[test_samples],y[test_samples]))
return np.array(outer_scores)
if __name__ == "__main__":
#创建一个虚拟数据集
X,y = make_blobs(random_state=0)
#将数据和标签划分为训练集和测试集
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
#讲模型实例化,并用它来拟合训练集
logreg = LogisticRegression().fit(X_train,y_train)
#在测试集上评估该模型
print("Test set score:{:.2f}".format(logreg.score(X_test,y_test)))
#交叉验证
#k折交叉验证
'''
mglearn.plots.plot_cross_validation()
plt.show()
'''
iris = load_iris()
logreg = LogisticRegression()
scores = cross_val_score(logreg,iris.data,iris.target,cv=5)
print("Cross-validation scores:{}".format(scores))
#分层k折交叉验证和其他策略
'''
mglearn.plots.plot_stratified_cross_validation()
plt.show()
'''
kfold = KFold(n_splits=5)
print("Cross-validation score:\n{}".format(cross_val_score(logreg,iris.data,iris.target,cv=kfold)))
kflod = KFold(n_splits=3,shuffle=True,random_state=0)
print("Cross-validation scores:\n{}".format(cross_val_score(logreg,iris.data,iris.target,cv=kflod)))
#留一法
loo = LeaveOneOut()
scores = cross_val_score(logreg,iris.data,iris.target,cv=loo)
print("Number of cv iterations:",len(scores))
print("Mean accuracy:{:2f}".format(scores.mean()))
#打乱划分交叉验证
'''
mglearn.plots.plot_shuffle_split()
plt.show()
'''
shuffle_split = ShuffleSplit(test_size=.5,train_size=.5,n_splits=10)
scores=cross_val_score(logreg,iris.data,iris.target,cv=shuffle_split)
print("Cross-validation scores:\n{}".format(scores))
#分组交叉验证
#创建模拟数据集
X,y=make_blobs(n_samples=12,random_state=0)
#假设前3个样本属于同一组,接下来的4个属于同一组,以此类推
groups=[0,0,0,1,1,1,1,2,2,3,3,3]
scores = cross_val_score(logreg,X,y,groups,cv=GroupKFold(n_splits=3))
print("Cross-validation scores:\n{}".format(scores))
'''
mglearn.plots.plot_group_kfold()
plt.show()
'''
##########################################################
#简单的网格搜索实现
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=0)
print("Size of training set:{} size of test set:{}".format(X_train.shape[0],X_test.shape[0]))
best_score = 0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
#对每种参数组合都训练一个SVC
svm = SVC(gamma=gamma,C=C)
svm.fit(X_train,y_train)
#在测试集上评估SVC
score=svm.score(X_test,y_test)
#如果我们得到了更高的分数,则保存该分数和对应的参数
if score>best_score:
best_score=score
best_parameters={'C':C,'gamma':gamma}
print("Best score:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
'''
mglearn.plots.plot_threefold_split()
plt.show()
'''
#将数据划分为训练+验证集与测试集
X_trainval,X_test,y_trainval,y_test = train_test_split(iris.data,iris.target,random_state=0)
#将数据划分为训练+验证集与验证集
X_train,X_valid,y_train,y_valid = train_test_split(X_trainval,y_trainval,random_state=1)
print("Size of training set:{} size of validation set:{} size of test set:{}\n".format(X_train.shape[0],X_valid.shape[0],X_test.shape[0]))
best_score = 0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
#对每种参数组合都训练一个SVC
svm = SVC(gamma=gamma,C=C)
svm.fit(X_train,y_train)
#在验证集上评估SVC
score = svm.score(X_valid,y_valid)
#如果我们得到了更高的分数,则保存该分数和对应的参数
if score>best_score:
best_score = score
best_parameters = {'C':C,'gamma':gamma}
#在训练+验证集上重新构建一个模型,并在测试集上进行评估
svm = SVC(**best_parameters)
svm.fit(X_trainval,y_trainval)
test_score = svm.score(X_test,y_test)
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:",best_parameters)
print("Test set score with best parameters:{:.2f}".format(test_score))
######################################################################
#带交叉验证的网格搜索
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
#对于每种参数组合都训练一个SVC
svm = SVC(gamma=gamma,C=C)
#执行交叉验证
score = cross_val_score(svm,X_trainval,y_trainval,cv=5)
#计算交叉验证平均精度
score = np.mean(scores)
#如果我们得到了更高的分数,则保存该分数和对应的参数
if score > best_score:
best_score = score
best_parameters={'C':C,'gamma':gamma}
#在训练+验证集上重新构建一个模型
svm = SVC(**best_parameters)
svm.fit(X_trainval,y_trainval)
'''
mglearn.plots.plot_cross_val_selection()
plt.show()
'''
'''
mglearn.plots.plot_grid_search_overview()
plt.show()
'''
param_grid = {'C':[0.001,0.01,0.1,1,10,100],'gamma':[0.001,0.01,0.1,1,10,100]}
print("Parameter grid:\n{}".format(param_grid))
grid_search = GridSearchCV(SVC(),param_grid,cv=5)
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state = 0)
grid_search.fit(X_train,y_train)
print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test)))
print("Best parameters:{}".format(grid_search.best_params_))
print("Best cross-validation score:{:.2f}".format(grid_search.best_score_))
print("Best estimator:\n{}".format(grid_search.best_estimator_))
#分析交叉验证的结果
#转换为DataFrame(数据框)
results = pd.DataFrame(grid_search.cv_results_)
#显示前5行
display(results.head())
scores = np.array(results.mean_test_score).reshape(6,6)
#对交叉验证平均分数作图
'''
mglearn.tools.heatmap(scores,xlabel='gamma',xticklabels=param_grid['gamma'],ylabel='C',yticklabels=param_grid['C'],cmap='viridis')
plt.show()
'''
param_grid = [{'kernel':['rbf'],'C':[0.001,0.01,0.1,1,10,100],'gamma':[0.001,0.01,0.1,1,10,100]},{'kernel':['linear'],'C':[0.001,0.01,0.1,1,10,100]}]
print("List of grids:\n{}".format(param_grid))
grid_search=GridSearchCV(SVC(),param_grid,cv=5)
grid_search.fit(X_train,y_train)
print("Best parameters:{}".format(grid_search.best_params_))
print("Best cross-validation score:{:.2f}".format(grid_search.best_score_))
result=pd.DataFrame(grid_search.cv_results_)
display(result.T)
#嵌套交叉验证
scores = cross_val_score(GridSearchCV(SVC(),param_grid,cv=5),iris.data,iris.target,cv=5)
print("Cross-validation scores:",scores)
print("Mean cross-validation score:",scores.mean())
#利用自定义nested_cv在iris上测试
scores = nested_cv(iris.data,iris.target,StratifiedKFold(5),StratifiedKFold(5),SVC,ParameterGrid(param_grid))
print("Cross-validation scores: {}".format(scores))
#构造不平衡数据集
digits=load_digits()
y=digits.target==9
X_train,X_test,y_train,y_test = train_test_split(digits.data,y,random_state=0)
dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train,y_train)
pred_most_frequent = dummy_majority.predict(X_test)
print("Unique predicted labels:{}".format(np.unique(pred_most_frequent)))
print("Test score:{:.2f}".format(dummy_majority.score(X_test,y_test)))
#利用决策树
tree = DecisionTreeClassifier(max_depth=2).fit(X_train,y_train)
pred_tree = tree.predict(X_test)
print("Test score:{:.2f}".format(tree.score(X_test,y_test)))
dummy = DummyClassifier().fit(X_train,y_train)
pred_dummy = dummy.predict(X_test)
print("dummy score:{:.2f}".format(dummy.score(X_test,y_test)))
logreg = LogisticRegression(C=0.1).fit(X_train,y_train)
pred_logreg = logreg.predict(X_test)
print("logreg score:{:.2f}".format(logreg.score(X_test,y_test)))
#混淆矩阵
confusion = confusion_matrix(y_test,pred_logreg)
print("Confusion matrix:\n{}".format(confusion))
print("Most frequent class:")
print(confusion_matrix(y_test,pred_most_frequent))
print("\nDummy model:")
print(confusion_matrix(y_test,pred_dummy))
print("\nDecision tree:")
print(confusion_matrix(y_test,pred_tree))
print("\nLogistic Regression")
print(confusion_matrix(y_test,pred_logreg))
#F1分数
print("f1 score most frequent:{:.2f}".format(f1_score(y_test,pred_most_frequent)))
print("f1 score dummy:{:.2f}".format(f1_score(y_test,pred_dummy)))
print("f1 score tree:{:.2f}".format(f1_score(y_test,pred_tree)))
print("f1 score logistic regression:{:.2f}".format(f1_score(y_test,pred_logreg)))
print(classification_report(y_test,pred_most_frequent,target_names=['not nine','nine']))
print(classification_report(y_test,pred_dummy,target_names=["not nine",'nine']))
print(classification_report(y_test,pred_logreg,target_names=["not nine","nine"]))