监督学习(二)
线性回归(普通最小二乘法):
岭回归:
岭回归是一种专用于共线性数据分析的有偏估计回归方法,实质上是一种改良的最小二乘估计法,通过放弃最小二乘法的无偏性,以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法,对病态数据的拟合要强于最小二乘法。它的引入是为了缓解普通最小二乘法容易造成的过拟合问题。
其损失函数为:
其中正则化参数λ>0,通过引入L2范数正则化,能够显著降低过拟合的风险。
LASSO回归:
其与岭回归的唯一区别是正则化项为L1范数
L1正则化的结果是,使用lasso时某些系数刚好为0,这说明某些特征被模型完全忽略,这可以看作是一种自动化的特正选择。
L1范数和L2范数都有助于降低拟合风险,其中L1比L2更易于过的稀疏解。
Logistic回归(logistic regression)
Logistic回归的模型公式是:
其主要特点是Y值的变化不随着X叠加的大小线性变化,而是一种平滑的变化。其损失函数参见这个博客)
线性支持向量机(Linear Support Vector Machine,LSVM):
该方法在周志华《机器学习》中被称为软间隔支持向量机
先来说一下支持向量机(SVM),简单来说就是要把正反例分成两部分有很多种划分方法,那么如果选取最好的呢?有一种观点是选取在两类训练样本“正中间”的划分超平面,而位于两类训练样本边界上的向量被称为支持向量。样本空间中任意点x到超平面(w,b)的距离是
所以两个异类支持向量到超平面的距离之和为
到此时问题就转化为了求r最大值的问题,同时也等价于最小化w范数的平方。
再来讨论软间隔支持向量机。之前的讨论都是基于两个样本能够线性可分的情况,但现实情况常常会出现无法线性可分,所以在此处加入“软间隔”,也就是允许一部分错误的样本出现,其损失函数是:
其中C>0是惩罚参数,当C值大时,对误分类项的惩罚增大,C值小时,对误分类的惩罚减小。而ξi是对每个样本都引入的一个松弛变量,使得间隔加上松弛变量大于等于1。
代码实现:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import make_blobs
import mglearn
import numpy as np
if __name__ == "__main__":
#普通最小二乘法
X,y = mglearn.datasets.make_wave(n_samples=60)
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)
lr = LinearRegression().fit(X_train,y_train)
#下述两个lr的属性结尾的下划线是scikit-learn中总是将训练数据中的得出的值保存在以下划线结尾的属性中
print("lr.coef_:{}".format(lr.coef_))
print("lr.intercept_:{}".format(lr.intercept_))
#下面的两个结果相近,代表出现了欠拟合,因为训练精度与测试精度接近
print("Training set score:{:.2f}".format(lr.score(X_train,y_train)))
print("Test set score:{:.2f}".format(lr.score(X_test,y_test)))
X,y = mglearn.datasets.load_extended_boston()
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
lr = LinearRegression().fit(X_train,y_train)
print("Training set score:{:.2f}".format(lr.score(X_train,y_train)))
print("Test set score:{:.2f}".format(lr.score(X_test,y_test)))
#############################################################
#岭回归
ridge = Ridge().fit(X_train,y_train)
print("Training set score:{:.2f}".format(ridge.score(X_train,y_train)))
print("Test set score:{:.2f}".format(ridge.score(X_test,y_test)))
#alpha为正则化系数
ridge10 = Ridge(alpha=10).fit(X_train,y_train)
print("Training set score:{:.2f}".format(ridge10.score(X_train,y_train)))
print("Test set score:{:.2f}".format(ridge10.score(X_test,y_test)))
ridge01 = Ridge(alpha=0.1).fit(X_train,y_train)
print("Training set score:{:.2f}".format(ridge01.score(X_train,y_train)))
print("Test set score:{:.2f}".format(ridge01.score(X_test,y_test)))
'''
plt.plot(ridge.coef_,"s",label="Ridge alpha=1")
plt.plot(ridge10.coef_,"^",label="Ridge alpha=10")
plt.plot(ridge01.coef_,"v",label="Ridge alpha=0.1")
plt.plot(lr.coef_,'o',label="LinearRegression")
plt.xlabel("Cofficient index")
plt.ylabel("Cofficient magnitude")
plt.hlines(0,0,len(lr.coef_))
plt.ylim(-25,25)
plt.legend()
#plt.show()
mglearn.plots.plot_ridge_n_samples()
#plt.show()
'''
#Lasso回归
lasso = Lasso().fit(X_train,y_train)
print("Training set score:{:.2f}".format(lasso.score(X_train,y_train)))
print("Test set score:{:.2f}".format(lasso.score(X_test,y_test)))
print("Number of features used:{}".format(np.sum(lasso.coef_!=0)))
#同样调整alpha的值,并增加max_iter的值
lasso001 = Lasso(alpha=0.01,max_iter=100000).fit(X_train,y_train)
print("Training set score:{:.2f}".format(lasso001.score(X_train,y_train)))
print("Test set score:{:.2f}".format(lasso001.score(X_test,y_test)))
print("Number of features used:{}".format(np.sum(lasso001.coef_!=0)))
#继续调整参数
lasso00001 = Lasso(alpha=0.0001,max_iter=100000).fit(X_train,y_train)
print("Training set score:{:.2f}".format(lasso00001.score(X_train,y_train)))
print("Test set score:{:.2f}".format(lasso00001.score(X_test,y_test)))
print("Number of features used:{}".format(np.sum(lasso00001.coef_!=0)))
'''
plt.plot(lasso.coef_,"s",label="Lasso alpha=1")
plt.plot(lasso001.coef_,"^",label="Lasso alpha=0.01")
plt.plot(lasso00001.coef_,"v",label="Lasso alpha=0.0001")
plt.plot(ridge01.coef_,'o',label="Ridge alpha=0.1")
plt.xlabel("Cofficient index")
plt.ylabel("Cofficient magnitude")
plt.ylim(-25,25)
plt.legend(ncol=2,loc=(0,1.05))
#plt.show()
'''
#使用LogisticRegression和LinearSVC
X,y = mglearn.datasets.make_forge()
fig,axes = plt.subplots(1,2,figsize=(10,3))
'''
for model,ax in zip([LinearSVC(),LogisticRegression()],axes):
clf = model.fit(X,y)
mglearn.plots.plot_2d_separator(clf,X,fill=False,eps=0.5,ax=ax,alpha=.7)
mglearn.discrete_scatter(X[:,0],X[:,1],y,ax=ax)
ax.set_title("{}".format(clf.__class__.__name__))
ax.set_xlabel("Feature 0")
ax.set_ylabel("Feature 1")
axes[0].legend()
'''
#mglearn.plots.plot_linear_svc_regularization()
#plt.show()
#LogisticRegression
cancer = load_breast_cancer()
X_train,X_test,y_train,y_test = train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=42)
logreg = LogisticRegression().fit(X_train,y_train)
print("Training set score:{:.3f}".format(logreg.score(X_train,y_train)))
print("Test set score:{:.3f}".format(logreg.score(X_test,y_test)))
logreg100 = LogisticRegression(C=100).fit(X_train,y_train)
print("Training set score:{:.3f}".format(logreg100.score(X_train,y_train)))
print("Test set score:{:.3f}".format(logreg100.score(X_test,y_test)))
logreg001 = LogisticRegression(C=0.01).fit(X_train,y_train)
print("Training set score:{:.3f}".format(logreg001.score(X_train,y_train)))
print("Test set score:{:.3f}".format(logreg001.score(X_test,y_test)))
'''
plt.plot(logreg.coef_.T,'o',label="C=1")
plt.plot(logreg100.coef_.T,'^',label="C=100")
plt.plot(logreg001.coef_.T,'v',label="C=001")
plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation=90)
plt.hlines(0,0,cancer.data.shape[1])
plt.ylim(-5,5)
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.legend()
plt.show()
'''
for C,marker in zip([0.001,1,100],['o','^','v']):
lr_l1 = LogisticRegression(C=C,penalty="l1").fit(X_train,y_train)
print("Training accuracy of l1 logreg with C={:.3f}:{:.2f}".format(C,lr_l1.score(X_train,y_train)))
print("Test accuracy of l1 logreg with C={:.3f}:{:.2f}".format(C,lr_l1.score(X_test,y_test)))
'''
plt.plot(lr_l1.coef_.T,marker,label="C={:.3f}".format(C))
plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation=90)
plt.hlines(0,0,cancer.data.shape[1])
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.ylim(-5,5)
plt.legend(loc=3)
plt.show()
'''
X,y = make_blobs(random_state=42)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
'''
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(["Class 0","Class 1","Class 2"])
plt.show()
'''
linear_svm = LinearSVC().fit(X,y)
print("Coefficient shape:",linear_svm.coef_.shape)
print("Intercept shape:",linear_svm.intercept_.shape)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
line = np.linspace(-15,15)
'''
for coef,intercept,color in zip(linear_svm.coef_,linear_svm.intercept_,['b','r','g']):
plt.plot(line,-(line*coef[0]+intercept)/coef[1],c=color)
plt.ylim(-10,15)
plt.xlim(-10,8)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(['Class 0','Class 1','Class 2','Line class 0','Line class 1','Line class 2'],loc=(1.01,0.3))
plt.show()
'''
mglearn.plots.plot_2d_classification(linear_svm,X,fill=True,alpha=.7)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
line = np.linspace(-15,15)
'''
for coef,intercept,color in zip(linear_svm.coef_,linear_svm.intercept_,['b','r','g']):
plt.plot(line,-(line*coef[0]+intercept)/coef[1],c=color)
plt.legend(['Class 0','Class 1','Class 2','Line class 0','Line class 1','Line class 2'],loc=(1.01,0.3))
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.show()