数据表示与特征工程

One-Hot编码(独热编码):
在机器学习算法中,我们会遇到分类特征,这些特征值并不是连续的,而是离散的,所以我们需要对这些数据进行编码。
其主要思想就是利用N位状态寄存器对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任何时候只有一位有效,举例说明:
性别分为男和女,由于有两种特征所以编码为01和10
国籍有中国、美国、法国,由于有三种特征所以编码100、010、001
喜欢的运动有足球、篮球、乒乓球、羽毛球,由于有四种状态所以编码为1000、0100、0010、0001
这样当有一个新样本进来的时候,我们就能对其通过上述规则进行编码,从而组成一个高维的稀疏矩阵。

分箱:
分箱就是将连续变量离散化,将多状态的离散变量合并成少状态,这样做会降低模型过拟合的风险。
优势:

  1. 离散特征的增加和减少都很容易,易于模型的快速迭代;
  2. 稀疏向量内积乘法运算速度快,计算结果方便存储,容易扩展;
  3. 离散化后的特征对异常数据有很强的鲁棒性:比如一个特征是年龄>30是1,否则0。如果特征没有离散化,一个异常数据“年龄300岁”会给模型造成很大的干扰;
  4. 逻辑回归属于广义线性模型,表达能力受限;单变量离散化为N个后,每个变量有单独的权重,相当于为模型引入了非线性,能够提升模型表达能力,加大拟合;
  5. 离散化后可以进行特征交叉,由M+N个变量变为M*N个变量,进一步引入非线性,提升表达能力;
  6. 特征离散化后,模型会更稳定,比如如果对用户年龄离散化,20-30作为一个区间,不会因为一个用户年龄长了一岁就变成一个完全不同的人。当然处于区间相邻处的样本会刚好相反,所以怎么划分区间是门学问;
  7. 特征离散化以后,起到了简化了逻辑回归模型的作用,降低了模型过拟合的风险。
  8. 可以将缺失作为独立的一类带入模型。
  9. 将所有变量变换到相似的尺度上。

线性模型与树模型
线性模型是所有特征给与权重相加得到一个新的值,而树模型是产生可视化的分类规则,也就相当于分区间的阶梯函数。

代码实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
import pandas as pd
import mglearn
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectPercentile
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE

if __name__ == "__main__":
#文件中没有包含名称的表头,因此我们传入header=None
#然后在'names'中显式地提供列名称
data = pd.read_csv(
"C:/Users/tonyw/Desktop/adult_income_data.csv",header=None,index_col=False,
names=['age','workclass','fnlwgt','education','education-num',
'martial-status','occupation','relationship','race','gender',
'capital-gain','capital-loss','hours-per-week','native-country',
'income']
)
#为了便于说明,我们只选了其中几列
data = data[['age','workclass','education','gender','hours-per-week','occupation','income']]
display(data.head())
#检查字符串编码的分类数据
print(data.gender.value_counts())
print("Original features:\n",list(data.columns),'\n')
data_dummies = pd.get_dummies(data)
print("Features after get_dummies:\n",list(data_dummies.columns))
display(data_dummies.head())
features=data_dummies.ix[:,'age':'occupation_ Transport-moving']
#提取Numpy数组
X = features.values
y = data_dummies['income_ >50K'].values
print("X.shape:{} y.shape:{}".format(X.shape,y.shape))
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
print("Test score:{:.2f}".format(logreg.score(X_test,y_test)))
#创建一个DataFrame,包含一个整数特征和一个分类字符串特征
demo_df = pd.DataFrame({'Integer Feature':[0,1,2,1],'Categorical Feature':['socks','fox','socks','box']})
display(demo_df)
display(pd.get_dummies(demo_df))
demo_df['Integer Feature']=demo_df['Integer Feature'].astype(str)
display(pd.get_dummies(demo_df,columns=['Integer Feature','Categorical Feature']))

X,y=mglearn.datasets.make_wave(n_samples=100)
line = np.linspace(-3,3,1000,endpoint=False).reshape(-1,1)
reg = DecisionTreeRegressor(min_samples_split=3).fit(X,y)
'''
plt.plot(line,reg.predict(line),label='decision tree')
reg=LinearRegression().fit(X,y)
plt.plot(line,reg.predict(line),label='linear regression')
plt.plot(X[:,0],y,'o',c='k')
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.legend(loc='best')
plt.show()
'''
bins = np.linspace(-3,3,11)
print("bins:{}".format(bins))
which_bin = np.digitize(X,bins=bins)
print("\nData points:\n",X[:5])
print("\nBin menbership for data points:\n",which_bin[:5])
#使用OneHotEncoder进行变换
encoder = OneHotEncoder(sparse=False)
#encoder.fit找到which_bin中的唯一值
encoder.fit(which_bin)
#tranform创建one-hot编码
X_binned = encoder.transform(which_bin)
print(X_binned[:5])
print("X_binned.shape:{}".format(X_binned.shape))
line_binned = encoder.transform(np.digitize(line,bins=bins))
reg=LinearRegression().fit(X_binned,y)
'''
plt.plot(line,reg.predict(line_binned),label='linear regression binned')
reg=DecisionTreeRegressor(min_samples_split=3).fit(X_binned,y)
plt.plot(line,reg.predict(line_binned),label='decision tree binned')
plt.plot(X[:,0],y,'o',c='k')
plt.vlines(bins,-3,3,linewidth=1,alpha=.2)
plt.legend(loc="best")
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.show()
'''
X_combined = np.hstack([X,X_binned])
print(X_combined.shape)
reg = LinearRegression().fit(X_combined,y)
line_combined=np.hstack([line,line_binned])
'''
plt.plot(line,reg.predict(line_combined),label='linear regression combined')
for bin in bins:
plt.plot([bin,bin],[-3,3],':',c='k')
plt.legend(loc="best")
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.plot(X[:,0],y,'o',c='k')
X_product = np.hstack([X_binned,X*X_binned])
print(X_product.shape)
reg = LinearRegression().fit(X_product,y)
line_product=np.hstack([line_binned,line*line_binned])
plt.plot(line,reg.predict(line_product),label='linear regression product')
for bin in bins:
plt.plot([bin,bin],[-3,3],':',c='k')
plt.plot(X[:,0],y,'o',c='k')
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.legend(loc="best")
plt.show()
'''
#包含直到x**10的多项式:
#默认的'include_bias=True'添加恒等于1的常数特征
poly = PolynomialFeatures(degree=10,include_bias=False)
poly.fit(X)
X_poly=poly.transform(X)
print("X_poly.shape:{}".format(X_poly.shape))
print("Entries of X:\n{}".format(X[:5]))
print("Entries of X_poly:\n{}".format(X_poly[:5]))
print("Polynomial feature names:\n{}".format(poly.get_feature_names()))
reg=LinearRegression().fit(X_poly,y)
line_poly=poly.transform(line)
'''
plt.plot(line,reg.predict(line_poly),label='polynomial linear regression')
plt.plot(X[:,0],y,'o',c='k')
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.legend(loc="best")
'''
'''
for gamma in [1,10]:
svr = SVR(gamma=gamma).fit(X,y)
plt.plot(line,svr.predict(line),label='SVR gamma={}'.format(gamma))
plt.plot(X[:,0],y,'o',c='k')
plt.ylabel("Regession output")
plt.xlabel("Input feature")
plt.legend(loc="best")
'''
boston = load_boston()
X_train,X_test,y_train,y_test=train_test_split(boston.data,boston.target,random_state=0)
#缩放数据
scaler = MinMaxScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)
poly = PolynomialFeatures(degree=2).fit(X_train_scaled)
X_train_poly = poly.transform(X_train_scaled)
X_test_poly=poly.transform(X_test_scaled)
print("X_train.shape:{}".format(X_train.shape))
print("X_train_poly.shape:{}".format(X_train_poly.shape))

ridge = Ridge().fit(X_train_scaled,y_train)
print("Score without interaction:{:.3f}".format(ridge.score(X_test_scaled,y_test)))
ridge = Ridge().fit(X_train_poly,y_train)
print("Score with interactions:{:.3f}".format(ridge.score(X_test_poly,y_test)))

rf = RandomForestRegressor(n_estimators=100).fit(X_train_scaled,y_train)
print("Score without interactions:{:.3f}".format(rf.score(X_test_scaled,y_test)))
rf = RandomForestRegressor(n_estimators=100).fit(X_train_poly,y_train)
print("Score with intercations:{:.3f}".format(rf.score(X_test_poly,y_test)))

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
score=Ridge().fit(X_train,y_train).score(X_test,y_test)
print("Test score:{:3f}".format(score))
#单变量统计
cancer = load_breast_cancer()
#获得确定性的随机数
rng = np.random.RandomState(42)
noise = rng.normal(size=(len(cancer.data),50))
#向数据中添加噪声特征
#前30个特征来自数据集,后50个是噪声
X_w_noise=np.hstack([cancer.data,noise])
X_train,X_test,y_train,y_test=train_test_split(X_w_noise,cancer.target,random_state=0,test_size=.5)
#使用f_classif(默认值)和SelectPercentile来选择50%的特征
select = SelectPercentile(percentile=50)
select.fit(X_train,y_train)
#对训练集进行变换
X_train_selected=select.transform(X_train)

print("X_train.shape:{}".format(X_train.shape))
print("X_train_select.shape:{}".format(X_train_selected.shape))
mask=select.get_support()
print(mask)
#将遮罩可视化--黑色为True,白色为False
'''
plt.matshow(mask.reshape(1,-1),cmap='gray_r')
plt.xlabel("Sample index")
plt.show()
'''
#对测试数据进行变换
X_test_selected = select.transform(X_test)
lr = LogisticRegression()
lr.fit(X_train,y_train)
print("Score with all features:{:.3f}".format(lr.score(X_test,y_test)))
lr.fit(X_train_selected,y_train)
print("Score with only selected features:{:.3f}".format(lr.score(X_test_selected,y_test)))
#基于模型的特征选择
select = SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=42),threshold="median")
select.fit(X_train,y_train)
X_train_l1 = select.transform(X_train)
print("X_train.shape:{}".format(X_train.shape))
print("X_train_l1.shape:{}".format(X_train_l1.shape))
X_test_l1=select.transform(X_test)
score=LogisticRegression().fit(X_train_l1,y_train).score(X_test_l1,y_test)
print("Test score:{:.3f}".format(score))
#迭代特征选择
select = RFE(RandomForestClassifier(n_estimators=100,random_state=42),n_features_to_select=40)
select.fit(X_train,y_train)

X_train_rfe=select.transform(X_train)
X_test_rfe = select.transform(X_test)
score=LogisticRegression().fit(X_train_rfe,y_train).score(X_test_rfe,y_test)
print("Test score:{:.3f}".format(score))
print("Test score:{:.3f}".format(select.score(X_test,y_test)))
#专家系统
#由于X=citibike.index.strftime("%s").astype("int").reshape(-1,1)
#出现问题,测试无法顺利进行