数据表示与特征工程

One-Hot编码(独热编码)：
在机器学习算法中，我们会遇到分类特征，这些特征值并不是连续的，而是离散的，所以我们需要对这些数据进行编码。
其主要思想就是利用N位状态寄存器对N个状态进行编码，每个状态都有它独立的寄存器位，并且在任何时候只有一位有效，举例说明：
性别分为男和女，由于有两种特征所以编码为01和10
国籍有中国、美国、法国，由于有三种特征所以编码100、010、001
喜欢的运动有足球、篮球、乒乓球、羽毛球，由于有四种状态所以编码为1000、0100、0010、0001
这样当有一个新样本进来的时候，我们就能对其通过上述规则进行编码，从而组成一个高维的稀疏矩阵。

分箱：
分箱就是将连续变量离散化，将多状态的离散变量合并成少状态，这样做会降低模型过拟合的风险。
优势：

离散特征的增加和减少都很容易，易于模型的快速迭代；
稀疏向量内积乘法运算速度快，计算结果方便存储，容易扩展；
离散化后的特征对异常数据有很强的鲁棒性：比如一个特征是年龄>30是1，否则0。如果特征没有离散化，一个异常数据“年龄300岁”会给模型造成很大的干扰；
逻辑回归属于广义线性模型，表达能力受限；单变量离散化为N个后，每个变量有单独的权重，相当于为模型引入了非线性，能够提升模型表达能力，加大拟合；
离散化后可以进行特征交叉，由M+N个变量变为M*N个变量，进一步引入非线性，提升表达能力；
特征离散化后，模型会更稳定，比如如果对用户年龄离散化，20-30作为一个区间，不会因为一个用户年龄长了一岁就变成一个完全不同的人。当然处于区间相邻处的样本会刚好相反，所以怎么划分区间是门学问；
特征离散化以后，起到了简化了逻辑回归模型的作用，降低了模型过拟合的风险。
可以将缺失作为独立的一类带入模型。
将所有变量变换到相似的尺度上。

线性模型与树模型
线性模型是所有特征给与权重相加得到一个新的值，而树模型是产生可视化的分类规则，也就相当于分区间的阶梯函数。

代码实现：

import pandas as pd
import mglearn
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectPercentile
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor 
from sklearn.feature_selection import RFE

if __name__ == "__main__":
    #文件中没有包含名称的表头，因此我们传入header=None
    #然后在'names'中显式地提供列名称
    data = pd.read_csv(
        "C:/Users/tonyw/Desktop/adult_income_data.csv",header=None,index_col=False,
        names=['age','workclass','fnlwgt','education','education-num',
                'martial-status','occupation','relationship','race','gender',
                'capital-gain','capital-loss','hours-per-week','native-country',
                'income']
    )
    #为了便于说明，我们只选了其中几列
    data = data[['age','workclass','education','gender','hours-per-week','occupation','income']]
    display(data.head())
    #检查字符串编码的分类数据
    print(data.gender.value_counts())
    print("Original features:\n",list(data.columns),'\n')
    data_dummies = pd.get_dummies(data)
    print("Features after get_dummies:\n",list(data_dummies.columns))
    display(data_dummies.head())
    features=data_dummies.ix[:,'age':'occupation_ Transport-moving']
    #提取Numpy数组
    X = features.values
    y = data_dummies['income_ >50K'].values
    print("X.shape:{} y.shape:{}".format(X.shape,y.shape))
    X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
    logreg = LogisticRegression()
    logreg.fit(X_train,y_train)
    print("Test score:{:.2f}".format(logreg.score(X_test,y_test)))
    #创建一个DataFrame，包含一个整数特征和一个分类字符串特征
    demo_df = pd.DataFrame({'Integer Feature':[0,1,2,1],'Categorical Feature':['socks','fox','socks','box']})
    display(demo_df)
    display(pd.get_dummies(demo_df))
    demo_df['Integer Feature']=demo_df['Integer Feature'].astype(str)
    display(pd.get_dummies(demo_df,columns=['Integer Feature','Categorical Feature']))
    
    X,y=mglearn.datasets.make_wave(n_samples=100)
    line = np.linspace(-3,3,1000,endpoint=False).reshape(-1,1)
    reg = DecisionTreeRegressor(min_samples_split=3).fit(X,y)
    '''
    plt.plot(line,reg.predict(line),label='decision tree')
    reg=LinearRegression().fit(X,y)
    plt.plot(line,reg.predict(line),label='linear regression')
    plt.plot(X[:,0],y,'o',c='k')
    plt.ylabel("Regression output")
    plt.xlabel("Input feature")
    plt.legend(loc='best')
    plt.show()
    '''
    bins = np.linspace(-3,3,11)
    print("bins:{}".format(bins))
    which_bin = np.digitize(X,bins=bins)
    print("\nData points:\n",X[:5])
    print("\nBin menbership for data points:\n",which_bin[:5])
    #使用OneHotEncoder进行变换
    encoder = OneHotEncoder(sparse=False)
    #encoder.fit找到which_bin中的唯一值
    encoder.fit(which_bin)
    #tranform创建one-hot编码
    X_binned = encoder.transform(which_bin)
    print(X_binned[:5])
    print("X_binned.shape:{}".format(X_binned.shape))
    line_binned = encoder.transform(np.digitize(line,bins=bins))
    reg=LinearRegression().fit(X_binned,y)
    '''
    plt.plot(line,reg.predict(line_binned),label='linear regression binned')
    reg=DecisionTreeRegressor(min_samples_split=3).fit(X_binned,y)
    plt.plot(line,reg.predict(line_binned),label='decision tree binned')
    plt.plot(X[:,0],y,'o',c='k')
    plt.vlines(bins,-3,3,linewidth=1,alpha=.2)
    plt.legend(loc="best")
    plt.ylabel("Regression output")
    plt.xlabel("Input feature")
    plt.show()
    '''
    X_combined = np.hstack([X,X_binned])
    print(X_combined.shape)
    reg = LinearRegression().fit(X_combined,y)
    line_combined=np.hstack([line,line_binned])
    '''
    plt.plot(line,reg.predict(line_combined),label='linear regression combined')
    for bin in bins:
        plt.plot([bin,bin],[-3,3],':',c='k')
    plt.legend(loc="best")
    plt.ylabel("Regression output")
    plt.xlabel("Input feature")
    plt.plot(X[:,0],y,'o',c='k')
    X_product = np.hstack([X_binned,X*X_binned])
    print(X_product.shape)
    reg = LinearRegression().fit(X_product,y)
    line_product=np.hstack([line_binned,line*line_binned])
    plt.plot(line,reg.predict(line_product),label='linear regression product')
    for bin in bins:
        plt.plot([bin,bin],[-3,3],':',c='k')
    plt.plot(X[:,0],y,'o',c='k')
    plt.ylabel("Regression output")
    plt.xlabel("Input feature")
    plt.legend(loc="best")
    plt.show()
    '''
    #包含直到x**10的多项式：
    #默认的'include_bias=True'添加恒等于1的常数特征
    poly = PolynomialFeatures(degree=10,include_bias=False)
    poly.fit(X)
    X_poly=poly.transform(X)
    print("X_poly.shape:{}".format(X_poly.shape))
    print("Entries of X:\n{}".format(X[:5]))
    print("Entries of X_poly:\n{}".format(X_poly[:5]))
    print("Polynomial feature names:\n{}".format(poly.get_feature_names()))
    reg=LinearRegression().fit(X_poly,y)
    line_poly=poly.transform(line)
    '''
    plt.plot(line,reg.predict(line_poly),label='polynomial linear regression')
    plt.plot(X[:,0],y,'o',c='k') 
    plt.ylabel("Regression output")
    plt.xlabel("Input feature")
    plt.legend(loc="best") 
    '''
    '''
    for gamma in [1,10]:
        svr = SVR(gamma=gamma).fit(X,y)
        plt.plot(line,svr.predict(line),label='SVR gamma={}'.format(gamma))
    plt.plot(X[:,0],y,'o',c='k')
    plt.ylabel("Regession output")
    plt.xlabel("Input feature")
    plt.legend(loc="best")
    '''
    boston = load_boston()
    X_train,X_test,y_train,y_test=train_test_split(boston.data,boston.target,random_state=0)
    #缩放数据
    scaler = MinMaxScaler()
    X_train_scaled=scaler.fit_transform(X_train)
    X_test_scaled=scaler.transform(X_test)
    poly = PolynomialFeatures(degree=2).fit(X_train_scaled)
    X_train_poly = poly.transform(X_train_scaled)
    X_test_poly=poly.transform(X_test_scaled)
    print("X_train.shape:{}".format(X_train.shape))
    print("X_train_poly.shape:{}".format(X_train_poly.shape))

    ridge = Ridge().fit(X_train_scaled,y_train)
    print("Score without interaction:{:.3f}".format(ridge.score(X_test_scaled,y_test)))
    ridge = Ridge().fit(X_train_poly,y_train)
    print("Score with interactions:{:.3f}".format(ridge.score(X_test_poly,y_test)))

    rf = RandomForestRegressor(n_estimators=100).fit(X_train_scaled,y_train)
    print("Score without interactions:{:.3f}".format(rf.score(X_test_scaled,y_test)))
    rf = RandomForestRegressor(n_estimators=100).fit(X_train_poly,y_train)
    print("Score with intercations:{:.3f}".format(rf.score(X_test_poly,y_test)))
    
    X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
    score=Ridge().fit(X_train,y_train).score(X_test,y_test)
    print("Test score:{:3f}".format(score))
    #单变量统计
    cancer = load_breast_cancer()
    #获得确定性的随机数
    rng = np.random.RandomState(42)
    noise = rng.normal(size=(len(cancer.data),50))
    #向数据中添加噪声特征
    #前30个特征来自数据集，后50个是噪声
    X_w_noise=np.hstack([cancer.data,noise])
    X_train,X_test,y_train,y_test=train_test_split(X_w_noise,cancer.target,random_state=0,test_size=.5)
    #使用f_classif(默认值)和SelectPercentile来选择50%的特征
    select = SelectPercentile(percentile=50)
    select.fit(X_train,y_train)
    #对训练集进行变换
    X_train_selected=select.transform(X_train)

    print("X_train.shape:{}".format(X_train.shape))
    print("X_train_select.shape:{}".format(X_train_selected.shape))
    mask=select.get_support()
    print(mask)
    #将遮罩可视化--黑色为True，白色为False
    '''
    plt.matshow(mask.reshape(1,-1),cmap='gray_r')
    plt.xlabel("Sample index")
    plt.show()
    '''
    #对测试数据进行变换
    X_test_selected = select.transform(X_test)
    lr = LogisticRegression()
    lr.fit(X_train,y_train)
    print("Score with all features:{:.3f}".format(lr.score(X_test,y_test)))
    lr.fit(X_train_selected,y_train)
    print("Score with only selected features:{:.3f}".format(lr.score(X_test_selected,y_test)))
    #基于模型的特征选择
    select = SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=42),threshold="median")
    select.fit(X_train,y_train)
    X_train_l1 = select.transform(X_train)
    print("X_train.shape:{}".format(X_train.shape))
    print("X_train_l1.shape:{}".format(X_train_l1.shape))
    X_test_l1=select.transform(X_test)
    score=LogisticRegression().fit(X_train_l1,y_train).score(X_test_l1,y_test)
    print("Test score:{:.3f}".format(score))
    #迭代特征选择
    select = RFE(RandomForestClassifier(n_estimators=100,random_state=42),n_features_to_select=40)
    select.fit(X_train,y_train)

    X_train_rfe=select.transform(X_train)
    X_test_rfe = select.transform(X_test)
    score=LogisticRegression().fit(X_train_rfe,y_train).score(X_test_rfe,y_test)
    print("Test score:{:.3f}".format(score))
    print("Test score:{:.3f}".format(select.score(X_test,y_test)))
    #专家系统
    #由于X=citibike.index.strftime("%s").astype("int").reshape(-1,1)
    #出现问题，测试无法顺利进行