监督学习(一)

分类和回归
分类和回归本质上的是相类似的问题，不过分类问题强调的是非黑即白的结果，是一种离散的方法，而回归问题是一种拟合问题，比如预测房价，多或者少1块钱并不会有本质的区别，是一种连续的方法。

泛化、过拟合与欠拟合
泛化：
指在训练集上构建模型，然后能够对没见过的新数据(该新数据与训练集有相同的特性)能够做出准确预测，这就是指从训练集泛化到测试集。
过拟合：
指在拟合模型的时候过分关注训练集的细节，得到了一个在训练集上表现很好，但是不能泛化到新数据上的模型，那就存在过拟合。
欠拟合：
指模型过于简单，该模型无法抓住数据的全部内容和数据中的变化，甚至在训练集上的表现就很差。

监督学习：
指利用一组已知类别的样本调整分类器的参数，使其能达到所要求性能的过程。它从标记的训练数据来推断一个功能的机器学习任务，训练数据包括一套训练示例。

代码实现：
以下的代码比较杂，主要包含了数据集的展示，使用K近邻算法对数据进行分类。

import mglearn
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
import numpy as np

if __name__ == "__main__":
    X,y = mglearn.datasets.make_forge()
    mglearn.discrete_scatter(X[:,0],X[:,1],y)
    plt.legend(["Class 0", "Class 1"], loc=4)
    plt.xlabel("First feature")
    plt.ylabel("Second feature")
    #plt.show()
    print("X.shape: {}".format(X.shape))

    X,y = mglearn.datasets.make_wave(n_samples=40)
    plt.plot(X,y,'o')
    plt.ylim(-3,3)
    plt.xlabel("Feature")
    plt.ylabel("Target")
    #plt.show()

    cancer = load_breast_cancer()
    print("cancer.keys():\n{}".format(cancer.keys()))
    print("Shape of cancer data:{}".format(cancer.data.shape))
    print("Sample counts per class:\n{}".format({n: v for n, v in zip(cancer.target_names, np.bincount(cancer.target))}))
    print("Feature names:\n{}".format(cancer.feature_names))
    
    boston = load_boston()
    print("Data shape:{}".format(boston.data.shape))
    X,y = mglearn.datasets.load_extended_boston()
    print("X.shape: {}".format(X.shape))

    mglearn.plots.plot_knn_classification(n_neighbors=3)
    #plt.show()
    X,y = mglearn.datasets.make_forge()
    X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
    clf = KNeighborsClassifier(n_neighbors=3)
    clf.fit(X_train,y_train)
    print("Test set predictions:{}".format(clf.predict(X_test)))
    print("Test set accuracy:{:.2f}".format(clf.score(X_test,y_test)))

    
    fig,axes = plt.subplots(1,3,figsize=(10,3))
    for n_neighbors,ax in zip([1,3,9],axes):
        clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X,y)
        mglearn.plots.plot_2d_separator(clf,X,fill=True,eps=0.5,ax=ax,alpha=.4)
        mglearn.discrete_scatter(X[:,0],X[:,1],y,ax=ax)
        ax.set_title("{} neighbors(s)".format(n_neighbors))
        ax.set_xlabel("feature 0")
        ax.set_ylabel("feature 1")
    axes[0].legend(loc=3)
    #plt.show()
    
    cancer = load_breast_cancer()
    X_train,X_test,y_train,y_test = train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=66)
    training_accuracy = []
    test_accuracy = []
    neighbors_settings = range(1,11)
    for n_neighbors in neighbors_settings:
        #构架模型
        clf = KNeighborsClassifier(n_neighbors=n_neighbors)
        clf.fit(X_train,y_train)
        #记录训练集精度
        training_accuracy.append(clf.score(X_train,y_train))
        #记录泛化精度
        test_accuracy.append(clf.score(X_test,y_test))
    plt.plot(neighbors_settings,training_accuracy,label="training accuracy")
    plt.plot(neighbors_settings,test_accuracy,label="test accuracy")
    plt.ylabel("Accuracy")
    plt.xlabel("n_neighbors")
    plt.legend()
    #plt.show()

    mglearn.plots.plot_knn_regression(n_neighbors=3)
    #plt.show()

    X,y = mglearn.datasets.make_wave(n_samples=40)
    X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
    reg = KNeighborsRegressor(n_neighbors=3)
    reg.fit(X_train,y_train)
    print("Test set predictions:\n{}".format(reg.predict(X_test)))
    print("Test set R^2:{:.2f}".format(reg.score(X_test,y_test)))

    fig,axes = plt.subplots(1,3,figsize=(15,4))
    #创建1000个数据点，在-3到3之间均匀分布
    line = np.linspace(-3,3,1000).reshape(-1,1)
    for n_neighbors,ax in zip([1,3,9],axes):
        #利用1个、3个或9个邻居进行预测
        reg = KNeighborsRegressor(n_neighbors=n_neighbors)
        reg.fit(X_train,y_train)
        ax.plot(line,reg.predict(line))
        ax.plot(X_train,y_train,"^",c=mglearn.cm2(0),markersize=8)
        ax.plot(X_test,y_test,"v",c=mglearn.cm2(1),markersize=8)
        ax.set_title("{} neighbor(s)\n train score:{:.2f} test score:{:.2f}".format(n_neighbors,reg.score(X_train,y_train),reg.score(X_test,y_test),reg.score(X_test,y_test)))
        ax.set_xlabel("Feature")
        ax.set_ylabel("Target")
    axes[0].legend(["Model predictions","Training data/target","Test data/target"],loc="best")
    #plt.show()