处理文本数据

词袋：
简单来说就是将一句话进行分词处理后，统计每个单词在该句话中出现的次数，上述操作会使得词与词之间的顺序关系丢失。

词向量：
在计算机中要将语言编码，我们可以使用one-hot编码，但是这种方式首先会产生大量稀疏向量，造成维数灾难，其次是无法表示两个词之间的相似度，因为任意两个向量都是正交的。所以采用词向量的方式进行编码更加实际，在此使用

构成一个词向量矩阵，然后计算

根据矩阵乘法原则，左边矩阵的每一行都要与右边的列向量相乘，这样得出的数据正是两个向量的相似度，所以通过这个计算就能得出每个词与第一个词的关系。

TF-IDF：
这个算法主要是计算词在文本中的权重，因为不同的词在文本中的重要性存在差异，比如专业名词在文本中出现的绝对数量一般较少，但其重要程度较高。每个词的重要程度可以通过以下公式进行计算，它的计算包括两部分，一个是tf部分，一个是idf部分

其中分子是该词在文件中出现的次数，分母是文件中所有字词出现次数之和

其中N是训练集中的文档数量，Nw是训练集中出现单词w的文档数量。

n-gram：
n-gram的特点是基于一个词的出现依赖于其他若干词的思想，如果我们获得足够多的相近词特征，就可以推测出当前的词是什么，类似于猜词游戏，给你条件，猜测正确的词。该模型是一种语言模型，是一个基于概率的判别模型，输入一句话，输出的是这句话的概率，可以选取概率高的句子作为预测正确的句子。一般在实际使用中N取2或者3，这样处理起来比较简单，因为如果选择过大会产生稀疏问题。

隐含迪利克雷分布(Latent Dirichlet Allocation,LDA)：
LDA一般用于主题提取，其自动分析每个文档，统计文档中的词语，根据统计的信息来断定当前文档含有哪些主题，以及每个主题所占的比例是多少。其对应的概率模型如下

K是主题个数，M是文档总数，Nm是第m个文档的单词总数。α是每个文档下主题的多项分布的迪利克雷先验参数，β是每个主题下词的多项分布的迪利克雷先验参数，Zmn是第m个文档下第n个词的主题，Wmn是第m个文档中的第n个词，两个隐含变量θ和φ分别表示第m个文档下的主题分布和第k个主题下的词分布，前者为k维向量，后者为v维向量。

代码实现：

from sklearn.datasets import load_files
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
import mglearn
import matplotlib.pyplot as plt
from sklearn.decomposition import LatentDirichletAllocation

if __name__ == "__main__":
    reviews_train = load_files("C:/Users/tonyw/Desktop/aclImdb/train/")
    #load_files返回一个Bunch对象，其中包含训练文本和训练标签
    text_train,y_train=reviews_train.data,reviews_train.target
    print("type of text_train:{}".format(type(text_train)))
    print("length of text_train:{}".format(len(text_train)))
    print("text_train[1]:\n{}".format(text_train[1]))
    #处理html换行符
    text_train = [doc.replace(b"<br />",b" ") for doc in text_train]
    print("Samples per class (training):{}".format(np.bincount(y_train)))
    reviews_test = load_files("C:/Users/tonyw/Desktop/aclImdb/test")
    text_test,y_test = reviews_test.data,reviews_test.target
    print("Number of documents in test data:{}".format(len(text_test)))
    print("Samples per class(test):{}".format(np.bincount(y_test)))
    text_test = [doc.replace(b"<br />",b" ") for doc in text_test]
    #词袋模型
    bards_words=["The fool doth think he is wise,","but the wise man knows himself to be a fool"]
    vect = CountVectorizer()
    vect.fit(bards_words)
    print("Vocabulary size:{}".format(len(vect.vocabulary_)))
    print("Vocabulary content:\n{}".format(vect.vocabulary_))
    bag_of_words = vect.transform(bards_words)
    print("bag_of_words:{}".format(repr(bag_of_words)))
    print("Dense representation of bag_of_word:\n{}".format(bag_of_words.toarray()))
    #将词袋应用于电影评论
    vect = CountVectorizer().fit(text_train)
    X_train = vect.transform(text_train)
    print("X_train:\n{}".format(repr(X_train)))
    feature_names = vect.get_feature_names()
    print("Number of features:{}".format(len(feature_names)))
    print("First 20 features:\n{}".format(feature_names[:20]))
    print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
    print("Every 2000th feature:\n{}".format(feature_names[::2000]))
    scores = cross_val_score(LogisticRegression(),X_train,y_train,cv=5)
    print("Mean cross-validation accuracy:{:.2f}".format(np.mean(scores)))
    param_grid = {'C':[0.001,0.01,0.1,1,10]}
    grid = GridSearchCV(LogisticRegression(),param_grid,cv=5)
    grid.fit(X_train,y_train)
    print("Best cross-validation score:{:.2f}".format(grid.best_score_))
    print("Best parameters: ",grid.best_params_)
    X_test = vect.transform(text_test)
    print("{:.2f}".format(grid.score(X_test,y_test)))
    vect = CountVectorizer(min_df=5).fit(text_train)
    X_train = vect.transform(text_train)
    print("X_train with min_df:{}".format(repr(X_train)))
    feature_names = vect.get_feature_names()
    print("First 50 features:\n{}".format(feature_names[:50]))
    print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
    print("Every 700th feature:\n{}".format(feature_names[::700]))
    grid = GridSearchCV(LogisticRegression(),param_grid,cv=5)
    grid.fit(X_train,y_train)
    print("Best cross-validation score:{:.2f}".format(grid.best_score_))
    #停用词
    print("Numbers of stop words:{}".format(len(ENGLISH_STOP_WORDS)))
    print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))
    vect = CountVectorizer(min_df=5,stop_words="english").fit(text_train)
    X_train =vect.transform(text_train)
    print("X_train with stop words:\n{}".format(repr(X_train)))
    grid = GridSearchCV(LogisticRegression(),param_grid,cv=5)
    grid.fit(X_train,y_train)
    print("Best cross-validation score:{:.2f}".format(grid.best_score_))
    #tf-idf
    pipe = make_pipeline(TfidfVectorizer(min_df=5),LogisticRegression())
    param_grid = {'logisticregression__C':[0.001,0.01,0.1,1,10]}
    grid = GridSearchCV(pipe,param_grid,cv=5)
    grid.fit(text_train,y_train)
    print("Best cross-validation score:{:.2f}".format(grid.best_score_))
    vectorizer = grid.best_estimator_.named_steps["tfidfvectorizer"]
    #变换训练数据集
    X_train = vectorizer.transform(text_train)
    #找到数据集中每个特征的最大值
    max_value = X_train.max(axis=0).toarray().ravel()
    sorted_by_tfidf = max_value.argsort()
    #获取特征名称
    feature_names = np.array(vectorizer.get_feature_names())
    print("Features with lowest tfidf:\n{}".format(feature_names[sorted_by_tfidf[:20]]))
    print("Features with highest tfidf:\n{}".format(feature_names[sorted_by_tfidf[-20:]]))
    sorted_by_idf = np.argsort(vectorizer.idf_)
    print("Featues with lowest idf:\n{}".format(feature_names[sorted_by_idf[:100]]))
    mglearn.tools.visualize_coefficients(grid.best_estimator_.named_steps["logisticregression"].coef_,feature_names,n_top_features=40)
    plt.show()
    #bard_words示例
    print("bards_words:\n{}".format(bards_words))
    #一元分词
    cv = CountVectorizer(ngram_range=(1,1)).fit(bards_words)
    print("Vocabulary size:{}".format(len(cv.vocabulary_)))
    print("Vocabulary:\n{}".format(cv.get_feature_names()))
    #二元分词
    cv = CountVectorizer(ngram_range=(2,2)).fit(bards_words)
    print("Vocabulary size:{}".format(len(cv.vocabulary_)))
    print("Vocabulary:\n{}".format(cv.get_feature_names()))
    print("Transformed data(dense):\n{}".format(cv.transform(bards_words).toarray()))
    #在bards_words上使用一元分词、二元分词和三元分词
    cv = CountVectorizer(ngram_range=(1,3)).fit(bards_words)
    print("Vocabulary size:{}".format(len(cv.vocabulary_)))
    print("Vocabulary:\n{}".format(cv.get_feature_names()))
    #利用网格找出n元分词的最佳设置
    pipe = make_pipeline(TfidfVectorizer(min_df=5),LogisticRegression())
    #运行网格搜索需要时间长，因为网格较大，且包含三元分词
    param_grid = {"logisticregression__C":[0.001,0.01,0.1,1,10,100],
                    "tfidfvectorizer_ngram_range":[(1,1),(1,2),(1,3)]}
    grid = GridSearchCV(pipe,param_grid,cv=5)
    grid.fit(text_train,y_train)
    print("Best cross-validation score:{:.2f}".format(grid.best_score_))
    print("Best parameters:\n{}".format(grid.best_params_))
    #从网格搜索中提取分数
    scores = grid.cv_results_['mean_test_score'].reshape(-1,3)
    #热图可视化
    '''
    heatmap = mglearn.tools.heatmap(
        scores,xlabel="C",ylabel="ngram_range",cmap="viridis",fmt="%.3f",
        xticklabels = param_grid["logisticregression__C"],
        yticklabels = param_grid["tfidfvectorizer__ngram_range"]
    )
    plt.colorbar(heatmap)
    plt.show()
    '''
    #提取特征名称与系数
    vect = grid.best_estimator_.named_steps['tfidfvectorizer']
    feature_names = np.array(vect.get_feature_name())
    coef = grid.best_estimator_.named_steps['logisticregression'].coef_
    '''
    mglearn.tools.visualize_coefficients(coef,feature_names,n_top_features=40)
    plt.show()
    '''
    #找到三元分词特征
    mask=np.array([len(feature.split(" ")) for feature in feature_names])==3
    #仅将三元分词特征可视化
    '''
    mglearn.tools.visualize_coefficients(coef.ravel()[mask],feature_names[mask],n_top_features=40)
    plt.show()
    '''
    #使用LDA(隐含狄利克雷分布)
    vect = CountVectorizer(max_features=10000,max_df=.15)
    X = vect.fit_transform(text_train)
    lda = LatentDirichletAllocation(n_topics=10,learning_method="batch",max_iter=25,random_state=0)
    #我们在一个步骤中构建模型并变换数据
    #计算变换需要花点时间，二者同时及逆行可以节省时间
    document_topics = lda.fit_transform(X)
    print(lda.components_.shape)
    #对于每个主题(components_的一行)，将特征排序(升序)
    #用[:,::-1]将行反转，将排序变为降序
    sorting = np.argsort(lda.components_,axis=1)[:,::-1]
    #从向量器中获取特征名称
    feature_names = np.array(vect.get_feature_names())
    #打印出前10个主题
    '''
    mglearn.tools.print_topics(topics=range(10),feature_names=feature_names,sorting=sorting,topics_per_chunk=5,n_words=10)
    plt.show()
    '''
    lda100 = LatentDirichletAllocation(n_topics=100,learning_method="batch",max_iter=25,random_state=0)
    document_topics100 = lda100.fit_transform(X)
    topics = np.array([7,16,24,25,28,36,37,45,51,53,54,63,89,97])
    sorting = np.argsort(lda100.components_,axis=1)[:,::-1]
    feature_names = np.array(vect.get_feature_names())
    '''
    mglearn.tools.print_topics(topics=topics,feature_names=feature_names,sorting=sorting,topics_per_chunk=7,n_words=20)
    plt.show()
    '''
    music = np.argsort(document_topics100[:,45])[::-1]
    #打印出这个主题最重要的前5个文档
    for i in music[:10]:
        #显示前两个句子
        print(b".".join(text_train[i].split(b".")[:2])+b".\n")
    #LDA学到的主题
    fig,ax = plt.subplots(1,2,figsize=(10,10))
    topic_names = ["{:>2}".format(i)+" ".join(words) for i,words in enumerate(feature_names[sorting[:,:2]])]
    #两列的条形图
    for col in [0,1]:
        start = col*50
        end = (col+1)*50
        ax[col].barh(np.arange(50),np.sum(document_topics100,axis=0)[start:end])
        ax[col].set_yticks(np.arange(50))
        ax[col].set_yticklabels(topic_names[start:end],ha="left",va="top")
        ax[col].invert_yaxis()
        ax[col].set_xlim(0,2000)
        yax = ax[col].get_yaxis()
        yax.set_tick_params(pad=130)
    plt.tight_layout()
    plt.show()