词袋:
简单来说就是将一句话进行分词处理后,统计每个单词在该句话中出现的次数,上述操作会使得词与词之间的顺序关系丢失。
词向量:
在计算机中要将语言编码,我们可以使用one-hot编码,但是这种方式首先会产生大量稀疏向量,造成维数灾难,其次是无法表示两个词之间的相似度,因为任意两个向量都是正交的。所以采用词向量的方式进行编码更加实际,在此使用
构成一个词向量矩阵,然后计算
根据矩阵乘法原则,左边矩阵的每一行都要与右边的列向量相乘,这样得出的数据正是两个向量的相似度,所以通过这个计算就能得出每个词与第一个词的关系。
TF-IDF:
这个算法主要是计算词在文本中的权重,因为不同的词在文本中的重要性存在差异,比如专业名词在文本中出现的绝对数量一般较少,但其重要程度较高。每个词的重要程度可以通过以下公式进行计算,它的计算包括两部分,一个是tf部分,一个是idf部分
其中分子是该词在文件中出现的次数,分母是文件中所有字词出现次数之和
其中N是训练集中的文档数量,Nw是训练集中出现单词w的文档数量。
n-gram:
n-gram的特点是基于一个词的出现依赖于其他若干词的思想,如果我们获得足够多的相近词特征,就可以推测出当前的词是什么,类似于猜词游戏,给你条件,猜测正确的词。该模型是一种语言模型,是一个基于概率的判别模型,输入一句话,输出的是这句话的概率,可以选取概率高的句子作为预测正确的句子。一般在实际使用中N取2或者3,这样处理起来比较简单,因为如果选择过大会产生稀疏问题。
隐含迪利克雷分布(Latent Dirichlet Allocation,LDA):
LDA一般用于主题提取,其自动分析每个文档,统计文档中的词语,根据统计的信息来断定当前文档含有哪些主题,以及每个主题所占的比例是多少。其对应的概率模型如下
K是主题个数,M是文档总数,Nm是第m个文档的单词总数。α是每个文档下主题的多项分布的迪利克雷先验参数,β是每个主题下词的多项分布的迪利克雷先验参数,Zmn是第m个文档下第n个词的主题,Wmn是第m个文档中的第n个词,两个隐含变量θ和φ分别表示第m个文档下的主题分布和第k个主题下的词分布,前者为k维向量,后者为v维向量。
代码实现:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193from sklearn.datasets import load_files
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
import mglearn
import matplotlib.pyplot as plt
from sklearn.decomposition import LatentDirichletAllocation
if __name__ == "__main__":
reviews_train = load_files("C:/Users/tonyw/Desktop/aclImdb/train/")
#load_files返回一个Bunch对象,其中包含训练文本和训练标签
text_train,y_train=reviews_train.data,reviews_train.target
print("type of text_train:{}".format(type(text_train)))
print("length of text_train:{}".format(len(text_train)))
print("text_train[1]:\n{}".format(text_train[1]))
#处理html换行符
text_train = [doc.replace(b"<br />",b" ") for doc in text_train]
print("Samples per class (training):{}".format(np.bincount(y_train)))
reviews_test = load_files("C:/Users/tonyw/Desktop/aclImdb/test")
text_test,y_test = reviews_test.data,reviews_test.target
print("Number of documents in test data:{}".format(len(text_test)))
print("Samples per class(test):{}".format(np.bincount(y_test)))
text_test = [doc.replace(b"<br />",b" ") for doc in text_test]
#词袋模型
bards_words=["The fool doth think he is wise,","but the wise man knows himself to be a fool"]
vect = CountVectorizer()
vect.fit(bards_words)
print("Vocabulary size:{}".format(len(vect.vocabulary_)))
print("Vocabulary content:\n{}".format(vect.vocabulary_))
bag_of_words = vect.transform(bards_words)
print("bag_of_words:{}".format(repr(bag_of_words)))
print("Dense representation of bag_of_word:\n{}".format(bag_of_words.toarray()))
#将词袋应用于电影评论
vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))
feature_names = vect.get_feature_names()
print("Number of features:{}".format(len(feature_names)))
print("First 20 features:\n{}".format(feature_names[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))
scores = cross_val_score(LogisticRegression(),X_train,y_train,cv=5)
print("Mean cross-validation accuracy:{:.2f}".format(np.mean(scores)))
param_grid = {'C':[0.001,0.01,0.1,1,10]}
grid = GridSearchCV(LogisticRegression(),param_grid,cv=5)
grid.fit(X_train,y_train)
print("Best cross-validation score:{:.2f}".format(grid.best_score_))
print("Best parameters: ",grid.best_params_)
X_test = vect.transform(text_test)
print("{:.2f}".format(grid.score(X_test,y_test)))
vect = CountVectorizer(min_df=5).fit(text_train)
X_train = vect.transform(text_train)
print("X_train with min_df:{}".format(repr(X_train)))
feature_names = vect.get_feature_names()
print("First 50 features:\n{}".format(feature_names[:50]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 700th feature:\n{}".format(feature_names[::700]))
grid = GridSearchCV(LogisticRegression(),param_grid,cv=5)
grid.fit(X_train,y_train)
print("Best cross-validation score:{:.2f}".format(grid.best_score_))
#停用词
print("Numbers of stop words:{}".format(len(ENGLISH_STOP_WORDS)))
print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))
vect = CountVectorizer(min_df=5,stop_words="english").fit(text_train)
X_train =vect.transform(text_train)
print("X_train with stop words:\n{}".format(repr(X_train)))
grid = GridSearchCV(LogisticRegression(),param_grid,cv=5)
grid.fit(X_train,y_train)
print("Best cross-validation score:{:.2f}".format(grid.best_score_))
#tf-idf
pipe = make_pipeline(TfidfVectorizer(min_df=5),LogisticRegression())
param_grid = {'logisticregression__C':[0.001,0.01,0.1,1,10]}
grid = GridSearchCV(pipe,param_grid,cv=5)
grid.fit(text_train,y_train)
print("Best cross-validation score:{:.2f}".format(grid.best_score_))
vectorizer = grid.best_estimator_.named_steps["tfidfvectorizer"]
#变换训练数据集
X_train = vectorizer.transform(text_train)
#找到数据集中每个特征的最大值
max_value = X_train.max(axis=0).toarray().ravel()
sorted_by_tfidf = max_value.argsort()
#获取特征名称
feature_names = np.array(vectorizer.get_feature_names())
print("Features with lowest tfidf:\n{}".format(feature_names[sorted_by_tfidf[:20]]))
print("Features with highest tfidf:\n{}".format(feature_names[sorted_by_tfidf[-20:]]))
sorted_by_idf = np.argsort(vectorizer.idf_)
print("Featues with lowest idf:\n{}".format(feature_names[sorted_by_idf[:100]]))
mglearn.tools.visualize_coefficients(grid.best_estimator_.named_steps["logisticregression"].coef_,feature_names,n_top_features=40)
plt.show()
#bard_words示例
print("bards_words:\n{}".format(bards_words))
#一元分词
cv = CountVectorizer(ngram_range=(1,1)).fit(bards_words)
print("Vocabulary size:{}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))
#二元分词
cv = CountVectorizer(ngram_range=(2,2)).fit(bards_words)
print("Vocabulary size:{}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))
print("Transformed data(dense):\n{}".format(cv.transform(bards_words).toarray()))
#在bards_words上使用一元分词、二元分词和三元分词
cv = CountVectorizer(ngram_range=(1,3)).fit(bards_words)
print("Vocabulary size:{}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))
#利用网格找出n元分词的最佳设置
pipe = make_pipeline(TfidfVectorizer(min_df=5),LogisticRegression())
#运行网格搜索需要时间长,因为网格较大,且包含三元分词
param_grid = {"logisticregression__C":[0.001,0.01,0.1,1,10,100],
"tfidfvectorizer_ngram_range":[(1,1),(1,2),(1,3)]}
grid = GridSearchCV(pipe,param_grid,cv=5)
grid.fit(text_train,y_train)
print("Best cross-validation score:{:.2f}".format(grid.best_score_))
print("Best parameters:\n{}".format(grid.best_params_))
#从网格搜索中提取分数
scores = grid.cv_results_['mean_test_score'].reshape(-1,3)
#热图可视化
'''
heatmap = mglearn.tools.heatmap(
scores,xlabel="C",ylabel="ngram_range",cmap="viridis",fmt="%.3f",
xticklabels = param_grid["logisticregression__C"],
yticklabels = param_grid["tfidfvectorizer__ngram_range"]
)
plt.colorbar(heatmap)
plt.show()
'''
#提取特征名称与系数
vect = grid.best_estimator_.named_steps['tfidfvectorizer']
feature_names = np.array(vect.get_feature_name())
coef = grid.best_estimator_.named_steps['logisticregression'].coef_
'''
mglearn.tools.visualize_coefficients(coef,feature_names,n_top_features=40)
plt.show()
'''
#找到三元分词特征
mask=np.array([len(feature.split(" ")) for feature in feature_names])==3
#仅将三元分词特征可视化
'''
mglearn.tools.visualize_coefficients(coef.ravel()[mask],feature_names[mask],n_top_features=40)
plt.show()
'''
#使用LDA(隐含狄利克雷分布)
vect = CountVectorizer(max_features=10000,max_df=.15)
X = vect.fit_transform(text_train)
lda = LatentDirichletAllocation(n_topics=10,learning_method="batch",max_iter=25,random_state=0)
#我们在一个步骤中构建模型并变换数据
#计算变换需要花点时间,二者同时及逆行可以节省时间
document_topics = lda.fit_transform(X)
print(lda.components_.shape)
#对于每个主题(components_的一行),将特征排序(升序)
#用[:,::-1]将行反转,将排序变为降序
sorting = np.argsort(lda.components_,axis=1)[:,::-1]
#从向量器中获取特征名称
feature_names = np.array(vect.get_feature_names())
#打印出前10个主题
'''
mglearn.tools.print_topics(topics=range(10),feature_names=feature_names,sorting=sorting,topics_per_chunk=5,n_words=10)
plt.show()
'''
lda100 = LatentDirichletAllocation(n_topics=100,learning_method="batch",max_iter=25,random_state=0)
document_topics100 = lda100.fit_transform(X)
topics = np.array([7,16,24,25,28,36,37,45,51,53,54,63,89,97])
sorting = np.argsort(lda100.components_,axis=1)[:,::-1]
feature_names = np.array(vect.get_feature_names())
'''
mglearn.tools.print_topics(topics=topics,feature_names=feature_names,sorting=sorting,topics_per_chunk=7,n_words=20)
plt.show()
'''
music = np.argsort(document_topics100[:,45])[::-1]
#打印出这个主题最重要的前5个文档
for i in music[:10]:
#显示前两个句子
print(b".".join(text_train[i].split(b".")[:2])+b".\n")
#LDA学到的主题
fig,ax = plt.subplots(1,2,figsize=(10,10))
topic_names = ["{:>2}".format(i)+" ".join(words) for i,words in enumerate(feature_names[sorting[:,:2]])]
#两列的条形图
for col in [0,1]:
start = col*50
end = (col+1)*50
ax[col].barh(np.arange(50),np.sum(document_topics100,axis=0)[start:end])
ax[col].set_yticks(np.arange(50))
ax[col].set_yticklabels(topic_names[start:end],ha="left",va="top")
ax[col].invert_yaxis()
ax[col].set_xlim(0,2000)
yax = ax[col].get_yaxis()
yax.set_tick_params(pad=130)
plt.tight_layout()
plt.show()