A simple and efficient way to get document frequency counts of words from a corpus is to use CountVectorizer from Scikit Learn
Getting back to the word from the index is not immediately obvious, here's how to do it:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
docs = <load your docs as an iterable>
count_vect = CountVectorizer()
doc_counts = count_vect.fit_transform(docs) # this is of type scipy.sparse.csr.csr_matrix which is why we need to use
.ravel() below.
word_counts = zip(count_vect.get_feature_names(), np.asarray(doc_counts.sum(axis=0)).ravel())
word_counts = sorted(word_counts, key=lambda idx: -1 * idx[1] )
# Display top 100 words by frequency
word_counts[:100]
No comments:
Post a Comment