Tuesday, June 21, 2016

How to list word occurences using CountVectorizer from Scikit Learn

A simple and efficient way to get document frequency counts of words from a corpus is to use CountVectorizer from Scikit Learn

Getting back to the word from the index is not immediately obvious, here's how to do it:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

docs = <load your docs as an iterable>

count_vect = CountVectorizer()
doc_counts = count_vect.fit_transform(docs)  # this is of type scipy.sparse.csr.csr_matrix which is why we need to use

.ravel() below.

word_counts = zip(count_vect.get_feature_names(), np.asarray(doc_counts.sum(axis=0)).ravel())
word_counts = sorted(word_counts, key=lambda idx: -1 * idx[1] )
 

# Display top 100 words by frequency
word_counts[:100]

No comments:

Post a Comment