During the Google Summer of Code program of this year (2017), I am working on a project about topic modelling. The intention of this project is to identify the topics of each news article so that they can be grouped for further researches or analysis. I write this post to share some thoughts of mine and briefly discuss the program I implemented. I have also created some slides for this discussion at Prezi.
I would like to divide my implementation into four steps, tokenize data, remove stop-words, apply machine learning algorithms and tune parameters.
The data I am using is fetched from the MediaCloud database; each datum consists of two attributes: sentence, and ariticle_id. Attribute sentence contains a plain text string of one sentence from the article, while article_id represents to which article this sentence belongs.
In general, tokenization means breaking sentences up into words. Simple as this may sound, there are always some things we can do better, for example, how to handle verbs in various tense/voice (e.g., study, studying and studied) and nouns in singular/plural forms (e.g., datum and data). This is where we need to introduce lemmatization.
I have used the lemmatizer from nltk to solve this issue. In particular, it uses the WordNet which contains lexical data of English that can help to find conceptual relationships between words such as antonyms, hyponyms, hypernyms, and synonyms.
After this step, all sentences should be divided into words with uniformed forms.
Blindly passing all tokens into machine learning algorithm has been proven to be inefficient and inaccurate. A useful preprocessing step should at least contain stop-words removal.
In this project, there are three kinds of stop-words to be removed, meaningless words, meaningless topics and document-frequent words. Meaningless words refer to the words such as ‘to’, ‘for’, and others similar ones. They do not have real meanings hence including them only wastes computation powers. Meaningless topics are analogous to the meaningless words, they may have semantic meanings, but they do not give any useful information as topics (e.g., we do not want to label an article with the topic ‘Mr.’). Document-frequent words refer to words frequently occur in all articles. We prefer to filter them out because they contradict to the idea of topics (i.e., what makes this article different from the others).
Based on the definitions above, there are two ways for us to eliminate stop-words, TF-IDF, and stop-words list. TF-IDF stands for term frequency-inverse document frequency; it weights each word by dividing the number of its occurrence in the current file by its presence in the whole set. Stop-word list contains all the stop-words we want to remove. While the stop-word list may take time and effort to build, we can have more control on it. In this project, I used an existing stop-word list.
Applying machine learning algorithms
Now that all words are meaningful, it is time to feed them to the machine-learning algorithms. While there are numerous approaches, I used Latent Dirichlet Allocation (LDA) model and Non-negative Matrix Factorization (NMF) model. The LDA model posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics. This assumption is also akin to the standard bag of words model in our project. The NMF model is less commonly used than the LDA model and has fewer parameters to be tuned. In theory, there are some diﬀerence in the assumptions of LDA and NMF, but here in this project, they have generated very similar topics, which is a good news showing the correctness of topics. Here I focused on the LDA algorithm.
After the model is built, we can further improve its accuracy by ﬁnding the most optimal value for each parameter. In particular, we can tune two parameters in our LDA model, alpha and the total number of topics.
Alpha denotes the sparsity of topics among stories. The smaller alpha, the more sparse the topic distribution. This is a bit like the long-tail theory, where most topics have about zero signiﬁcance while a limited number of topics are shared by most stories.
In our project, I focused work on the second parameter, the total number of topics (I will call it topic_num for short).
The simplest way to ﬁnd optimal topic_num would be the brute-force, where we can start with topic_num equals to 1 and then 2, 3 and so on. Although this guarantees us with the correctness of the result, it simply takes too long and wastes tons of computation power.
An alternative way is using recursion, which is a bit like binary search. We start from extreme values and gradually shrink the size until we ﬁnd the optimal value. Again this also requires a relatively large computation power.
The method I use is via solving polynomial equations. After testing the model on various sizes of data set, I found that the likelihood often ﬁrst increases and then decreases with increasing topic_num. Based on this observation, I further assumed the polynomial relationship between likelihood and topic_num.
This allows me to build the polynomial equation based on only a few pre-computed points and identify the maximum value straight away.
The result shows this method exhibits a relatively good performance with significantly lowered costs in time and computation power.
This model also ﬁts well in an increasing data set. For x new stories, instead of re-tune the model, we can simply assume topic_num will also increase by x +/- e, where e is a small value.
Although the project is finished, there are some improvements to be made. Firstly, we can achieve a better lemmatization by giving the exact position of each word in the sentence. Secondly, we may further investigate the relationship between likelihood and topic_num to give a more accurate prediction. Thirdly, we can hack into the API of LDA model and try to reduce the number of iterations while training.