What do I love about the Google Summer of Code Program

The intention of this post is to share the wonderful experience I had in the Google Summer of Code (GSoC) program to other students who have similar backgrounds like me. By doing so, I sincerely hope more students can know and participate in this fantastic opportunity in the close future. I have faith that both students and host organisations will benefit a lot from it.

 

What is Google Summer of Code?

GSoC is a global program that offers students with an opportunity to contribute their knowledge and coding skills to a list of open-source software and technology-related organisations.

I had the honour to join this program together with other 1,317 students in this year (2017). Specifically, I chose and was selected by the Berkman Klein Center for Internet and Society at Harvard University to implement topic creation using machine learning in the MediaCloud project. I “fall in love” with this project at the first moment seeing it, it is not only my first preference but also the only project I have applied in GSoC. With that being said, I would suggest applicants apply for multiple projects to secure a position in GSoC.

 

Why do I love it?

The best part of this program is that each student will be guided by an excellent mentor from the host organisation through the entire project. It is my first time to write code for a large open-source program, and my mentor, Linas Valiukas, indeed helped me a lot in this adapting process by efficient communications. Every week, I send my weekly progress reports to him via email, he replies with valuable advice and suggestions. For every push request, he points out things I have done well as well as what I can do to improve. He is also always ready to answer my questions on Slack via quick messages.

The second thing I love is the content of GSoC project it proved can actually assist me to improve my skills. My task in GSoC is to develop a new feature in an existing project, and I have sharpened my coding skills in two ways. First, I can learn by reading existing code from this project, which tells me how a large open-source project should be organised, constructed and contributed by multiple people methodically. By carefully read through the coding standards, I learnt what the best way to guarantee the readability and understandability is in such project. Moreover, by frequent PR and reviews, I get to know things like most appropriate code structure, test cases and other important details that are often neglected by university projects but crucial in real-life ones. Understanding this shows me a way to further prepare myself for the works in the industry.

Furthermore, I love GSoC also because of how thoughtful this program is designed for students with diversities. Given GSoC is a global project participated by students all over the world, it is universal and inevitable that students from different countries and regions might not be able to work on the same time schedule. For example, as a student in Australia, our summer holiday is not as long as the student from the United States. Thus I have not finished my final exams before the beginning date of GSoC. I was pleasantly surprised to learn that Google has thought about this and emphasised that this program should give way to students’ academic duties from the university. My mentor and hosting organisation also kindly gave me a week to prepare for my exams. As a result, I managed to obtain high marks from the university and working on this project at the same time.

 

Last but not least, remote meetings organised by my host organisation (Berkman Klein Center for Internet and Society at Harvard University) at the end of each stage provided me with a chance to not only demonstrate my work to my peers but also listen to what achievements they have accomplished. I find this interesting as well as encouraging.

 

How to apply?

All university students from a bachelor degree to a doctoral degree are more than welcome to join. Google will announce the timetable in February every year, so please keep an open eye on here. After that, you can submit proposals for at most five projects among thousands of options. In each proposal, you can show and elaborate your genius ideas, plans and time schedules for that project. If your project proposal is selected by the host organisation, then you are ready to start your own unique journey with GSoC! Have a lovely trip!

 

 

Some thoughts about topic modelling

During the Google Summer of Code program of this year (2017), I am working on a project about topic modelling. The intention of this project is to identify the topics of each news article so that they can be grouped for further researches or analysis. I write this post to share some thoughts of mine and briefly discuss the program I implemented. I have also created some slides for this discussion at Prezi.

I would like to divide my implementation into four steps,  tokenize data, remove stop-words, apply machine learning algorithms and tune parameters.

Tokenizing data

The data I am using is fetched from the MediaCloud database; each datum consists of two attributes: sentence, and ariticle_id. Attribute sentence contains a plain text string of one sentence from the article, while article_id represents to which article this sentence belongs.

In general, tokenization means breaking sentences up into words. Simple as this may sound, there are always some things we can do better, for example, how to handle verbs in various tense/voice (e.g., study, studying and studied) and nouns in singular/plural forms (e.g., datum and data). This is where we need to introduce lemmatization.

I have used the lemmatizer from nltk to solve this issue. In particular, it uses the WordNet which contains lexical data of English that can help to find conceptual relationships between words such as antonyms, hyponyms, hypernyms, and synonyms.

After this step, all sentences should be divided into words with uniformed forms.

 Removing stop-words

Blindly passing all tokens into machine learning algorithm has been proven to be inefficient and inaccurate. A useful preprocessing step should at least contain stop-words removal.

In this project, there are three kinds of stop-words to be removed, meaningless words,  meaningless topics and document-frequent words. Meaningless words refer to the words such as ‘to’, ‘for’, and others similar ones. They do not have real meanings hence including them only wastes computation powers. Meaningless topics are analogous to the meaningless words, they may have semantic meanings, but they do not give any useful information as topics (e.g., we do not want to label an article with the topic ‘Mr.’). Document-frequent words refer to words frequently occur in all articles. We prefer to filter them out because they contradict to the idea of topics (i.e., what makes this article different from the others).

Based on the definitions above, there are two ways for us to eliminate stop-words, TF-IDF, and stop-words list. TF-IDF stands for term frequency-inverse document frequency; it weights each word by dividing the number of its occurrence in the current file by its presence in the whole set. Stop-word list contains all the stop-words we want to remove. While the stop-word list may take time and effort to build, we can have more control on it. In this project, I used an existing stop-word list.

 Applying machine learning algorithms

Now that all words are meaningful, it is time to feed them to the machine-learning algorithms. While there are numerous approaches, I used Latent Dirichlet Allocation (LDA) model and Non-negative Matrix Factorization (NMF) model. The LDA model posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics. This assumption is also akin to the standard bag of words model in our project. The NMF model is less commonly used than the LDA model and has fewer parameters to be tuned. In theory, there are some difference in the assumptions of LDA and NMF, but here in this project, they have generated very similar topics, which is a good news showing the correctness of topics. Here I focused on the LDA algorithm.

Tuning parameters

After the model is built, we can further improve its accuracy by finding the most optimal value for each parameter. In particular, we can tune two parameters in our LDA model, alpha and the total number of topics.

Alpha denotes the sparsity of topics among stories. The smaller alpha, the more sparse the topic distribution. This is a bit like the long-tail theory, where most topics have about zero significance while a limited number of topics are shared by most stories.

In our project, I focused work on the second parameter, the total number of topics (I will call it topic_num for short).

The simplest way to find optimal topic_num would be the brute-force, where we can start with topic_num equals to 1 and then 2, 3 and so on. Although this guarantees us with the correctness of the result, it simply takes too long and wastes tons of computation power.

An alternative way is using recursion, which is a bit like binary search. We start from extreme values and gradually shrink the size until we find the optimal value. Again this also requires a relatively large computation power.

The method I use is via solving polynomial equations. After testing the model on various sizes of data set, I found that the likelihood often first increases and then decreases with increasing topic_num. Based on this observation, I further assumed the polynomial relationship between likelihood and topic_num.

This allows me to build the polynomial equation based on only a few pre-computed points and identify the maximum value straight away.

The result shows this method exhibits a relatively good performance with significantly lowered costs in time and computation power.

This model also fits well in an increasing data set. For x new stories, instead of re-tune the model, we can simply assume topic_num will also increase by x +/- e, where e is a small value.

Future works

Although the project is finished, there are some improvements to be made. Firstly, we can achieve a better lemmatization by giving the exact position of each word in the sentence. Secondly, we may further investigate the relationship between likelihood and topic_num to give a more accurate prediction. Thirdly, we can hack into the API of LDA model and try to reduce the number of iterations while training.