Some thoughts on GSoC 2017 and LibraryBox

This summer, I was a part of Summer of Code working with Jason Griffey on his Berkman Klein project, LibraryBox.

I wrote on Medium about parts, but I wanted to reflect here on some of the things that I learned.

First, I’d like to say that this summer was a blast. The end was a bit hectic, since I’d gotten right into the thick of getting everything pinned down when I went back to school at the University of New Mexico (shoutout to my burqueno peeps!) and ran into professors that like packing the first few weeks with content. When I started working on LibraryBox this summer, I was more or less familiar with its usage. I’d played around with PirateBox and getting things working on a laptop previously, having gone to my fair share of DefCon events where I saw them pop up for adhoc file sharing. There’s a lot of great stuff that LibraryBox and PirateBox do that I wanted to help make better.

Lesson 1: Real, working hardware and such is the greatest debugging platform.

When I picked up porting it to the raspberry pi, I ran into a lot of little tiny issues here and there. As a note to future people doing things like this: Be ready to do some footwork when moving from an embedded platform like OpenWRT onto a platform like the rpi. There’s subtle differences and I can’t stress enough that until I was about halfway through the summer, I really didn’t know what I was doing. Learning that getting things running on real hardware as fast as possible is your best bet to make sure that your life is going to be just fine. Once I got my software running on the pi, I had a lot more fun getting the project moved forward.

Lesson 2: A day you don’t code isn’t a day wasted

It’s easy to think that a day you write zero lines of code is a day you’ve wasted. The reality is that this is a terrible way to handle the situation and will only drive you to the edge of burnout before you realize what you’re doing. Take days to think, work on something different, and even just not work at all. All-nighters are not worth it unless you have a good reason and the more I tried to force myself to work the less I got done.

The reality of the matter is that taking a day to work out notes and lay out design on paper can be the best day you get things done. I did a lot of “go to starbucks and sketch out the problem I’m having” and my notebook is as a result filled with pages like this:

Those notes were a few hours of thinking, sketching out ideas and getting things down on paper. Some notes I took while on a plane, some notes I took at dinner, other times notes were the byproduct of someone mentioning something.

Lesson 3: You have people around you who can solve your problems.

Software dev isn’t done in a vacuum. It’s tempting to hunker down and hide away and not talk to people, but there’s a lot of things that your friends and collegues know that you don’t know they know. I learned that one friend of mine was currently dealing with some of the same container problems I was facing, but in a slightly different context. This helped me find the source of one of my woes and hammer down what I really needed to solve. I had been burning for a week trying to figure out an obscure problem with systemd and how to make chroots work, then a friend mentioned systemd-nspawn and the skies opened up and this happened:

true story.

This was a good week of me being unhappy with my work, getting progressively more and more frustrated. Ten minutes of conversation solved a week of me hiding from people and just beating at the problem. It was bad enough that I hadn’t bothered to really mention it to my partner, who was a bouncing board for so many other things this summer.

Wrapping up

These were my experiences. My mentor helped me get on track when I had been wavering, helped set up a schedule and my partner really kept me in line. We build software not in isolation but by looking at others for help. I hope to be back next ear for more of this and get to know more people as I refine my skills! This summer had challenges, things that I didn’t expect. Did I hit every goal? Not by a longshot. But I hit the important goals, getting somewhere that I felt was more than just a proof of concept and more something that can be actively demoed soon.

SwellRT/Wave E2E Encryption: Overview

The code can be downloaded from this git branch (compare changes).

Synopsis

Apache Wave is a software framework for online real-time collaborative
edition. Similarly to Google Docs and Etherpad, it uses Operational
Transformations
to manage user collaboration.

During this Google Summer of Code we have provided end-to-end encryption to wave
documents. This means that only the people who know a particular key, have
access to the documents and can edit and retreive the contents of a them,
protecting in that way the privacy of Wave users.

We have based our work on this awesome paper that explains how some
researchers encrypted Google Docs’ Operational Transformations. We have took
their ideas and adapted them to Apache Wave’s architecture.

Produced work

To sumarize the work we have produced, we have recorded this video:

To encrypt the messages we have used the algorithm AES-GCM from the WebCrypto
API. We have used JsInterop bindings to call it from our Java classes.

Messages are properly encrypted and decrypted when they are sent and received
by the clients. The texts of a documents are also properly recovered from the
server’s snapshot. Everything seems to run smoothly, except for some annoying
bugs that appear sparsely, and a serious user interface bug that prevents users
that did not created the wave to decrypt its snapshot. My mentor and me think
that we can fix them quickly, just after the program has ended.

How to use it

Running our modified version of Wave does not require any additional
configuration, just use Gradle commands as usual. To compile the code and
run the server use:

$ ./gradlew run

And open the url http://localhost:9898/ with any browser. Once registered and
logged in, use the “New Encrypted Wave” button to create a new encrypted wave.

Encrypted Wave button

In its URL you can see that the new wave’s identifier starts with “ew+” instead
of “w+”, as it is usual in common waves. Also, a symmetric cryptographic key is
attached, after the wave identifier, separated by an exclamation mark (!).

Encrypted Wave URL

The user must preserve that URL (or at least the key part) in order to open the
wave again in the future.

Future work

AES-GCM assures both confidentiality and integrity for the messages written by
the legitimate users, but an attacker who has the control over the server can
still do a lot of harm:

  • Only the text of a document is encrypted, but not other parts like the content
    of its hiperlinks, for example. We should extend the encryption beyond the
    inserted characters.
  • The authentication could also be extended to all the components, not only text
    ones. Also, as the paper states that the history of a document should
    also be authenticaded (see appendix A.2).
  • It is unlikely to hide the structure and format of the document to the server,
    but we may be able to hide some more information, like user’s typing traits.

On the other hand, it is not convenient having users handling symmetric keys by
themselves. Keys should be encrypted and stored in the server as user data. To
do so, we should derive a key from the user’s password using pbkdf2 (available
in the WebCrypto API), to encrypt all the keys a user generates or registers
for her waves.

The users could use public key cryptograpy in order to being able to invite each
other to edit in a wave document. This feature were part of the original plan of
work for this Summer, but we have had not enough time to develop this part.

What do I love about the Google Summer of Code Program

The intention of this post is to share the wonderful experience I had in the Google Summer of Code (GSoC) program to other students who have similar backgrounds like me. By doing so, I sincerely hope more students can know and participate in this fantastic opportunity in the close future. I have faith that both students and host organisations will benefit a lot from it.

 

What is Google Summer of Code?

GSoC is a global program that offers students with an opportunity to contribute their knowledge and coding skills to a list of open-source software and technology-related organisations.

I had the honour to join this program together with other 1,317 students in this year (2017). Specifically, I chose and was selected by the Berkman Klein Center for Internet and Society at Harvard University to implement topic creation using machine learning in the MediaCloud project. I “fall in love” with this project at the first moment seeing it, it is not only my first preference but also the only project I have applied in GSoC. With that being said, I would suggest applicants apply for multiple projects to secure a position in GSoC.

 

Why do I love it?

The best part of this program is that each student will be guided by an excellent mentor from the host organisation through the entire project. It is my first time to write code for a large open-source program, and my mentor, Linas Valiukas, indeed helped me a lot in this adapting process by efficient communications. Every week, I send my weekly progress reports to him via email, he replies with valuable advice and suggestions. For every push request, he points out things I have done well as well as what I can do to improve. He is also always ready to answer my questions on Slack via quick messages.

The second thing I love is the content of GSoC project it proved can actually assist me to improve my skills. My task in GSoC is to develop a new feature in an existing project, and I have sharpened my coding skills in two ways. First, I can learn by reading existing code from this project, which tells me how a large open-source project should be organised, constructed and contributed by multiple people methodically. By carefully read through the coding standards, I learnt what the best way to guarantee the readability and understandability is in such project. Moreover, by frequent PR and reviews, I get to know things like most appropriate code structure, test cases and other important details that are often neglected by university projects but crucial in real-life ones. Understanding this shows me a way to further prepare myself for the works in the industry.

Furthermore, I love GSoC also because of how thoughtful this program is designed for students with diversities. Given GSoC is a global project participated by students all over the world, it is universal and inevitable that students from different countries and regions might not be able to work on the same time schedule. For example, as a student in Australia, our summer holiday is not as long as the student from the United States. Thus I have not finished my final exams before the beginning date of GSoC. I was pleasantly surprised to learn that Google has thought about this and emphasised that this program should give way to students’ academic duties from the university. My mentor and hosting organisation also kindly gave me a week to prepare for my exams. As a result, I managed to obtain high marks from the university and working on this project at the same time.

 

Last but not least, remote meetings organised by my host organisation (Berkman Klein Center for Internet and Society at Harvard University) at the end of each stage provided me with a chance to not only demonstrate my work to my peers but also listen to what achievements they have accomplished. I find this interesting as well as encouraging.

 

How to apply?

All university students from a bachelor degree to a doctoral degree are more than welcome to join. Google will announce the timetable in February every year, so please keep an open eye on here. After that, you can submit proposals for at most five projects among thousands of options. In each proposal, you can show and elaborate your genius ideas, plans and time schedules for that project. If your project proposal is selected by the host organisation, then you are ready to start your own unique journey with GSoC! Have a lovely trip!

 

 

Some thoughts about topic modelling

During the Google Summer of Code program of this year (2017), I am working on a project about topic modelling. The intention of this project is to identify the topics of each news article so that they can be grouped for further researches or analysis. I write this post to share some thoughts of mine and briefly discuss the program I implemented. I have also created some slides for this discussion at Prezi.

I would like to divide my implementation into four steps,  tokenize data, remove stop-words, apply machine learning algorithms and tune parameters.

Tokenizing data

The data I am using is fetched from the MediaCloud database; each datum consists of two attributes: sentence, and ariticle_id. Attribute sentence contains a plain text string of one sentence from the article, while article_id represents to which article this sentence belongs.

In general, tokenization means breaking sentences up into words. Simple as this may sound, there are always some things we can do better, for example, how to handle verbs in various tense/voice (e.g., study, studying and studied) and nouns in singular/plural forms (e.g., datum and data). This is where we need to introduce lemmatization.

I have used the lemmatizer from nltk to solve this issue. In particular, it uses the WordNet which contains lexical data of English that can help to find conceptual relationships between words such as antonyms, hyponyms, hypernyms, and synonyms.

After this step, all sentences should be divided into words with uniformed forms.

 Removing stop-words

Blindly passing all tokens into machine learning algorithm has been proven to be inefficient and inaccurate. A useful preprocessing step should at least contain stop-words removal.

In this project, there are three kinds of stop-words to be removed, meaningless words,  meaningless topics and document-frequent words. Meaningless words refer to the words such as ‘to’, ‘for’, and others similar ones. They do not have real meanings hence including them only wastes computation powers. Meaningless topics are analogous to the meaningless words, they may have semantic meanings, but they do not give any useful information as topics (e.g., we do not want to label an article with the topic ‘Mr.’). Document-frequent words refer to words frequently occur in all articles. We prefer to filter them out because they contradict to the idea of topics (i.e., what makes this article different from the others).

Based on the definitions above, there are two ways for us to eliminate stop-words, TF-IDF, and stop-words list. TF-IDF stands for term frequency-inverse document frequency; it weights each word by dividing the number of its occurrence in the current file by its presence in the whole set. Stop-word list contains all the stop-words we want to remove. While the stop-word list may take time and effort to build, we can have more control on it. In this project, I used an existing stop-word list.

 Applying machine learning algorithms

Now that all words are meaningful, it is time to feed them to the machine-learning algorithms. While there are numerous approaches, I used Latent Dirichlet Allocation (LDA) model and Non-negative Matrix Factorization (NMF) model. The LDA model posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics. This assumption is also akin to the standard bag of words model in our project. The NMF model is less commonly used than the LDA model and has fewer parameters to be tuned. In theory, there are some difference in the assumptions of LDA and NMF, but here in this project, they have generated very similar topics, which is a good news showing the correctness of topics. Here I focused on the LDA algorithm.

Tuning parameters

After the model is built, we can further improve its accuracy by finding the most optimal value for each parameter. In particular, we can tune two parameters in our LDA model, alpha and the total number of topics.

Alpha denotes the sparsity of topics among stories. The smaller alpha, the more sparse the topic distribution. This is a bit like the long-tail theory, where most topics have about zero significance while a limited number of topics are shared by most stories.

In our project, I focused work on the second parameter, the total number of topics (I will call it topic_num for short).

The simplest way to find optimal topic_num would be the brute-force, where we can start with topic_num equals to 1 and then 2, 3 and so on. Although this guarantees us with the correctness of the result, it simply takes too long and wastes tons of computation power.

An alternative way is using recursion, which is a bit like binary search. We start from extreme values and gradually shrink the size until we find the optimal value. Again this also requires a relatively large computation power.

The method I use is via solving polynomial equations. After testing the model on various sizes of data set, I found that the likelihood often first increases and then decreases with increasing topic_num. Based on this observation, I further assumed the polynomial relationship between likelihood and topic_num.

This allows me to build the polynomial equation based on only a few pre-computed points and identify the maximum value straight away.

The result shows this method exhibits a relatively good performance with significantly lowered costs in time and computation power.

This model also fits well in an increasing data set. For x new stories, instead of re-tune the model, we can simply assume topic_num will also increase by x +/- e, where e is a small value.

Future works

Although the project is finished, there are some improvements to be made. Firstly, we can achieve a better lemmatization by giving the exact position of each word in the sentence. Secondly, we may further investigate the relationship between likelihood and topic_num to give a more accurate prediction. Thirdly, we can hack into the API of LDA model and try to reduce the number of iterations while training.