Google Summer of Code 2017 @ Teem

Hi ! This blog post is a summary of the awesome experience that I had throughout the Google Summer of Code 2017 working with the Project Teem under Berkman Klein Center of Internet and Security, Harvard University. I would like to thank Berkman Klein Center for selecting me for the project Teem for this summer!

While writing this post I was thinking, how fast did we reach the end of Google Summer of Code? Really, time just flies when you start enjoying something!

Introduction to the project

Teem is a web application meant for collaborative development through commons based peer production. I liked the idea that Teem is a decentralised application, meaning that you can host on your own! As I was reading a bit about decentralised technologies I found an interest in contributing towards Teem!

I started exploring the suggested ideas page given by Berkman and there was one line written over there that caught my attention and there was one line that caught my attention, it was:

Internet applications do not exist as independent islands but rather as a part of a live ecosystem where everything is connected.

I started darfting my proposal keeping in mind the above written line and hence came up with the proposal containing three major goals to be covered over the entire summer. These 3 major goals were:

Link previews in Teem

We all like to see how facebook or whatsapp show us a small contextual link preview of the URL’s we paste in the those websites. Why not add this to Teem also, after all our goal is to have a application that is an all round place for collaborative development. We don’t want this place to be lagging from other social web applications!

Better sharing of Teem pages

Since Teem entirely depends on javascript for rendering all the data, we need to take care in some cases where data has to be fetched by a web crawler. Since a web crawler will not allow Teem to execute javascript, it will not be able to render any useful data. Thus, in order to have better sharing on social networking sites, I decided to add this as the next goal.

Integrating third-party apps

We want Teem to be a one stop collaborative development environment and in order to do so, we have to add integrations in Teem that allow people to do more! So I decided to work on to integrate Trello with Teem so that people can better track what work they have done!

With all this given in the proposal I was ready to start the summer.

GSoC 2018 Question tool – Report

I would like to thank my mentors Ellen Popko and Justin Clark for constantly supporting me, guiding me and reviewing my code during the GSoC project. This post is a summary of work done by me on Question Tool as a part of Google Summer of Code 2018 with Berkman Klein Center.

Introduction

Question Tool is a free, open source discussion tool which can be used either during a class/lecture or during an online discussion.

Gene Koo, Fellow at the Berkman Center, said, “It’s an effective way to keep feedback focused, direct speakers to audience interests, and potentially prevent the mic from being hijacked by that weirdo.”

The participants can add their instances to start a discussion or they can reply to or upvote a question which they want to get answered. It’s a great tool to be used for an online lecture where the online audience can post questions in the tool and the speaker can address them. The most important questions are sorted towards the top.

Project Description

GitHub Repository – https://github.com/berkmancenter/question_tool

During the summer I worked on solving bugs in the tool and adding some new features to it.

Question div UI

Changes were made to the question card’s UI. The upvote button was been given a separate div on the left side to prevent any confusion, flags were changed, share icons moved to a single option and ‘show x replies’ and ‘reply’ are now two similar buttons beside each other.

Editing Advanced features for admins and moderators

Each instance has its advanced features like maximum response length, social media sharing and so on which were decided when the instance was getting set up and remained the same throughout. I implemented an additional tab where the admins/mods can change the advanced options anytime.

Revert Upvote

Right now there is no option for users to cancel their upvotes. Once upvoted, on clicking again they get an error message saying ‘already upvoted’. This was solved by showing a pop up when the user clicks on an already upvoted question which acts as a confirmation for canceling the upvote by asking the user yes/no

Admin Interface bugs

Solved bugs present in the admin interface. Proper checks were implemented when adding a moderator. Earlier it was possible to add more than the maximum number (four) of moderators if the user presses the done button instead of plus. This and other minor errors in the admin interface were solved.

A Bug in the Presentation Mode

This was a small issue but an important one. This solves the error of hidden questions being visible in the presentation mode. Presentation mode is likely to be used in a projector screen so it was important to hide the hidden questions (visible to admins and moderators) once admin/mods enter to presentation mode.

Change Password

Added the ability for users to changes their passwords.

Future work

The PRs for some of the above tasks are still not merged. Firstly I’ll finish the unmerged changes. Then I plan to make changes based on the reviews I get from users using the new version of the tool. Apart from that, there are two major things which can be added to the tool in future.

Side div

Currently, when opening the ‘show reply’ box, the popover template covers the full screen and hides other questions but after this functionality, the user can easily view one question’s replies and all other questions simultaneously as shown in the screenshot below since the reply box will be moved to a separate div on the right side of the screen.

Increased functionality to Revert Upvote

After the addition of revert upvote functionality, there arises a single issue which needs to be solved to avoid any misuse of it. This is not a present problem but there can be some cases where users may try to change the ranking of questions through their ability to un-upvote. To prevent that we can add a time limit for the users to be able to cancel their upvotes which will serve the sole purpose of canceling accidental upvotes.

Conclusion

This summer has been a great learning experience for me. I have learned a lot from my mentors while working on the project.

I am very grateful to my mentors Ellen Popko and Justin Clark, and the entire Berkman Klein community for giving me this wonderful opportunity.

Thank you

Some thoughts on GSoC 2017 and LibraryBox

This summer, I was a part of Summer of Code working with Jason Griffey on his Berkman Klein project, LibraryBox.

I wrote on Medium about parts, but I wanted to reflect here on some of the things that I learned.

First, I’d like to say that this summer was a blast. The end was a bit hectic, since I’d gotten right into the thick of getting everything pinned down when I went back to school at the University of New Mexico (shoutout to my burqueno peeps!) and ran into professors that like packing the first few weeks with content. When I started working on LibraryBox this summer, I was more or less familiar with its usage. I’d played around with PirateBox and getting things working on a laptop previously, having gone to my fair share of DefCon events where I saw them pop up for adhoc file sharing. There’s a lot of great stuff that LibraryBox and PirateBox do that I wanted to help make better.

Lesson 1: Real, working hardware and such is the greatest debugging platform.

When I picked up porting it to the raspberry pi, I ran into a lot of little tiny issues here and there. As a note to future people doing things like this: Be ready to do some footwork when moving from an embedded platform like OpenWRT onto a platform like the rpi. There’s subtle differences and I can’t stress enough that until I was about halfway through the summer, I really didn’t know what I was doing. Learning that getting things running on real hardware as fast as possible is your best bet to make sure that your life is going to be just fine. Once I got my software running on the pi, I had a lot more fun getting the project moved forward.

Lesson 2: A day you don’t code isn’t a day wasted

It’s easy to think that a day you write zero lines of code is a day you’ve wasted. The reality is that this is a terrible way to handle the situation and will only drive you to the edge of burnout before you realize what you’re doing. Take days to think, work on something different, and even just not work at all. All-nighters are not worth it unless you have a good reason and the more I tried to force myself to work the less I got done.

The reality of the matter is that taking a day to work out notes and lay out design on paper can be the best day you get things done. I did a lot of “go to starbucks and sketch out the problem I’m having” and my notebook is as a result filled with pages like this:

Those notes were a few hours of thinking, sketching out ideas and getting things down on paper. Some notes I took while on a plane, some notes I took at dinner, other times notes were the byproduct of someone mentioning something.

Lesson 3: You have people around you who can solve your problems.

Software dev isn’t done in a vacuum. It’s tempting to hunker down and hide away and not talk to people, but there’s a lot of things that your friends and collegues know that you don’t know they know. I learned that one friend of mine was currently dealing with some of the same container problems I was facing, but in a slightly different context. This helped me find the source of one of my woes and hammer down what I really needed to solve. I had been burning for a week trying to figure out an obscure problem with systemd and how to make chroots work, then a friend mentioned systemd-nspawn and the skies opened up and this happened:

This was a good week of me being unhappy with my work, getting progressively more and more frustrated. Ten minutes of conversation solved a week of me hiding from people and just beating at the problem. It was bad enough that I hadn’t bothered to really mention it to my partner, who was a bouncing board for so many other things this summer.

Wrapping up

These were my experiences. My mentor helped me get on track when I had been wavering, helped set up a schedule and my partner really kept me in line. We build software not in isolation but by looking at others for help. I hope to be back next ear for more of this and get to know more people as I refine my skills! This summer had challenges, things that I didn’t expect. Did I hit every goal? Not by a longshot. But I hit the important goals, getting somewhere that I felt was more than just a proof of concept and more something that can be actively demoed soon.

SwellRT/Wave E2E Encryption: Overview

The code can be downloaded from this git branch (compare changes).

Synopsis

Apache Wave is a software framework for online real-time collaborative
edition. Similarly to Google Docs and Etherpad, it uses Operational
Transformations to manage user collaboration.

During this Google Summer of Code we have provided end-to-end encryption to wave
documents. This means that only the people who know a particular key, have
access to the documents and can edit and retreive the contents of a them,
protecting in that way the privacy of Wave users.

We have based our work on this awesome paper that explains how some
researchers encrypted Google Docs’ Operational Transformations. We have took
their ideas and adapted them to Apache Wave’s architecture.

Produced work

To sumarize the work we have produced, we have recorded this video:

To encrypt the messages we have used the algorithm AES-GCM from the WebCrypto
API. We have used JsInterop bindings to call it from our Java classes.

Messages are properly encrypted and decrypted when they are sent and received
by the clients. The texts of a documents are also properly recovered from the
server’s snapshot. Everything seems to run smoothly, except for some annoying
bugs that appear sparsely, and a serious user interface bug that prevents users
that did not created the wave to decrypt its snapshot. My mentor and me think
that we can fix them quickly, just after the program has ended.

How to use it

Running our modified version of Wave does not require any additional
configuration, just use Gradle commands as usual. To compile the code and
run the server use:

$ ./gradlew run

And open the url http://localhost:9898/ with any browser. Once registered and
logged in, use the “New Encrypted Wave” button to create a new encrypted wave.

Encrypted Wave button

In its URL you can see that the new wave’s identifier starts with “ew+” instead
of “w+”, as it is usual in common waves. Also, a symmetric cryptographic key is
attached, after the wave identifier, separated by an exclamation mark (!).

Encrypted Wave URL

The user must preserve that URL (or at least the key part) in order to open the
wave again in the future.

Future work

AES-GCM assures both confidentiality and integrity for the messages written by
the legitimate users, but an attacker who has the control over the server can
still do a lot of harm:

Only the text of a document is encrypted, but not other parts like the content
of its hiperlinks, for example. We should extend the encryption beyond the
inserted characters.
The authentication could also be extended to all the components, not only text
ones. Also, as the paper states that the history of a document should
also be authenticaded (see appendix A.2).
It is unlikely to hide the structure and format of the document to the server,
but we may be able to hide some more information, like user’s typing traits.

On the other hand, it is not convenient having users handling symmetric keys by
themselves. Keys should be encrypted and stored in the server as user data. To
do so, we should derive a key from the user’s password using pbkdf2 (available
in the WebCrypto API), to encrypt all the keys a user generates or registers
for her waves.

The users could use public key cryptograpy in order to being able to invite each
other to edit in a wave document. This feature were part of the original plan of
work for this Summer, but we have had not enough time to develop this part.

Relevant links

What do I love about the Google Summer of Code Program

The intention of this post is to share the wonderful experience I had in the Google Summer of Code (GSoC) program to other students who have similar backgrounds like me. By doing so, I sincerely hope more students can know and participate in this fantastic opportunity in the close future. I have faith that both students and host organisations will benefit a lot from it.

What is Google Summer of Code?

GSoC is a global program that offers students with an opportunity to contribute their knowledge and coding skills to a list of open-source software and technology-related organisations.

I had the honour to join this program together with other 1,317 students in this year (2017). Specifically, I chose and was selected by the Berkman Klein Center for Internet and Society at Harvard University to implement topic creation using machine learning in the MediaCloud project. I “fall in love” with this project at the first moment seeing it, it is not only my first preference but also the only project I have applied in GSoC. With that being said, I would suggest applicants apply for multiple projects to secure a position in GSoC.

Why do I love it?

The best part of this program is that each student will be guided by an excellent mentor from the host organisation through the entire project. It is my first time to write code for a large open-source program, and my mentor, Linas Valiukas, indeed helped me a lot in this adapting process by efficient communications. Every week, I send my weekly progress reports to him via email, he replies with valuable advice and suggestions. For every push request, he points out things I have done well as well as what I can do to improve. He is also always ready to answer my questions on Slack via quick messages.

The second thing I love is the content of GSoC project it proved can actually assist me to improve my skills. My task in GSoC is to develop a new feature in an existing project, and I have sharpened my coding skills in two ways. First, I can learn by reading existing code from this project, which tells me how a large open-source project should be organised, constructed and contributed by multiple people methodically. By carefully read through the coding standards, I learnt what the best way to guarantee the readability and understandability is in such project. Moreover, by frequent PR and reviews, I get to know things like most appropriate code structure, test cases and other important details that are often neglected by university projects but crucial in real-life ones. Understanding this shows me a way to further prepare myself for the works in the industry.

Furthermore, I love GSoC also because of how thoughtful this program is designed for students with diversities. Given GSoC is a global project participated by students all over the world, it is universal and inevitable that students from different countries and regions might not be able to work on the same time schedule. For example, as a student in Australia, our summer holiday is not as long as the student from the United States. Thus I have not finished my final exams before the beginning date of GSoC. I was pleasantly surprised to learn that Google has thought about this and emphasised that this program should give way to students’ academic duties from the university. My mentor and hosting organisation also kindly gave me a week to prepare for my exams. As a result, I managed to obtain high marks from the university and working on this project at the same time.

Last but not least, remote meetings organised by my host organisation (Berkman Klein Center for Internet and Society at Harvard University) at the end of each stage provided me with a chance to not only demonstrate my work to my peers but also listen to what achievements they have accomplished. I find this interesting as well as encouraging.

How to apply?

All university students from a bachelor degree to a doctoral degree are more than welcome to join. Google will announce the timetable in February every year, so please keep an open eye on here. After that, you can submit proposals for at most five projects among thousands of options. In each proposal, you can show and elaborate your genius ideas, plans and time schedules for that project. If your project proposal is selected by the host organisation, then you are ready to start your own unique journey with GSoC! Have a lovely trip!

Some thoughts about topic modelling

During the Google Summer of Code program of this year (2017), I am working on a project about topic modelling. The intention of this project is to identify the topics of each news article so that they can be grouped for further researches or analysis. I write this post to share some thoughts of mine and briefly discuss the program I implemented. I have also created some slides for this discussion at Prezi.

I would like to divide my implementation into four steps, tokenize data, remove stop-words, apply machine learning algorithms and tune parameters.

Tokenizing data

The data I am using is fetched from the MediaCloud database; each datum consists of two attributes: sentence, and ariticle_id. Attribute sentence contains a plain text string of one sentence from the article, while article_id represents to which article this sentence belongs.

In general, tokenization means breaking sentences up into words. Simple as this may sound, there are always some things we can do better, for example, how to handle verbs in various tense/voice (e.g., study, studying and studied) and nouns in singular/plural forms (e.g., datum and data). This is where we need to introduce lemmatization.

I have used the lemmatizer from nltk to solve this issue. In particular, it uses the WordNet which contains lexical data of English that can help to find conceptual relationships between words such as antonyms, hyponyms, hypernyms, and synonyms.

After this step, all sentences should be divided into words with uniformed forms.

Removing stop-words

Blindly passing all tokens into machine learning algorithm has been proven to be inefficient and inaccurate. A useful preprocessing step should at least contain stop-words removal.

In this project, there are three kinds of stop-words to be removed, meaningless words, meaningless topics and document-frequent words. Meaningless words refer to the words such as ‘to’, ‘for’, and others similar ones. They do not have real meanings hence including them only wastes computation powers. Meaningless topics are analogous to the meaningless words, they may have semantic meanings, but they do not give any useful information as topics (e.g., we do not want to label an article with the topic ‘Mr.’). Document-frequent words refer to words frequently occur in all articles. We prefer to filter them out because they contradict to the idea of topics (i.e., what makes this article different from the others).

Based on the definitions above, there are two ways for us to eliminate stop-words, TF-IDF, and stop-words list. TF-IDF stands for term frequency-inverse document frequency; it weights each word by dividing the number of its occurrence in the current file by its presence in the whole set. Stop-word list contains all the stop-words we want to remove. While the stop-word list may take time and effort to build, we can have more control on it. In this project, I used an existing stop-word list.

Applying machine learning algorithms

Now that all words are meaningful, it is time to feed them to the machine-learning algorithms. While there are numerous approaches, I used Latent Dirichlet Allocation (LDA) model and Non-negative Matrix Factorization (NMF) model. The LDA model posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics. This assumption is also akin to the standard bag of words model in our project. The NMF model is less commonly used than the LDA model and has fewer parameters to be tuned. In theory, there are some diﬀerence in the assumptions of LDA and NMF, but here in this project, they have generated very similar topics, which is a good news showing the correctness of topics. Here I focused on the LDA algorithm.

Tuning parameters

After the model is built, we can further improve its accuracy by ﬁnding the most optimal value for each parameter. In particular, we can tune two parameters in our LDA model, alpha and the total number of topics.

Alpha denotes the sparsity of topics among stories. The smaller alpha, the more sparse the topic distribution. This is a bit like the long-tail theory, where most topics have about zero signiﬁcance while a limited number of topics are shared by most stories.

In our project, I focused work on the second parameter, the total number of topics (I will call it topic_num for short).

The simplest way to ﬁnd optimal topic_num would be the brute-force, where we can start with topic_num equals to 1 and then 2, 3 and so on. Although this guarantees us with the correctness of the result, it simply takes too long and wastes tons of computation power.

An alternative way is using recursion, which is a bit like binary search. We start from extreme values and gradually shrink the size until we ﬁnd the optimal value. Again this also requires a relatively large computation power.

The method I use is via solving polynomial equations. After testing the model on various sizes of data set, I found that the likelihood often ﬁrst increases and then decreases with increasing topic_num. Based on this observation, I further assumed the polynomial relationship between likelihood and topic_num.

This allows me to build the polynomial equation based on only a few pre-computed points and identify the maximum value straight away.

The result shows this method exhibits a relatively good performance with significantly lowered costs in time and computation power.

This model also ﬁts well in an increasing data set. For x new stories, instead of re-tune the model, we can simply assume topic_num will also increase by x +/- e, where e is a small value.

Future works

Although the project is finished, there are some improvements to be made. Firstly, we can achieve a better lemmatization by giving the exact position of each word in the sentence. Secondly, we may further investigate the relationship between likelihood and topic_num to give a more accurate prediction. Thirdly, we can hack into the API of LDA model and try to reduce the number of iterations while training.