Category Archives: Big Data

Developing big data analysis tools 2.0

What we worked on

The Big Data team spent the past few weeks introducing the group’s work so far and setting concrete goals for this semester. We are extremely excited to welcome new highly qualified and interested team members!

During our first meeting, the team joined the Privacy Tools group of the Center for Research in Computation and Society to talk about the results from the group’s work last year and the CACM paper that followed.

Next, we met to learn about every member’s interests and goals discussed how might we work together to define and explore the key questions that are interesting to the group.

Our Work Plan for the Semester

Throughout the semester we will be asking 3 key questions:

(1) Do current anonymization techniques used in large datasets able to maintain the data and its properties reliable and complete?

Can we use anonymized datasets in research?

Can robust insights be generated from such anonymized datasets?

To answer such questions, we will analyze samples of such datasets and try to understand whether analysis of the original datasets and the anonymized ones generates the same results.

(2) If current methods do not maintain the data’s key properties, is there an anonymization method that can do so?

We will experiment with different ways of anonymizing data and try to understand which one, if any,  generates robust and satisfactory results in a way that maintains the qualities of the original data and does not compromise users’ privacy.

(3) Finally, if anonymizing data in a way that maintains its original properties is not possible, we will research and brainstorm new concepts of privacy

Can privacy exist without anonymity?

This is a huge undertaking, and one that many have thought of in the past. We will spend the semester researching different notions of privacy and try to understand what lies at the core of it, and whether we can generate a kind of privacy without anonymity. While we may not succeed, we think that spending time on this issue is important.

What went well so far

Everyone seems genuinely fascinated by the problem and excited to get our hands dirty trying to do some meaningful work on the subject. We are all thrilled to be working on a subject that not many others have explored before us.

What was challenging

Since we are stepping into a path not many have walked before us, we will have to figure things out as we go. This may be challenging at time, but we will work to create a supportive community that will facilitate a productive process.

What’s up next

We have been building a reading list to get everyone up to speed on concepts in data anonymization, de-identification methods, legal requirements, and related notions of privacy. In the next week or so we will discuss the readings and the larger themes around them. We are also working on securing access to some large data sets so that we can start conducting preliminary analysis and visualization.

Navigating Regulation and Data Sets (Big Data Team)

Dealing with the complexities of data in both programming and policy is an apt way of  characterizing the last two months for the Big Data team. On the programming side, we’ve tackled a massive trove of information that is in need of organization and translation. Our product aims to resolve these discrepancies and provide a useful tool for educational researchers. On the policy side, we have dug even deeper into theFamily Educational Rights and Privacy Act (FERPA) and have found, to our surprise, that no one seems to have “the answer” on de-identifying student data in the MOOC (massive open online course) context.

The programming team of our group has been working on understanding the data we receive from EdX and HarvardX courses. We have started writing code to convert the data we receive into a more digestible form. We are also putting together basic sanity checks on the data. For example, if a student appears in the list of students considered for a certificate in a course, do they also appear in the list of users signed up for the course? After these basic checks, we will also continue interfacing with the policy side of the group to figure out exactly how this data can be made available to educational researchers. One interesting way to test the dataset is the concept of k-anonymity. For a user who has a certain set of attributes, are there at least k-1 other users that have the same set of attributes? This allows us to see how many users would be uniquely identifiable in a dataset and then work to make sure the attributes that are included give us a fully de-identified dataset.

On the user experience design branch of the team, we have created a plan for the design process and interviewed Justin and Sergiy, HarvardX Research Fellows, to get their insights on the existing user interface provided by edX. From their feedback, we have better defined the task at hand. The next goal is to meet with a few faculty members to develop use cases and continue the design process from there.

Our policy team has met with lawyers involved with EdX and other Harvard staff well-versed in FERPA and data privacy. These meetings have revealed how unclear FERPA is on online-only education. We have concluded, however, that FERPA applies to our project and thus that we need to identify levels of data for release to our different constituencies. The most challenging aspect of this is developing a de-identification process that will allow the release of student data to researchers without violating FERPA. Our next step is to complete a memorandum discussing our options and possible de-identification methods.

— Elise Young

From the DPSI Launch Event –

Jim Waldo, mentor of the Tools for Big Data Analysis team, explains why the time is ripe for the creation of powerful instruments to advance digital education. Tools for big data analysis can enable non-technologists to use, benefit from, and respond to data analysis in educational contexts.

Big Data Team – FERPA Overview

Education researchers are excited about the potential to improve teaching methods and enhance student learning by analyzing data from massive open online courses (MOOCs). The DPSI Big Data Team is creating a platform to normalize and distributeedX data to researchers. The Big Data Team is also evaluating student privacy concerns and researching whether the Family Educational Rights and Privacy Act (FERPA)applies to edX data.In brief, FERPA protects student data by prohibiting third-party access to “education records” that contain personally identifiable information without student consent. FERPA only applies to schools that receive funds under any applicable program of the U.S. Department of Education; this may include direct funding or application of federal funds to student tuition (i.e. FAFSA).

The Big Data Team’s initial research indicates that FERPA does not provide a private cause of action – there will be no FERPA-based lawsuits. However, a policy or practice of confidentiality violations may result in the discontinuation of federal funding – a significant blow to any institution such as Harvard that receives significant federal funds.

The Big Data Team is currently researching several questions including:

1. Is edX data subject to FERPA and if so how should the team anonymize the data so that personally identifiable information is removed?

2. Who is a “student” within the context of FERPA? Is an online-only user in Germany considered a student, same as a Harvard-enrolled student who is required to use edX?

3. Who within a school may access personally identifiable information? Are school software engineers, data analytics professionals, or Big Data Team members eligible “third party researchers” who may be permitted to access personally identifiable information? If so, what additional requirements must be met, and how should the team comply?

In the upcoming weeks the Big Data Team will meet with Harvard and MIT administrators to research edX’s organizational structure, funding sources, and relationship with federally funded institutions.

David Gobaud (JD ‘15), Lindsay Lin (JD ‘15), Sharon Stovezky (College ‘15), and Elise Young (JD ‘14)