Navigating Regulation and Data Sets (Big Data Team)

Dealing with the complexities of data in both programming and policy is an apt way of  characterizing the last two months for the Big Data team. On the programming side, we’ve tackled a massive trove of information that is in need of organization and translation. Our product aims to resolve these discrepancies and provide a useful tool for educational researchers. On the policy side, we have dug even deeper into theFamily Educational Rights and Privacy Act (FERPA) and have found, to our surprise, that no one seems to have “the answer” on de-identifying student data in the MOOC (massive open online course) context.

The programming team of our group has been working on understanding the data we receive from EdX and HarvardX courses. We have started writing code to convert the data we receive into a more digestible form. We are also putting together basic sanity checks on the data. For example, if a student appears in the list of students considered for a certificate in a course, do they also appear in the list of users signed up for the course? After these basic checks, we will also continue interfacing with the policy side of the group to figure out exactly how this data can be made available to educational researchers. One interesting way to test the dataset is the concept of k-anonymity. For a user who has a certain set of attributes, are there at least k-1 other users that have the same set of attributes? This allows us to see how many users would be uniquely identifiable in a dataset and then work to make sure the attributes that are included give us a fully de-identified dataset.

On the user experience design branch of the team, we have created a plan for the design process and interviewed Justin and Sergiy, HarvardX Research Fellows, to get their insights on the existing user interface provided by edX. From their feedback, we have better defined the task at hand. The next goal is to meet with a few faculty members to develop use cases and continue the design process from there.

Our policy team has met with lawyers involved with EdX and other Harvard staff well-versed in FERPA and data privacy. These meetings have revealed how unclear FERPA is on online-only education. We have concluded, however, that FERPA applies to our project and thus that we need to identify levels of data for release to our different constituencies. The most challenging aspect of this is developing a de-identification process that will allow the release of student data to researchers without violating FERPA. Our next step is to complete a memorandum discussing our options and possible de-identification methods.

— Elise Young