~ Archive for K-Anonymity Analysis ~

Anonymity

ø

Anonymity Ethics Assignment: Technical Track
Link to GitHub repo.

K-Anonymizing a Redacted HarvardX Dataset: A Report

Background

For this assignment, I had to k-anonymize (k=5) a redacted HarvardX dataset using different techniques. My work in this assignment is largely based on Dr. Letanya Sweeney’s paper, “Achieving k-Anonymity Privacy Protection Using Generalization and Suppression” (2002).

Data

The dataset I was working with had 199,999 records and 15 columns. Two of these were considered identifiers: course_id and user_id. The rest were considered quasi-identifiers and contained information about the users’ demographics and forum usage.

Methods

Determining the K-Anonymity

I determined that the k-anonymity of the dataset considering only the set of quasi-identifiers was k=1. I found this in two different ways. First, I counted the number of times each record instance, that is to say any record containing a specific collection of values, appeared. If a record instance appeared more than once, then that specific collection of values is not uniquely identifying. I could then count the number of times each record instance appeared in the dataset. I found that most record instances appeared once, meaning that the dataset was 1-anonymous. Below is a histogram of the counts of record instances, highlighting that more than half of all the records in the dataset were completely unique in terms of the original quasi-identifiers.

I confirmed that the original anonymity of the dataset was k=1 using a second approach directly modeled after Dr. Sweeney’s Theorem 1. See Figure 11 in the Jupyter Notebook for implementation. The function returns the k-anonymity of the dataset.

Making Dataset 5-Anonymous With Only Record Suppression

By suppressing 150,286 records, I was able to make the dataset 5-anonymous. This represents 75% of the data, indicating that using only this approach leads to filtering out most of the data. I suppressed record instances where their counts were <=4.

Making Dataset 5-Anonymous With Only Column Suppression

By suppressing 11 quasi-identifier columns, I was able to make the dataset 5-anonymous. The only remaining quasi-identifier columns were level of education and gender. Clearly this approach also results in a dataset that is not especially useful.

Making Dataset 5-Anonymous With Only Generalization

I was able to use generalization to make my dataset 5-anonymous by bucketing values in certain columns. Specifically, I generalized the lower and/or upper bounds of 8 attributes. I generalized the city, postal code, and country code attributes in a more simple way more similar to suppressing them, though I suggest some more complicated generalization strategies for these columns in the Notebook.

Making Dataset 5-Anonymous With Combination Strategy

Ultimately, the combination strategy made the most sense to me. I started with using generalization to bucket 8 columns as previously detailed. Then I used column suppression to suppress the postal code and city columns. Lastly, I used record suppression to k-anonymize country code. Using this approach, I was able to preserve 73% of the data while maintaining 5-anonymity.

Conclusion

This assignment highlighted to me the pros and cons of the different anonymization strategies and allowed me to explore how to combine them effectively. I now understand Dr. Letanya Sweeney’s k-anonymization Theorems much better. I have also learned more about different k-anonymization algorithms and metrics by implementing them myself.

Log in