~ Archive for Privacy and Technology ~



Anonymity Ethics Assignment: Technical Track
Link to GitHub repo.

K-Anonymizing a Redacted HarvardX Dataset: A Report


For this assignment, I had to k-anonymize (k=5) a redacted HarvardX dataset using different techniques. My work in this assignment is largely based on Dr. Letanya Sweeney’s paper, “Achieving k-Anonymity Privacy Protection Using Generalization and Suppression” (2002).


The dataset I was working with had 199,999 records and 15 columns. Two of these were considered identifiers: course_id and user_id. The rest were considered quasi-identifiers and contained information about the users’ demographics and forum usage.


Determining the K-Anonymity

I determined that the k-anonymity of the dataset considering only the set of quasi-identifiers was k=1. I found this in two different ways. First, I counted the number of times each record instance, that is to say any record containing a specific collection of values, appeared. If a record instance appeared more than once, then that specific collection of values is not uniquely identifying. I could then count the number of times each record instance appeared in the dataset. I found that most record instances appeared once, meaning that the dataset was 1-anonymous. Below is a histogram of the counts of record instances, highlighting that more than half of all the records in the dataset were completely unique in terms of the original quasi-identifiers.

I confirmed that the original anonymity of the dataset was k=1 using a second approach directly modeled after Dr. Sweeney’s Theorem 1. See Figure 11 in the Jupyter Notebook for implementation. The function returns the k-anonymity of the dataset.

Making Dataset 5-Anonymous With Only Record Suppression

By suppressing 150,286 records, I was able to make the dataset 5-anonymous. This represents 75% of the data, indicating that using only this approach leads to filtering out most of the data. I suppressed record instances where their counts were <=4.

Making Dataset 5-Anonymous With Only Column Suppression

By suppressing 11 quasi-identifier columns, I was able to make the dataset 5-anonymous. The only remaining quasi-identifier columns were level of education and gender. Clearly this approach also results in a dataset that is not especially useful.

Making Dataset 5-Anonymous With Only Generalization

I was able to use generalization to make my dataset 5-anonymous by bucketing values in certain columns. Specifically, I generalized the lower and/or upper bounds of 8 attributes. I generalized the city, postal code, and country code attributes in a more simple way more similar to suppressing them, though I suggest some more complicated generalization strategies for these columns in the Notebook.

Making Dataset 5-Anonymous With Combination Strategy

Ultimately, the combination strategy made the most sense to me. I started with using generalization to bucket 8 columns as previously detailed. Then I used column suppression to suppress the postal code and city columns. Lastly, I used record suppression to k-anonymize country code. Using this approach, I was able to preserve 73% of the data while maintaining 5-anonymity.


This assignment highlighted to me the pros and cons of the different anonymization strategies and allowed me to explore how to combine them effectively. I now understand Dr. Letanya Sweeney’s k-anonymization Theorems much better. I have also learned more about different k-anonymization algorithms and metrics by implementing them myself.

Privacy Policies


A simple text analysis of 11 website privacy policies accompanied by a report. Written for a Computer Science class in Fall 2021.

// Take a look at the code I wrote to analyze these privacy policies!

An Analysis of the Privacy Policies of “Back to School” Websites 


To compare the privacy policies of various websites I used in the first two weeks of school to purchase back to school supplies or otherwise use a related service.


I made a list of the websites I had visited for “back to school” reasons. I chose this subset of websites because (1) I have recently visited them, (2) I had a particular non-entertainment-related reason or need to visit them, and (3) I think my peers may have visited similar sites. Then I searched for their Privacy Policies, copied the policy texts into .txt  files, and saved them locally on my computer. Links to all the Privacy Policies are available in the Appendix and all the .txt  files are available on the GitHub repo linked below.

Originally, I intended to build a tool that would take in a link or PDF file and scrape the text directly from there. However, I discovered that there was significant variation in the page structures which would make such scraping for different websites difficult. Since I did not want to limit the scope of my analysis only to websites with policies structured in a way I could easily scrape, I opted to copy and paste the texts into .txt  files. This step can be reproduced by visiting the linked privacy policies and copy-pasting their main bodies of text into .txt files

I then wrote some Python code to process the text of each policy. I used a Jupyter Notebook to facilitate the text analysis. First, I calculated overall metrics of each text like its line count, total words, and total unique words. Then I calculated various readability and time metrics using textstat.

In the second part of my analysis, I “cleaned” each text. This cleaning allowed for a more rigorous analysis of the frequencies and counts of words. For each policy, I removed punctuation marks and made all the words lowercase. Then I used a lemmatizer to remove plural suffixes from words. Lastly, I remove “stop words,” which are commonly used words in English like “the” and “and.” I got a list of such “stop words” from the nltk  library I used. 

After cleaning the text, I conducted my overall analysis on it. I also found the longest and the most common words in each policy. The “cleaning” allowed me to find the most common words that aren’t “the” or other “stop words.” 

Visit this project’s repo here to see all privacy policy .txt  files and the Python Jupyter Notebook script.


In the Appendix are tables containing the full calculated results of my analysis. 

The longest privacy policy by word count was Kayak.com, which had over 10,000 words. It was followed in length by Ebay and Condé Nast. Kayak was also the privacy policy calculated to take the longest time to read. This relationship between word count and time to read is to be expected. The shortest privacy policy was that of the Harvard Book Store’s website. This was not surprising given its website’s simplicity. The next shortest were Zappos and BBC. 

The most readable privacy policies as calculated by two different scores were BBC and the Harvard Book Store, in that order. Given BBC’s general mission and commitment to accessibility, a good readability score makes sense here. Zoom had the least readable Flesch score.

The longest words were found in each policy. There was some overlap in the longest words where a word appeared as the longest in several policies. Also, some policies had several longest words of the same length. Here is a selection of the longest words: “personalisation” (BBC, Google), “responsibility” (Harvard Book Store, Zappos), “confidentiality” (StackOverflow, Google), “acknowledgement” (StackOverflow), “pseudoephedrine” (Target), “troubleshooting” (Google), “recommendation” (Zappos), and “disproportionate” (Zoom).

The most common word in all but two of the privacy policies was “information.” The words “personal” and “service” also appeared amongst the top three most common words of nearly all the privacy policies.


Overall, with some exceptions, I found that the policies I analyzed were long and not very readable. All 11 privacy policies had a reading level score of grade-level 13 or above. This means that all of the privacy policies were rated to require at least the completion of high school education to fully be understood. This raises several problems. I expect a high proportion of users for some of these websites to be younger than early-college-aged–for example YouTube and Zoom. Furthermore, not all adult users have completed high school. If we agree that informed consent is critical to ethical Terms of Services, then this analysis demonstrates that many commonly used websites have privacy policies that may not be understood by significant portions of their user bases.

On a different note, I had expected that Google would have the longest privacy policy given how many different services it has. I was surprised that Google had a centralized privacy policy document because many of its services are very different. Does it make sense to have the same privacy policy for Gmail as for YouTube? I understand in theory that a company may want to hold some values constant across different services. However, I think the contexts are different enough to warrant more distinction between which parts of the policy apply where (and maybe with examples). At the same time, Google can and does leverage its data across several of its services, so maybe it is not surprising that it has a single policy. I see this example as different from Wired magazine’s privacy policy being that of its parent company, Condé Nast, since many of Condé Nast’s companies are magazines too.

I believe my findings are reproducible as they are outlined very clearly in this report. That being said, I acknowledge that my analysis has many limitations, which I will go over in the last section. I am confident in my results in as much as I am confident that my code works as expected. However, I think there are many subtleties to readability and accessibility of knowledge not captured in my analysis. I would need to incorporate more readability and other metrics to confidently rank privacy policies from best to worst, where best means the policy is clear and accessible. The one exception may be BBC, which I do think had the most comprehensive and accessible privacy policy of the ones I analyzed.


There are several limitations to my study. Firstly, I am not distinguishing in my analysis between text that corresponds to the main policy text and other secondary text like section headers. By including all of this secondary text, I may be slightly skewing word counts. 

Another limitation is a small sample size (n=11) of the policies, as well as the biased nature of their selection given that I selected them based on my personal use. I proposed in this study that these websites are commonly used amongst Harvard students coming “back to school.” However, I have not conducted any research to support this. 

Thirdly, I am comparing privacy policies across fields and very different services. Although I was eager to analyze different policies, I also recognize that comparing “apples to pears to lemons” can make overall conclusions more difficult to draw.

Lastly, I am only extracting and processing the policy text quantitatively. I think there is also a qualitative element to the accessibility of privacy policies: How is the policy formatted? How much text is hidden behind an “expand to read more” button? How many links are there? Thus, adding a quantitative or design analysis would also be beneficial.


Table 1: Overall Analysis

Policy Line count All Words Cleaned words
Word count Unique words Word count Unique words
BBC 378 3730 666 1726 530
Ebay 371 8357 1239 4818 1046
Harvard Book Store 8 176 107 89 77
Kayak 403 10589 1234 5783 1032
StackOverflow 240 5307 927 2922 778
Sweetgreen 158 4729 876 2634 722
Target 453 6406 1030 3798 857
Wired (Condé Nast) 449 7276 1187 4341 996
YouTube (Google) 391 5168 902 2951 731
Zappos 47 1821 539 1022 437
Zoom 180 4718 861 2742 706

Note: In the Word Count column of Table 1, cells with values over 5,000 words are colored in orange to flag that these policies could take up over 10 A4 single-spaced pages to print. (5,000 single-spaced words roughly equals 10 A4 pages.)

Table 2: Readability and Time

Policy Readability Scores Time in minutes
Flesch Consensus
Higher is more readable. 0-29 is “very confusing” Lower is more readable. Score is “estimated school grade level required to understand the text”
BBC 58.5 13 ~4
Ebay 19.98 21 ~11
Harvard Book Store 41.19 13 ~12 seconds
Kayak 21.1 19 ~14
StackOverflow 14.4 21 ~7
Sweetgreen 31.35 15 ~6
Target 18.86 19 ~8.5
Wired (Condé Nast) 8.27 21 ~6.5
YouTube (Google) 25.7 21 ~6.5
Zappos 29.93 17 ~2
Zoom 4.65 27 ~6

Note: In the Consensus column of Table 3, cells with values that denote low readability scores are colored in red.

Table 3: Most Common Words

Policy #1 Word #2 Word #3 Word
BBC information bbc service
Ebay data personal service
Harvard Book Store information privacy right
Kayak information service may
StackOverflow information stack overflow
Sweetgreen information service personal
Target information target service
Wired (Condé Nast) information personal service
YouTube (Google) information google service
Zappos information zappos com
Zoom zoom data information

Note: Cells are grayed out where the most common word is the name of the company.

Table 4: Links to Privacy Policies

Website Type Date Privacy Policy Title Privacy Policy Link Comments
BBC.co.uk News Page updated: July 2021 The BBC Privacy and Cookies Policy https://www.bbc.co.uk/usingthebbc/privacy-policy/
Ebay.com Retail effective from June 20, 2020 User Privacy Notice https://www.ebay.com/help/policies/member-behaviour-policies/user-privacy-notice-privacy-policy?id=4260
Harvard.com Books N/A Privacy & Policies https://www.harvard.com/about/privacy_policies/ This is the Harvard Book Store’s website.
Kayak.com Travel Effective Date: July 1, 2021 Privacy Policy https://www.kayak.com/privacy
StackOverflow.com Q+A Updated 17 March 2021 Stack Exchange, Inc. Privacy Policy https://stackoverflow.com/legal/privacy-policy
Sweetgreen.com Food Effective Date: January 2, 2021 Privacy policy https://www.sweetgreen.com/privacy-policy
Target.com Retail Last update: 7/1/2021 Target Privacy Policy https://www.target.com/c/target-privacy-policy/-/N-4sr7p
Wired.com News Last Updated: June 30, 2021 Privacy Policy and Cookie Statement https://www.condenast.com/privacy-policy#privacypolicy Apparently Wired is owned by Condé Nast and thus falls under their Privacy Policy.
YouTube.com Video Effective July 1, 2021 Privacy Policy https://policies.google.com/privacy?hl=en The general Google Privacy Policy is the policy linked on YouTube’s Terms of Service
Zappos.com Retail Last updated: January 15, 2020 Privacy Notice https://www.zappos.com/c/privacy-policy
Zoom.com Video conferencing Last updated: June 4, 2021 Zoom Privacy Statement https://explore.zoom.us/docs/en-us/privacy.html

Note: Last Accessed 9/12/21.

Reasonable Expectations


A position paper exercise on privacy. Written for a Computer Science class in Fall 2021.

The Case For Subjective, Individualized Expectations of Privacy


The virtual world challenges the legacy idea of “a reasonable expectation of privacy.” Society cannot broadly determine what should be recognized as “reasonable” in the face of varied digital literacy rates, mixed financial incentives of many of those in society, and ever-changing technologies. Instead, in order to protect important privacy rights, new privacy norms should rely primarily on the subjective expectation of privacy expressed through informed consent.


There are several definitions of “privacy,” conceptually ranging from something morally valuable to a legal right (1). In the United States, the Fourth Amendment grants the right “to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures” (2). This Amendment has been central to American privacy case law and theory. Notably, in Katz v. United States, the interpretation of the Fourth Amendment was expanded to include as private and protected what an individual “seeks to preserve as private,” as Justice Potter Stewart wrote in his majority opinion (3).

In his concurring opinion for the same case, Justice John Marshall Harlan II additionally set forth the idea of “a reasonable expectation of privacy” (4). He suggested that this “reasonable expectation” requires (1) “that a person have exhibited an actual (subjective) expectation of privacy,” and (2) “that the expectation be one that society is prepared to recognize as ‘reasonable.’” This paper will discuss why this idea is no longer useful in the virtual world.

Digital Literacy

The outdated notion of “a reasonable expectation of privacy” is partly founded on society’s supposed ability to broadly agree on expectations of privacy. However, such a consensus cannot possibly be reached because there is too great a difference between the most and the least digitally literate in society. In the United States today, the major concern about digital life is no longer so much around digital access. Instead, the Pew Research Center reports, the critical issue is, “the degree to which people succeed or struggle when they use technology to try to navigate their environments, solve problems, and make decisions” (italics mine) (5). The reports demonstrates that there are several different groups of Americans on the digital literacy spectrum.

This is concerning because in order to define “a reasonable expectation of privacy” in the virtual world, a society needs to be able to define those expectations together. A digitally illiterate society, or one with highly varied levels of digital literacy, surely cannot do this. If a society as a whole is ill-equipped to determine these expectations, then that leaves the possibility of selecting a specific subset of society to do so. However, that presents its own challenges. For example, American elected representatives in Congress themselves appeared fairly digitally illiterate in the 2018 hearings against Facebook (6).

Mixed Incentives

Other challenges arise if a society hands off the determining “reasonable expectation[s] of privacy” to a representative or other stakeholder. Evidence has already been presented highlighting the American Congress’s elementary digital literacy. The ability of elected representatives to determine these “reasonable expectation[s]” should further be called into question given that many large technology companies spend millions on lobbying in American politics (7). Lastly, given that many of these technology companies’ business models rely on limited privacy protections of their users, it would be naive to think that they could act on behalf of society to recognize what these reasonable expectations of privacy might be. As Maciej Ceglowski puts it:

Those who benefit from the death of privacy attempt to frame our subjugation in terms of freedom, just like early factory owners talked about the sanctity of contract law. They insisted that a worker should have the right to agree to anything, from sixteen-hour days to unsafe working conditions, as if factory owners and workers were on an equal footing (8).

Fundamentally, society’s inability to determine reasonable privacy expectations for itself in the digital world undermines Justice Harlan II’s framework for thinking about privacy.

Ever-changing Technology

Lastly, given the unprecedented pace of technology development, it is unreasonable to expect that society could keep up with all the new developments. Indeed, it might only be reasonable to assume that the technology companies themselves could keep up with their technology—if even that is a reasonable assumption. Either only those building the virtual world fully understand the implications of that world, or no one does. In either case, Justice Harlan II’s requirement “that the expectation be one that society is prepared to recognize as ‘reasonable’” cannot be fulfilled.

Informed Consent and Expectation of Privacy

Where Justice Harlan II’s second requirement fails in the virtual world, the first requirement thrives. In the virtual world, “a reasonable expectation of privacy” exists only when the person has “exhibited an actual (subjective) expectation of privacy,” as Justice Harlan II’s puts it in his first requirement.9 This subjective expectation can be reached when there is informed consent. In their report on privacy, a United Nations Human Rights Council reaches a similar conclusion, declaring that users must, “have meaningful transparency about how their data are being gathered, stored, used and potentially shared with others, so that they are able to raise concerns and make informed decisions” (10).

Without this meaningful transparency, informed consent cannot be attained and privacy rights are seriously compromised. For instance, in “Markets and Privacy,” Kenneth C. Laudon describes refutes classic “fair information principles” on the grounds that the individual cannot possibly review all the systems in which her information is stored (11). When informed consent is not prioritized, it is hard for people to understand and be critical of their privacy in practice. The existing gaps around informed consent in the virtual world are well-documented. A Pew Research study, for example, found that in late 2018, 74% of adult Facebook users, “were not aware that the site lists their traits and interests for advertisers” (12).

The subjective expectation of privacy is different in the case of children, who are themselves major Internet users. According to a 2017 National Center for Education Statistics survey, 51% of children aged 3 and 4 and 77% of children 11-14 used the Internet either at home, school, or elsewhere (13). It’s also known that many children use Internet and other services, like Facebook, while pretending they are older (14). Under the expectation of privacy through informed consent, existing standards of informed consent for children should be applied. For example, children should have extra protections under privacy law, like they do under other types of laws. The expectations for informing consent should also be catered to the children’s needs and comprehension abilities.


Privacy rights are important in both the physical and virtual worlds. The idea of “a reasonable expectation of privacy” does not fully contend with complexities of the digital age like varying digital literacy rates, mixed financial and other incentives in society, and rapidly evolving technology. In its place, the notion of a subjective notion of privacy individualized through informed consent emerges as the most promising tenet of modern privacy theory.


1 “Privacy.” Stanford Encyclopedia of Philosophy, revised January 18, 2018. https://plato.stanford.edu/entries/priva…

2 “Fourth Amendment.” Legal Information Institute, Cornell Law Schoolhttps://www.law.cornell.edu/constitution…

3 “Charles KATZ, Petitioner, v. UNITED STATES.” Legal Information Institute, Cornell Law Schoolhttps://www.law.cornell.edu/supremecourt…

4 Ibid.

5 Horrigan, John B.. “Digital Readiness Gaps.” Pew Research Center, September 20, 2016. https://www.pewresearch.org/internet/201…

6 Stewart, Emily. “Lawmakers seem confused about what Facebook does — and how to fix it.” Vox, April 10, 2018. https://www.vox.com/policy-and-politics/…

7 Kang, Cecilia and Kenneth P. Vogel. “Technology Giants Amass a Lobbying Army for an Epic Washington Battle.” The New York Times, June 5, 2019. https://www.nytimes.com/2019/06/05/us/po…

8 Ceglowski, Maciej. “The Moral Economy of Tech.” Idle Words, June 26, 2016. https://idlewords.com/talks/sase_panel.h…

9 “Charles KATZ, Petitioner, v. UNITED STATES.” Legal Information Institute, Cornell Law Schoolhttps://www.law.cornell.edu/supremecourt…

10 “The right to privacy in the digital age: Report of the Office of the United Nations High Commissioner for Human Rights.” United Nations Human Rights Council, Twenty-seventh session, June 30, 2014.

11 Laudon, Kenneth C.. “Markets and Privacy.” Association for Computing Machinery. Communications of the ACM; Sep 1996; 39, 9. Pg. 97.

12 Gramlich, John. “10 facts about Americans and Facebook.” Pew Research Center, June 1, 2021. https://www.pewresearch.org/fact-tank/20…

13 “Percentage of persons age 3 and over who use the Internet anywhere and who use the Internet at selected locations, by selected characteristics: 2011 and 2017.” National Center for Education Statisticshttps://nces.ed.gov/programs/digest/d18/…

14 Aiken, Mary. “The Kids Who Lie About Their Age to Join Facebook.” The Atlantic, August 30, 2016. https://www.theatlantic.com/technology/a…

Log in