~ Archive for Privacy Policies Analysis ~

Privacy Policies

0

A simple text analysis of 11 website privacy policies accompanied by a report. Written for a Computer Science class in Fall 2021.

// Take a look at the code I wrote to analyze these privacy policies!

An Analysis of the Privacy Policies of “Back to School” Websites 

Aim:

To compare the privacy policies of various websites I used in the first two weeks of school to purchase back to school supplies or otherwise use a related service.

Methods: 

I made a list of the websites I had visited for “back to school” reasons. I chose this subset of websites because (1) I have recently visited them, (2) I had a particular non-entertainment-related reason or need to visit them, and (3) I think my peers may have visited similar sites. Then I searched for their Privacy Policies, copied the policy texts into .txt  files, and saved them locally on my computer. Links to all the Privacy Policies are available in the Appendix and all the .txt  files are available on the GitHub repo linked below.

Originally, I intended to build a tool that would take in a link or PDF file and scrape the text directly from there. However, I discovered that there was significant variation in the page structures which would make such scraping for different websites difficult. Since I did not want to limit the scope of my analysis only to websites with policies structured in a way I could easily scrape, I opted to copy and paste the texts into .txt  files. This step can be reproduced by visiting the linked privacy policies and copy-pasting their main bodies of text into .txt files

I then wrote some Python code to process the text of each policy. I used a Jupyter Notebook to facilitate the text analysis. First, I calculated overall metrics of each text like its line count, total words, and total unique words. Then I calculated various readability and time metrics using textstat.

In the second part of my analysis, I “cleaned” each text. This cleaning allowed for a more rigorous analysis of the frequencies and counts of words. For each policy, I removed punctuation marks and made all the words lowercase. Then I used a lemmatizer to remove plural suffixes from words. Lastly, I remove “stop words,” which are commonly used words in English like “the” and “and.” I got a list of such “stop words” from the nltk  library I used. 

After cleaning the text, I conducted my overall analysis on it. I also found the longest and the most common words in each policy. The “cleaning” allowed me to find the most common words that aren’t “the” or other “stop words.” 

Visit this project’s repo here to see all privacy policy .txt  files and the Python Jupyter Notebook script.

Results:

In the Appendix are tables containing the full calculated results of my analysis. 

The longest privacy policy by word count was Kayak.com, which had over 10,000 words. It was followed in length by Ebay and Condé Nast. Kayak was also the privacy policy calculated to take the longest time to read. This relationship between word count and time to read is to be expected. The shortest privacy policy was that of the Harvard Book Store’s website. This was not surprising given its website’s simplicity. The next shortest were Zappos and BBC. 

The most readable privacy policies as calculated by two different scores were BBC and the Harvard Book Store, in that order. Given BBC’s general mission and commitment to accessibility, a good readability score makes sense here. Zoom had the least readable Flesch score.

The longest words were found in each policy. There was some overlap in the longest words where a word appeared as the longest in several policies. Also, some policies had several longest words of the same length. Here is a selection of the longest words: “personalisation” (BBC, Google), “responsibility” (Harvard Book Store, Zappos), “confidentiality” (StackOverflow, Google), “acknowledgement” (StackOverflow), “pseudoephedrine” (Target), “troubleshooting” (Google), “recommendation” (Zappos), and “disproportionate” (Zoom).

The most common word in all but two of the privacy policies was “information.” The words “personal” and “service” also appeared amongst the top three most common words of nearly all the privacy policies.

Conclusion/Discussion: 

Overall, with some exceptions, I found that the policies I analyzed were long and not very readable. All 11 privacy policies had a reading level score of grade-level 13 or above. This means that all of the privacy policies were rated to require at least the completion of high school education to fully be understood. This raises several problems. I expect a high proportion of users for some of these websites to be younger than early-college-aged–for example YouTube and Zoom. Furthermore, not all adult users have completed high school. If we agree that informed consent is critical to ethical Terms of Services, then this analysis demonstrates that many commonly used websites have privacy policies that may not be understood by significant portions of their user bases.

On a different note, I had expected that Google would have the longest privacy policy given how many different services it has. I was surprised that Google had a centralized privacy policy document because many of its services are very different. Does it make sense to have the same privacy policy for Gmail as for YouTube? I understand in theory that a company may want to hold some values constant across different services. However, I think the contexts are different enough to warrant more distinction between which parts of the policy apply where (and maybe with examples). At the same time, Google can and does leverage its data across several of its services, so maybe it is not surprising that it has a single policy. I see this example as different from Wired magazine’s privacy policy being that of its parent company, Condé Nast, since many of Condé Nast’s companies are magazines too.

I believe my findings are reproducible as they are outlined very clearly in this report. That being said, I acknowledge that my analysis has many limitations, which I will go over in the last section. I am confident in my results in as much as I am confident that my code works as expected. However, I think there are many subtleties to readability and accessibility of knowledge not captured in my analysis. I would need to incorporate more readability and other metrics to confidently rank privacy policies from best to worst, where best means the policy is clear and accessible. The one exception may be BBC, which I do think had the most comprehensive and accessible privacy policy of the ones I analyzed.

Limitations:

There are several limitations to my study. Firstly, I am not distinguishing in my analysis between text that corresponds to the main policy text and other secondary text like section headers. By including all of this secondary text, I may be slightly skewing word counts. 

Another limitation is a small sample size (n=11) of the policies, as well as the biased nature of their selection given that I selected them based on my personal use. I proposed in this study that these websites are commonly used amongst Harvard students coming “back to school.” However, I have not conducted any research to support this. 

Thirdly, I am comparing privacy policies across fields and very different services. Although I was eager to analyze different policies, I also recognize that comparing “apples to pears to lemons” can make overall conclusions more difficult to draw.

Lastly, I am only extracting and processing the policy text quantitatively. I think there is also a qualitative element to the accessibility of privacy policies: How is the policy formatted? How much text is hidden behind an “expand to read more” button? How many links are there? Thus, adding a quantitative or design analysis would also be beneficial.

Appendix

Table 1: Overall Analysis

Policy Line count All Words Cleaned words
Word count Unique words Word count Unique words
BBC 378 3730 666 1726 530
Ebay 371 8357 1239 4818 1046
Harvard Book Store 8 176 107 89 77
Kayak 403 10589 1234 5783 1032
StackOverflow 240 5307 927 2922 778
Sweetgreen 158 4729 876 2634 722
Target 453 6406 1030 3798 857
Wired (Condé Nast) 449 7276 1187 4341 996
YouTube (Google) 391 5168 902 2951 731
Zappos 47 1821 539 1022 437
Zoom 180 4718 861 2742 706

Note: In the Word Count column of Table 1, cells with values over 5,000 words are colored in orange to flag that these policies could take up over 10 A4 single-spaced pages to print. (5,000 single-spaced words roughly equals 10 A4 pages.)

Table 2: Readability and Time

Policy Readability Scores Time in minutes
Flesch Consensus
Higher is more readable. 0-29 is “very confusing” Lower is more readable. Score is “estimated school grade level required to understand the text”
BBC 58.5 13 ~4
Ebay 19.98 21 ~11
Harvard Book Store 41.19 13 ~12 seconds
Kayak 21.1 19 ~14
StackOverflow 14.4 21 ~7
Sweetgreen 31.35 15 ~6
Target 18.86 19 ~8.5
Wired (Condé Nast) 8.27 21 ~6.5
YouTube (Google) 25.7 21 ~6.5
Zappos 29.93 17 ~2
Zoom 4.65 27 ~6

Note: In the Consensus column of Table 3, cells with values that denote low readability scores are colored in red.

Table 3: Most Common Words

Policy #1 Word #2 Word #3 Word
BBC information bbc service
Ebay data personal service
Harvard Book Store information privacy right
Kayak information service may
StackOverflow information stack overflow
Sweetgreen information service personal
Target information target service
Wired (Condé Nast) information personal service
YouTube (Google) information google service
Zappos information zappos com
Zoom zoom data information

Note: Cells are grayed out where the most common word is the name of the company.

Table 4: Links to Privacy Policies

Website Type Date Privacy Policy Title Privacy Policy Link Comments
BBC.co.uk News Page updated: July 2021 The BBC Privacy and Cookies Policy https://www.bbc.co.uk/usingthebbc/privacy-policy/
Ebay.com Retail effective from June 20, 2020 User Privacy Notice https://www.ebay.com/help/policies/member-behaviour-policies/user-privacy-notice-privacy-policy?id=4260
Harvard.com Books N/A Privacy & Policies https://www.harvard.com/about/privacy_policies/ This is the Harvard Book Store’s website.
Kayak.com Travel Effective Date: July 1, 2021 Privacy Policy https://www.kayak.com/privacy
StackOverflow.com Q+A Updated 17 March 2021 Stack Exchange, Inc. Privacy Policy https://stackoverflow.com/legal/privacy-policy
Sweetgreen.com Food Effective Date: January 2, 2021 Privacy policy https://www.sweetgreen.com/privacy-policy
Target.com Retail Last update: 7/1/2021 Target Privacy Policy https://www.target.com/c/target-privacy-policy/-/N-4sr7p
Wired.com News Last Updated: June 30, 2021 Privacy Policy and Cookie Statement https://www.condenast.com/privacy-policy#privacypolicy Apparently Wired is owned by Condé Nast and thus falls under their Privacy Policy.
YouTube.com Video Effective July 1, 2021 Privacy Policy https://policies.google.com/privacy?hl=en The general Google Privacy Policy is the policy linked on YouTube’s Terms of Service
Zappos.com Retail Last updated: January 15, 2020 Privacy Notice https://www.zappos.com/c/privacy-policy
Zoom.com Video conferencing Last updated: June 4, 2021 Zoom Privacy Statement https://explore.zoom.us/docs/en-us/privacy.html

Note: Last Accessed 9/12/21.

Log in