You are viewing a read-only archive of the Blogs.Harvard network. Learn more.
Skip to content

A note on data

I’ve come into possession of some UC election voting statistics, which I’ll be throwing up here over the next couple of weeks (mainly because Cliff and a couple other people thought it sounded interesting).

First, however, I need to list some caveats. The statistics and charts I put up are derived from backups of the UC’s election software database. After some finessing, I was able to piece together the various bits of data to recreate anonymous voters. Each individual is identified by a unique encrypted string and has a known dorm or house affiliation. Additionally, each individual is attached to his or her complete voting history, including candidate rankings.

Despite my best efforts, I cannot decrypt the unique identifier, and I have no way of identifying the real people who correspond to the individuals in the database. Furthermore, the data is not flawless. Voting histories are (nearly) perfect, and are (nearly) perfectly attached to correct individuals. Due to the design of the voting system, however, individuals’ dorm or house affiliations are overwritten each time an individual votes. This poses two problems: we cannot determine how individual dorms vote in any given election, and (more importantly) we cannot perfectly determine an individual’s class year.

We’ll get into this more as I share the data, but two recent successful campaigns (John Haddock in 2005 and Ryan Petersen in 2006) credited their victories — in part — to successful freshman outreach. In order to evaluate these claims, we need to be able to separate freshman voters from upperclassman voters. We have database backups from 2007 and 2008 that retain the original dorm and house affiliations, so we can perfectly reconstruct this distinction for last year’s presidential election. For years prior to 2007 we have to do it algorithmically; if an individual voted in a freshman election in a given year, or if he or she voted in house elections in three consecutive years, or voted in any election in four consecutive years, we can guess the individual’s freshman year with some reliability.

After running through a series of these algorithms, we are left with a group of about 7-13% of the presidential electorate who we are unable to classify. Additionally, since the electronic voting system wasn’t used before 2002, our algorithms break down for lack of data as we get closer to that singularity. 2004 is the earliest year for which we can classify individuals by freshman status with any reliability, with 13% of the voters unclassified. Needless to say, this creates a built-in margin of error for any of our breakdowns over the freshman divide, and I will give the min, the max and my guesstimate of the likely range whenever I use this data.

As an aside, the algorithms are much better at detecting freshman voters than upperclassman because so many of the freshmen vote in the fall general elections (usually 55-65% turnout), so it is likely that more of these unclassified voters are upperclassmen than freshmen. Taking this into account, my guesstimate range assumes that no less than 20% and no more than 66% of the unclassified voters are freshmen. At some point it may be possible to get a rough idea of the percent of each class that votes in any election, as well as the percent of voters who did not vote freshman year, and that might allow us to refine this estimate. Until then, I’ll just be flying by the seat of my pants.

I’m going to try to get some information on freshman turnout up tonight, and we’ll take it from there. Questions and comments welcome.