You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

Life Just Got Busy

ø

Since DSA yesterday afternoon. Once off the phone with Beau, I rushed home to get Kati ready for her flight to DC. Spent a much longer evening than usual working on the invitations—glue, ribbons, envelopes, calligraphy, etc—and didn’t get to bed until well after midnight. This morning, thought I’d be able to finish the rest in a quick whirl before work, but it ended up swallowing up most of the day and it was 3:30 by the time I finally made it to work via the post office to drop off the 72 beauties in their handwritten, bright yellow envelopes.

In the midst of the mounting stress over the upcoming meeting with Mike Foote, and the PlanktonTech report I have to write,  and the status meeting with Andy to prepare for, and the résumé that needs putting together, and the R interface to finish up, I got an email from Dave Lazarus asking for comments on a paper he and John Barron are planning to submit to Nature. On—wait for it!—Cenozoic diatom diversity. Yes, believe it or not, I just got scooped once again. On work that I haven’t done yet, so I guess it’s not technically being scooped if it’s something you were thinking about doing someday but hadn’t gotten around to. Anyway, this is another thing on my plate and while I’m sure it’ll be fascinating to read it’s also another blow to my confidence for not getting that project done quickly, not to mention several hours next week down the drain at a time when I really need them. But, I also need Dave, so there’s no way I can blow him off—though I did email him back to ask how much feedback he wanted, and that I wouldn’t be able to provide much before late next week, because I’m busy. Humph.

Anyway. Now that I’ve vented some time-crunch stress, I’ll get to the task I started (but didn’t finish) yesterday—putting the keystone pieces into the Bridge of Rads database interface. Once in place, we’ll see whether I can jump up and down on it without it crumbling into a sad pile of rubble, as the dust swirls around me.

Made some decent progress over the course of the afternoon, putting all the major pieces in place—but noticed that there were quite a lot of little pieces missing altogether. This is the downside of “sit down and code” rather than developing a very detailed plan of what the program’s going to look like… But whatever, it’s not an enormously complex piece of software, and I can figure out as I go along.

Left work at 7:15pm, by which point I’d gotten much of the way through—although I have yet to do a first test run of the whole interface to see if it all works… perhaps tomorrow?!

Tying the Interface Together

ø

Completely neglected to post yesterday—it was a slow, distracted day but I still got a few things done—mostly assembling bits and pieces of the R routine to read the ImageJ “pipe” file into R and create the SQL INSERT statements.

In the course of thinking that through, made a minor design change in the database. Rather than storing the complete name of each measurement type in the relevant column of the “Measurements” table, e.g. “Length from top of shell to base of ante-/postcephalic chamber”, I thought it might be prudent to store just an integer ID related 1-to-1 to a measurement name. Wrote a quick look-up function to return the full name for each integer ID so that the original design can be easily recovered with a function call. The rationale was that once I have finished collecting data and am at the data analysis stage, it might be very clumsy to formulate the mathematical models in terms of SQL queries including those long labels, and that the volume of a sphere, for example, would be more neatly described as “π x (1)^2” than “π x ‘Width of outer medullary shell’^2”. This might be daft and overkill, but I had a horror vision of running into problems with that and decided to do it this way instead. I think, given that SQLite is so forgiving in terms of data types, it would also be quite easy to revert to my original design and replace the measurement type IDs with the full strings quickly.

Anyhow, that was yesterday, and today is today. For the morning I’d like to get the bits and pieces together and get the interface up and running.

The Interface Begins to Form

ø

Worked long and hard today on the first steps of the interface to the RSQLite database. Got the structure and logic of the main menu working, and have a completely working first menu item—”New slide”. “New individual” is next, and more difficult. Spent quite a long time struggling with how to get raster images to plot in R, but eventually figured out a way, using the pixmap package (which needed to be installed). Not fast, not elegant, but it’ll do the job for my purposes.

library(pixmap);

data <- read.pnm(“piccie.pnm”);

plot(data);

As ever with R—it’s three lines of code, and three hours to find it.

I also now need to confront the fact that I don’t really know the taxonomy of the lineages I’m setting out to study. So, in the process of preparing the “here’s what to measure” images for the interface, I’ll not only need to make the mathematical models I’ll use to calculate silica volume, but also assemble the taxonomic literature I’ll need to be familiar with in order to be able to identify the buggers.

A long list, and I’m not as far as I hoped I’d been—but I’m making good progress, and enjoying myself. That said, I’ve spent a full 9-5 day (9-5:23, actually) staring at the computer screen today, and I’m pooped.

RadData: Hello, World

ø

Struggled some more with setting up the tables. I had changed my schema last week, because I had discovered there was something of a redundancy in the Slides table. Both the Slide ID and the Hole ID would contain the Hole ID (for the reason I described earlier—the non-Hole-ID part of the Slide ID is not unique). So I thought I’d implement this as a composite primary key, the primary key for the Slides table consisting of a Slide ID and a Hole ID column. This means that both of those columns need to be in the child table, Individuals. As I was trying to implement this, I realized having a composite primary key would make for unwieldy joins and searches, for the sake of a relatively low savings in storage (redundancy of the 4-character hole designation in the Slides table, which will have a few 100 entries, so on the order of maybe a few thousand characters).

Once I had fixed the table setup to reflect this design change, it came time to return to testing the tables set up so far (everything except for Measurements), to see if it works. The kind of query I want to be able to make first: show me all the individuals from a certain hole. By my current understanding of SQL, the query should look something like this:

SELECT indiv_id, species, comment, preservation FROM holes h, slides s, individuals i WHERE h.hole_id=s.hole_id AND s.slide_id=i.slide_id AND h.hole_id=’0699A’

This, in fact, works. And is a substantial cause for celebration! Pretty satisfying, I can hardly believe I’ve taught myself enough SQL to do this in a couple of short weeks. Very cool. Next tasks: add the measurements table, and address the foreign key constraints issue from last week.

Asked Andy whether it would be OK to move the microscope—he made the very good point that moving the microscope is a major undertaking since it will most likely cause the optics to get jiggled out of alignment, and would then require a technician to be called in to re-center everything and get it all working again. With that in mind, it might be easier to just run to the computer store and pick up an external hard disk, and then get a set-up going in the back room with my laptop.

The measurements table seems to be working now, too. And I’m able to, for example, retrieve measurements of a particular type from a particular species found in a specified hole:

SELECT h.hole_id, i.indiv_id, species, meas_type, meas_value FROM holes h, slides s, individuals i, measurements m WHERE h.hole_id=s.hole_id AND s.slide_id=i.slide_id AND i.indiv_id=m.indiv_id AND m.meas_type=’Length’ AND h.hole_id=’0709C’ AND i.species=’Pterocorys oblongatus’

This is deeply awesome.

465 Days

1

Or, one year and one hundred days to go, exactly.

Before the fear strikes deep, let me throw myself back into the fray. I left off reading through Beau’s recommended SQL book, which has been terrific. I’m going to plow ahead with that until I get to a point where I can start implementing simple SQL commands through RSQLite. Hopefully before lunch!

Chapter 2 of Beau’s suggested text, “The Practical SQL Handbook”, recommends going through “normalization”, i.e. making sure the database design conforms to the four design standards known as normal forms.

  1. The first normal form requires each row-and-column intersection to be one and only one atomic value. Since I have separated the measurements table out from the individuals table, I am OK here, I think. If I had kept measurements as a list in the individuals table, I would be in violation of this form. The other tables are all OK too.
  2. The second normal form applies only to tables where the primary key is a combination of two or more columns; I have no such tables, so I’m OK.
  3. The third normal form requires every non-key column in a table to be an attribute of the primary key (not an attribute of another column); as the book says, “Every non-key column must describe the key, the whole key, and nothing but the key”. I’m pretty sure this is true of all of my tables.
  4. The fourth normal forms forbids independent many-to-one relationships between the key and non-key columns. Since every non-key column in my tables is a unique attribute of the key column (e.g. every individual has only one species attribute; every measurement has only one measurement type and measurement value and associated image file; every slide has only one age and one preservation attribute), I think I’m OK here too.

The book recommends setting up the tables in SQL and testing them with some dummy data before going ahead with full implementation. This seems like a sensible idea, so off we go with Chapter 3, which will hopefully teach me how to do that (probably with some help from the RSQLite documentation). Hit a bit of a bump here—still can’t find any decent basic documentation for RSQLite that tells me the basics, like how to create a new database, send a query, etc. The official CRAN documentation is an obtuse mess of technical-speak.

After reading half of the archives of the extremely unfriendly R-sig-DB mailing list/support forum, I discovered a hint that the documentation I should be looking for is not RSQLite, the SQLite implementation for R, but rather the documentation for DBI, the R package that handles interfacing for databases generically (for SQLite as well as MySQL, postgreSQL, etc). Took a while to get to this point.

This latter mentioned document finally gave some of that basic information I’d been hunting for. Like the heavens opened up, and a beam of knowledge shone down upon me, the angelic chorus singing hallelujah!

SQL queries can be sent by either dbSendQuery or dbGetQuery. dbGetquery sends the query and retrieves the results as a data frame. dbSendQuery sends the query and returns an object of class inheriting from “DBIResult” which can be used to retrieve the results, and subsequently used in a call to dbClearResult to remove the result.

Finally something I can use. Yay! Now I finally know how to connect to SQLite, I can jump back to reading the book, which will tell me more about the actual SQL commands to use to set up my first table.

Success! Flying high. Made an actual table, even if it’s just a toy. Reading about indexes, but unsure if I need one (at least at this stage). Also unsure about whether to include images stored within the database (this would be as a BLOB data type, apparently), or simply store the system path to the image file, and keep the images in a folder. The latter might offer a bit more stability/security in case the database goes to shit for some reason, and presumably would help performance by keeping the .sqlite file smaller. It might—but this is a real uncertainty—make data entry a little harder. As I think about it, though, I imagine it’ll be as difficult to read the image into a BLOB field as it would be to enter/copy-paste/read the image path or filename into a text field.

Break for Lurgee, Now SQL

1

After the Engaged Encounter weekend, took three days to pay my dues to The Lurgee for the year. I’d held out for long enough, and I suppose it was inevitable that I would come down eventually.

Since I’ve been back up and running—and it’s really only been since yesterday—I’ve been sinking my teeth into learning SQL, having made the decision to get the radiolarian project rolling before I expend too much energy on the somewhat-risky mathematical morphospace. Beau was extremely helpful in providing lecture notes and recommended reading on SQL basics, which I have been devouring and have found to very much change my state of mind on the matter from one of confusion and anxiety, to a sense of having an overview and being excited about starting to tinker with SQL through R (using RSQLite).

One of the first things I did, in order to show Beau what I was thinking of doing, was to describe the database I wanted to construct using a diagram (something I’ve since seen described as a data model and ascribed with much importance by the various sources Beau provided):

Initially I was unsure whether to include the measurements as attributes of the “Individuals” table, i.e. to have fields Length of first segment, length of second segment, width of first segment, length of second segment, and so on, rather than having a separate table in which each measurement is an entity, with measurement type (length of first segment, length of second segment, etc.) as an attribute. I went with the above layout following gut feeling, but that’s since been vindicated in my reading of the lecture notes Beau supplied, which suggest that lists are not a good way to go and don’t conform to normal form. It’ll be better to have a relational set-up like this: firstly, it’ll prevent each row from having numerous empty columns (because different taxa are going to have different measurements associated with them), secondly, it’ll allow me to add new measurement types as I go along if it turns out I don’t fully anticipate the range of measurements right from the outset (or choose to expand the study to another taxon, which requires a new measurement—say, fourth segment width—to be properly quantified).

What’s not shown in the model above is a key for the “Measurements” table, which will be a system-generated (if possible) Measurement ID field.