Verbatim

My thoughts, written down.

Archive for August, 2012

Harvard Career & Academic Resource Center (CARC) Awesomeness

Posted by signal on 29th August 2012

Today I registered for many of the CARC workshops for Fall 2012.  Some of the best ones are offered only on-campus, however there were quite a few good ones offered online that I was able to register for:

Listen Up: Three Ways to Make Your Audience Pay Attention (webinar)
Gaining Grammar Confidence (webinar)
How to use Positive Psychology to Improve Performance and Wellbeing (teleconference)
Resume and Cover Letter 101 (webinar)
Perfectionism (webinar)
Lifting the Curtain on Powerful Persuasive Speaking (webinar)

One in particular that looked very interesting that has already filled up (I was not able to get into it), was:

Networking: Making Your Connections Count (webinar)

 

Posted in Uncategorized | No Comments »

Ready to try my first Coursera course: Statistics One

Posted by signal on 26th August 2012

In preparation for taking STAT E-50 Spring 2013, I am taking a class from CourseraStatistics One.  I have heard many good things about this class, and using Coursera for general Data Science training.  They have classes in Machine Learning, Data Analysis, Neural Networks and so much more.  I picked up the text, Statistics, 4th Edition (by Freedman, Pisani, & Purves. Norton Publishing), cheap online.  The international version (which I bought), is supposed to be the same as the US version just a different color.

My wife will also be working through her undergraduate Statistics class at Champlain College, and so we will be doing statistics together.

One of the main draws of the Coursera class is that it uses R.  I have been trying to get my R skills in shape and doing some practical statistics will be great.  The class is taught by Professor Andrew Conway, Princeton University.  It will be an interesting experience to take a class “unofficially”, with much less stress, before I take the real class for credit.

Posted in Uncategorized | No Comments »

Using Coda2 – My experiences during S-75

Posted by signal on 7th August 2012

Shortly after registering for S-75 Building Dynamic Websites, I began searching for what tools I would use to build my projects with. I specifically looked at tools that run on Mac OSX. At the very least, I was looking for an IDE. Some of the programs I looked at were:

Versions

Cornerstone2

Coda2

In reviewing all of these programs, it seemed that Coda2 was what I wanted.  It had the ability to remotely edit files.  I knew I would be storing most of my files on the CS-50 appliance (virtual machine) used for the class, but I wanted to use rich editing tools.  Coda2 also supports SFTP/FTP, CSS, PHP, Version Control (Git/SVN) and more.

There is a forum that is used to discuss Coda2, you can find it here.  Coda2 is definitely not without bugs.  I experienced a lot of sluggish behavior, and at times it just became unresponsive and I had to force quit and restart.  I never lost any data.

My biggest disappointment had to do with the code validation and error checking.  If you are developing monolithic files, where everything you are trying to do is in one file, I am sure it likely works well.  However, when developing dynamic web sites, its very typical to have a file output your header for example, with your document specification, etc., and then have many files that are included together to create your overall code.  Coda2 doesn’t like this.  If it sees you have HTML in a file, but no header for example, it freaks out. It’s not smart enough to look at all the files in the project and start with index.html and assemble them logically.  Hopefully they fix this, I basically was on my own when it came to validation and error correction.  I manually scraped my code from “View Source” in my browser and uploaded to W3’s Validation Service.

Things I liked about Coda2:

  • Syntax highlighting
  • File navigation
  • Powerful Editor
  • Good page preview ability

I should mention that I did not use the version control built into Coda2.  This had nothing to do with its potential to do this function.  Because the code was actually being stored on the CS-50 appliance, it made more sense for me to use git built into the CS-50 appliance.

I will say that an IDE is definitely not necessary for a class like S-75, although I did find value in using one.  If you are already comfortable with something like Text Wrangler or vi, then that may work just as good.

 

Posted in Uncategorized | 3 Comments »

My Data Science Roadmap

Posted by signal on 3rd August 2012

I have set a goal to learn Data Analytics and began this journey a while back.  One means which I am learning Data Science by is EMC’s Data Science Training.  They succinctly outline the skills I am looking to master for building a practical foundation of analytics:

Problem Category of Techniques Methods to Learn
Group items by similarity Find structure and commonalities in the data Clustering K-means clustering
Discover relationships between actions or items Association Rules Apriori
Discover relationships between the outcome and input variables Regression Linear Regression Logistic Regression
Assign (known) labels to objects Classification Naïve Bayes   Decision Trees
Find the structure in a temporal process     Forecast the behavior of a temporal process Time Series Analysis ACF, PACF, ARIMA
Analyze text data Text Analysis Regular Expressions, Document representation (Bag of Words), TF-IDF

 

In addition to the above I plan to approach with foundation knowledge in Mathematics, Computer Science, Machine Learning, Artificial Intelligence, Predictive Analytics and Life Science.  Some of this will be via my degree program at Harvard, however the program I am in, Information Technology, only gives some courses that are useful in Data Science.  Other knowledge will come from additional courses I will take outside of my degree program, books, and possibly even the pursuit of another graduate degree specific to Data Analytics.

A few degree programs that look very attractive are below.  The prerequisites are what prevent me from pursuing one of these programs at this time.  I have significant amount of work I need to do to get my Mathematics and Life Sciences foundations built up before I would be able to be admitted.  My background is in technology and computer science, which is very useful to Data Science, but only one part of a much larger domain of knowledge.

Master of Science in Bioinformatics – John Hopkins University

Master of Science in Analytics – North Carolina State University 

Master of Science in Analytics – Northwestern University 

Master of Science in Predictive Analytics – Northwestern University

Mining Massive Data Sets Graduate Certificate – Stanford University

MSc Machine Learning – University of London

Master of Science in Data Mining – Central Connecticut State University

Master of Science Biomedical Informatics

 It would likely be three years or more before I would be able to pursue a program such as above.  In the meantime I plan to build up my knowledge in the various domains.

College Courses I will take outside of Harvard (all of the below have co-requisite labs as well):

Biology I
Biology II
Chemistry I
Chemistry II
Organic Chemistry I
Organic Chemistry II

Courses I am taking or have taken at Harvard that will help in Data Science:

Introduction to Statistics
Java for Distributed Computing
Oracle Database Administration
Visualization
Computing Foundations for Computational Science
Books I will be working through:

R

Data Mining with R: Learning with Case Studies (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series)
The R Book
Data Mashups in R
R in a Nutshell: A Desktop Quick Reference
R Cookbook (O’Reilly Cookbooks)
Getting Started with RStudio
Parallel R

Statistics

Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems)
All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics)
Think Stats
Statistics in a Nutshell: A Desktop Quick Reference (In a Nutshell (O’Reilly))
Statistics Hacks: Tips & Tools for Measuring the World and Beating the Odds

Linear Algebra

Introduction to Linear Algebra, Fourth Edition

Machine Learning

Machine Learning in Action
Machine Learning for Hackers

Data Mining

Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites
21 Recipes for Mining Twitter
Big Data Glossary
Data Analysis with Open Source Tools

Visualization

Designing Data Visualizations
Now You See It: Simple Visualization Techniques for Quantitative Analysis
Beautiful Visualization: Looking at Data through the Eyes of Experts (Theory in Practice)
Visualize This: The FlowingData Guide to Design, Visualization, and Statistics

Hadoop

Hadoop: The Definitive Guide
HBase: The Definitive Guide
Programming Pig
Cassandra: The Definitive Guide

There is much I have left out, I am sure, and if anyone has any good books to recommend please do.  I have found the Quora fourms to be particularly helpful in networking with others about Data Science.

 

 

Posted in Uncategorized | 2 Comments »

My final project for S75 = Ajax, PHP, javascript, XML, CSS, mySQL, HTML, Google and BART

Posted by signal on 2nd August 2012

I am finally done with my Summer School class S-75 Building Dynamic Websites, taught by Professor David Malan. It was a very intense course. I liked how fast things moved and how we were challenged every single day. I already had a background in some of the technologies (PHP, HTML, mySQL) but had really no working knowledge in so much other stuff such as xPath, XML, CSS, Ajax, javascript, etc.

My final project was building a mashup between the BART (Bay Area Rapid Transit) using their API and the Google Maps v3 API. It was written in PHP, used mySQL as a datastore cache, and pulled realtime information from BART using Ajax. The program was written using the Model / View / Controller methodology and even version controlled via git using bitbucket.org as a repository host.

I pretty much gave my soul to this class for the 7 weeks. I was even coding on my vacation :). However, I feel it was a good investment and I would do it again. If you ever have the opportunity to take any course from Professor Malan I highly recommend it. I am considering taking CS-50 possibly in Spring 2013 to fill in some of the gaps of my computer science background.

Below is a screenshot of my final project, where the Pitssburg/Bay Point – SFIA/Millbrae route has been selected. You can see stations plotted along a path that was drawn in the actual route color used by BART.

The user an click on the stations and it pulls realtime data from BART. Since it is using Ajax, there is no additional page load that happens and so the whole process is a very seamless user experience. It was one of those projects where when it starts off you wonder how you will make it happen in such a short period of time, but then you amaze yourself by pulling it off. Now that I know my way around the Google Maps API, and I have a reasonably good foundation in xPath/XML, I am on the hunt to find some other sites with GIS data, where I can build a mashup of something that does not yet exist, something with Big Data.

Below is another screenshot of what it looks like when the user clicks on a station and receives realtime data, this time on the Fremont – Daly City route (green).

Posted in Uncategorized | 3 Comments »