My thoughts, written down.

My Data Science Roadmap

Posted by signal on August 3rd, 2012

I have set a goal to learn Data Analytics and began this journey a while back.  One means which I am learning Data Science by is EMC’s Data Science Training.  They succinctly outline the skills I am looking to master for building a practical foundation of analytics:

Problem Category of Techniques Methods to Learn
Group items by similarity Find structure and commonalities in the data Clustering K-means clustering
Discover relationships between actions or items Association Rules Apriori
Discover relationships between the outcome and input variables Regression Linear Regression Logistic Regression
Assign (known) labels to objects Classification Naïve Bayes   Decision Trees
Find the structure in a temporal process     Forecast the behavior of a temporal process Time Series Analysis ACF, PACF, ARIMA
Analyze text data Text Analysis Regular Expressions, Document representation (Bag of Words), TF-IDF


In addition to the above I plan to approach with foundation knowledge in Mathematics, Computer Science, Machine Learning, Artificial Intelligence, Predictive Analytics and Life Science.  Some of this will be via my degree program at Harvard, however the program I am in, Information Technology, only gives some courses that are useful in Data Science.  Other knowledge will come from additional courses I will take outside of my degree program, books, and possibly even the pursuit of another graduate degree specific to Data Analytics.

A few degree programs that look very attractive are below.  The prerequisites are what prevent me from pursuing one of these programs at this time.  I have significant amount of work I need to do to get my Mathematics and Life Sciences foundations built up before I would be able to be admitted.  My background is in technology and computer science, which is very useful to Data Science, but only one part of a much larger domain of knowledge.

Master of Science in Bioinformatics – John Hopkins University

Master of Science in Analytics – North Carolina State University 

Master of Science in Analytics – Northwestern University 

Master of Science in Predictive Analytics – Northwestern University

Mining Massive Data Sets Graduate Certificate – Stanford University

MSc Machine Learning – University of London

Master of Science in Data Mining – Central Connecticut State University

Master of Science Biomedical Informatics

 It would likely be three years or more before I would be able to pursue a program such as above.  In the meantime I plan to build up my knowledge in the various domains.

College Courses I will take outside of Harvard (all of the below have co-requisite labs as well):

Biology I
Biology II
Chemistry I
Chemistry II
Organic Chemistry I
Organic Chemistry II

Courses I am taking or have taken at Harvard that will help in Data Science:

Introduction to Statistics
Java for Distributed Computing
Oracle Database Administration
Computing Foundations for Computational Science
Books I will be working through:


Data Mining with R: Learning with Case Studies (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series)
The R Book
Data Mashups in R
R in a Nutshell: A Desktop Quick Reference
R Cookbook (O’Reilly Cookbooks)
Getting Started with RStudio
Parallel R


Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems)
All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics)
Think Stats
Statistics in a Nutshell: A Desktop Quick Reference (In a Nutshell (O’Reilly))
Statistics Hacks: Tips & Tools for Measuring the World and Beating the Odds

Linear Algebra

Introduction to Linear Algebra, Fourth Edition

Machine Learning

Machine Learning in Action
Machine Learning for Hackers

Data Mining

Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites
21 Recipes for Mining Twitter
Big Data Glossary
Data Analysis with Open Source Tools


Designing Data Visualizations
Now You See It: Simple Visualization Techniques for Quantitative Analysis
Beautiful Visualization: Looking at Data through the Eyes of Experts (Theory in Practice)
Visualize This: The FlowingData Guide to Design, Visualization, and Statistics


Hadoop: The Definitive Guide
HBase: The Definitive Guide
Programming Pig
Cassandra: The Definitive Guide

There is much I have left out, I am sure, and if anyone has any good books to recommend please do.  I have found the Quora fourms to be particularly helpful in networking with others about Data Science.



2 Responses to “My Data Science Roadmap”

  1. mark shearman Says:

    Hi Brian:

    Could you let me know the source of the ‘data brain’ graphic ?

    Thank you.

  2. signal Says:

    I am not sure. If you download the graphic and goto you can upload it and you will see its everywhere. Who to attribute it to, I don’t know.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>