The Analysis of Data Project

The Analysis of Data Project provides educational material in the area of data analysis.

  • The project features comprehensive coverage of all relevant disciplines including probability, statistics, computing, and machine learning.
  • The content is almost self-contained and includes mathematical prerequisites and basic computing concepts.
  • The R programming language is used to demonstrate the contents. Full code is available, facilitating reproducibility of experiments and letting readers experiment with variations of the code.
  • The presentation is mathematically rigorous, and includes derivations and proofs in most cases.
  • HTML versions are freely available on the website http://theanalysisofdata.com. Hardcopies are available at affordable prices.

Please email the author with typos, comments, and suggestions for improvements.

About the Author
Guy Lebanon is a senior manager at Amazon, where he leads the Machine Learning Science Group. Prior to that he was a tenured professor at the Georgia Institute of Technology and a scientist at Google and Yahoo. His main research areas are machine learning and data science. Guy received his PhD in 2005 from Carnegie Mellon University and BA, and MS degrees from Technion - Israel Institute of Technology. Dr. Lebanon has authored over 60 refereed publications. He is an action editor of Journal of Machine Learning Research, was the program chair of the 2012 ACM CIKM Conference, and will be the conference co-chair of AI & Statistics (AISTATS 2015). He received the NSF CAREER Award, the ICML best paper runner-up award, the Yahoo Faculty Research and Engagement Award, and is a Siebel Scholar.

Click here for additional information.

 

Volume 1: Probability

Introduction to multivariate probability theory, including random vectors, random processes, markov chains, limit theorems, and related mathematics such as set theory, metric spaces, differentiation, integration, and measure theory.

Print: amazon e-store
HTML: viewer 1 viewer 2
PDF: chapter 1 chapter A

Table of Contents

Volume 2: Computing

Overview of essential computing for data analysis, including operating systems, C++ and R programming, data structures, databases, parallel computing, and big data.

Print:
HTML: viewer 1 viewer 2
PDF: chapter 4 chapter 5

Expected publication date: August 2013.

 

Volume 3: Statistics and Machine Learning

Introduction to statistics and machine learning, including m-estimators, hypothesis tests, regression, clustering, classification, regularization, and non-parametric methods. The text will cover theory, methodology, and case studies.