Wednesday, May 7, 2014

S

John Chambers

John Chambers * Bell Labs * 1976

R

gentleman ihaka
Robert Gentleman & Ross Ihaka * University of Auckland, New Zealand * 1993

ETL

http://www.rcasts.com/2012/11/big-data-etl-and-big-data-analysis.html

ETL

Modeling

Modeling

Linear Models

Poisson Regression

Poisson Regression

Time Series

Graphics

Graphics

Machine Learning

  • Ridge and Lasso Regression
  • K-means Clustering
  • K-medoids Clustering
  • Hierarchical Clustering
  • Decision Trees
  • Random Forests
  • Splines
  • Generalized Additive Models

Penalized Regression

K-means Clustering

Plot of wine data scaled into two dimensions and color coded by results of K-means clustering

K-means Clustering

Gap curves for wine data.  The blue curve is the observed within-cluster dissimilarity, and the green curve is the expected within-cluster dissimilarity.  The red curve represents the Gap statistic (expected-observed) and the error bars are the standard deviation of the gap.Gap curves for wine data.  The blue curve is the observed within-cluster dissimilarity, and the green curve is the expected within-cluster dissimilarity.  The red curve represents the Gap statistic (expected-observed) and the error bars are the standard deviation of the gap.

Hierarchical Clustering

Hierarchical clustering of wine data

Hierarchical Clustering

Hierarchical clustering of wine data split into three groups (red) and 13 groups (blue)

Decision Trees

Splines

Reporting and Presenting

This whole presentation

R code and all

R for Everyone

R for Everyone

R for Everyone

Based on. . .

What's Inside

Encouraging Girls in STEM

Fundamentals of R

  • Getting and installing R
  • The RStudio Environment
  • The basics of R
  • Basic Math
  • Advanced Data Structures
  • Reading Data into R
  • Matrix Calculations
  • Data Munging
  • Writing functions
  • Conditionals
  • Loops
  • String manipulation and regular expressions
  • ggplot2

Modeling and Analytics with R

  • Basic Statistics
    • Probability Distributions
    • Averages, standard deviations and correlations
    • t-test
  • Linear Models
    • Simple linear regression
    • Multiple Regression
  • Generalized Linear Models
    • Logistic Regression
    • Poisson Regression
    • Survival Analysis
  • Assessing Model Quality
  • Time Series

Machine Learning in R

  • Variable selection for high dimensional data with the Elastic Net as implemented by the glmnet package
  • Reduce uncertainty with weakly informative priors and Bayesian regression
  • K-Means clustering
  • Hierarchical clustering
  • Multidimensional scaling
  • Matrix Factorization for recommendations
  • Decision Trees for classification
  • Random Forests for ensembling decision trees
  • Bootstrap for measuring uncertainty
  • Cross validation for model assessment

High Performance Computing in R

  • Benchmarking and profiling code
  • Converting tricky loops into vectorized code
  • Using matrix algebra and cross product functions instead of standard loops and sums
  • Fast data manipulation with dplyr
  • Using alternative backend data storage such as PostgreSQL
  • The data.table package
  • The parallel, doParallel and foreach packages for parallel computations
  • Integrating C++ into R packages
  • Alternative matrix algebra libraries

Data Presentation and Portability

  • Reproducible reports using knitr
  • Basic Introduction to LaTeX
  • Basic Introduction to Markdown
  • Using LaTeX and knitr to automatically generate reports with embedded analytics
  • Using Markdown and knitr to automatically generate websites with embedded analytics
  • Combine Markdown, knitr, reveal.js and pandoc to make HTML5 slideshows with embedded analytics
  • Advanced plotting
    • rCharts
    • ggvis
  • Building R Packages

The R Community

Europe Too

Asia Pacific

World's Biggest

3,824 Members!

Giving Back to the Community

Jared P. Lander

The Tools