Today, Google announced two new services that are sure to be loved by data geeks. First is their BigQuery which lets you analyze “Terabytes of data, trillions of records.” This is great for people with large datasets. I wonder if a program like R(my favorite statistical analysis package) can read it? If so would R just pull down the data like it would from any other database? That would most likely result in a data.frame that is far too large for a standard computer to handle. Maybe R can be ran in a way that it hits the BigQuery service and leaves the data in there. Maybe even the processing can be done on Google’s end, allowing for much better computation time. This is something I’ve been dreaming of for a while now.
The other new service is Google Prediction, which just may make my aforementioned dream of running R on google unnecessary. It essentially allows you to utilize Google’s supervised learning algorithms to make predictions. As of right now, it can only make discrete predictions (such as categorization) which I suppose means they are using either a multinomial logistic regression or some sort of CART analysis. Eventually they plan on supporting continuous data.
Both of these services require Google Storage which allows you to store data in Google’s cloud $.17/gigabyte/month plus bandwidth fees. A lot of people are saying this is meant to compete with the Amazon’s EC2, which I suppose bodes well for running R on data stored there. In terms of sheer storage, this doesn’t seem like an economical alternative to buying a terabyte hard drive. But I’m guessing the advantage will be in using Google’s horsepower to run the algorithms (of your own design) for your analysis.
Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.