Recently I have gone to a number of conventions like Strata NYC and Predictive Analytics World NYC. I heard the same call over and over. There is a storage of Data Scientists! It is going to get worse! We need another 190,00 Data Scientists just to fill the need! For those of you who do not know what a Data Scientist is, Mike Driscoll describes it on Quora as a blend of Red-Bull-Fueled Hacking and espresso-inspired statistics. Awesome!
I started to wonder where this number came from, and how it was developed. Why? Well, I am a Data Scientist of sorts, and I am not confident there is a real shortage of people who do this work or who can do this work. It also raises my alarm bells when I see the same presentations by different people that present the same numbers. The chance of so many people coming to exact the same numbers independently is about as likely as five people in the US dying by drink tap water ( the same chance as winning Powerball). I did a project to estimate the number of R users in 2006 at a Subway on a napkin that was re-used by countless people over the next couple of years. Thank god others have taken a more detailed look at that issue since, and people now use their numbers.
Turns out the 190,000 number comes from McKinsey Global Institute which projects the shortfall by 2018. When I found that out, I really began to question the number which had already been misquoted in most of the presentations I had seen. Some presentation had even presented the 190,000 person shortfall as a current condition rather than a projection for 2018. The term Data Scientist was first coined by Jeff Hammerbacker at Facebook in 2007. I am leary of a projection seven years out for a position that was not even named until four years ago. Reminds me of Morris's paper to predict batting averages for the season for MLB batters using their first 40 at bats. Not a very useful training set.
While I was writing this I was sent a post from Andrew Gelman's blog. I am a firm believer that no statistics blog post is complete without an Andrew Gelman quote or post so here it is: The #1 way to lie with statistics is...to just lie . Do not read anything into the coincidence of the quote with this post, but the timing is surprising. Besides it is a good warning to us all to let the data speak for itself, and not try to support our own opinions through use of statistics or lack thereof.
Now to the Mckinsey Report. If you are dying to read all 156 pages of the report here is the link: McKinsey Big Data Report. You will need the Red Bulls and Espressos that Mike Driscoll mentioned earlier! I will save you the time. Mckinsey talks about how they can to that number on page 134 in the appendix. I see a lot of problems. First there is no data or sample data, and there is no description of the predictive model used. Without the means to attempt to validate, I have to question if the conclusion is valid. In their brief description of what they did to come up with these numbers I already see problems. Mckinsey says their raw data is based on SOC code numbers from 2008. That is one year after the term data scientist was coined and what is required to be one has changed quite a bit sense then. A static description of a moving target may be a highly inaccurate. Second, they list the SOC codes they used to determine their population. I see an number of SOC code that Data scientists come from that are missing from the start. The most glaring one is physicist. Some of the best Data scientist in the field are physicists and there are a lot of them in the field.
Looks like we need to get a Data Scientist to look at how many Data Scientists were are going to need in the future.