| July 21, 2016

Variations of Data Scientists

I spent some time last month at a SIAM conference. Since I graduated and joined the industry two years ago, I hadn’t been to a conference. The attendees were mostly academics, but there were a few data scientists coming from industry as well. Why so few? Why don’t we get to go to conferences? There are lots of questions about what data science is and what it isn’t. This made me wonder about the current state of data scientists. Because data scientists are a weird breed. How many are there? How many should a company have?

We know a lot about how many people get PhDs; less than 2% of Americans have them. How many of those end up being data scientists? There are about 52k new PhDs every year [1]. But how many are prepared to become data scientists? Finding that stat isn’t super obvious. And since the data isn’t super rigorous, I’m going to do some estimations, something I’m much more comfortable with now that I’ve spent a few years giving “directional” analysis summaries. Suffice to say, back-of-the-envelope computations will probably be good enough for this article. Let’s try it.

There were 1,900 new math PhDs in the United States in 2014 [2]. But people who aren’t mathematicians can be data scientists. Let’s expand to include everyone in STEM. This may be an overstatement, but let’s soldier on anyways. Of those 52k new PhDs, there were 40,588 PhDs in science and engineering fields in 2016 [3]. Of those, about 60% have non-academic jobs [4]. So there are 24.5k new PhDs looking for industrial work each year, potentially as data scientists. Let’s assume these people all have at least 25 good years of working. So there are about 612k PhDs who are, theoretically, capable of being data scientists and currently working. There are 308M Americans in the U.S., so the (potential) data scientists make up about 0.2% of the total American population. …plus or minus a few thousand who leave or enter the country over time.

At this point in our analysis, we have the following starting point: 0.2% of all Americans are doing work that looks similar to that of a data scientist. If data scientists were equally distributed across all the companies in the U.S., I would expect to see data scientists making up a maximum of 0.2% of each company… maybe. Because, of course, there are hundreds of thousands of companies with no data scientists at all.

Now I want to know how many companies are actually employing data scientists. This brings us to another big reason for uncertainty; there are lots of intrinsic biases and opportunities for error in researching employment numbers. Primarily, job title nomenclature is fairly arbitrary. There are a lot of data analysts and engineers out there who aren’t called “data scientists” but are doing data science work. Additionally, a company might start giving out data science titles because it’s the “hot” thing to do right now. However, I would argue that if a company is embracing the idea of data scientists and you want a job that has the specialization of a data scientist, then it’s worth knowing who is hiring a “data scientist.” Regardless, there is definitely error coming from how a company decides to title their employees.

The second bias could come from my research methods for determining how many data scientists a particular company has. I’ll go into my methods in the next paragraph. But at a high level, public companies are a little easier to get information on than private companies. And it’s difficult to find out how many data scientists a company has no matter their public/private status. Let’s look at my methodology next.

As a first step, I chose a few companies that are either big or popular right now. Then I used LinkedIn to get an estimate of how many people claimed to be data scientists at each company. Since LinkedIn is subject to self-selection bias and people who don’t update their profiles, I think these numbers underrepresent reality. So, I found some additional comparisons in companies where I could find a more reliable source of the number of data scientists at each company. From these few companies, I can determine a modeled view of how many data scientists there are. For example, Yahoo Labs! says they have 200 employees in the lab, but LinkedIn says they only have 34 data scientists. Meanwhile, Google has 231 data scientists on LinkedIn while their website says they have 982 people in their research lab. From these, and a few others, I inferred an “effective” number of data scientists per company. With my newly created effective data scientist title, I’m trying to measure the number of actual data scientists + the employees who act like data scientists. Thus, the set of effective data scientists is greater than or equal to the set of titled data scientists. From here on out I’m mostly going to be talking about “effective” data scientists.

Numerical distribution of data scientists in the total number of employees at designated companies. Image credit: Samantha Schumacher.

Above are the results from my initial research. I focused on headquarters population only; satellite and store employees are not included in the company size. Note: some of these companies are quite small – I’m looking at you Snapchat. Thus, it may be more beneficial to understand what the percentage of effective data scientists are. Here are those results:

Estimated percentage of data scientists at company headquarters. Image credit: Samantha Schumacher.

As you can see, Uber and Snapchat have the highest percentage of data scientists of any of the companies I considered. Therefore, perhaps there is a startup bias to this… or, once we notice that the top four companies are all located in Silicon Valley, perhaps Silicon Valley is the reason for these results.

Map of the locations of companies that hire data scientists in the United States. Image credit: Samantha Schumacher.

So, location can play a large role in how many data scientists a particular company hires. However, these apparently high levels of data scientists could be due to a methodology problem. I made an assumption that the number of people who are titled “data scientists” is a fixed ratio compared to the number of people who do data science-type work. This may be a flaw in my analysis which effects Silicon Valley companies. If most of the effective data scientists are actually titled data scientists within Silicon Valley, that is, for Silicon Valley,

effective data scientists = titled data scientists,

then my inferred results may be overestimating the percentage of effective data scientists. But I’m also willing to believe that Silicon Valley companies are more focused on data-supported results. So these companies might, as a consequence or cause of their location, believe that more data scientists will result in higher earnings. Startups also contain a higher-than-average percentage of data scientists. But who knows if this is because they are startups, or because they are hip to the hotness of the data science title, or because they actually use that many data scientists.

Looking past the “Silicon Valley Effect,” the companies with higher percentages of data scientists are companies that are known for data science. Netflix is famous for its data science, and the company’s percentage data show that. Meanwhile, Walmart has a negative reputation for not being able to keep data scientists [5], and they don’t have as many. Maybe there is something to this relationship?

Lastly, I took an informal survey of a small collection of my friends with effective data science titles (n=15). They are going to help me make a totally subjective guess at what the relative reputation for good data science is at each of these companies. I took the mean of responses I received and plotted it against the percentage of effective data scientists.

Scatterplot of company reputation versus effective data scientsts. Image credit: Samantha Schumacher.

With this fairly random-looking scatterplot, I have no great conclusions. Clearly the respectability of the data science department is not a function of its size for my data set. But beyond that, there isn’t much to say. I don’t have a recommendation about how many data scientists a company should have because the [limited] data I’ve collected does not yield any strong correlations. What do you think? How many data scientists/mathematicians are appropriate for a particular company to employ?

This is something I’ll continue to investigate. I’m also planning to get some resources together for academics who want to transfer into the world of “data science.” So, perhaps in a few months or a few years, we’ll have a better answer on what it means for a company to have data scientists, and what kind of value those data scientists bring.

This post has been republished from the author's blog, SocialMathematics.net.

Samantha Schumacher is a Senior Supply Chain Analyst/ Effective Data Scientist for Target Corporation. She holds a PhD in Applied Mathematics from the University of Minnesota. She is also the cartoonist and blogger behind SocialMathematics.net, where she enthusiastically investigates many of the ways math interacts with the modern world.