Harvard Business review refers to Data Scientist job as the Sexiest Job of the 21st Century. So who is a data scientist? What does he/she do? I will describe this in this article.
Data scientist is one of the fanciest job titles and most of us love fancy job titles. But what makes a data scientist? What does a data scientist do? Let’s cut clutter and understand it with the help of an example.
Who is a data scientist – a real-life use case?
I have two accounts on Facebook. One is strictly personal, while the other one had to be created because I wanted to have a company page on Facebook. There is absolutely nothing common except me (however even my email id and mobile numbers are different in these accounts).
Last week, when I logged into the official account, I was surprised to see some suggestions on the People You May Know section.
The first and third ones were real surprises, how can Facebook figure this out? The first one is my college batch mate (almost 30 yrs back) and we had no connection after we passed out. It was indeed a pleasant surprise. This is not an isolated case, I have discovered many such friends through Facebook.
Harvard Business Review article titled “Data Scientist: The Sexiest Job of the 21st Century“, states the following, about LinkedIn in 2006, when Jonathan Goldman joined LinkedIn:
“The company had just under 8 million accounts, and the number was growing quickly as existing members invited their friends and colleagues to join. But users weren’t seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently missing in the social experience.”
That’s when the idea of suggesting possible network connections (branded as “People you may know” ) came up in Goldman’s mind. Validating and analyzing multiple possibilities, he finally developed the algorithm to everyone’s liking. PYMK was launched, resulting in millions of page views and new connections. LinkedIn’s success has a lot to do with PYMK feature. To quote the article further:
“Goldman, a PhD in physics from Stanford, was intrigued by the linking he did see going on and by the richness of the user profiles. It all made for messy data and unwieldy analysis, but as he began exploring people’s connections, he started to see possibilities. He began forming theories, testing hunches, and finding patterns that allowed him to predict whose networks a given profile would land in. He could imagine that new features capitalizing on the heuristics he was developing might provide value to users.“
Goldman is a data scientist. Quote above provides the essence of what does a data scientist do. Lets dig deeper and try to understand data scientist a little better.
Lets first start with the Scientist part of the data scientist. A scientist is a professional, who makes discoveries e.g. Newton discovered the gravity theory. Newton was curious to see the apple falling and he wanted to know “why?”. He thought of some possibilities (also known as hypothesis) as to why this must have happened? He then used physics’ principles to validate the possibilities till he arrived at the real reason – theory of gravity.
This is what Goldman did in the LinkedIn scenario. This is what Google founders did while developing Google search engine. They possibly started with the problem –
How can we make search more relevant, contextual and meaningful.
One of the hypothesis must have been PageRank based on keyword relevancy, age of the page, links from authoritative sites etc (No body knows all the parameters, may be they will again publish a paper someday). As you all know, they used a combination of statistical models, machine learning and home-grown tools (inspiration for Hadoop) to validate their hypothesis. What came forward as a result of this effort – Google. They worked with data- in this case piles of hyperlinked pages and images over the internet. A data scientist works with data.
Every data scientist need not be A Goldman or Larry Page and Sergey Brin but the approach and the essence of the role remains the same. Data is available in every organization – the size and scale may differ? But who does not want to find a new way to reach out to its customers? Who would not like to generate new ways of generating leads? Who does not want to have better revenue forecasting? If you can use this data effectively to help a business grow, you are a data scientist.
Skills of a Data Scientist
It’s a sea out there. Sometime back, I was trying to do some research on tools for my own understanding and the more I researched, more number of tools came up. so now I have decided to group them together in different categories for better understanding. I will publish that once I get to know all of them, well almost all of them.
Anyways, to give you a broad understanding of the areas where technology plays a role – it starts with data storage, Extraction and loading of data (ETL/Ingest), DWH, Data mining & analytics and visualization. There are plenty of tools and technologies available for each of these areas, however you can’t possibly know all of them. So, as a data scientist, which ones should you know?
Lets start from the beginning, you should understand the Data extraction and loading – good data is key to your success. So you should be familiar with the data cleansing, data profiling concepts. As far as tools are concerned, you can look at ingest tools like Sqoop.
The extracted data can be stored in a DWH (Data ware house) or in file based systems like HDFS. Basic understanding of these can prove to be handy.
The most important part comes next i.e. Data mining, analyzing and using models to validate hypothesis. You need to know the concepts of statistical models, machine learning algorithms and some programming languages.
Python and R are the two of the most powerful languages. You can decide to choose one of these. Also, SQL is important. Data scientists deal with data and it’s not possible without knowing SQL.
Basic Statistical modeling
An important armor in any data scientist’s arsenal. Statistical models like regression, anova, ancova etc enable a data scientist to understand the relationship in data sets and use it to develop predictive models. R programming language, part of CRAN library is a way to implement statistical models.
You don’t need to know the intricacies of the statistical models (as most of the Python and R library have implemented that). You need to understand how to use these models for solving customer problems.
A data scientist also needs to know about machine learning and how to use it for improving models. Weka in Java, Mahout (part of Hadoop stack) and Prediction API (from Google) are the programming languages to be used for machine learning implementation.
Data scientists are story tellers. They need visualization tools to showcase their solutions. A tool with Dashboard and reporting capabilities can prove to be handy. Tools like Tableau, Power BI and Qlikview are the most powerful tools.
In my view, this is probably the most important skill for any data scientist. You can’t be a problem solver, if you are not analytical. You don’t need to be a Newton or Albert Einstein though. It’s more of an approach. What about evaluating your analytical skills – Start with this University of Kent Lateral thinking quiz.
To go a notch higher, one of the most well-rounded tests for problem solving are from McKinsey. The Mckinsey problem solving tests are available on their website with answers. These tests don’t need you to know statistical techniques or any programming languages.
Last but not the least, Kaggle Competitions provide the most comprehensive Analytics problem solving opportunities, these are real-life problems and would need you to use almost all the skills of a data scientist.
Techcanvass offers Business Analysis and Analytics certification courses for professionals. We are an IIBA endorsed education provider (EEP) and iSQI Germany Authorized Training Partner.