What is Data science?
Data scientist or Data science have been ‘the words’ of this decade. Every body seems to be excited about the opportunities or potential. So it becomes important to understand what is data science?
In this article, I will try and explain the the concept of data science with examples. I will also provide a high level view of the key skills required to become a data scientist.
Understanding Data Science
I have two accounts on Facebook. One is strictly personal, while the other one had to be created because I wanted to have a company page on Facebook.
There is absolutely nothing common except my name (Even my email id and mobile numbers are different in these accounts).
Last week, when I logged into the official account, I was surprised to see some suggestions on the People You May Know section.
The first and third ones were real surprises, how can Facebook figure these out?
The first one is my college batch mate (almost 25 yrs ) and we had no connection after we passed out. It was indeed a pleasant surprise. This is not an isolated case, I have discovered many such friends through Facebook or LinkedIn.
Harvard Business Review article titled “Data Scientist: The Sexiest Job of the 21st Century“, states the following, about LinkedIn in 2006, when Jonathan Goldman joined LinkedIn:
The company had just under 8 million accounts, and the number was growing quickly as existing members invited their friends and colleagues to join. But users weren’t seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently missing in the social experience.
That’s when the idea of suggesting possible network connections (branded as “People you may know” ) came up in Goldman’s mind.
Validating and analyzing multiple possibilities, he finally developed the algorithm to everyone’s liking. PYMK was launched, resulting in millions of page views and new connections. LinkedIn’s success has a lot to do with PYMK feature. To quote the article further:
Goldman, a PhD in physics from Stanford, was intrigued by the linking he did see going on and by the richness of the user profiles. It all made for messy data and unwieldy analysis, but as he began exploring people’s connections, he started to see possibilities. He began forming theories, testing hunches, and finding patterns that allowed him to predict whose networks a given profile would land in. He could imagine that new features capitalizing on the heuristics he was developing might provide value to users.
Goldman is a data scientist. The Quote provides the essence of what does a data scientist do. Lets dig deeper and try to understand data scientist a little better.
Lets first start with the Scientist part of the data scientist.
A scientist is a professional, who makes discoveries e.g. Newton discovered the gravity theory. Newton was curious to see the apple falling and he wanted to know “why?”.
He thought of some possibilities/assumptions (also known as hypothesis) as to why this must have happened? He then used maths & physics’ principles to validate the possibilities till he arrived at the real reason – theory of gravity.
This is what Goldman did in the LinkedIn scenario. He tried to solve the following problem:
How can we help in connecting users with their old friends, college mates or colleagues and thereby expanding the network.
In order to solve this problem , he made use of the existing data and applied algorithms/techniques. Let’s try and understand the problem better.
Considering myself as a user on Facebook, I am having 200 connections, each of my 200 connections will be have 200 connections each.
If I consider 3 levels, there are possibly 1.6 billion possible connections for me. I have still not considered the connections from college, company, likes pattern etc.
How can Facebook predict top 10-15 connections, whom I would really be interested in and not just any 10 faces. Data scientists develop algorithms and models to solve these problems.
The most interesting part is that these models have the capability to improve themselves. Over the period of time, the accuracy of predicting “PYMK” becomes better and better (that means it can learn). This is why Data science is being touted as the sexiest job of the 21st century.
Skills to become a data scientist
Technical Skills – It’s a sea out there. Sometime back, I was trying to do some research on tools for my own understanding and the more I researched, more number of tools came up. So instead of focusing on tools, one must concentrate on understanding techniques and concepts.
Data science involves Data cleansing, analytics techniques and data visualization. Python and R are two of the most powerful programming environments with built-in libraries.
Knowledge of SQL is also key.
Basic Statistical modeling and machine learning– An important armor in any data scientist’s arsenal. Statistical models like regression, anova, ancova etc enable a data scientist to understand the relationship in data sets and use it to develop predictive models.
You don’t need to get into the depth of these models but having a basic understanding is important
Visualization tools– Data scientists are story tellers. They need visualization tools to showcase their solutions. A tool with Dashboard and reporting capabilities can prove to be handy. Tools like Qlikview, SAS, R etc can be useful.
Analytical Skills – in my view, this is probably the most important skill for any data scientist. You can’t be a problem solver, if you are not analytical. You don’t need to be a Newton or Albert Einstein though. It’s more about an approach. You can find resources to understand structured problem solving approaches.
To hone your problem solving and analytical skills, several tests and resources are available.
Start with this University of Kent Lateral thinking quiz.
To go a notch higher, one of the most well-rounded tests for problem solving are from McKinsey. The Mckinsey problem solving tests are available on their website with answers. These tests don’t need you to know statistical techniques or any programming languages.
Last but not the least, Kaggle Competitions provides the most comprehensive Analytics problem solving opportunities. These are real-life problems.