As a beginner in data science, there are two concepts you cannot ignore: statistics and machine learning. Sometimes you may be confused as they seem to use the same algorithms to make predictions. But if you think of how they are utilized, you will find they are totally different.
It seems like statistics focuses more on experiments and machine learning focuses more on actual problems. So, what is the relationship between them?
Let’s start with two subjects first: pure mathematics and computer science. You may think they have nothing in common, but they pursue the same target: to solve daily problems in an efficient way.
Both mathematicians and computer scientists focus on building abstract models to simulate real world problems. The difference is that mathematics uses more theory as compared to computer science—just think how long mathematics took to develop throughout history.
I’m as enthusiastic about the future of AI as (almost) anyone, but I would estimate I’ve created 1000X more value from careful manual analysis of a few high quality data sets than I have from all the fancy ML models I’ve trained combined.
— Sean J. Taylor (@seanjtaylor) February 20, 2018
We can agree that statistics aims to explain data in a reasonable way and machine learning only works when paired with a predictable process. Or, we can say that statistics focuses more on the model itself (the accuracy, meaning and bias), while machine learning focuses more on the end result.
The goal of this explanation is not to perfectly outline all the differences between these subjects, but to understand the relationship between them. Scientists seek theories to explain WHY natural rules occur, but data engineers only care HOW to use them.
As future data scientists, we are a mix between real scientists and handy engineers; we design the frameworks of models and predict problems in the real world. But if we are not strong in statistics, we will easily get lost in the sea of data. So let’s start a calculation first!