How many times have you started out with a clear idea of goals and techniques and ultimately found yourself lost in a maze of big data?
If your answer is “quite often,” this article might help you find a way out. Or even better, not fall prey to this trap.
I have seen the way some data scientists look at a dataset and start trying to make sense of everything right away.
They go through the basics like finding outliers and consolidating missing values, and then feel like they’ve already accomplished a majority of their goal.
However, this could not be farther from the truth.
Here are my eight essential pre-analysis approaches/steps for every data scientist to consider:
1. Know Why You Have the Data You Have
Question the credibility of data if you can. How reliable is it? Is there a possibility for bias?
Don’t trust everything to be correctly represented.
2. Define the Business Goal
You should not stop until you have a clear picture of what your business goal is.
Go through the cycle of analyzing the problem, defining the problem and then refining the problem statement. It may or may not be monetary.
3. Understand the Nature of Your Project
Usually any data analysis project is expected to pay future dividends. If it’s only a one-time project, it may be approached in a manner that best fits your usual style.
However, when the goal is to create a dynamic model, the approach and technique must be defined before diving into any details.
Create a mental map of the factors. If you have a team, discuss how different variables are to be handled.
Ask experts in the field to understand which factors are important or not. Drawing Venn diagrams and association matrices are particularly helpful in this case.
The objective here is to understand and connect your data with the real world; often, these connections and associations are not easy to discover.
5. Look into the Details and Understand the Headers
Find out how each variable connects with the business goal.
At this point, the business problem statement should be engraved in stone in your mind. You should automatically be able to connect your independent variables to the ultimate business goal.
6. Clear out the Weeds
You must understand the data just enough to root out major typos and errors. Streamline the dataset and create a quality data matrix.
7. Stay Legal
It is extremely important not to modify any data point without explicitly informing the project and data owners. It is a crime to modify data points to your liking, or to correct data based on a bit of online research or because you have a strong “hunch” as to what the actual data point is.
8. Reiterate Your Problem Statement
Once you have a clean slate, it is time to reiterate the problem statement. Quite often the problem statement gets revised based on steps 4, 5, 6 and 7 above.
So, What Happens Next?
Now you have the requirements defined and fresh material ready to query. It is time to start painting the big picture. More often than not, new learners forget that slow and steady wins the race.
Remember to look beyond the data and see how you can improve data collection.
- How can you make the process self-sufficient?
- How can you create a tool that learns from itself?
- How can you reduce the amount of effort to reach the same conclusion for new datasets?
The maze is always a challenge when you are looking at its walls from within. Create a blueprint of some successful exit routes from the maze because, after all, your analysis method must be sustainable.
Congratulations, you can finally navigate the maze like a champ! Follow your mind map and conquer the maze!