Quickly Visualize All Variables in a New Dataset

A new data scientist can feel overwhelmed when tasked with exploring a new dataset; each dataset brings forward different challenges in preparation for modeling.

This article will focus on getting a quick glimpse at your data in R and, specifically, dealing with these three aspects:

Viewing the distribution: is it normal?
Identifying the skewness
Identifying the outliers

My goal is to show you how to make a quick visual of the data to solve these questions using ggplot2‘s Small Multiple Chart.

Examine the Target Variable

The following guide will use a default dataset from the “MASS” library in R. This dataset contains data on median price for houses in the Boston area. Load it like this:

> library(MASS)
> Boston <- Boston
> head(Boston)

Conveniently, the data is tidy. No cleaning will be necessary. Instead, we can direct our focus on visualizing the data using ggplot2, starting with the key indicator of the dataset, “medv,” which is the median house price.

> require(ggplot2)

#Histogram of medv
------------------
> ggplot(data = Boston, aes(x = medv)) + 
geom_histogram()

#Density Plot of medv
---------------------
> ggplot(data = Boston, aes(x = medv)) + 
stat_density()

Both histograms and density plots show the general shape of the data; however, it is worth mentioning the differences between the two:

histograms specifically show where the data is and are preferred when visualizing one variable
density plots generalize by showing a line of best fit, which helps us glance the distribution of the variable

Some noteworthy observations about the indicator variable “medv”:

it is not normally distributed
it is skewed to the right
there are extreme values/outliers at the tail

It seems we’ll need to handle those outliers to come closer to normalizing our data.

Visualize All Variables

For convenience—and convenience is always welcomed in data science—let’s continue to look at the rest of the Boston dataset using ggplot2‘s Small Multiple Chart.

First, we start by “melting” our data from a wide format into a long format. We will need to call the reshape2 package to perform this. Take a look at the code below for the transformation our data needs for ggplot2‘s small multiple chart.

> require(reshape2)

#Convert wide to long
---------------------
> melt.boston <- melt(Boston)

> head(melt.boston)

Here, we’ve successfully created a “toy” dataset where the values have been spread into rows alongside a “variable” column. Now, let’s bring in ggplot2 to make quick visuals of all the variables.

#Small Multiple Chart
---------------------
> ggplot(data = melt.boston, aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")

In the plot definition, we faceted (or grouped) by values for each variable and kept the scales “free” so each plot has its own parameters. It’s easy to see how convenient this would be for larger datasets rather than investigating each variable.

Returning to our main questions:

1. Viewing the distribution: is it normal?

The “rm” variable is closest to normal. This means most of our data will need to be transformed.

2. Identifying the skewness

We can see that “crime,” “zn,” “chaz,” “dis,” and “black” are highly skewed. Several other variables have moderate skewness.

3. Identifying the outliers

boxplot(Boston)

Outliers can be found using R’s boxplot function and the methods to deal with them will vary. Simply removing them is rarely the correct method. We won’t touch on methods here, but rather focus on the variables that need transformation to normal distributions.

Looks like transforming our data to remove outliers will be a big task to help normalize the data.

Dealing with these outliers will take a more subjective course of action (e.g., K-Nearest Neighbour, mean values, etc.) which we can discuss in a future article.

Recap

We managed to position ourselves to better address the data’s needs by:

plotting the target variable using ggplot2
reshaping the dataset from wide to long using reshape2‘s melt() function
plotting all the variables with ggplot2‘s small multiple chart
examining the output for skewness and outliers

For a more thorough understanding of ggplot2‘s capabilities, I recommend taking all three of DataCamp‘s Data Visualization with ggplot2 courses to become more acquainted with the syntax.

9 comments on “Quickly Visualize All Variables in a New Dataset”

nyandemo

March 6, 2018

What about visualizing categorical data?

LikeLiked by 1 person

JCV

March 6, 2018

Great Article! Please post one regarding the handling of outliers?

LikeLiked by 1 person

- Jake Daniels
  
  March 13, 2018
  
  In due time!
  
  LikeLike
  
Ravish Khare

March 7, 2018

Nice idea using melt

LikeLiked by 1 person

Bontemps

March 7, 2018

Nice post on a great way to use ggplot + tidyr,, but..; I think the package DataExplorer (https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html ) is doing that without effort.
However, your post is very pedagogical and informative

LikeLiked by 1 person

- Jake Daniels
  
  March 13, 2018
  
  I’ll check it out Bontemps. It never hurts to shorten the process.
  
  LikeLike
  
Stefanos

March 10, 2018

Hey, nice and practical guide! Thanks for sharing!
A small note: medv is skewed to the right, not to the left, as its tail goes to the right, and that’s where skewness is characterized from.

LikeLike

- Jake Daniels
  
  March 13, 2018
  
  Good Catch! Thanks.
  
  LikeLike
  
Wagner Rodeski

March 26, 2018

Could you please add a Python equivalent codes to those nice viz you just showed up?

LikeLike