Intermediate Learning R

Quickly Visualize All Variables in a New Dataset

A new data scientist can feel overwhelmed when tasked with exploring a new dataset;  each dataset brings forward different challenges in preparation for modeling.

This article will focus on getting a quick glimpse at your data in R and, specifically, dealing with these three aspects:

  1. Viewing the distribution: is it normal?
  2. Identifying the skewness
  3. Identifying the outliers

My goal is to show you how to make a quick visual of the data to solve these questions using ggplot2‘s Small Multiple Chart.


Examine the Target Variable

The following guide will use a default dataset from the “MASS” library in R. This dataset contains data on median price for houses in the Boston area. Load it like this:

> library(MASS)
> Boston <- Boston
> head(Boston)

pic1

Conveniently, the data is tidy. No cleaning will be necessary. Instead, we can direct our focus on visualizing the data using ggplot2, starting with the key indicator of the dataset, “medv,” which is the median house price.

> require(ggplot2)

#Histogram of medv
------------------
> ggplot(data = Boston, aes(x = medv)) + 
geom_histogram()

#Density Plot of medv
---------------------
> ggplot(data = Boston, aes(x = medv)) + 
stat_density()

Both histograms and density plots show the general shape of the data; however, it is worth mentioning the differences between the two:

  • histograms specifically show where the data is and are preferred when visualizing one variable
  • density plots generalize by showing a line of best fit, which helps us glance the distribution of the variable

Some noteworthy observations about the indicator variable “medv”:

  1. it is not normally distributed
  2. it is skewed to the right
  3. there are extreme values/outliers at the tail

It seems we’ll need to handle those outliers to come closer to normalizing our data.


Visualize All Variables

For convenience—and convenience is always welcomed in data science—let’s continue to look at the rest of the Boston dataset using ggplot2‘s Small Multiple Chart.

First, we start by “melting” our data from a wide format into a long format. We will need to call the reshape2 package to perform this. Take a look at the code below for the transformation our data needs for ggplot2‘s small multiple chart.

> require(reshape2)

#Convert wide to long
---------------------
> melt.boston <- melt(Boston)

> head(melt.boston)

pic2

Here, we’ve successfully created a “toy” dataset where the values have been spread into rows alongside a “variable” column. Now, let’s bring in ggplot2 to make quick visuals of all the variables.

#Small Multiple Chart
---------------------
> ggplot(data = melt.boston, aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")

smallmchart.PNG

In the plot definition, we faceted (or grouped) by values for each variable and kept the scales “free” so each plot has its own parameters. It’s easy to see how convenient this would be for larger datasets rather than investigating each variable.

Returning to our main questions:

      1. Viewing the distribution: is it normal?

The “rm” variable is closest to normal. This means most of our data will need to be transformed.

      2. Identifying the skewness

We can see that “crime,” “zn,” “chaz,” “dis,” and “black” are highly skewed. Several other variables have moderate skewness.

      3. Identifying the outliers

boxplot(Boston)

boxplot.PNG

Outliers can be found using R’s boxplot function and the methods to deal with them will vary. Simply removing them is rarely the correct method. We won’t touch on methods here, but rather focus on the variables that need transformation to normal distributions.

Looks like transforming our data to remove outliers will be a big task to help normalize the data.

Dealing with these outliers will take a more subjective course of action (e.g., K-Nearest Neighbour, mean values, etc.) which we can discuss in a future article.


Recap

We managed to position ourselves to better address the data’s needs by:

  • plotting the target variable using ggplot2
  • reshaping the dataset from wide to long using reshape2‘s melt() function
  • plotting all the variables with ggplot2‘s small multiple chart
  • examining the output for skewness and outliers

For a more thorough understanding of ggplot2‘s capabilities, I recommend taking all three of DataCamp‘s Data Visualization with ggplot2 courses to become more acquainted with the syntax.

9 comments on “Quickly Visualize All Variables in a New Dataset

  1. What about visualizing categorical data?

    Liked by 1 person

  2. Great Article! Please post one regarding the handling of outliers?

    Liked by 1 person

  3. Nice idea using melt

    Liked by 1 person

  4. Nice post on a great way to use ggplot + tidyr,, but..; I think the package DataExplorer (https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html ) is doing that without effort.
    However, your post is very pedagogical and informative

    Liked by 1 person

  5. Stefanos

    Hey, nice and practical guide! Thanks for sharing!
    A small note: medv is skewed to the right, not to the left, as its tail goes to the right, and that’s where skewness is characterized from.

    Like

  6. Could you please add a Python equivalent codes to those nice viz you just showed up?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: