Beginner Hadoop Perspective R

Rolling up the Sleeves on My First Data Project

Exploring Toronto Data with R and Hive

Preface

My professor was talking about our final data science project, which tops off the requirements to complete the certificate I’m pursuing; he suggested we explore the City of Toronto’s Open Data Catalogue and use local datasets for it.

So, after keeping my eyes peeled for recent additions, the City of Toronto published a Neighbourhood Profile dataset that consists of 140 social planning neighbourhoods, describing social-economic and geographic data. The purpose of this dataset is to help government and community organizations with local planning. It is fascinating and valuable for the city to break down this data to somehow quantify the diversity Toronto often champions.


The Wake Before the Storm

I currently prefer previewing the data in Excel before importing to R to get a heads-up on how much data cleanup is required, if any:

Journey_01.jpg

Structurally speaking, this actually looks very different from the datasets I’ve seen in my training, as well as those found from CRAN-R. There are multiple layers of data point descriptions from column A to D, along with labels across rows one and two. There are also commas and percentage signs within the numeric values, an inconvenience that hasn’t come with the datasets I’ve used thus far.

There’s an opportunity here for a data cleanup project. My goal will be to build Hive tables containing some contents of this dataset. I’ll have to get my hands dirty in R to clean and restructure some data, then apply some Hadoop knowledge to move our subsetted data into Hive. This blog will go through some of my data exploration in R, the data restructuring itself, and finally how I loaded a subset of the data into Hive tables. All R and Hive scripts are available on my GitHub Toronto Neighbourhoods Repository.


Beginning Our Data Wrangling in R

You’ll notice that the first four columns of the original file are factors with categories and subcategories:

journey_04

It’s good practice to count how many variables we potentially have to work with.

> data.frame(Categories = length(unique(profiles2016$Category))
           ,Topics = length(unique(profiles2016$Topic))
           ,DataSources = length(unique(profiles2016$Data.Source))
           ,Characteristics = length(unique(profiles2016$Characteristic))
      )

journey_05-e1523029323799.jpg

DataSources is an interesting variable. But for now, I suspect there’s a column hierarchy going on here.

My guess at this point is perhaps a Category might be something like “Real Estate,” Topics under that might be “Commercial” or “Private,” and then Characteristics might be “Commercial high rise,” “Commerical low rise,” “Commercial plaza,” etc., with a count of such types of real estate in each of the 140 neighbourhoods.


Exploring Factor Hierarchy with Functions

I wrote the following code to help explore this hierarchy; explanations inside:

#Output number of subcategories per category:
#x is parent category, y is child
#example: list number of subcategories y per category x

> col.hierarchy <- function (x,y){
  subs <- NULL
  for (i in 1:length(unique(x))){
    subs[i] <- length(unique(y[x==unique(as.character(x))[i]]))
  }
  data.frame(Category=as.character(unique(x)),numSubcategories=subs)
}
#Output subcategories for given category:
#x is category, y is desired subcategories, z is which category
#example: list subcategory y's where category x equals z

> col.subnames <- function(x,y,z){
  unique(as.character(y)[x==z])
}

In action:

journey_07

From the above, I’ve counted the number of Topics under each of the fifteen Categories. All seem reasonable to analyze. None of these fifteen “themes” has an overwhelming number of variables to consider. More on the “source()” line later.

journey_08

Here’s the same function, except counting Characteristics under each Topic (32 of 49 are listed above). It appears here that characteristics related to languages spoken and ethnic backgrounds have many variables to look at. Looking at the data on my own (I won’t outline in this blog), I can confirm these are the specific languages and countries listed, with a count of them in each of the 140 neighbourhoods.

journey_09

To explore what Topics are under a Category, as opposed to how many, I scripted a col.subnames() function. It lists the six Topics under the Category of “Income.” The call below lists the nine Characteristics under the last Topic of “Income taxes.”

journey_10

Through hours of exploring the dataset, I found enough grounds to eliminate the Data.Sources column. Commonly themed Characteristics and Topics always came from the same data source.

Through this exploration, I also decided that the Labour Category was a good candidate to continue my small data project: building a HiveQL database of labour data in Toronto.


The Realized Benefit of source()

Before I continue, let’s go back to the source() function as promised.

journey_11

I’ve developed a personal preference for batch-running “groundwork” scripts, like importing data and building user-defined functions, before conducting the meat of any analysis in the main R script file. Thus, I’ve moved this groundwork script to another file, and from my main analysis script, I will run all of it with one line.

This seems to be efficient, at least on the eyes, if you potentially have many files and many functions to initialize before calling them as part of your analysis.


Cleaning out the Punctuation

On lines 11-20 of my groundwork script, I removed the commas and percentage signs, which would have impeded R’s ability to read the numeric values (in turn coercing them into characters or strings, which I don’t want).

journey_12I’ll mention here that another thing I’m starting to pick up is a feeling of guilt when looping in this manner. Eventually, I’d also like to get more used to using R’s apply() family of functions.

Array-based programming is one of the advantages of using R, and I should practice it so I can deal with data structures of higher dimensions, as intended in R.


Converting the Labour Data from csv to HiveQL Database

I know it’s fairly straightforward to import .csv files into Hive, as long as you have the first row of field names and subsequent rows of records, separated by commas. So let’s do that right now and see how far we’re off.

> test <- t(profiles2016[which(profiles2016$Category=="Labour"),])
journey_14
View(test)
journey_15
write.csv(test, “test,csv”, row.names = F)

Beautiful, we have our work cut out for us. We need to:

1. Pick out Labour data

> LabourData <- profiles2016[which(profiles2016$Category=="Labour),]

#OR

> LabourData <- split(profiles2016, profiles2016$Category)$Labour

2. Remove Index , Category, Data Source and City of Toronto (an aggregate column)

> rownames(LabourData) <- c()
> LabourData$Category <- NULL
> LabourData$Data.Source <- NULL
> LabourData$City.of.Toronto <- NULL

3. Break out by Topics: each topic will constitute a relation/table in our Labour database

> LabourStatus   <- LabourData[which(LabourData$Topic==as.character(unique(LabourData$Topic)[1])),]
> LabourActivity <- LabourData[which(LabourData$Topic==as.character(unique(LabourData$Topic)[2])),]
> workerClass    <- LabourData[which(LabourData$Topic==as.character(unique(LabourData$Topic)[3])),]
> workerOccup    <- LabourData[which(LabourData$Topic==as.character(unique(LabourData$Topic)[4])),]
> LabourIndustry <- LabourData[which(LabourData$Topic==as.character(unique(LabourData$Topic)[5])),]
> LabourPlace    <- LabourData[which(LabourData$Topic==as.character(unique(LabourData$Topic)[6])),]

4. Get neighbourhood names in the .csv (without the new index numbers!)

> LabourTables <- list(LabourStatus,
                 LabourActivity,
                 WorkerClass,
                 WorkerOccup,
                 LabourIndustry,
                 LabourPlace)

> rm(LabourStatus,
   LabourActivity,
   WorkerClass,
   WorkerOccup,
   LabourIndustry,
   LabourPlace)

 

It turns out that the Topic column is still there along with Characteristic, plus we have row names again. So let’s remove them before we build the neighbourhood names and finally:

5. Transpose

journey_17

So here we have the final stages complete. I actually transposed first at line 77 before building the neighbourhood names at lines 85-87 (reversing tasks 5 and 6 above). Again, I removed row names, as I found that the neighbourhood names ended up there on the transpose.


Sending the Dataset to Hive

Let’s preview what we have now:

journey_18

Aside from the NAs to the far right, it looks good. I’m ready to export the .csv files from R.

#Export csv Files
> myTableNames <- c("LabourStatus,
                  LabourActivity,
                  WorkerClass,
                  WorkerOccup,
                  LabourIndustry,
                  LabourPlace)

> for(i in 1:length(t.LabourTables)){
  write.csv(t.LabourTables[[i]],paste(myTableNames[i],".csv",sep=""),row.names = F)
}

journey_20

These six .csv files were brought into Hortonwork’s Hadoop environment, where I’ll be building the Hive(QL) tables that will hold this labour data. Below will show how I build one of the tables, and then populate it with one of the .csv file’s data.

journey_21

Looks dandy so far. However, you’ll notice I initiated the integer columns as strings. This is because the .csv files contain quotation marks around all the values. Had I initiated them as integers, those quotation marks would cause NULL values to be loaded into the Hive tables.

The solution, aside from finding a better way to export .csv files from R, is to remove the quotation marks, then cast/convert the string datatype to integer.

Below I used Regular Expressions in Hive to find and replace quotation marks with blanks, and then the data was in good shape for casting to the integer datatype.

journey_22

Let’s see what we have now:

journey_23

Fantastic! Five more tables to go on my own time.


Concluding Thoughts

I was very pleased to have gained these hours of practice exploring and manipulating data using R and handling gentle curveballs while working in Hive. I think a fair amount of open data sources are cleaned up already by the many talented contributors out there. So again, I’m glad to have practiced one of the steps towards becoming one of those talented contributors!

I do get the feeling this City of Toronto dataset was constructed perhaps using pivot tables from Excel or something like it. Excel pivot tables are somewhat common in my experience, so I think the functions I’ve built may come in handy in the future.

Strictly within R and Hive, getting the data into the .csv format required was also a bit trickier than I expected. [An example of this is having to use the tblproperties option in Hive to exclude the first row of field names from the .csv file, or having to explicitly initialize the attributes as strings, when R initialized everything as factors on the read.csv load.]

All in all, I recommend doing such cleanup projects. The smallest surprises just creep up and forces inspires you to learn something new that the textbook didn’t cover because it was such a nuance.

Student of Big Data science: Machine Learning & Business Intelligence Tools | Practitioner of Insurance Data Analytics

0 comments on “Rolling up the Sleeves on My First Data Project

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: