Beginner Hadoop Perspective R

TGIF: The Grind Includes Fridays

Now that I’m halfway through the winter semester, I thought a reflection on my Christmas study break was long overdue.

In my last blog, I set out to do as much self-study as possible while finding balance with holiday festivities and two young daughters.

And after four weeks of…

  • several Santa Clause sightings
  • family gatherings
  • kindergarten events
  • a snowy visit to the zoo
  • 18 hours of online courses

…this was certainly the most studying I’ve ever achieved over a Christmas break. Also, if I was still counting steps on my wearable tracker, I would have registered a heck of a lot of steps from the festivities.

The courses I chose to spend a lovely eighteen hours with gave me exposure to tools and methodologies in Hadoop and R. As someone who learns best by repetition and practice, this was a great way to set my future self up for success.

I will highlight two courses that left lasting impressions on me so far. I often applied their concepts during the past six weeks as I continue to delve into data science. I recommend you try them out too.


[Lynda.com]: R Statistics Essential Training Instructor: Barton Poulson

Organizing this course by topics like “Charting and Analysis of One or Multiple Variables” and “Regression and Multiple Regressions” turned out to be more helpful now than I thought at the time.

It helped me establish a good habit of, early on, framing the data question by these parameters. Immediately after presented with a question or problem, I ask myself:

  • What end result am I looking for? Is it a chart, an acceptance/rejection of a hypothesis, or a projection?
  • How many variables do I have to work with?

The answer to these questions easily guided me to specific R functions I should look to use.

I knew there were many parameters for graphing functions; however, it was something else to see all of them being demonstrated. Below is one example of some customization that can be done with a box plot. I’ll definitely remember to revisit the provided R exercise files when building any visuals in the future.

grodon.PNGShortcuts, Shortcuts, Shortcuts!

Not coding in a console window is another habit that’s been so invaluable to me.

To have script readily available to execute in a source window has been so useful, especially with these keyboard shortcuts (*RStudio):

Ctrl+Enter : Run selected line(s)
Ctrl+Shift+P : Run previously run line(s)
Ctrl+Shift+C : Comment out/in
Ctrl+L : Clear console


Another Slick Tip

If you format a block of code (like the boxplot function below) with each parameter on a new line preceded by a comma, and each closing bracket on its own line, you enable simple experimentation with function parameters or control statements.

gordon2

You run the entire block each time, simply commenting in/out lines as desired and you’ll quickly see the result.

On some occasions, Poulson started with each parameter commented out, and reran the same block over and over adding one parameter back each time. In this case, the boxplot changed one customization at a time, and it made it much easier to visualize each parameter.


[Lynda.com]: Learning Hadoop Instructor: Lynn Langit

“…I felt real progress with my understanding  by the minute!

The difference between being aware of the different components within the Hadoop ecosystem and actually understanding them is certainly not knowledge acquired overnight. This course from Lynda did a great job helping me take a step towards closing the gap.

Langit covered common tools inside Hadoop like MapReduce, Hive, Pig and Oozie at such a level and pace that I felt real progress with my understanding  by the minute!

In my experience as a student, whenever I hear “distribution” I think histograms, curves, and mounds. However, I’ve learned that a distribution in Hadoop refers to the different user interfaces for interacting with the Hadoop ecosystem. Cloudera, for instance, was often used to demonstrate concepts in this course and was easy to understand.

Some other intriguing distribution and visualization tools were mentioned throughout the course. I will be revisiting the tools listed below in greater depth in the future.

bigML
Datameer
Circos
D3.js


 

SEIZE THE DAY: to do the things one wants to do when there is the chance instead of waiting for a later time.  Merriam-Webster.com

Conclusion

I remember feeling overwhelmed knowing there were so many data science courses on Lynda.com.

If anyone happens to feel the same way when they browse Lynda.com, the two courses above are safe bets as good starting points for data analysis in R and Hadoop. With a 30 day free trial, which I took advantage of for all these courses I’ve blogged about over my 4 week Christmas break, it’d be such an obvious calling to make use of the resources presented.

Through leading by example, I hope to inspire at least one person to “seize the day,” despite heading into the weekend, a vacation or a break.

Just remember TGIF: The Grind Includes Fridays.

Student of Big Data science: Machine Learning & Business Intelligence Tools | Practitioner of Insurance Data Analytics

0 comments on “TGIF: The Grind Includes Fridays

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: