The aim of this article is to provide you with a quick look-up guide for your first step towards a data science project.
Before importing data, a data scientist needs to identify the relevant sources of data required for the problem at hand. Data Collection and data management are the foundation stones for the success of any data-related project. Every enterprise has a dedicated data management team that continuously strives to identify different sources of data, and extracts, transforms and loads the data (a.k.a. ETL) to a central repository called a data warehouse.
This topic is enormous and, hence, out of scope for this article, but in my opinion it is a very important concept to be understood by any aspiring data scientist.
Once you have identified your source of data, it can be imported to R for further analysis. There are multiple functions in R specific to your data file type (e.g., CSV, TXT, HTML, XLSX, etc.)
Importing TXT/CSV Files to R
Using Base Functions
The table below summarizes the base functions (i.e., no additional package installation required) for importing your data to R, based on file format.
Each of the functions in the table above comes with a set of default arguments which makes them different from the others. These arguments are:
- header: logical value. If TRUE, the function assumes your file has a header row. If that’s not the case, you can add the argument header = FALSE.
- fill: logical value. If TRUE, rows having unequal length will be added with blank fields implicitly.
- sep: the field separator character. For example, “\t” is used for a tab-delimited file.
- dec: the character used in the file for decimal points.
- stringsAsFactor is another important argument and should be set to FALSE if you don’t want your text data to be converted to factors.
Watch Out! If you don’t explicitly set the above arguments, the function will assume the default argument values.
Within each of the above functions, you also need to specify either the file name (if it is on your local machine) or the URL (if the file is located on the web).
Reading a Local File
To locate a file on your machine, you can follow one of these approaches:
- Set your working directory to point to the folder containing your file with the command setwd (“”) and then provide the file name in the function.
- Use file.choose() within the import function. This lets you interactively choose your file from your machine.
Tip: read.table() is a general function that can be used to read any file in table format provided you set the arguments as per your requirements. The data will be imported as a data frame. For example, if you have a text file with data fields separated by “|” you can use the command below:
Using the readr Package
The functions in this package are used in a similar way as the base functions. The readr package is much faster (over 10 times) than the base functions and, hence, very useful with large TXT or CSV files.
- delim: the character that separates values in the data file.
- col_names: can be either TRUE (default value), FALSE or a character vector specifying column names. If TRUE, the first row of the input will be used as the column names.
Similar to base functions, within each of the above functions you also need to specify either the file name (if it is on your local machine, or use file.choose()) or the URL (if the file is located on the web).
In this article, we got to know different ways to import TXT/CSV files to R depending on the data volume, file location and data separators. For base functions, no additional package needs to be installed whereas for advanced functions, you first need to install the package (e.g., readr in our case) and then call the library to use the functions.
In the next article we shall summarize the functions for importing other file types to R. Cheers!
Pingback: Rolling up the Sleeves on My First Data Project