Files

A comma-separated values (CSV) file is a delimited text file that generally uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by the delimiter. CSV is a common data exchange format that is widely supported by consumer, business, and scientific applications. R makes it easy to export and import data in CSV format.

Local Files

Export data to a csv file

data("mtcars")                          # load the mtcars dataset
write.csv(mtcars, file = 'mtcars.csv')  # export to file

Import data from a csv file

x <- read.csv('mtcars.csv')             # read file 
head(x)                                 # print data
##                   X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Remote Files

Some data providers offer data in csv format on their website. The STOXX website, a financial index provider, is one of these. Open this link for the EURO STOXX 50 Index: tab Data -> Historical Data provides some open source files for histroical prices. Clicking on EUR Price will open this link. The read.csv() function can read this file directly from the internet.

# read.csv is very flexible. For the full list of arguments type ?read.csv
x <- read.csv('https://www.stoxx.com/document/Indices/Current/HistoricalData/h_3msx5e.txt', sep = ';') 
head(x)
##         Date Symbol Indexvalue  X
## 1 22.03.2019   SX5E    3305.73 NA
## 2 25.03.2019   SX5E    3300.48 NA
## 3 26.03.2019   SX5E    3319.53 NA
## 4 27.03.2019   SX5E    3322.04 NA
## 5 28.03.2019   SX5E    3320.29 NA
## 6 29.03.2019   SX5E    3351.71 NA
rownames(x) <- as.Date(x[,1], format = '%d.%m.%Y') # assign rownames
x[,c(1,ncol(x))] <- NULL                           # drop the first and last column
head(x)                                            # print data
##            Symbol Indexvalue
## 2019-03-22   SX5E    3305.73
## 2019-03-25   SX5E    3300.48
## 2019-03-26   SX5E    3319.53
## 2019-03-27   SX5E    3322.04
## 2019-03-28   SX5E    3320.29
## 2019-03-29   SX5E    3351.71

R Packages

The ‘quantmod’ Package

The quantmod package provides a very suitable function for downloading financial data from the web. This function is called getSymbols. The function works with a variety of sources.

# install the package
install.packages('quantmod')
# load the package
require(quantmod)

For stocks and shares, the yahoo source is used. Symbols can be found here.

# retrieve Facebook quotes
x <- getSymbols(Symbols = 'FB', src = 'yahoo', auto.assign = FALSE)   
tail(x)
##            FB.Open FB.High FB.Low FB.Close FB.Volume FB.Adjusted
## 2019-06-14  180.51  181.84 180.00   181.33  16773700      181.33
## 2019-06-17  185.01  189.50 184.41   189.01  29459900      189.01
## 2019-06-18  194.00  194.53 187.28   188.47  37571400      188.47
## 2019-06-19  187.00  188.10 184.55   187.48  21417100      187.48
## 2019-06-20  190.95  191.16 187.64   189.53  14635700      189.53
## 2019-06-21  188.75  192.00 188.75   191.14  22663200      191.14

For currencies and metals, the oanda source is used. Symbols are the instruments’ ISO codes separated by /. ISO codes can be found here.

# retrieve the historical euro/dollar exchange rate
x <- getSymbols(Symbols = 'EUR/USD', src = 'oanda', auto.assign = FALSE)   
tail(x)
##             EUR.USD
## 2019-06-17 1.122164
## 2019-06-18 1.120698
## 2019-06-19 1.120980
## 2019-06-20 1.128537
## 2019-06-21 1.132506
## 2019-06-22 1.136910

For economics series, the FRED source is used. Symbols can be found here.

# retrieve the historical Gross Domestic Product for Japan
x <- getSymbols(Symbols = 'JPNNGDP', src = 'FRED', auto.assign = FALSE)   
tail(x)
##             JPNNGDP
## 2017-10-01 549849.1
## 2018-01-01 548682.4
## 2018-04-01 550560.5
## 2018-07-01 546999.6
## 2018-10-01 549735.0
## 2019-01-01 554340.1

RESTful APIs

An Application Program Interface (API) is basically a messenger that takes a request, tells a system what you want to do and then returns the response back to you. A RESTful API is an API that uses HTTP requests to GET, PUT, POST and DELETE data. The httr R package is a useful tool for working with HTTP. Each API has its very specific usage and documentation.

# install the package
install.packages('httr')
# load the package
require(httr)

CRAN downloads

The API of the CRAN downloads database. Documentation available here

Example. Which was the most downloaded package of the last month?

baseurl <- 'https://cranlogs.r-pkg.org/'        # API base url. See documentation
endpoint <- 'top/'                              # API endpoint. See documentation
period <- 'last-month/'                         # API parameter. See documentation
count <- 1                                      # API parameter. See documentation
url <- paste0(baseurl, endpoint, period, count) # build full url 
x <- GET(url)                                   # retrieve url
data <- content(x)                              # extract data
data                                            # print data
## $start
## [1] "2019-05-24T00:00:00.000Z"
## 
## $end
## [1] "2019-06-22T00:00:00.000Z"
## 
## $downloads
## $downloads[[1]]
## $downloads[[1]]$package
## [1] "Rcpp"
## 
## $downloads[[1]]$downloads
## [1] "896438"

The most downloaded package between 2019-05-24 and 2019-06-22 was Rcpp with a total of 896438 downloads.

KuCoin API

The API of KuCoin, cryptocurrency exchange. Documentation available here

Example. Retrieve and plot Bitcoin price every minute in the last 24 hours.

# set GMT timezone. See documentation
Sys.setenv(TZ='GMT')                        
# API base url. See documentation
baseurl <- 'https://api.kucoin.com'  
# API endpoint. See documentation
endpoint <- '/api/v1/market/candles'    
# today and yesterday in seconds
today <- as.integer(as.numeric(Sys.time()))  
yesterday <- today - 24*60*60
# API parameters. See documentation
param <- c(symbol = 'BTC-USDT', type = '1min', startAt = yesterday, endAt = today)
# build full url. See documentation
url <- paste0(baseurl, endpoint, '?', paste(names(param), param, sep = '=', collapse = '&')) 
# retrieve url
x <- GET(url)    
# extract data
x <- content(x)      
data <- x$data
# formatting
data <- sapply(1:length(data), function(i) {
  # extract single candle
  candle <- as.numeric(data[[i]])
  # formatting. See documentation
  return( c(time = candle[1], open = candle[2], close = candle[3], high = candle[4], low = candle[5]) )
})
# convert to xts
datetime <- as.POSIXct(data[1,], origin = '1970-01-01')
data <- xts(t(data[-1,]), order.by = datetime)
# plot closing values
plot(data$close, main = 'Bitcoin price in dollars')

Web Scraping

Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used. The rvest package is a useful tool to scrape information from web pages.

# install the package
install.packages('rvest')
# load the package
require(rvest)

Example. Write a function to retrieve articles from Google Scholar given a generic query string q.

getArticles <- function(q){
  # build url
  url <- paste0('https://scholar.google.com/scholar?hl=en&q=', q)
  # sanitize url
  url <- URLencode(url)
  # get results
  res <- read_html(url) %>%           # get url
    html_nodes('div.gs_ri h3 a') %>%  # select titles by css selector 
    html_text()                       # extract text
  # return results
  return(res)
}
# retrieve articles about web scraping in r
getArticles('web scraping in r')
##  [1] "Automated data collection with R: A practical guide to web scraping and text mining"                                          
##  [2] "Web Scraping with Python: Collecting More Data from the Modern Web"                                                           
##  [3] "A primer on theory-driven web scraping: Automatic extraction of big data from the Internet for use in psychological research."
##  [4] "Web Scraping and Naïve Bayes Classification for Job Search Engine"                                                            
##  [5] "The use of web-scraping software in searching for grey literature"                                                            
##  [6] "Web scraping technologies in an API world"                                                                                    
##  [7] "RCrawler: An R package for parallel web crawling and scraping"                                                                
##  [8] "Web scraping with Python"                                                                                                     
##  [9] "Exploiting web scraping in a collaborative filtering-based approach to web advertising."                                      
## [10] "Programming by a sample: rapidly creating web applications with d. mix"

Code Download

Download the full code to generate this document and reproduce the examples. The file is in R Markdown, format for making dynamic documents with R. An R Markdown document is written in markdown, an easy-to-write plain text format, and contains chunks of embedded R code.
Download

Exercise: create the pdf version of this web page
Hint: download the file above and have a look at the introductory 1-min video of the official Rmarkdown guide

Find more R tutorials here

by Emanuele Guidotti