Useful new R packages for data visualization and analysis (2024)

The following is from a hands-on session I led at the recent Computer Assisted Reporting conference.

There’s a lot of activity going on with R packages now because of a new R development package called htmlwidgets, making it easy for people to write R wrappers for existing JavaScript libraries.

The first html-widget-inspired package I want us to demo is

Leaflet for R.

If you’re not familiar with Leaflet, it’s a JavaScript mapping package. To install it, you need to use the devtools package and get it from GitHub (if you don’t already have devtools installed on your system, download and install it with install.packages(“devtools”).

devtools::install_github("rstudio/leaflet")

Load the library

library("leaflet")

Step 1: Create a basic map object and add tiles

mymap <- leaflet()mymap <- addTiles(mymap)

View the empty map by typing the object name:

mymap

Step 2: Set where you want the map to be centered and its zoom level

mymap <- setView(mymap, -84.3847, 33.7613, zoom = 17)mymap

Add a pop-up

addPopups(-84.3847, 33.7616, 'Data journalists at work, NICAR 2015')

And now I’d like to introduce you to a somewhat new chaining function in R: %>%

This takes the results of one function and sends it to the next one, so you don’t have to keep repeating the variable name you’re storing things, similar to the one-character Unix pipe command. We could compact the code above to:

mymap <- leaflet() %>% addTiles() %>% setView(-84.3847, 33.7613, zoom = 17) %>% addPopups(-84.3847, 33.7616, 'Data journalists at work, NICAR 2015')

View the finished product:

mymap

Or if you didn’t want to store the results in a variable for now but just work interactively:

leaflet() %>% addTiles() %>% setView(-84.3847, 33.7613, zoom = 16) %>% addPopups(-84.3847, 33.7616, 'Data journalists at work, NICAR 2015')

Now let’s do something a little more interesting – map nearby Starbucks locations. Load the starbucks.csv data set – See data source at: https://opendata.socrata.com/Business/All-Starbucks-Locations-in-the-US-Map/ddym-zvjk

Data files for these exercises are available on my NICAR15data repository on GitHub. You can also download the Starbucks data file directly from Socrata’s OpenData site in R with the code

download.file("https://opendata.socrata.com/api/views/ddym-zvjk/rows.csv?accessType=DOWNLOAD", destfile="starbucks.csv", method="curl")

starbucks <- read.csv("https://opendata.socrata.com/api/views/ddym-zvjk/rows.csv?accessType=DOWNLOAD", stringsAsFactors = FALSE)str(starbucks)atlanta <- subset(starbucks, City == "Atlanta" & State == "GA")leaflet() %>% addTiles() %>% setView(-84.3847, 33.7613, zoom = 16) %>% addMarkers(data = atlanta, lat = ~ Latitude, lng = ~ Longitude,popup = atlanta$Name) %>% addPopups(-84.3847, 33.7616, 'Data journalists at work, NICAR 2015')

A script created by a TCU prof lets you create choropleth maps of World Bank data with a single line of code! More info here:

https://rpubs.com/walkerke/wdi_leaflet

We don’t have time to do more advanced work with Leaflet, but you can do considerably more sophisticated GIS work with Leaflet and R. More on that at the end of this post.

More info on the Leaflet project page https://rstudio.github.io/leaflet/

A little more fun with Starbucks data: How many people are there per Starbucks in each state? Let’s load in a file of state populations

statepops <- read.csv("acs2013_1yr_statepop.csv", stringsAsFactors = FALSE)# A little glimpse at the dplyr library; lots more on that soonlibrary(dplyr)

There’s a very easy way to count Starbucks by state with dplyr’s count function format: count(mydataframe, mycolumnname)

starbucks_by_state <- count(starbucks, State)

We’ll need to add state population here. You can do that with base R’s merge or dplyr’s left_join. left_join is faster but I find merge more intuitive

starbucks_by_state <- merge(starbucks_by_state, statepops, all.x = TRUE, by.x="State", by.y="State") # No need to do by.x and by.y if columns have the same name# better namesnames(starbucks_by_state) <- c("State", "NumberStarbucks", "StatePopulation")

Add new column to starbucks_by_state with dplyr mutate function, which just means alter the data frame by adding one or more columns. Then we’ll store in a new dataframe, starbucks_data, so as not to much with the original.

starbucks_data <- starbucks_by_state %>% mutate( PeoplePerStarbucks = round(StatePopulation / NumberStarbucks) ) %>% select(State, NumberStarbucks, PeoplePerStarbucks) %>% arrange(desc(PeoplePerStarbucks))

Again the %>% character, so we don’t have to keep writing things like

starbucks_data <- mutate(starbucks_by_state, PeoplePerStarbucks = round(StatePopulation / NumberStarbucks))starbucks_data <- select(starbucks_data, State, NumberStarbucks, PeoplePerStarbucks)starbucks_data <- arrange(starbucks_data, desc(PeoplePerStarbucks))

Can we pretend for a moment that doing a histogram of this data is meaningful :-)? Because I want to show you a cool new histogram tool in Hadley Wickham’s ggvis package, still under development:

library(ggvis)starbucks_data %>% ggvis(x = ~PeoplePerStarbucks, fill := "gray") %>% layer_histograms()

Not a big deal? How about one with interactive sliders?

starbucks_data %>% ggvis(x = ~PeoplePerStarbucks, fill := "gray") %>% layer_histograms(width = input_slider(1000, 20000, step = 1000, label = "width")) # Can even add a rollover tooltipstarbucks_data %>% ggvis(x = ~PeoplePerStarbucks, fill := "gray") %>% layer_histograms(width = input_slider(1000, 20000, step = 1000, label = "width")) %>% add_tooltip(function(df) (df$stack_upr_ - df$stack_lwr_))

Time Series Graphing

Load needed libraries: dygraphs and xts if not on your system yet, first install with

install.packages("dygraphs")install.packages("xts")

To begin, let’s run some demo code with a sample data set already included with R, monthly male and female deaths from lung diseases in the UK from 1974 to 1979 datasets are mdeaths and fdeaths

First we’ll create a single object from the two of them with the cbind() – combine by column – function.

library("dygraphs")library("xts")lungDeaths <- cbind(mdeaths, fdeaths)# And now here's how easy it is to create an interactive multi-series graph:dygraph(lungDeaths)

The most “complicated” thing about dygraphs is that it is specifically for time series graphing and requires a time series object. You can create one with the base R ts() function

ts(datafactor, frequency of measurements per year, starting date as c(year, month))

Read in a data file on Atlanta unemployment rates

atl_un <- read.csv("FRED-ATLA-unemployment.csv", stringsAsFactors = FALSE)# now we need to convert this into a time seriesatl_ts <- ts(data=atl_un$Unemployment, frequency = 12, start = c(1990, 1))dygraph((atl_ts), main="Monthly Atlanta Unemployment Rate")

More info on dygraphs: https://rstudio.github.io/dygraphs

Note: There is an existing package called quantmod that will pull a lot of financial and economic data for you and put it into xts format. It pulls data from the Federal Reserve of St.Louis. I searched on their website and found out that the URL for Atlanta unemployment was

https://research.stlouisfed.org/fred2/series/ATLA013URN

which means the code is ATLA013URN

library("quantmod")

This command

getSymbols("ATLA013URN", src="FRED")

automatically pulls the data into R in the right time-series format, storing it in a variable the same name as the symbol, in this case ATLA013URN. Then we can use dygraph:

dygraph(ATLA013URN, main="Atlanta Unemployment")

To change name of data column in the ATLA013URN time series:

names(ATLA013URN) <- "rate"

Now re-graph:

dygraph(ATLA013URN, main="Atlanta Unemployment")

Aside: Quantmod has its own data visualization if you’re just exploring:

chartSeries(ATLA013URN, subset="last 10 years")

Another very new package lets you do more exploring of FRED data from the Federal Reserve of St.Louis; you need a free API key from the Federal Reserve site. More info here

https://github.com/jcizel/FredR

There’s another new package rbokeh, implementing an R version of the Python bokeh interactive Web plotting library. I’m going to skip this one since so much else to go over, but wanted you to know about it. It’s still under development already well documented at

https://hafen.github.io/rbokeh/rd.html

One other htmlwidgets-inspired package:

library(DT)datatable(atl_un)

Now that we’ve enjoyed some eye candy, I want to spend the rest of the session today on dplyr, a relatively new package by Hadley Wickham. Hadley is the author of a number of popular R packages including the ggplot2 visualization library.

The goal of dplyr is to offer a fairly easy, rational data manipulation. He talks about 5 basic, core things you want to do when manipulating data:

To choose only certain observations or rows by 1 or more criteria: filter()

To choose only certain variables or columns: select()

To sort: arrange()

To add new columns: mutate()

To summarize or otherwise analyze by subgroups: group_by() and summarise()

To apply a function to data by subgroups: group_by() and do()

There are other useful functions, such as ranking functions top_n() for the top n items in a group, min_rank() and dense_rank(), lead() and lag().

dplyr creates class of data frame called tbl_df that behaves largely like a data frame but has some convenience functionality, such as not accidentally printing out hundreds of rows if you type its name.

Hadley has a sample data package called nycflights13 for learning dplyr, but let’s see if we can load in my CSV file of domestic flights in and out of Georgia airports.

Note that there was just data available for January through November for 2014 when I downloaded this.

library(dplyr)ga <- read.csv("GAontime.csv", stringsAsFactors = FALSE, header = TRUE)

NOTE: read.csv can take awhile to process large data files. In a hurry?

Use the data.table package’s fread function. data.table has its own object classes and own ecosystem of functions. If you’re not planning to use those (I don’t), just convert the object back to a data frame or dplyr tbl_df object

ga <- data.table::fread("GAontime.csv")# We can turn this into a dplyr class tbl_df object withga <- tbl_df(ga)# Now see what happens if you just type the variable namega# We'll look at the structure:str(ga)# There's also a dplyr-specific function glimpse() with a slightly better formatglimpse(ga)# Let's just get Hartfield data. We want to filter for either ORIGIN or DEST being Hartsfield with code ATLatlanta <- filter(ga, ORIGIN == "ATL" | DEST == "ATL")

Now there are all sorts of questions we can answer with this data

What’s the average, median and longest delay for flights to a specific place by carrier? Feel free to pick the airport you’re flying to from Atlanta if it’s domestic, I’m going to use Boston’s Logan Airport

bosdelays1 <- atlanta %>% filter(DEST == "BOS") %>% group_by(CARRIER) %>% summarise( avgdelay = mean(DEP_DELAY, na.rm = TRUE), mediandelay = median(DEP_DELAY, na.rm = TRUE), maxdelay = max(DEP_DELAY, na.rm = TRUE) )bosdelays1# Or just the average delay by airline to Boston? avg_delays <- atlanta %>% filter(DEST == "BOS") %>% group_by(CARRIER) %>% summarise(avgdelay = mean(DEP_DELAY, na.rm=TRUE))avg_delays# What's the average delay by airline for each month to a specific destination? avg_delays_by_month <- atlanta %>% filter(DEST == "BOS") %>% group_by(CARRIER, MONTH) %>% summarise(avgdelay = round(mean(DEP_DELAY, na.rm=TRUE),1))avg_delays_by_month# Not as easy to see those, let's make a datatable:datatable(avg_delays_by_month)

Miss Excel pivot tables? You can do them in R too!

First let’s get a subset of the data we want

bos_delays <- subset(atlanta, DEST=="BOS", select=c("CARRIER", "DEP_DELAY", "MONTH"))library("rpivotTable")rpivotTable(bos_delays)

Let’s select Average from the dropdown, and then DEP_DELAY.Drag Carrier to the row box. Want to see average delay by month? Drag month to the column header. Want more visuals? Select a heatmap by column

But back to “regular” R….

What were the top 5 longest delays per airline?

delays <- atlanta %>% select(CARRIER, DEP_DELAY, DEST, FL_NUM, FL_DATE) %>% # columns I want group_by(CARRIER) %>% top_n(5, DEP_DELAY) %>% arrange(CARRIER, desc(DEP_DELAY))View(delays)# Which are the unlucky destinations in those top 5?table(delays$DEST)# What were the top 5 longest delays per destination?delays2 <- atlanta %>% select(CARRIER, DEP_DELAY, DEST, FL_NUM, FL_DATE) %>% # columns I want group_by(DEST) %>% top_n(5, DEP_DELAY) %>% arrange(CARRIER, desc(DEP_DELAY))View(delays2)# Can do basics such as percentage delayed flights by airline# Can use either subset or the true dplyr-way belowatlanta_delays1 <- subset(atlanta, select=c("CARRIER", "DEP_DEL15")) %>% group_by(CARRIER) %>% summarize( Percent = sum(DEP_DEL15, na.rm = TRUE) / n() )atlanta_delays2 <- atlanta %>% group_by(CARRIER) %>% summarize( Delays = sum(DEP_DEL15, na.rm = TRUE), Total = n(), Percent = round((Delays / Total) * 100,1) ) %>% arrange(desc(Percent)) # and a basic bar chart of the percentageslibrary(ggplot2)ggplot(data = atlanta_delays2, aes(x=CARRIER, y=Percent)) + geom_bar(stat="identity")# If you want to order by Percent and not alphabetical. Plus add color and a title:ggplot(data = atlanta_delays2, aes(x=reorder(CARRIER, Percent), y=Percent)) + geom_bar(stat="identity", fill="lightblue", color="black") + xlab("Airline") + ggtitle("Percent delayed flights from Atlanta Jan-Nov 2014")

Not a new package, but if you’re not familiar GoogleVis and want to see the code – this generates an HTML page:

library("googleVis")# get just the data we want - carrier and percent delay_subset <- subset(atlanta_delays2, select=c("CARRIER", "Percent"))gchart <- gvisColumnChart(delay_subset, options = list(title="Percent ATL delays by carrier"))plot(gchart)

Are there specific airplanes that flew in/out of Atlanta most often?

by_plane <- count(atlanta, TAIL_NUM) %>% arrange(desc(n))# what's the distribution?by_plane %>% ggvis(x = ~n, fill := "gray") %>% layer_histograms(width = input_slider(10, 200, step = 10, label = "binwidth")) # How might delays be related to distances flown? Hadley shows this code:by_tailnum <- group_by(atlanta, TAIL_NUM)delay <- summarise(by_tailnum, count = n(), dist = mean(DISTANCE, na.rm = TRUE), delay = mean(ARR_DELAY, na.rm = TRUE))delay <- filter(delay, count > 20, dist

Hadleyalso found not much correlation between distance flown and delays in the NYC data.

The googleVis library may not be a new library, but it’s got some new options. Let’s read in some Atlanta weather data:

atlwx <- read.csv("AtlantaTemps.csv", stringsAsFactors = FALSE)atlwx$date <- as.Date(atlwx$date, format="%Y-%m-%d")# Run this code and see what happens:dataviz <- gvisCalendar(atlwx, datevar="date", numvar="max", options=list(title="Daily high temps in Atlanta", width = 1000, height = 8000))plot(dataviz)

We won’t run this now because everyone would need a plot.ly account and API key – both free. But with 2 lines of code you can turn a ggplot image into an interactive JavaScript and embed that.

library("plotly")# if you had a plot.ly API key installed:myplotly <- plotly()myplotly$ggplotly()

For more on R, see my Beginner’s Guide to R:

PDF download https://cwrld.us/LearnRpdf HTML version: https://cwrld.us/IntroToR

Other packages I would have liked us to work with if we had more time:

rvest: Easy Web scraping package by Hadley Wickham. Step-by-step instructions on using it with the Selectorgadget bookmarklet: https://bit.ly/1zgq8JW

FredR: More exploring of FRED data from the Federal Reserve of St.Louis; you need a free API key from the Federal Reserve site. https://github.com/jcizel/FredR

plot.ly: Turn static ggplot2 graphics into interactive JavaScript visualizations easily embedded on the Web. Free plot.ly account and API key needed. https://plot.ly/ggplot2/

rbokeh: implements an R version of the Python bokeh interactive Web plotting library. It’s still under development but already well documented. https://hafen.github.io/rbokeh/rd.html

metricsgraphics: another interesting graphing project, interfacing to the MetricsGraphics.js D3 JavaScript library https://github.com/hrbrmstr/metricsgraphics

Additional things we could have done with more time using the packages we tried:

A script created by a TCU prof lets you create choropleth maps of World Bank data with Leaflet a single line of code! More info here:

https://rpubs.com/walkerke/wdi_leaflet

You can do considerably more sophisticated GIS work with Leaflet and R.

Draw circles with a 2km radius around each marker, for example. Tutorial by TCU assistant prof Kyle Walker https://rpubs.com/walkerke/rstudio_gis

Tutorial on creating choropleth maps with your own shapefiles and data

https://rpubs.com/walkerke/leaflet_choropleth

More info about Leaflet on the Leaflet project page https://rstudio.github.io/leaflet/

Do you use the ggplot2 visualization package? Save typing — and syntax-lookup — time with our free ggplot2 code snippets.