Author avatar

Chhaya Wagmi

Advanced Web Scraping with R

Chhaya Wagmi

  • Sep 15, 2020
  • 11 Min read
  • 7,887 Views
  • Sep 15, 2020
  • 11 Min read
  • 7,887 Views
Data
Data Analytics
Languages and Libraries
R

Introduction

You finished watching a Sci-Fi movie in which the protagonist develops a humanoid robot that can hold a conversation with people and even express its feelings just like humans do. You got excited and now want to build your own. But wait! Did you know that intelligence is built on information? How can you attain this information?

Web scraping provides one of the paths to get such information. To get you started, you'll need to learn different angles of fetching data from the web using R.

Fetching Data from a Single Table or Multiple Tables on an HTML Webpage

Yahoo! Finance consists of stock market data of equities, commodities, futures, etc. Once you land on this webpage, search for "Pluralsight" or "PS" in the search box. This will open up a webpage dedicated to the stock market data for Pluralsight. Since the webpage provides you an upfront option to download the historical data, there is no need to scrape it. But what about the company holders?

Click on the Holders tab, which will list out three sections:

  1. Major Holders
  2. Top Institutional Holders
  3. Top Mutual Fund Holders

Each of these sections consist of tabular data. To scrape these tables, use rvest and xml2 libraries.

The given code gets the task done. Go through the comments to understand how each command works:

1# --
2# Importing the rvest library 
3# It internally imports xml2 library too 
4# --
5library(rvest)
6
7
8# --
9# Load the link of Holders tab in a variable, here link
10# --
11link <- "https://finance.yahoo.com/quote/PS/holders?p=PS"
12
13
14# --
15# Read the HTML webpage using the xml2 package function read_html()
16# --
17driver <- read_html(link)
18
19
20# --
21# Since we know there is a tabular data on the webpage, we pass "table" as the CSS selector
22# The variable "allTables" will hold all three tables in it
23# --
24allTables <- html_nodes(driver, css = "table")
25
26
27# --
28# Fetch any of the three tables based on their index
29# 1. Major Holders
30# --
31majorHolders <- html_table(allTables)[[1]]
32majorHolders
33
34#       X1                                    X2
35# 1   5.47%       % of Shares Held by All Insider
36# 2 110.24%      % of Shares Held by Institutions
37# 3 116.62%       % of Float Held by Institutions
38# 4     275 Number of Institutions Holding Shares
39
40
41# --
42# 2. Top Institutional Holders
43# --
44topInstHolders <- html_table(allTables)[[2]]
45topInstHolders
46
47#                             Holder     Shares Date Reported  % Out       Value
48# 1      Insight Holdings Group, Llc 18,962,692  Dec 30, 2019 17.99% 326,347,929
49# 2                         FMR, LLC 10,093,850  Dec 30, 2019  9.58% 173,715,158
50# 3       Vanguard Group, Inc. (The)  7,468,146  Dec 30, 2019  7.09% 128,526,792
51# 4  Mackenzie Financial Corporation  4,837,441  Dec 30, 2019  4.59%  83,252,359
52# 5               Crewe Advisors LLC  4,761,680  Dec 30, 2019  4.52%  81,948,512
53# 6        Ensign Peak Advisors, Inc  4,461,122  Dec 30, 2019  4.23%  76,775,909
54# 7         Riverbridge Partners LLC  4,021,869  Mar 30, 2020  3.82%  44,160,121
55# 8          First Trust Advisors LP  3,970,327  Dec 30, 2019  3.77%  68,329,327
56# 9       Fred Alger Management, LLC  3,875,827  Dec 30, 2019  3.68%  66,702,982
57# 10 ArrowMark Colorado Holdings LLC  3,864,321  Dec 30, 2019  3.67%  66,504,964
58
59
60# --
61# 3. Top Mutual Fund Holders
62# --
63topMutualFundHolders <- html_table(allTables)[[3]]
64topMutualFundHolders
65
66#                                                           Holder    Shares Date Reported % Out      Value
67# 1                 First Trust Dow Jones Internet Index (SM) Fund 3,964,962  Dec 30, 2019 3.76% 68,236,996
68# 2                                     Alger Small Cap Focus Fund 3,527,274  Oct 30, 2019 3.35% 63,773,113
69# 3  Fidelity Select Portfolios - Software & IT Services Portfolio 3,297,900  Jan 30, 2020 3.13% 63,946,281
70# 4                         Vanguard Total Stock Market Index Fund 2,264,398  Dec 30, 2019 2.15% 38,970,289
71# 5                                  Vanguard Small-Cap Index Fund 2,094,866  Dec 30, 2019 1.99% 36,052,643
72# 6                                      Ivy Small Cap Growth Fund 1,302,887  Sep 29, 2019 1.24% 21,881,987
73# 7                            Vanguard Small Cap Value Index Fund 1,278,504  Dec 30, 2019 1.21% 22,003,053
74# 8                            Vanguard Extended Market Index Fund 1,186,015  Dec 30, 2019 1.13% 20,411,318
75# 9       Franklin Strategic Series-Franklin Small Cap Growth Fund 1,134,200  Oct 30, 2019 1.08% 20,506,336
76# 10                          Fidelity Stock Selector All Cap Fund 1,018,833  Jan 30, 2020 0.97% 19,755,171
r

Fetching Different Nodes from a Webpage Using CSS Selector

You can learn about fetching data using CSS selector from my blog available at GitHub.

Automatic Navigation to Multiple Pages and Fetching Entities

The above section helps you understand how to get entities if you have only one webpage devoted to Skills. But Pluralsight has a lot more Skills than just Machine Learning. Look in the image below for major skills taken from this URL:

Imgur

You can observe there are a total of 10 major skills and each has a distinct URL. Ignore the Browse all courses section, as it redirects back to the same webpage.

The objective here is to provide only one URL to the R program, here https://www.pluralsight.com/browse, and let the program automatically navigate to each of those 10 skill webpages and extract all course details as shown:

1library(rvest)
2library(stringr) # For data cleaning
3
4link <- "https://www.pluralsight.com/browse"
5
6driver <- read_html(link)
7
8# Extracting sub URLs
9# Here, tile-box is a parent class which holds the content in the nested class.
10# First, go inside the sub-class using html_children() and then fetch the URLs to each Skill page
11subURLs <- html_nodes(driver,'div.tile-box') %>% 
12            html_children() %>% 
13            html_attr('href')
14
15# Removing NA values and last `/browse` URL
16subURLs <- subURLs[!is.na(subURLs)][1:10]
17
18# Main URL - to complete the above URLs
19mainURL <- "https://www.pluralsight.com"
20
21# This function fetches those four entities as you learned in the previous section of this guide
22entity <- function(s){
23  
24  # Course Title
25  # Since Number of Courses may differ from Skill to Skill, therefore,
26  # we have done dynamic fetching of the course names
27  
28  v <- html_nodes(s, "div.course-item__info") %>%
29    html_children() 
30  
31  titles <- gsub("<|>", "", str_extract(v[!is.na(str_match(v, "course-item__title"))], ">.*<"))
32  
33  # Course Authors
34  authors <- html_nodes(s, "div.course--item__list.course-item__author") %>% html_text()
35  
36  # Course Level
37  level <- html_nodes(s, "div.course--item__list.course-item__level") %>% html_text()
38  
39  # Course Duration
40  duration <- html_nodes(s, "div.course--item__list.course-item__duration") %>% html_text()
41  
42  # Creating a final DataFrame
43  courses <- data.frame(titles, authors, level, duration)
44  
45  return(courses)
46}
47
48
49# A for loop which goes through all the URLs, fetch the entities and display them on the screen 
50i = 1
51for (i in 1:10) {
52  subDriver <- read_html(paste(mainURL, subURLs[i], sep = ""))
53  print(entity(subDriver))
54}
r

In the above code, understand the significance of html_children() and html_attr(). The code has elaborated comments to brief what each command is doing. The output of the above code will be similar to the previous section output and available for each skill.

Controlling Browser from R and Scraping Data

Consider you want to scrape the latest Google News about Pluralsight. Manually, you would open www.google.com, search for the keyword "Pluralsight," and click on News.

What if all these steps could be automated and you could fetch the latest news just by a small R script?

Note - Before proceeding with R code, make sure you follow these steps to establish Docker in your system:

  1. Download and install Docker
  2. Open Docker Terminal and run docker pull selenium/standalone-chrome. Replace chrome with firefox if you're a Firefox user.
  3. Then docker run -d -p 4445:4444 selenium/standalone-chrome
  4. If above two codes are successful, run docker-machine ip and note the IP address to be used in the R code

Given below is an elaborated code using RSelenium library:

1library(RSelenium)
2
3# Initiate the connection, remember remoteServerAddr needs to be replaced with the IP address you have 
4# received from the Docker Terminal
5driver <- remoteDriver(browserName = "chrome", remoteServerAddr = "192.168.99.100", port = 4445L)
6driver$open()
7
8
9# Provide the URL and let the driver load it for you
10driver$navigate("https://www.google.com/")
11
12
13# The search textarea of the Google falls under the name=q. Call this element.
14init <- driver$findElement(using = 'name', "q")
15
16# Enter the search keyword and hit Enter key
17init$sendKeysToElement(list("Pluralsight", key = "enter"))
18
19
20# Now, we have landed on the page with Pluralsight "All" results. Select the XPATH of the News tab and click it.
21News_tab <- driver$findElement(using = 'xpath', "//*[@id=\"hdtb-msb-vis\"]/div[2]/a")
22News_tab$clickElement()
23
24
25# You are now on the News results. Select the CSS selector for all the news (here, a.l)
26# Don't ignore that you have to use findElements (with s), not findElement. The latter gives only one result.
27res <- driver$findElements(using = 'css selector', 'a.l')
28
29# List out the latest headlines
30headlines <- unlist(lapply(res, function(x){x$getElementText()}))
31headlines
32
33# [1] "Pluralsight has free courses to help you learn Microsoft Azure ..."
34# [2] "Pluralsight offers free access to full portfolio of skill ..."     
35# [3] "Will Pluralsight Continue to Surge Higher?"                        
36# [4] "The CEO of Pluralsight explains why the online tech skills ..."    
37# [5] "Pluralsight Is Free For the Month of April"                        
38# [6] "This Pluralsight deal lets you learn some new skills from home ..."
39# [7] "Pluralsight One Commits Over $1 Million to Strategic Nonprofit ..."
40# [8] "Pluralsight Announces First Quarter 2020 Results"                  
41# [9] "Pluralsight Announces Date for its First Quarter 2020 Earnings ..."
42# [10] "Learn Adobe Photoshop, Microsoft Excel, Python for free with ..."
r

Conclusion

You have learned how to fetch data directly from table(s) with CSS selector, automatically navigate to multiple pages to retrieve information, control a web browser from a script, and fetch data using the RSelenium library.