You finished watching a Sci-Fi movie in which the protagonist develops a humanoid robot that can hold a conversation with people and even express its feelings just like humans do. You got excited and now want to build your own. But wait! Did you know that intelligence is built on information? How can you attain this information?
Web scraping provides one of the paths to get such information. To get you started, you'll need to learn different angles of fetching data from the web using R.
Yahoo! Finance consists of stock market data of equities, commodities, futures, etc. Once you land on this webpage, search for "Pluralsight" or "PS" in the search box. This will open up a webpage dedicated to the stock market data for Pluralsight. Since the webpage provides you an upfront option to download the historical data, there is no need to scrape it. But what about the company holders?
Click on the Holders tab, which will list out three sections:
Each of these sections consist of tabular data. To scrape these tables, use rvest
and xml2
libraries.
The given code gets the task done. Go through the comments to understand how each command works:
1# --
2# Importing the rvest library
3# It internally imports xml2 library too
4# --
5library(rvest)
6
7
8# --
9# Load the link of Holders tab in a variable, here link
10# --
11link <- "https://finance.yahoo.com/quote/PS/holders?p=PS"
12
13
14# --
15# Read the HTML webpage using the xml2 package function read_html()
16# --
17driver <- read_html(link)
18
19
20# --
21# Since we know there is a tabular data on the webpage, we pass "table" as the CSS selector
22# The variable "allTables" will hold all three tables in it
23# --
24allTables <- html_nodes(driver, css = "table")
25
26
27# --
28# Fetch any of the three tables based on their index
29# 1. Major Holders
30# --
31majorHolders <- html_table(allTables)[[1]]
32majorHolders
33
34# X1 X2
35# 1 5.47% % of Shares Held by All Insider
36# 2 110.24% % of Shares Held by Institutions
37# 3 116.62% % of Float Held by Institutions
38# 4 275 Number of Institutions Holding Shares
39
40
41# --
42# 2. Top Institutional Holders
43# --
44topInstHolders <- html_table(allTables)[[2]]
45topInstHolders
46
47# Holder Shares Date Reported % Out Value
48# 1 Insight Holdings Group, Llc 18,962,692 Dec 30, 2019 17.99% 326,347,929
49# 2 FMR, LLC 10,093,850 Dec 30, 2019 9.58% 173,715,158
50# 3 Vanguard Group, Inc. (The) 7,468,146 Dec 30, 2019 7.09% 128,526,792
51# 4 Mackenzie Financial Corporation 4,837,441 Dec 30, 2019 4.59% 83,252,359
52# 5 Crewe Advisors LLC 4,761,680 Dec 30, 2019 4.52% 81,948,512
53# 6 Ensign Peak Advisors, Inc 4,461,122 Dec 30, 2019 4.23% 76,775,909
54# 7 Riverbridge Partners LLC 4,021,869 Mar 30, 2020 3.82% 44,160,121
55# 8 First Trust Advisors LP 3,970,327 Dec 30, 2019 3.77% 68,329,327
56# 9 Fred Alger Management, LLC 3,875,827 Dec 30, 2019 3.68% 66,702,982
57# 10 ArrowMark Colorado Holdings LLC 3,864,321 Dec 30, 2019 3.67% 66,504,964
58
59
60# --
61# 3. Top Mutual Fund Holders
62# --
63topMutualFundHolders <- html_table(allTables)[[3]]
64topMutualFundHolders
65
66# Holder Shares Date Reported % Out Value
67# 1 First Trust Dow Jones Internet Index (SM) Fund 3,964,962 Dec 30, 2019 3.76% 68,236,996
68# 2 Alger Small Cap Focus Fund 3,527,274 Oct 30, 2019 3.35% 63,773,113
69# 3 Fidelity Select Portfolios - Software & IT Services Portfolio 3,297,900 Jan 30, 2020 3.13% 63,946,281
70# 4 Vanguard Total Stock Market Index Fund 2,264,398 Dec 30, 2019 2.15% 38,970,289
71# 5 Vanguard Small-Cap Index Fund 2,094,866 Dec 30, 2019 1.99% 36,052,643
72# 6 Ivy Small Cap Growth Fund 1,302,887 Sep 29, 2019 1.24% 21,881,987
73# 7 Vanguard Small Cap Value Index Fund 1,278,504 Dec 30, 2019 1.21% 22,003,053
74# 8 Vanguard Extended Market Index Fund 1,186,015 Dec 30, 2019 1.13% 20,411,318
75# 9 Franklin Strategic Series-Franklin Small Cap Growth Fund 1,134,200 Oct 30, 2019 1.08% 20,506,336
76# 10 Fidelity Stock Selector All Cap Fund 1,018,833 Jan 30, 2020 0.97% 19,755,171
You can learn about fetching data using CSS selector from my blog available at GitHub.
Consider you want to scrape the latest Google News about Pluralsight. Manually, you would open www.google.com, search for the keyword "Pluralsight," and click on News.
What if all these steps could be automated and you could fetch the latest news just by a small R script?
Note - Before proceeding with R code, make sure you follow these steps to establish Docker in your system:
docker pull selenium/standalone-chrome
. Replace chrome
with firefox
if you're a Firefox user.docker run -d -p 4445:4444 selenium/standalone-chrome
docker-machine ip
and note the IP address to be used in the R codeGiven below is an elaborated code using RSelenium library:
1library(RSelenium)
2
3# Initiate the connection, remember remoteServerAddr needs to be replaced with the IP address you have
4# received from the Docker Terminal
5driver <- remoteDriver(browserName = "chrome", remoteServerAddr = "192.168.99.100", port = 4445L)
6driver$open()
7
8
9# Provide the URL and let the driver load it for you
10driver$navigate("https://www.google.com/")
11
12
13# The search textarea of the Google falls under the name=q. Call this element.
14init <- driver$findElement(using = 'name', "q")
15
16# Enter the search keyword and hit Enter key
17init$sendKeysToElement(list("Pluralsight", key = "enter"))
18
19
20# Now, we have landed on the page with Pluralsight "All" results. Select the XPATH of the News tab and click it.
21News_tab <- driver$findElement(using = 'xpath', "//*[@id=\"hdtb-msb-vis\"]/div[2]/a")
22News_tab$clickElement()
23
24
25# You are now on the News results. Select the CSS selector for all the news (here, a.l)
26# Don't ignore that you have to use findElements (with s), not findElement. The latter gives only one result.
27res <- driver$findElements(using = 'css selector', 'a.l')
28
29# List out the latest headlines
30headlines <- unlist(lapply(res, function(x){x$getElementText()}))
31headlines
32
33# [1] "Pluralsight has free courses to help you learn Microsoft Azure ..."
34# [2] "Pluralsight offers free access to full portfolio of skill ..."
35# [3] "Will Pluralsight Continue to Surge Higher?"
36# [4] "The CEO of Pluralsight explains why the online tech skills ..."
37# [5] "Pluralsight Is Free For the Month of April"
38# [6] "This Pluralsight deal lets you learn some new skills from home ..."
39# [7] "Pluralsight One Commits Over $1 Million to Strategic Nonprofit ..."
40# [8] "Pluralsight Announces First Quarter 2020 Results"
41# [9] "Pluralsight Announces Date for its First Quarter 2020 Earnings ..."
42# [10] "Learn Adobe Photoshop, Microsoft Excel, Python for free with ..."
You have learned how to fetch data directly from table(s) with CSS selector, automatically navigate to multiple pages to retrieve information, control a web browser from a script, and fetch data using the RSelenium library.