DATA SCIENCE
A detailed guide for web scraping https://finance.yahoo.com using Requests, BeautifulSoup, Selenium, HTML tags, and embedded JSON data.
Vinod Dhole · Follow
17 min read · Mar 18, 2022
--
Table Of Contents
· Introduction
∘ What is “web scraping”?
∘ Objective
∘ The problem statement
∘ Prerequisites
∘ How to run the Code
∘ Setup and Tools
· 1. Web Scraping Stock Market News
∘
∘ 1.2 Exploring and locating Elements
∘
∘ 1.4 Save the extracted information to a CSV file
· 2. Web Scraping Cryptocurrencies
∘ 2.1 Introduction of selenium
∘
∘
∘ 2.4 Create Web Driver
∘ 2.5 Exploring and locating Elements
∘
∘ 2.7 Save the extracted information to a CSV file
· 3. Web Scraping Market Events Calendar
∘
∘
∘ 3.3 Get Embedded Json data
∘ 3.4 Locating Json Keys
∘
∘ 3.6 Save the extracted information to a CSV file
· References
· Future Work
· Conclusion
What is “web scraping”?
Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It’s a useful technique for creating datasets for research and learning.
Objective
The main objective of this tutorial is to showcase different web scraping methods that can be applied to any web page. This is for educational purposes only. Please read the terms and conditions carefully for any website to see whether you can legally use the data.
In this project, we will perform web scraping using the following 3 techniques:
- Use
Requests
,BeautifulSoup,
andHTML tags
to extract web page content. - Use
Selenium
to scrape data from dynamically loading websites. - Use embedded
JSON
data to scrape the website.
The problem statement
- Web Scraping Stock Market News (URL: https://finance.yahoo.com/topic/stock-market-news/)
This web page shows the latest news related to the stock market. We will try to extract data from this web page and store it in aCSV
(Comma-separated values) file. The file layout would be as mentioned below.
2. Web Scraping Cryptocurrencies (URL: https://finance.yahoo.com/cryptocurrencies)
This Yahoo! Finance web page shows a list of trending cryptocurrencies in table format. We will perform web scraping to retrieve the first 10 columns for the top 100 cryptocurrencies in the following CSV
format.
3. Web Scraping Market Events Calendar (URL: https://finance.yahoo.com/calendar)
This page shows date-wise market events. Users can select the date and choose any one of the following market events Earnings, Stock Splits, Economic Events & IPO. Our aim is to create a script that can be run for any single date and market event which grabs the data and loads it in CSV
format as shown below.
Prerequisites
- Basic knowledge of Python
- Basic knowledge of HTML, although it is not necessary
How to run the Code
You can execute the code using “Google Colab” or “Run Locally”
The code is available on Github: https://github.com/vinodvidhole/yahoo-finance-scraper
Setup and Tools
- Run on Google Colab: You will need to provide the Google login.
- Run on Local Machine: Download and install the Anaconda framework, and We will be using Jupyter Notebook for writing & executing code.
In this section, we will learn a basic Python web scraping technique using Requests
, BeautifulSoup,
and HTML tags
. The objective here is to perform web scraping of Yahoo! Finance Stock Market News.
Let’s kick-start with the first objective. Here’s an outline of the steps we’ll follow.
1.1 Download & Parse web page using Requests & BeautifulSoup
1.2 Exploring and locating Elements
1.3 Extract & Compile the information into a Python list
1.4 Save the extracted information to a CSV file
1.1 Download & Parse webpage using Requests and BeautifulSoup
The first step is to install requests
& beautifulsoup4
Libraries using pip
.
To download the web page we can use the requests.get()
function which returns a response object. This object contains the data from the web page and some other information.
The response.ok
& response.status_code
can be used for error trapping & tracking.
We can get the contents of the page using response.text
Finally, we can use BeautifulSoup
to parse the HTML data. This will return a bs4.BeautifulSoup
object. This will enable us to get hold of the required data with the help of different methods offered by BeautifulSoup
. We are going to learn some of these methods in the next subsection.
Let’s put all this together into a function.
Calling function get_page
and analyzing the output.
You can access different properties, data, images of HTML web page using methods like .find()
, .find_all()
etc. The following example will display the Title of the web page.
We can use the function get_page
to download any web page and parse it using beautiful soup.
1.2 Exploring and locating Elements
Now it’s time to explore the elements linked to the required data points from the web page. Web pages are written in a language called HTML (Hyper Text Markup Language). HTML is a fairly simple language comprised of tags (also called nodes or elements) e.g. <a href="https://finance.yahoo.com/" target="_blank">Go to Yahoo! Finance</a>
. The HTML tag has three parts:
- Name: (
html
,head
,body
,div
, etc.) Indicates what the tag represents and how a browser should interpret the information inside it. - Attributes: (
href
,target
,class
,id
, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions. - Children: A tag can contain some text or other tags or both between the opening and closing segments, e.g.,
<div>Some content</div>
.
Let’s inspect the source code of the web page by right-clicking → selecting the “Inspect” option. First, we need to identify the tag which represents the news listing.
In this case we can see the <div>
tag having class name "Ov(h) Pend(44px) Pstart(25px)"
is representing news listing. We can use the find_all
function to grab this information.
Total elements in the <div>
tag list match with the number of news items displayed on the webpage, so we are heading in the right direction.
The next step is to inspect the single <div>
tag and try to research more information. I am using "Visual Studio Code", but you can use any tool as simple as a notepad.
I copied the above output and pasted it into the “Visual Studio Code" and then identified tags & properties representing News Source, Headline, News content, etc.
Luckily, most of the required data points are available in one single <div>
tag, so now we can use the find
method to grab each item.
If any tag is not accessible directly, then you can use methods like findParent()
or findChild()
to point to the required tag.
The key takeout from this exercise is to identify the optimal tag/element which will provide us the required information. This is mostly straightforward, but sometimes you will have to perform a little more research.
1.3 Extract & Compile the information into a Python list
We’ve identified all the required tags and information. Let’s put this together in the helper function.
We will create one more function, to parse individual <div>
tags and return the information in dictionary form.
1.4 Save the extracted information to a CSV file
This is the last step. We are going to use Python library pandas
to save the data in CSV format. Let’s install and then import the pandas library.
Let’s create a wrapper function, The first step is to use the get_page
function to download the HTML page, then we can pass the output in get_news_tags
to identify a list of <div>
tags for news.
After that we will use a List Comprehension technique to parse each <div>
tag using parse_news
, the output will be in the form of lists
of dictionaries
.
Finally, we will use the .DataFrame()
method to create pandas dataframe and use the to_csv
function to store required data in CSV format.
Scraping the news using scrape_yahoo_news
function
The “stock-market-news.csv” should be available in the File → Open Menu. You can download the file or directly open it in a browser. Please verify the file content and compare it with the actual information available on the webpage.
You can also check the data by grabbing a few rows from the data frame returned by the scrape_yahoo_news
function.
Summary: Hopefully I was able to explain this simple but very powerful Python technique to scrape the Yahoo! Finance market news. These steps can be used to scrape any web page. You just have to do a little research to identify the required <tags>
/elements and use relevant python methods to collect the data.
In phase one we were able to scrape the Yahoo market news web page. However, if you’ve noticed as we scroll down the web page, more news will appear at the bottom of the page. This is called dynamic page loading. The previous technique is a basic Python method useful for scraping static data. To scrape the dynamically loaded data will use a different method called web scraping using Selenium. Let’s move ahead with this topic. The goal of this section is to extract top listing Crypto currencies from Yahoo! Finance.
Here’s an outline of the steps we’ll follow.
2.1 Introduction of selenium
2.2 Download & Set-up
2.3 Install & Import libraries
2.4 Create Web Driver
2.5 Exploring and locating Elements
2.6 Extract & Compile the information into a Python list
2.7 Save the extracted information to a CSV file
2.1 Introduction of selenium
Selenium is an open-source web-based automation tool. Python language and other languages are used with Selenium for testing as well as web scraping. Here we will use the Chrome browser, but you can try it on any browser.
You can find proper documentation on selenium here
Why should you use Selenium?
- Clicking on buttons
- Filling forms
- Scrolling
- Taking a screenshot
- Refreshing the page
The following methods will be helpful to find elements in a web page (these methods will return a list, if you are looking for only a single element then just remove ‘s’ from the following methods e.g. find_element_by_<…>
)
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
In this tutorial we will use only find_elements_by_xpath
and find_elements_by_tag_name
. You can find complete documentation of these methods here
2.2 Download & Set-up
In this section we’ll have to do some prep work to implement this method. We will need to install Selenium & proper web browser driver.
Google Colab:
If you are using a Google Colab platform, then execute the following code to perform the Initial installation. This piece of code 'google.colab' in str(get_ipython())
is used to identify the Google Colab platform.
Local Machine:
To run it locally, you will need Webdriver for Chrome on your machine. You can download it from this link https://chromedriver.chromium.org/downloads and just copy the driver in the folder where we will execute the python file (No need for installation). But make sure that the driver version matches the Chrome browser version installed on the local machine.
2.3 Install & Import libraries
Installation of the required libraries.
Once the Libraries installation is done, the next step is to import all the required modules/libraries. Please note that for the Local platform we need to import additional modules.
So all the necessary prep work is done. Let’s move ahead to implement this method.
2.4 Create Web Driver
In this step first we will create the instance of Chrome WebDriver using webdriver.Chrome()
method. After that, the driver.get()
method will initiate a page mentioned in the URL. In this case also, there is slight variation based on platform.
We have used some options
parameters for e.g. --headless
option will load the driver in the background. Check this link for more details.
Test run of get_driver
2.5 Exploring and locating Elements
This is almost a similar step that we have done in Phase One. We will try to identify relevant information like <tags>
, class
, XPath
etc. from the web page.
Get Table Headers (Column names):
Right-click and select the "Inspect" to do further analysis. As the web page shows cryptocurrency information in the Table form. We can grab the table header by using tag <th>.
Let’s use find_elements by TAG to get the table headers. These will represent the columns in the CSV file.
Creating a helper function to get the first 10 columns from the header, we have used List comprehension with conditions. You can also check out usage of the enumerate
function.
Get Table Row count:
Next we will find out the number of rows available in a Page, you can see table rows are placed in <tr>
tag. Here we will use XPath
to find <tr>
tag. We can capture the XPath
by selecting <tr>
tag then Right Click → Copy → Copy XPath.
So we get the XPath value as //*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]
, Let's use this with find_element()
& By.XPATH
.
Above XPath
points to the first row, we can get rid of the row number part from XPath and use it with find_elements
to get hold of all the available rows. Let's implement this with a function. Checkout the XPath
variations and the output in both the examples.
Get Table Column data:
Similarly, we can capture the XPath for any column value.
This is the XPath of a column //*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]/td[2]
Note that the number after tr
& td
represents the row_number
and column_number.
Let’s verify this with the find_element()
method.
We can change the row_number
& column_number
in XPath
and loop it through row count and column count to get required column values. Let's generalize this and put it in a function. We will get the data for one row at a time and return column values in the form of a dictionary.
Pagination:
The Yahoo! Finance web page shows only 25 Cryptocurrencies per page and users will have to click Next
Button to load the other sets of crypto currencies. This is called pagination.
This is the main reason we are using Selenium to tackle scenarios like pagination. With the help of Selenium you can perform multiple actions/events on a web page like clicking, scrolling, refreshing etc. The possibilities are endless, which makes this tool very powerful in web scraping.
Now we will grab the XPath
of Next
button.
Then get the element for Next Button using find_element
method, and after that we can perform click action using the .click()
method.
Now I am trying to check the first row on the web page to verify if .click()
really worked, and you will see the first row has changed. Click action was successful.
In this section we learned how to get required data points, and how to perform events / actions on the web page.
https://jovian.ml/vinodvidhole/yahoo-finance-web-scraper/v/505&cellId=92
2.6 Extract & Compile the information into a Python list
Let’s put all the pieces of the puzzle. We will pass the total number of rows to be scraped (in this case 100 rows) as an integer argument (total_crypto)
. After that, parse each row of the page and append the data in the List
till the total parsed row count matches to the total_crypto
. In addition, we will perform the Next
Button click if we reach the last row of the web page.
Note: Here to identify the Next
button element, we have used the WebDriverWait class instead of using find_element()
method. In this technique we can pass some wait-time before grabbing the element. This type of implementation is done to avoid the StaleElementReferenceException
.
Code Sample:
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="scr-res-table"]/div[2]/button[3]')))
2.7 Save the extracted information to a CSV file
This is the last step of this section. We are creating a placeholder function which calls all previously created helper functions and then we will save the data in CSV format using the pd.to_csv
method.
Time to scrape some cryptos!!! , we will scrape the top 100 cryptos in Yahoo! Finance web page by calling scrape_yahoo_crypto.
The “crypto-currencies.csv” should be available in the File → Open Menu. You can download the file or directly open it in a browser. Please verify the file content and compare it with the actual information available on the webpage.
You can also check the data by grabbing a few rows from the data frame returned by the scrape_yahoo_crypto
function.
Summary: Hope you’ve enjoyed this tutorial. Selenium enables us to perform multiple actions on the web browser, which is really very handy for scraping different types of data from any webpage.
This is the final segment of the tutorial. In this section, we will learn how to extract embedded JSON formatted data from the HTML web page which can be easily converted to a Python dictionary. Problem statement for this section is to scrape date-wise market events from Yahoo! Finance.
Here’s an outline of the steps we’ll follow.
3.1 Install & Import libraries
3.2 Download & Parse web page
3.3 Get Embedded Json data
3.4 Locating Json Keys
3.5 Pagination & Compiling the information in a Python list
3.6 Save the extracted information to a CSV file
3.1 Install & Import libraries
The first step to install and import Python Libraries
3.2 Download & Parse web page
This is exactly the same step that we’ve performed to download the webpage in section 1.1. Here we have used custom headers in requests.get()
Most of the things are explained in section 1.1. Let’s create the helper function.
3.3 Get Embedded Json data
In this step we will locate the Json formatted data which stores all information displayed on the webpage.
Open the web page and do Right Click → View Page Source, If you scroll down to the source page you will find the Json formatted data. Apparently this information is available in the <script>
tag. Fortunately there is a very convenient way to grab this tag by locating the following text in the webpage source code /* -- Data -- */
.
We will use Regular expressions to get the text inside the <script>
tag.
Next step is to grab the Json formatted string from the <script>
tag. I am printing first and last 150 characters from the <script>
tag.
On further analysis we can see that the formatted string has the starting key as context
and it ends at 12 characters from the end. So we can grab the Json string using Python slicing.
Lastly, we will use the json.loads()
method to convert Json string into Python Dictionary. Now creating a function using this information.
3.4 Locating Json Keys
Basically, the Json text is a multi-level nested dictionary, and some keys are used to store all the metadata displayed on the webpage. In this section we will identify the keys for the data we are trying to scrape.
We’ll need a Json Formatter
tool to navigate through multiple keys. I am using an online tool https://jsonblob.com/. However, you can choose any tool.
To simplify the analysis, we will write the Json text into my_json_file.json
file. After that, copy the file content and paste it to the left panel of the JSON Blob and it will do a neat formatting. We can easily navigate through each Keys and search for any item.
Next step is to find the Required Key location. Let’s search the company name 3D Systems Corporation
from the webpage in Json text using the JSON Blob formatter.
You can see the table data is stored in the rows
key, and we can track down the parent keys as shown in the above screen, check out the content of the row
key.
This sub-dictionary shows all the data displayed on the current page. You can do more research and exploration to get more useful information from the web page, a few examples shown below.
Putting this in helper functions.
3.5 Pagination & Compiling the information into a Python list
As we saw in the previous section on how to handle pagination
using the Selenium methods, here we'll learn a new technique for accessing multiple pages (pagination).
Most of the times webpage URL gets changed at runtime depending on the user actions. For example, in the below screenshot, I selected the Earnings for 1-March-2022. You can notice how that information is passed in the URL.
Similarly, when I click the next Button, offset and size
values get changed in the url.
So we can figure out the pattern & structure of the url and how it affects page navigation.
In this case the web page URL pattern is mentioned below:
- The following values are used for calendar event types
event_types = ['splits','economic','ipo','earnings']
- Date passed in
yyyy-mm-dd
format - Page number is controlled by
offset
value (for first pageoffset=0
) - Maximum number of rows in a page is assigned to
size
Based on the above information, we can build the URL at runtime, download the page, and then extract the information. This is how we handle pagination.
Let’s create a function in which we will pass event_type
and date
, then we will calculate the total rows for matching criteria using the get_total_rows
function. Maximum rows per page are constant (i.e., 100), so we can build iterating summation logic to calculate the total number of pages involved for the current criteria and extract each page data in the loop.
3.6 Save the extracted information to a CSV file
In this last section, we will save the data to CSV format using pd.DataFrame()
& to_csv(),
and call everything in a single placeholder function.
Executing the final function scrape_yahoo_calendar
Total 4 CSV files “event_type_yyyy-mm-dd.csv” should be available in the File → Open Menu. You can download the file or directly open it in a browser. Please verify the file content and compare it with the actual information available on the webpage.
Summary: This is a very useful technique which can be easily replicable. Without writing any customized code, we were able to extract data from multiple types of web pages just by changing one variable (in this case event_type
).
References to some useful links.
- https://github.com/vinodvidhole/yahoo-finance-scraper
- https://blog.jovian.ai/automate-web-scraping-using-python-aws-lambda-amazon-s3-amazon-eventbridge-cloudwatch-c4c982c35fa7
- https://htmldog.com/guides/html/
- https://selenium-python.readthedocs.io/index.html
- https://www.programcreek.com/python/example/100025/selenium.webdriver.ChromeOptions
- https://stackoverflow.com/questions/27003423/staleelementreferenceexception-on-python-selenium
- https://www.w3schools.com/js/js_json_intro.asp
- https://hhsm95.dev/blog/the-importance-of-using-user-agent-to-scraping-data/
- https://medium.com/@vinodvidhole
Ideas for future work
- Automate this process using AWS Lambda to download daily market calendar, crypto-currencies & market news in CSV format.
- Move the old files to an Archive folder, append date-stamp to the file if required, and also delete the Archived files older than two weeks.
- Process the raw data extracted from the third technique using pandas.
In this tutorial, we have implemented the following web scraping techniques.
- We have used Requests, BeautifulSoup, and HTML tags to extract data from a web page.
- We used Selenium to perform clicks on dynamically loading websites and captured the information.
- We extracted the existing embedded JSON data to scrape a website.
I hope I was able to teach you these web-scraping methods, and now you can use this knowledge to scrape any website.
If you have any questions or feedback, please feel free to post a comment or contact me on LinkedIn. Thank you for reading and if you liked this post, please consider following me. Until next time… Happy coding !!
Don’t forget to give your 👏 !