Get data from a website with ease using Python BeautifulSoup
Extract/parse html data from a website using Web Scraping library in Python.
Web Scraping is a process of extracting data from websites. There are two ways to do this:
Using API of the website(if available).
Using the HTML content of the website by accessing the DOM.
How to use BeautifulSoup
In this example, we will extract NIRF ranking data in a table format from https://www.nirfindia.org/2022/OverallRanking.html and save it in a csv file.
Also read: How the states perform in the NIRF ranking 2022.
Import following libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
Get content from the desired website and access the HTML
headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
page = requests.get("<https://www.nirfindia.org/2022/OverallRanking.html>", headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
Now we can search for html elements in the variable soup
using find()
.
Identify the HTML element
Identify the HTML element from which the data has to be extracted using the inspect element -CMD/CTRL + SHIFT + I. Always use the specific attributes like id
to refer to the desired html element.
For example in our case, the id
of the table which contains data is tbl_overall.
Extract data from the element
After finding the element, extract contents from the html table
element - tbody
with the attributes.
table = soup.find('table', attrs={'id':'tbl_overall'})
table_body = table.find('tbody')
Now table_body
variable contains the required data inside the table body. Use a for loop
to iterate over each row in the table and use find_all
function to get column data of element td
. We can keep recursive = False
to get only the top level elements in the table. If recursive = True
then we will also get the data which are nested inside the table, for example: in the column “Name”, there is another table hidden with 5 columns.
data = []
rows = table_body.find_all'tr', recursive=False)
for row in rows:
cols = row.find_all('td', recursive=False)
cols = [ele.contents[0] for ele in cols]
# print([ele for ele in cols if ele.find_parents('td')])
data.append([ele for ele in cols if (ele)]) # Get rid of empty values
Convert data into required format
Now that we have the ranking table data in data
as a list
, convert it into a data frame
and save it as a csv.
rank_data = pd.DataFrame(data)
rank_data.columns= ['Institute ID', 'Name', 'City', 'State', 'Score', 'Rank']
rank_data.to_csv("rank_data.csv")
Conclusion
I hope this blog will gives a glimpse of how to extract data from a website and use it for our applications. Please use the links in references section to get a detailed tutorial of Web Scraping using Python and BeautifulSoup.
References
Implementing Web Scraping using BeautifulSoup: https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/
Beautiful Soup: Build a Web Scraper With Python - https://realpython.com/beautiful-soup-web-scraper-python/
Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Code to extract ranking data of all the categories and perform data analysis on them: https://github.com/abhishekbm1996/nirf-2022-statewise