Get data from a website with ease using Python BeautifulSoup

Extract/parse html data from a website using Web Scraping library in Python.

Jul 26, 2022

soup in bowl — Photo by Ella Olsson on Unsplash

Web Scraping is a process of extracting data from websites. There are two ways to do this:

Using API of the website(if available).
Using the HTML content of the website by accessing the DOM.

How to use BeautifulSoup

In this example, we will extract NIRF ranking data in a table format from https://www.nirfindia.org/2022/OverallRanking.html and save it in a csv file.

Also read: How the states perform in the NIRF ranking 2022.

Import following libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd

Get content from the desired website and access the HTML

headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
page = requests.get("<https://www.nirfindia.org/2022/OverallRanking.html>", headers=headers)
soup = BeautifulSoup(page.content, "html.parser")

Now we can search for html elements in the variable soup using find() .

Identify the HTML element

Identify the HTML element from which the data has to be extracted using the inspect element -CMD/CTRL + SHIFT + I. Always use the specific attributes like id to refer to the desired html element.

For example in our case, the id of the table which contains data is tbl_overall.

Inspecting the table element in the page

Extract data from the element

After finding the element, extract contents from the html table element - tbody with the attributes.

table = soup.find('table', attrs={'id':'tbl_overall'})
table_body = table.find('tbody')

Now table_body variable contains the required data inside the table body. Use a for loop to iterate over each row in the table and use find_all function to get column data of element td. We can keep recursive = False to get only the top level elements in the table. If recursive = True then we will also get the data which are nested inside the table, for example: in the column “Name”, there is another table hidden with 5 columns.

data = []
rows = table_body.find_all'tr', recursive=False)
for row in rows:
    cols = row.find_all('td', recursive=False)
    cols = [ele.contents[0] for ele in cols]
    # print([ele for ele in cols if ele.find_parents('td')])
    data.append([ele for ele in cols if (ele)]) # Get rid of empty values

Convert data into required format

Now that we have the ranking table data in data as a list, convert it into a data frame and save it as a csv.

rank_data = pd.DataFrame(data)
rank_data.columns= ['Institute ID', 'Name', 'City', 'State', 'Score', 'Rank']
rank_data.to_csv("rank_data.csv")

Conclusion

I hope this blog will gives a glimpse of how to extract data from a website and use it for our applications. Please use the links in references section to get a detailed tutorial of Web Scraping using Python and BeautifulSoup.

References

Implementing Web Scraping using BeautifulSoup: https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/

Beautiful Soup: Build a Web Scraper With Python - https://realpython.com/beautiful-soup-web-scraper-python/

Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Code to extract ranking data of all the categories and perform data analysis on them: https://github.com/abhishekbm1996/nirf-2022-statewise

Abhishek’s Newsletter

Discussion about this post