fundamentals of Python web Scraping | EduGrad

In the growing world of data science, data sources are very important to gather data from and retrieving valuable insights from the same. At times data is gathered at the administrative end of the companies, sometimes we need to extract data for analytics from other sources as well and what else would be a better source of data than websites.

Websites store a huge amount of data aggregated from various sources that can be extracted using scraping. Scraping is a process where we parse through a web page and collect data along the way. Web scraping Python is performed using a “web scraper” also knows as “bot” or “spider” or “crawler”. A web scraper is a program that sends a request to a web page, downloads the content, collects only the required data form the response and storing it into a database.

What are the steps involved in python web scraping?

Web Scraping using Python is a three-step process.

Step 1 – Sending an HTTP request to the webpage you want to scrape

First, we send an HTTP request to the target URL of the webpage we want to access. Then similar to what happens for a browser, the server responds to the request by returning the HTML content of the target webpage. I.e. we get the HTML code of our target website as a response to our request made using python.

In Python, to send web requests, we import requests library.

import request
Response = request.get(‘target URL’)

Step 2 – Parsing the HTML contentLearn web scraping using Python | EduGrad

Once we have the HTML content of the webpage, we need a method to parse the HTML content. We cannot simply extract the data from the code because HTML data is nested. So we need an HTML parser to do that. A parser is needed to create a nested structure of the HTML data. There are many parsers available in python such as Html5lib.

Step 3 – Pulling data out of HTML using beautiful soup

Now that we have HTML data all we need to do is to navigate and search the parsed data that we collected. For this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files.

We first look at the required classes where data is used using the inspect element feature of chrome and then access all the data inside those classed blocks using our beautiful soup library.

Following is code, explaining the above process.

Scraper for extracting mobile phone name, price, rating, and description from Flipkart.

Importing our required libraries

#Importing the Beautiful Soup Library
from bs4 import BeautifulSoup

#Importing the requests Library
import requests

Sending Request to the URL

response = requests.get('''https://www.flipkart.com/search?q=nokia+smartphones&
sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_0_10_na_
na_pr&otracker1=AS_QueryStore_OrganicAutoSuggest_0_10_na_na_pr&as-pos=0&as-type=
RECENT&suggestionId=nokia+smartphones&requestId=
675612e2-512b-4d0e-8b75-6bdf91921d7c&as-backfill=on''')

Parsing the returned HTML data

soup = BeautifulSoup(response.text, 'lxml')

Accessing the required data from the HTML content

mname, mrating, mprice, mdesc = list(), list(), list(), list()
mobile_name = soup.find_all(class_='_3wU53n')
rating = soup.find_all(class_='hGSR34')
price = soup.find_all(class_='_1vC4OE _2rQ-NK')
description = soup.find_all(class_='vFw0gD')
Explore Data Science courses – 

Learn Data Analytics using Python | EduGrad Learn web scraping using Python | EduGrad Learn Python for Data science | EduGrad

Data Science and Machine Learning Projects – 

Learn to build Recommendation system in Python | Machine Learning Projects | Data science Projects | EduGrad Learn Predictive Regression models in Machine Learning | Machine Learning Projects | EduGrad Build classification model in Python | Data science projects | EduGrad

Our best-rated Tutorials –

Data Visualization tools and start creating your own Dashboards | EduGrad Learn Natural Language Processing tutorial | EduGrad Learn Regression Analysis in 2 min | EduGrad

Maths for Machine Learning Tutorial | EduGrad Learn to build face detection using OpenCV | EduGrad Text Identification from Images using Pytesseract and OpenCV

LEAVE A REPLY

Please enter your comment!
Please enter your name here