Web Scraping?
Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have. Let’s say you find data from the web, and there is no direct way to download it, web scraping using Python is a skill you can use to extract the data into a useful form that can be imported.
Web Scraping using Beautiful Soup
Using Jupyter Notebook, you should start by importing the necessary modules (pandas, numpy, matplotlib.pyplot, seaborn). If you don’t have Jupyter Notebook installed, I recommend installing it using the Anaconda Python distribution which is available on the internet. To easily display the plots, make sure to include the line %matplotlib inline as shown below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
To perform web scraping, you should also import the libraries shown below. The urllib.request module is used to open URLs. The Beautiful Soup package is used to extract data from html files. The Beautiful Soup library’s name is bs4 which stands for Beautiful Soup, version 4.
from urllib.request import urlopen
from bs4 import BeautifulSoup
After importing necessary modules, you should specify the URL containing the dataset and pass it to urlopen() to get the html of the page.
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
Getting the html of the page is just the first step. Next step is to create a Beautiful Soup object from the html. This is done by passing the html to the BeautifulSoup() function. The Beautiful Soup package is used to parse the html, that is, take the raw html text and break it into Python objects. The second argument ‘lxml’ is the html parser whose details you do not need to worry about at this point
soup = BeautifulSoup(html, 'lxml')
type(soup)
bs4.BeautifulSoup
The soup object allows you to extract interesting information about the website you’re scraping such as getting the title of the page as shown below.