Gice

Technology and General Blog

Web scraping is a technique used to select and extract specific content from websites. For instance, when we want to monitor prices and how they change, we can use a web scraper to extract just the information we want from a website and dump them into an excel file. In this tutorial, we will be learning how to scrape the web using beautifulsoup.

First, install beautifulsoup as follows:

pip install beautifulsoup4

Beautifulsoup is applied to an HTML file, and so we must begin by getting the HTML content of a webpage. This is typically done using the requests module. In this specific example, we will get the HTML content of a webpage and display it. For this, we first set the url; in this case, I’ve chosen the common sense media website (because it has a list of movies with ratings, which we may be interested in scraping). We then use the get() method to fetch the response object and extract the HTML portion using the content or text attribute.

import requests

url = “https://www.commonsensemedia.org/movie-reviews”

body = requests.get(url)

body_text = body.content  # or body.text

print(body.content)   # or print(body.text)

Now, we can begin using beautifulsoup. We create a beautifulsoup object which takes two arguments – the html file and the type of parser. There are four parsers available – html.parser, lxml, lxml-xml, and html5lib.

from bs4 import BeautifulSoup

soup = BeautifulSoup(body_text, ‘lxml’)

One must also install the parser. In this case, I’ve chosen the lxml parser, and so I will install it.

Now, we can do just about anything, but we’ll explore the different possibilities before I begin web scraping.

(i) The prettify() method will rewrite the text in a readable and “pretty” format.

(ii) The title method will retrieve the title.

(iii) The “p” method will extract all p tags from the html code.

(iv) The “a” method will extract all the a tags from the html code.

(v) The find_all() method will find all web elements that contain a particular argument. In this case, I’ve passed “a”, so find_all(“a”) will find all the “a” tags.

(vi) The find method will find all the arguments passed. In this case, we pass the argument id = “password.” So it will search the html code for the id, and if it matches, retrieve the clause.

So typically, we’d like to scrape a web page for jobs, movies, courses, etc., along with their respective information (such as prices and ratings). In this case, we’re interested in a website, particularly scraping their movie list.

import requests

url = “https://www.commonsensemedia.org/movie-reviews”
body = requests.get(url)
body_text = body.content

from bs4 import BeautifulSoup
soup = BeautifulSoup(body_text, ‘lxml’)

In this particular case, the html code of each movie name (what we are scraping) is itself within a container. We first begin by inspecting the element in question. In my case, I have chosen to inspect the first movie’s title (“till death”).

When you inspect the element, you will notice that what we are after – the movie title “till death” – is contained within a “div” tag with class “content-content-wrapper.” This first “div” tag will keep re-occurring throughout the html code since each movie title is contained within such a “div” tag. And so we say that for each div in divs, we wish to select the sub-” div” tag with a different class of “views-field views-field-field-reference-review-ent-prod result-title.” After that, we see a “strong” tag with class “field-content.” So we do the same thing again. And finally, our title itself is nested with an “a” tag, so we select the “a” tag.

divs = soup.find_all(“div”, class_=“content-content-wrapper”)

Please note here that after the word class, there is an underscore. This underscore distinguishes the html code class from the python classes. So we wrote the code which will extract the “div” tag with the class “content-content-wrapper.”

Then you write:

# divs = soup.find_all(“div”, ‘class’ : ‘content-content-wrapper’)

for div in divs:
    divs2 = div.find_all(“div”, class_=“views-field views-field-field-reference-review-ent-prod result-title”)
    for div in divs2:
        strongs = div.find_all(“strong”, class_=“field-content”)
        for strong in strongs:
            aa = strong.find_all(“a”)
            for a in aa:
                print(a.text)

The for loops exist to pick out each movie. Finally, when we want to select the text, we say a.text. The latter will print out each movie title, and in such a manner, we can scrape whatever we want.

Now, suppose that we wished to save this data into a csv file; that’s possible too. In order to write to csv, you must first import the csv module. First, let’s open up the file where we want the information stored. Here we will pass three arguments – the name of the file, the mode, and whether we want a newline or not. Here, we are adding a newline equal to nothing to prevent the csv file from adding returns (or new empty lines) after each entry. Second, we pass the file to the writer() method. Third, we write a new row. In this case, I’m calling my new row “Movies” because it’s the header to what is to follow.

import csv

file = open(“movie.csv”, “w”, newline=)
file_write = csv.writer(file)
file_write.writerow([‘Movies’])

Fourth, instead of just printing out the “a” variable, we will strip it of empty spaces and then use the writerow() method to write it to the csv file.

for div in divs:
    divs2 = div.find_all(“div”, class_=“views-field views-field-field-reference-review-ent-prod result-title”)
    for div in divs2:
        strongs = div.find_all(“strong”, class_=“field-content”)
        for strong in strongs:
            aa = strong.find_all(“a”)
            for a in aa:
                file_write.writerow([a.text.strip()])

The whole code would look something like this:

import requests

url = “https://www.commonsensemedia.org/movie-reviews”
body = requests.get(url)
body_text = body.content

from bs4 import BeautifulSoup
soup = BeautifulSoup(body_text, ‘lxml’)

divs = soup.find_all(“div”, class_=“content-content-wrapper”)

import csv

file = open(“movie.csv”, “w”, newline=)
file_write = csv.writer(file)
file_write.writerow([‘Movies’])

for div in divs:
    divs2 = div.find_all(“div”, class_=“views-field views-field-field-reference-review-ent-prod result-title”)
    for div in divs2:
        strongs = div.find_all(“strong”, class_=“field-content”)
        for strong in strongs:
            aa = strong.find_all(“a”)
            for a in aa:
                file_write.writerow([a.text.strip()])

This is just a simple example. In reality, web scraping is so powerful that you can scrape and monitor just about any web page.

Happy Coding!

Leave a Reply

Your email address will not be published. Required fields are marked *