Ping Shiuan Chua
  • About
  • Blog
  • Projects
  • PSD Studio
  • Tools
    Keyword Cluster Keyword Grouping Broken Links Check Term-frequency Web Scraper
Python >> Web Scraping

Identifying the Right Element Location when Scraping with BeautifulSoup

02 Feb 2020

Scraping usually involves identifying the right location of the element, and then passing that information (e.g. CSS Selector or Xpath) to a scraping code to automate the process.

However, there are occasions when you thought you have identified the right selector by using "Inspect" feature on the website, but the code returns no result when you run it.

Several reasons this might happen, such as:

  • The element is a JavaScript element, which your code couldn't perceive or load (e.g. with Python's BeautifulSoup)
  • The website identified the bot traffic and blocked it totally or using a Captcha
  • The website loads differently for your code than it did for you

This post will address the 3rd case, which from experience, happens most when you try to scrape the Google Search engine results page. As such, this is a follow-up for the following blog posts:

  • Finding SEO Backlink Opportunities
  • Scraping Search Results from Google

Tools Involved

Several tools that I will use to demonstrate the method are:

  • Spyder IDE - part of the Anaconda distribution
  • Atom IDE

The Spyder IDE will be used as a simple way to copy a variable out to a text form, while the Atom IDE is my preferred HTML editor to preview the exported text file

You may use any other tools that have the same features.

Exporting the Soup Variable

Referencing the code from Scraping Search Results from Google, you want to run the code up till you created the "soup" variable.

import urllib

query = "'trade war'"
query = urllib.parse.quote_plus(query) # Format into URL encoding
number_result = 20

import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup

ua = UserAgent()

google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
response = requests.get(google_url, {"User-Agent": ua.random})
soup = BeautifulSoup(response.text, "html.parser")

Next, we'll export the "soup" variable out to a text file. To do that, assign a variable to the string version of "soup".

export = str(soup)

On Spyder IDE, you'll now see the variable "export" in the Variable Explorer (it might take some time). Double-click that and copy the entire output.

Variable Explorer showing all recent variables

Raw HTML text of the soup variable

Previewing what BeautifulSoup is Seeing

Next, open Atom IDE or your preferred HTML editor (e.g. Notepad++).

Create a new file, paste your clipboard into the file.

If you do not like how it looks currently, you can use the Beautify feature of the Atom Beautify package. [This step is totally optional]

If you do not already have the Atom Beautify package, go ahead and download it. Here's an instruction on how to install a package on Atom.

Install Atom Beautify Package on Atom IDE

Once installed, go ahead and trigger the Beautify feature of Atom Beautify under the Packages menu. Now you'll see a more legible view of it. 

Next, save the file as a .html file by clicking save and appending ".html" to the end of the file name. Then, open the HTML file.

HTML of soup as seen on Chrome Desktop

You will now see a replica of the Google SERP, though it might look a bit different. This is exactly what BeautifulSoup is seeing, and is the most accurate way to find the CSS selector of the item you want to scrape.

Find the CSS Selector

You may skip this step if you already have knowledge on how to inspect and find the CSS class of an element in a webpage.

Scroll down to find the 1st organic search result of the HTML page. Right-click the text and select "Inspect" if you're on Chrome.

Next, hover across the HTML text you see until you find a line that highlights the entire box of the 1st organic search result.

Using Inspect Feature to view the Element's HTML

For me, the line is the one below

<div class="ZINbbc xpd O9g5cc uUPGi">

The class here is the CSS class of the "div" element. The CSS class controls the look and feel of the elements in a way that's scalable, as opposed to defining styles on an element-level.

Take one of the classes that is unique as compared to the non-relevant search result e.g. Google News, Paid Ads, Instant Answers. In this case, it is ZINbbc for me, hence it was referenced in the variable "result_div" in Scraping Search Results from Google.

Use this same method to find the title and description's class and replace it in your code.

Closing Note

This method is especially useful whenever you face a problem where you are very certain the selector you used is the right one, based on your inspection of the live website, but can never work in BeautifulSoup.

It allows you to see what BeautifulSoup is seeing, and this way you can debug if you need to enable a function to run JavaScript, scrape with another tool like Selenium to manually complete the Captcha or to find the right CSS class like we just did.

Share the love


Like the post? Consider donating to fund the maintenance of this website:

  • sponsor a green tea soy latte for a job well done
  • help to maintain this site for another month


Related Posts

Card image cap
Analysing Subreddit's Keyword Concentration with Python and PRAW
Card image cap
Error Handling in Selenium on Python
Card image cap
Scraping Search Results from Google Search
blog comments powered by Disqus

© 2021 Copyright: pingshiuanchua.com

Terms of Use Privacy Policy