Scraping usually involves identifying the right location of the element, and then passing that information (e.g. CSS Selector or Xpath) to a scraping code to automate the process.
However, there are occasions when you thought you have identified the right selector by using "Inspect" feature on the website, but the code returns no result when you run it.
Several reasons this might happen, such as:
This post will address the 3rd case, which from experience, happens most when you try to scrape the Google Search engine results page. As such, this is a follow-up for the following blog posts:
Several tools that I will use to demonstrate the method are:
The Spyder IDE will be used as a simple way to copy a variable out to a text form, while the Atom IDE is my preferred HTML editor to preview the exported text file
You may use any other tools that have the same features.
Referencing the code from Scraping Search Results from Google, you want to run the code up till you created the "soup" variable.
import urllib
query = "'trade war'"
query = urllib.parse.quote_plus(query) # Format into URL encoding
number_result = 20
import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
ua = UserAgent()
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
response = requests.get(google_url, {"User-Agent": ua.random})
soup = BeautifulSoup(response.text, "html.parser")
Next, we'll export the "soup" variable out to a text file. To do that, assign a variable to the string version of "soup".
export = str(soup)
On Spyder IDE, you'll now see the variable "export" in the Variable Explorer (it might take some time). Double-click that and copy the entire output.
Next, open Atom IDE or your preferred HTML editor (e.g. Notepad++).
Create a new file, paste your clipboard into the file.
If you do not like how it looks currently, you can use the Beautify feature of the Atom Beautify package. [This step is totally optional]
If you do not already have the Atom Beautify package, go ahead and download it. Here's an instruction on how to install a package on Atom.
Once installed, go ahead and trigger the Beautify feature of Atom Beautify under the Packages menu. Now you'll see a more legible view of it.
Next, save the file as a .html file by clicking save and appending ".html" to the end of the file name. Then, open the HTML file.
You will now see a replica of the Google SERP, though it might look a bit different. This is exactly what BeautifulSoup is seeing, and is the most accurate way to find the CSS selector of the item you want to scrape.
You may skip this step if you already have knowledge on how to inspect and find the CSS class of an element in a webpage.
Scroll down to find the 1st organic search result of the HTML page. Right-click the text and select "Inspect" if you're on Chrome.
Next, hover across the HTML text you see until you find a line that highlights the entire box of the 1st organic search result.
For me, the line is the one below
<div class="ZINbbc xpd O9g5cc uUPGi">
The class here is the CSS class of the "div" element. The CSS class controls the look and feel of the elements in a way that's scalable, as opposed to defining styles on an element-level.
Take one of the classes that is unique as compared to the non-relevant search result e.g. Google News, Paid Ads, Instant Answers. In this case, it is ZINbbc for me, hence it was referenced in the variable "result_div" in Scraping Search Results from Google.
Use this same method to find the title and description's class and replace it in your code.
This method is especially useful whenever you face a problem where you are very certain the selector you used is the right one, based on your inspection of the live website, but can never work in BeautifulSoup.
It allows you to see what BeautifulSoup is seeing, and this way you can debug if you need to enable a function to run JavaScript, scrape with another tool like Selenium to manually complete the Captcha or to find the right CSS class like we just did.
Like the post? Consider donating to fund the maintenance of this website: