Selenium is a very versatile web scraping tool that is accessible via multiple programming languages.
It's distinguished from text-parsing scrapers like BeautifulSoup as it actually simulates a web navigation experience, enabling you to scrape website running on a lot of Javascript and iframes.
That makes Selenium especially powerful when you are in need of scraping large websites, like e-commerce sites.
However, as with large websites, the pages you scrape won't be totally identical with one another.
Hence, error or exception handling is very, very important.
Without proper exception handling, you may face errors after errors and waste time, as any error will simply halt your scraping work.
This is especially bad when you have set up your scraping task to take place over lunch or overnight.
Sometimes, certain pages do not have a certain element, which is very common.
For example, you might be scraping Amazon for products' reviews. Some products that do not have any reviews simply do not have any review element to show you.
This can be easily solved by appending a "" or None to the list that you're populating your scrape result.
from selenium.common.exceptions import NoSuchElementException
reviews = []
for page in pages_to_scrape:
try:
# Insert your scraping action here
reviews.append(driver.find_element_by_css_selector('div.review').text)
except NoSuchElementException:
# Just append a None or ""
reviews.append(None)
Some websites are simply slow or too small that your scraping has caused their server to overload.
The latter is always bad and you shouldn't crash other people's websites when scraping.
In any case, timeouts (i.e. page failed to load) are common on large websites.
However, I won't recommend catching this error unless you know from experience that the website is prone to timeouts. It can be a huge waste of resource if the website has a 0.01% chance of timeout.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
driver.get("your url")
# Remove the while loop and break if you don't want to try again when it took too long
while True:
try:
# Define an element that you can start scraping when it appears
# If the element appears after 5 seconds, break the loop and continue
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR, "your selector")))
break
except TimeoutException:
# If the loading took too long, print message and try again
print("Loading took too much time!")
This error happens when the element you try to click (e.g. the "next page" button) gets blocked by another element and becomes unclickable.
The common cause of this is a pop-up being triggered, or there is a chat box that appears at the bottom-right of the page.
There are a few ways to solve this. The first way is to close the pop-up.
from selenium.common.exceptions import ElementClickInterceptedException
try:
# Tries to click an element
driver.find_element_by_css_selector("button selector").click()
except ElementClickInterceptedException:
# If pop-up overlay appears, click the X button to close
time.sleep(2) # Sometimes the pop-up takes time to load
driver.find_element_by_css_selector("close button selector").click()
You may also use Javascript to hide that element (credit to Louis' StackOverflow answer):
from selenium.common.exceptions import ElementClickInterceptedException
try:
# Tries to click an element
driver.find_element_by_css_selector("button selector").click()
except ElementClickInterceptedException:
element = driver.find_element_by_class_name("blocking element's class")
driver.execute_script("""
var element = arguments[0];
element.parentNode.removeChild(element);
""", element)
If it's not a pop-up, the problem could be solved by scrolling away, hoping that the blocking element moves with you and away from the button/link to be clicked.
from selenium.common.exceptions import ElementClickInterceptedException
try:
# Tries to click an element
driver.find_element_by_css_selector("button selector").click()
except ElementClickInterceptedException:
# Use Javascript to scroll down to bottom of page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
Stale element happens when the element is was deleted or no longer attached to the DOM.
Though a not very common error, some websites are prone to having this error.
When encountered with this error, you can just try again for a number of times.
from selenium.common.exceptions import StaleElementReferenceException
while True:
try:
item_list.append(driver.find_element_by_id("item id").text)
except StaleElementReferenceException:
continue # If StaleElement appears, try again
break # once try is successful, stop while loop
Catching errors and using these exceptions slows your code down significantly. It can cause a regular 1-hour scraping task to double in time especially if you wrap all your scraping actions with error exceptions.
Sometimes it's totally inescapable, and that's when a full coverage of error exceptions is necessary so that you can run it overnight without any worries.
However, I recommend to do a trial and error on your code and implement error exceptions only when you encounter them.
Start with a small, representative sample of your web pages to find out the type of errors that can happen.
If you're scraping a few elements in a row, it's best to assign the result of the scrape to a temporary, placeholder variable before appending to a list.
This is because if you're appending all to their respective lists, an error in the later stages will cause your prior appended lists having more elements.
# ====================
# Use this way
# ====================
try:
item_one_temp = driver.find_element_by_id("item one id").text
item_two_temp = driver.find_element_by_id("item two id").text
item_one.append(item_one_temp)
item_two.append(item_two_temp)
except NoSuchElementException:
item_one.append(None)
item_two.append(None)
# ====================
# Instead of this
# ====================
try:
item_one.append(driver.find_element_by_id("item one id").text)
# if the next element does not exist, item_one list is already appended
# i.e. one of your list is longer than another
item_two.append(driver.find_element_by_id("item two id").text)
except NoSuchElementException:
item_one.append(None)
item_two.append(None)
It's useful to specify specific errors that you expect as different errors require different treatments.
Only use a broad "except" statements if you're planning to do the same if any error arises.
I wish that I know all this before I started scraping as I had to waste a lot of time on StackOverflow. Hope this guide has been useful :)
Like the post? Consider donating to fund the maintenance of this website: