Selenium is a very versatile web scraping tool that is accessible via multiple programming languages.
That makes Selenium especially powerful when you are in need of scraping large websites, like e-commerce sites.
However, as with large websites, the pages you scrape won't be totally identical with one another.
Hence, error or exception handling is very, very important.
Without proper exception handling, you may face errors after errors and waste time, as any error will simply halt your scraping work.
This is especially bad when you have set up your scraping task to take place over lunch or overnight.
Sometimes, certain pages do not have a certain element, which is very common.
For example, you might be scraping Amazon for products' reviews. Some products that do not have any reviews simply do not have any review element to show you.
This can be easily solved by appending a "" or None to the list that you're populating your scrape result.
from selenium.common.exceptions import NoSuchElementException reviews =  for page in pages_to_scrape: try: # Insert your scraping action here reviews.append(driver.find_element_by_css_selector('div.review').text) except NoSuchElementException: # Just append a None or "" reviews.append(None)
Some websites are simply slow or too small that your scraping has caused their server to overload.
The latter is always bad and you shouldn't crash other people's websites when scraping.
In any case, timeouts (i.e. page failed to load) are common on large websites.
However, I won't recommend catching this error unless you know from experience that the website is prone to timeouts. It can be a huge waste of resource if the website has a 0.01% chance of timeout.
from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException driver.get("your url") # Remove the while loop and break if you don't want to try again when it took too long while True: try: # Define an element that you can start scraping when it appears # If the element appears after 5 seconds, break the loop and continue WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR, "your selector"))) break except TimeoutException: # If the loading took too long, print message and try again print("Loading took too much time!")
This error happens when the element you try to click (e.g. the "next page" button) gets blocked by another element and becomes unclickable.
The common cause of this is a pop-up being triggered, or there is a chat box that appears at the bottom-right of the page.
There are a few ways to solve this. The first way is to close the pop-up.
from selenium.common.exceptions import ElementClickInterceptedException try: # Tries to click an element driver.find_element_by_css_selector("button selector").click() except ElementClickInterceptedException: # If pop-up overlay appears, click the X button to close time.sleep(2) # Sometimes the pop-up takes time to load driver.find_element_by_css_selector("close button selector").click()
from selenium.common.exceptions import ElementClickInterceptedException try: # Tries to click an element driver.find_element_by_css_selector("button selector").click() except ElementClickInterceptedException: element = driver.find_element_by_class_name("blocking element's class") driver.execute_script(""" var element = arguments; element.parentNode.removeChild(element); """, element)
If it's not a pop-up, the problem could be solved by scrolling away, hoping that the blocking element moves with you and away from the button/link to be clicked.
Stale element happens when the element is was deleted or no longer attached to the DOM.
Though a not very common error, some websites are prone to having this error.
When encountered with this error, you can just try again for a number of times.
from selenium.common.exceptions import StaleElementReferenceException while True: try: item_list.append(driver.find_element_by_id("item id").text) except StaleElementReferenceException: continue # If StaleElement appears, try again break # once try is successful, stop while loop
Catching errors and using these exceptions slows your code down significantly. It can cause a regular 1-hour scraping task to double in time especially if you wrap all your scraping actions with error exceptions.
Sometimes it's totally inescapable, and that's when a full coverage of error exceptions is necessary so that you can run it overnight without any worries.
However, I recommend to do a trial and error on your code and implement error exceptions only when you encounter them.
Start with a small, representative sample of your web pages to find out the type of errors that can happen.
If you're scraping a few elements in a row, it's best to assign the result of the scrape to a temporary, placeholder variable before appending to a list.
This is because if you're appending all to their respective lists, an error in the later stages will cause your prior appended lists having more elements.
# ==================== # Use this way # ==================== try: item_one_temp = driver.find_element_by_id("item one id").text item_two_temp = driver.find_element_by_id("item two id").text item_one.append(item_one_temp) item_two.append(item_two_temp) except NoSuchElementException: item_one.append(None) item_two.append(None) # ==================== # Instead of this # ==================== try: item_one.append(driver.find_element_by_id("item one id").text) # if the next element does not exist, item_one list is already appended # i.e. one of your list is longer than another item_two.append(driver.find_element_by_id("item two id").text) except NoSuchElementException: item_one.append(None) item_two.append(None)
It's useful to specify specific errors that you expect as different errors require different treatments.
Only use a broad "except" statements if you're planning to do the same if any error arises.
I wish that I know all this before I started scraping as I had to waste a lot of time on StackOverflow. Hope this guide has been useful :)
Like the post? Consider donating to fund the maintenance of this website: