Ping Shiuan Chua
  • About
  • Blog
  • Projects
  • PSD Studio
  • Tools
    Keyword Cluster Keyword Grouping Broken Links Check Term-frequency Web Scraper
Python

Scraping Internal Links from Entire Domain

14 Sep 2020

Being able to collate all active links in a domain can be helpful either to build a sitemap or for competitive analysis.

One way to achieve that is via Python with BeautifulSoup.

In this quick post, I'll demonstrate a way to recursively gather links in a domain.

The links gathered will only be those that can be accessed via the domain by a normal user i.e. it's linked from somewhere within the domain. Some websites have hidden pages and this solution won't cover that.

 

Initializing

Firstly, we'll initialise several important things:

  1. The domain URL
  2. The random user-agent dictionary
from tqdm import tqdm
from fake_useragent import UserAgent
ua = UserAgent()
ua_dict = {"User-Agent": ua.random}

from bs4 import BeautifulSoup
import requests

url = "https://flutter.dev/" # You can use any other links

The link above is chosen as it's not a very big site, but yet can be multi-layered due to the documentation page. In this example, we'll do two layers of link gathering.

The first layer will be done on the homepage itself, while subsequent layers build onto the list of links gathered in the previous iteration. Two layers = 1 more iteration after the homepage.

 

First Iteration

To get the first list of links from the homepage, we gather all tags' href property. Note that we didn't use the <a> tag, as some links are not housed within <a> tags.

body = requests.get(url, headers = ua_dict).content
soup = BeautifulSoup(body, 'html.parser')

body = soup.find('body')
href_tags = body.find_all(href=True)   
all_links = list(set([tag.get('href') for tag in href_tags]))

Now, you want to look into the all_links variable to see the way the website refers to an internal link. There are two patterns:

  1. Regular link with the domain included (e.g. https://mydomain/page1)
  2. Without the domain and starts with a forward slash (e.g. /page1)

The following code will use pattern 1 with pattern 2 commented right below it for the same variable assignment. Choose the relevant one depending on your use case.

all_internal_links = [x for x in all_links if url in x] # For pattern 1
#all_internal_links = [x for x in all_links if x[0] == '/'] # For pattern 2

Note that this is not a foolproof method to gather internal links. Look through the patterns in the list of links you obtained for the site and modify the code appropriately.

Now, we've finished the 1st iteration.

 

Multiple Iterations Henceforth

Moving forward, proceed at your own risk as pinging the domain too many times might cause the website to crash or your IP permanently banned for suspected DDoS.

This post is purely for educational purposes and I'm not condoning using this for nefarious attacks on websites!

Usually, I reserve this analysis for very small sites and I usually only need to do for the first layer. That's usually enough to get a good grasp of the kind of content they focus on.

Going into the second layer might help if the home page is built as a clean, minimal landing page.

However, as some sites can go very, very deep especially if one of the links you obtained in the homepage is a blog or documentation page, be careful before setting the code to go straight to the next loop.

Following is the code to run to go into the second layer.

second_layer = []
for link in tqdm(first_layer, ncols = 100):
    body = requests.get(link, headers = ua_dict).content
    #body = requests.get(url + link, headers = ua_dict).content
    soup = BeautifulSoup(body, 'html.parser')
    
    body = soup.find('body')
    href_tags = body.find_all(href=True)   
    links = list(set([tag.get('href') for tag in href_tags]))
    internal_links = [x for x in links if url in x]
    #internal_links = [x for x in push_list if x[0] == '/']
    second_layer.extend(internal_links)

all_internal_links = list(set(all_internal_links + second_layer))

Using this code, you force yourself to stop at the second layer, before evaluating whether it's worth going to the next layer.

If you really want to scrape the entire site e.g. you own the site, you can:

  1. Add a cache variable assigned with the homepage links.
  2. Place the code above in a loop e.g. for n in range(2) to go into the third layer.
  3. Create an intermediary list to house the new links.
  4. Loop the cache list.
  5. After scraping that round, push all the links to the intermediary list.
  6. Run a check for duplicates vs all_internal_links and check for internal links.
  7. If there's no new link, break the loop.
  8. If there is, extend all_internal_links with the new links, then assign new links to cache. Repeat the next loop for n.

You can do other analysis once you have the lists. For example, you can scrape all body content and performing all sorts of analysis, though again you don't want to be pinging the domain too many times.

Regardless, always be responsible in your code.

Share the love



Related Posts

Card image cap
Analysing Subreddit's Keyword Concentration with Python and PRAW
Card image cap
Using Graph Theory for Keyword Ideas Expansion via Networkx
Card image cap
Keyword Generator via Google Autocomplete
blog comments powered by Disqus

© 2020 Copyright: pingshiuanchua.com

Terms of Use Privacy Policy