Being able to collate all active links in a domain can be helpful either to build a sitemap or for competitive analysis.
One way to achieve that is via Python with BeautifulSoup.
In this quick post, I'll demonstrate a way to recursively gather links in a domain.
The links gathered will only be those that can be accessed via the domain by a normal user i.e. it's linked from somewhere within the domain. Some websites have hidden pages and this solution won't cover that.
Firstly, we'll initialise several important things:
from tqdm import tqdm
from fake_useragent import UserAgent
ua = UserAgent()
ua_dict = {"User-Agent": ua.random}
from bs4 import BeautifulSoup
import requests
url = "https://flutter.dev/" # You can use any other links
The link above is chosen as it's not a very big site, but yet can be multi-layered due to the documentation page. In this example, we'll do two layers of link gathering.
The first layer will be done on the homepage itself, while subsequent layers build onto the list of links gathered in the previous iteration. Two layers = 1 more iteration after the homepage.
To get the first list of links from the homepage, we gather all tags' href property. Note that we didn't use the <a> tag, as some links are not housed within <a> tags.
body = requests.get(url, headers = ua_dict).content
soup = BeautifulSoup(body, 'html.parser')
body = soup.find('body')
href_tags = body.find_all(href=True)
all_links = list(set([tag.get('href') for tag in href_tags]))
Now, you want to look into the all_links variable to see the way the website refers to an internal link. There are two patterns:
The following code will use pattern 1 with pattern 2 commented right below it for the same variable assignment. Choose the relevant one depending on your use case.
all_internal_links = [x for x in all_links if url in x] # For pattern 1
#all_internal_links = [x for x in all_links if x[0] == '/'] # For pattern 2
Note that this is not a foolproof method to gather internal links. Look through the patterns in the list of links you obtained for the site and modify the code appropriately.
Now, we've finished the 1st iteration.
Moving forward, proceed at your own risk as pinging the domain too many times might cause the website to crash or your IP permanently banned for suspected DDoS.
This post is purely for educational purposes and I'm not condoning using this for nefarious attacks on websites!
Usually, I reserve this analysis for very small sites and I usually only need to do for the first layer. That's usually enough to get a good grasp of the kind of content they focus on.
Going into the second layer might help if the home page is built as a clean, minimal landing page.
However, as some sites can go very, very deep especially if one of the links you obtained in the homepage is a blog or documentation page, be careful before setting the code to go straight to the next loop.
Following is the code to run to go into the second layer.
second_layer = []
for link in tqdm(first_layer, ncols = 100):
body = requests.get(link, headers = ua_dict).content
#body = requests.get(url + link, headers = ua_dict).content
soup = BeautifulSoup(body, 'html.parser')
body = soup.find('body')
href_tags = body.find_all(href=True)
links = list(set([tag.get('href') for tag in href_tags]))
internal_links = [x for x in links if url in x]
#internal_links = [x for x in push_list if x[0] == '/']
second_layer.extend(internal_links)
all_internal_links = list(set(all_internal_links + second_layer))
Using this code, you force yourself to stop at the second layer, before evaluating whether it's worth going to the next layer.
If you really want to scrape the entire site e.g. you own the site, you can:
You can do other analysis once you have the lists. For example, you can scrape all body content and performing all sorts of analysis, though again you don't want to be pinging the domain too many times.
Regardless, always be responsible in your code.
Like the post? Consider donating to fund the maintenance of this website: