• Home
  • Guides
    • All
    • Linux
    • Programming
    • Tools
    Building a Python Broken Link Checker

    Building a Python Broken Link Checker

    Working With Webhooks in PHP

    Working With Webhooks in PHP

    My Backup Setup for Linux PCs

    My Backup Setup for Linux PCs

    Detecting Hidden WordPress Malware Disguised as Images

    Detecting Hidden WordPress Malware Disguised as Images

    Server-Side Image Conversion with Apache

    Server-Side Image Conversion with Apache

    Fastest Way to Extract a Massive .tar.gz File on Linux

    Fastest Way to Extract a Massive .tar.gz File on Linux

    Monitor SSL Expiration with Python

    Monitor SSL Expiration with Python

    Building a Simple WordPress Post List Tool with PHP

    Building a Simple WordPress Post List Tool with PHP

    Monitoring Web Page Changes with Python

    Monitoring Web Page Changes with Python

  • Blog
    • All
    • Artificial Intelligence
    • Developer Life
    • Privacy
    • Reviews
    • Security
    • Tutorials
    What Drawing and Animating in BASIC Taught Me About Programming

    What Drawing and Animating in BASIC Taught Me About Programming

    Why I Stopped Treating SMS As Good Enough

    Why I Stopped Treating SMS As Good Enough

    Why I Built My Own Ad Server Instead of Relying on Outdated Platforms

    Why I Built My Own Ad Server Instead of Relying on Outdated Platforms

    How I Stay Focused When Working on Long Term Projects

    How I Stay Focused When Working on Long Term Projects

    Imposter Syndrome as a Self-Taught Developer

    Imposter Syndrome as a Self-Taught Developer

    Why Stable Websites Outperform Flashy Redesigns

    Why Stable Websites Outperform Flashy Redesigns

    AdGuard Ad Blocker Review

    AdGuard Ad Blocker Review

    Surfshark VPN Review

    Surfshark VPN Review

    Nmap Unleash the Power of Cybersecurity Scanning

    Nmap: Unleash the Power of Cybersecurity Scanning

  • Apps
    • Bible App
    • Bible Verse Screensaver
    • Blue AI Chatbot
    • Early Spring Predictor
    • FIGlet Generator
    • Password Generator
    • StegX
    • The Matrix
    • WeatherX
    • Website Risk Level Tool
  • About
    • About JMooreWV
    • Live Cyber Attack Stats
  • Contact
    • General Contact
    • Website Administration & Cybersecurity
No Result
View All Result
  • Home
  • Guides
    • All
    • Linux
    • Programming
    • Tools
    Building a Python Broken Link Checker

    Building a Python Broken Link Checker

    Working With Webhooks in PHP

    Working With Webhooks in PHP

    My Backup Setup for Linux PCs

    My Backup Setup for Linux PCs

    Detecting Hidden WordPress Malware Disguised as Images

    Detecting Hidden WordPress Malware Disguised as Images

    Server-Side Image Conversion with Apache

    Server-Side Image Conversion with Apache

    Fastest Way to Extract a Massive .tar.gz File on Linux

    Fastest Way to Extract a Massive .tar.gz File on Linux

    Monitor SSL Expiration with Python

    Monitor SSL Expiration with Python

    Building a Simple WordPress Post List Tool with PHP

    Building a Simple WordPress Post List Tool with PHP

    Monitoring Web Page Changes with Python

    Monitoring Web Page Changes with Python

  • Blog
    • All
    • Artificial Intelligence
    • Developer Life
    • Privacy
    • Reviews
    • Security
    • Tutorials
    What Drawing and Animating in BASIC Taught Me About Programming

    What Drawing and Animating in BASIC Taught Me About Programming

    Why I Stopped Treating SMS As Good Enough

    Why I Stopped Treating SMS As Good Enough

    Why I Built My Own Ad Server Instead of Relying on Outdated Platforms

    Why I Built My Own Ad Server Instead of Relying on Outdated Platforms

    How I Stay Focused When Working on Long Term Projects

    How I Stay Focused When Working on Long Term Projects

    Imposter Syndrome as a Self-Taught Developer

    Imposter Syndrome as a Self-Taught Developer

    Why Stable Websites Outperform Flashy Redesigns

    Why Stable Websites Outperform Flashy Redesigns

    AdGuard Ad Blocker Review

    AdGuard Ad Blocker Review

    Surfshark VPN Review

    Surfshark VPN Review

    Nmap Unleash the Power of Cybersecurity Scanning

    Nmap: Unleash the Power of Cybersecurity Scanning

  • Apps
    • Bible App
    • Bible Verse Screensaver
    • Blue AI Chatbot
    • Early Spring Predictor
    • FIGlet Generator
    • Password Generator
    • StegX
    • The Matrix
    • WeatherX
    • Website Risk Level Tool
  • About
    • About JMooreWV
    • Live Cyber Attack Stats
  • Contact
    • General Contact
    • Website Administration & Cybersecurity
No Result
View All Result
Home Guides Programming Python

Building a Python Broken Link Checker

Jonathan Moore by Jonathan Moore
15 hours ago
Reading Time: 26 mins read
A A
Building a Python Broken Link Checker
FacebookTwitter

Broken links have a way of hiding until they become somebody else’s problem. A page gets renamed, a migration misses a detail, a plugin changes a path, or an old article keeps pointing to something that no longer exists. By the time someone notices it, the site has already been serving dead links for days, weeks, or longer. I like catching that kind of cleanup early, especially on larger sites where checking things by hand turns into a waste of time quickly.

I wanted a tool that would start from one URL, crawl through a site, and tell me exactly where broken links were being found. I did not want a vague report that only listed bad targets without context. I wanted the page where the bad link appeared, the URL it pointed to, and the response or error that made it fail. With that information, I can go straight to the page that needs attention.

I also wanted it to stay practical. It needed to crawl internal pages, avoid checking the same things over and over, and give real time feedback while it was working. I added sitemap support too, because sometimes a sitemap is the fastest way to cover a site without waiting for a crawler to discover everything naturally. That gave me a tool I could use for small cleanups, migrations, and larger site audits without changing my workflow much.

What The Script Needs To Do

A useful broken link checker has to do more than request a page and look for a 404. It needs to fetch HTML, extract links, normalize them, decide whether they are internal or external, and then keep moving through the internal ones without crawling forever. It also needs to track what it has already seen so it does not waste time revisiting the same pages over and over. Without that piece, a crawler becomes noisy and slow in a hurry.

I also wanted the checker to support two entry points. The first is the obvious one, which is starting from a single URL and letting the crawler discover the rest of the site from there. The second is sitemap mode, where I can point it at a sitemap XML file and have it pull page URLs directly from that list. Some sites are easier to audit from a sitemap than from navigation alone, so both modes earn their place.

The last piece was reporting. I wanted the script to print progress while it runs so it does not sit there looking dead on a bigger site. I also wanted it to support JSON and CSV output because there are plenty of times when the terminal is not the final destination. Sometimes I want a quick answer on screen, and sometimes I want a report I can sort later.

I kept the scope narrow on purpose. I was not trying to build a full SEO platform or a headless browser crawler with endless configuration. I wanted something I could run against sites I actually maintain, understand a few months later, and adjust without having to relearn the whole thing.

Starting With A Simple Command Line Tool

I kept the script as a single command line tool because that fits the kind of work I usually do. When I am checking a site, I do not need a full interface. I need something I can point at a URL, limit with a few flags, and run from a terminal without extra friction. Keeping it small also makes the important parts easier to find when I need to change something later.

The script begins with a straightforward header and argument parser. I gave it options for request timeout, page limits, skipping external links, sitemap mode, and optional JSON or CSV output. Those switches cover the way I usually run this kind of scan without turning the command into a pile of flags. If a tool becomes annoying to run, it gets ignored, so I try to keep the command clean.

The first part of the script sets up the imports and command line. Once those pieces are in place, the crawling logic has a clear entry point. It also makes the script feel like something I can run directly instead of a loose collection of functions. Here is the top of the script where the main behavior and options begin:

#!/usr/bin/env python3
#
# Crawl a website or sitemap and report broken links with the page where each
# bad link was found. Supports internal crawling, optional external link
# checking, and JSON or CSV output.
#
# Example:
# python3 broken_link_checker.py https://example.com --max-pages 25 --skip-external

from __future__ import annotations

import argparse
import csv
import json
import sys
from collections import deque
from dataclasses import dataclass
from html.parser import HTMLParser
from pathlib import Path
from typing import Iterable
from xml.etree import ElementTree
from urllib.error import HTTPError, URLError
from urllib.parse import urldefrag, urljoin, urlparse
from urllib.request import Request, urlopen

The argument setup is not flashy, but it decides whether I will actually keep using the tool. A crawler like this should be obvious to run and easy to adjust. I would rather have a focused command I reach for often than a larger one with options I never touch.

Defining The Data The Script Tracks

Before the crawler starts doing any work, I define the basic pieces of data it needs to pass around. The constants keep the default timeout, user agent, HTML content types, and skipped URL schemes in one place. The two small data classes give the rest of the script a cleaner way to talk about page responses and broken link results.

I like using data classes for this kind of utility because they make the output obvious. A broken link is not just a string, and a page response is not just a chunk of HTML. Each one has a few pieces of information that need to travel together through the script.

This setup gives the rest of the script a stable vocabulary. I can tell the difference between a page that was fetched and a link issue that needs to be reported. Here is the setup for those constants and data classes.

DEFAULT_TIMEOUT = 15
DEFAULT_USER_AGENT = "JMooreWV-BrokenLinkChecker/1.0"
HTML_TYPES = ("text/html", "application/xhtml+xml")
SKIPPED_SCHEMES = {"mailto", "tel", "javascript", "data", "ftp"}


@dataclass(slots=True)
class LinkIssue:
    source_url: str
    target_url: str
    status_code: int | None
    error: str


@dataclass(slots=True)
class PageResult:
    url: str
    status_code: int
    content_type: str
    body: str

The LinkIssue class is what ends up in the final report. The PageResult class is what the crawler gets back after loading a page. Separating those two keeps response handling and reporting from blurring together.

Fetching Pages

The script uses Python’s standard library for HTTP requests, so there is no dependency to install before running it. I build a request with a user agent and a broad accept header, then pass it to urlopen with the configured timeout. When the method is GET, the response body gets decoded so the crawler can inspect the HTML.

The same fetch function is used for page crawling, sitemap loading, and link checking. That keeps the request behavior consistent across the script. If I ever need to change headers or timeout behavior later, there is one place to do it.

The fetch layer is intentionally plain. It wraps the standard library request call, captures the response details, and returns a PageResult. This is the request and fetch layer.

def make_request(url: str, method: str, timeout: int) -> Request:
    return Request(
        url,
        method=method,
        headers={
            "User-Agent": DEFAULT_USER_AGENT,
            "Accept": "*/*",
        },
    )


def fetch_url(url: str, timeout: int, method: str = "GET") -> PageResult:
    request = make_request(url, method=method, timeout=timeout)
    with urlopen(request, timeout=timeout) as response:
        status_code = getattr(response, "status", response.getcode())
        content_type = response.headers.get("Content-Type", "")
        body = ""
        if method == "GET":
            charset = response.headers.get_content_charset() or "utf-8"
            body = response.read().decode(charset, errors="replace")
        return PageResult(
            url=response.geturl(),
            status_code=status_code,
            content_type=content_type,
            body=body,
        )


def fetch_text(url: str, timeout: int) -> str:
    result = fetch_url(url, timeout=timeout, method="GET")
    return result.body

The fetch_text helper keeps the sitemap code direct. It says exactly what that part of the script needs: load this URL and give me text back. That tiny wrapper avoids repeating the same fetch and body access pattern later.

Extracting Links From HTML

Once the script can fetch a page, the next job is pulling links out of the HTML cleanly. I used Python’s HTMLParser module for that instead of adding a dependency. For this kind of internal tool, I do not always need a heavier parser if the task is simple and focused. In this case, I only care about collecting href values from links and area tags.

That parser gets fed the HTML body from each fetched page. From there, I convert relative links into absolute URLs with urljoin, strip fragments with urldefrag, and ignore schemes that are not useful for this kind of check, like mailto, tel, javascript, data, and ftp. That keeps the crawler focused on the links that actually matter for a website audit.

Normalizing the URL removes a surprising amount of noise. A page with and without a fragment is still the same page for this job, and a bare domain should collapse to / instead of being treated like a different address. Cleaning that up early prevents duplicate work later in the crawl.

This section is where a lot of cleanup happens before a link ever reaches the checker. By the time a URL leaves this function, it is already in a more predictable shape. That saves the rest of the script from making the same decisions repeatedly. Here is the part of the script that extracts and normalizes links:

class LinkExtractor(HTMLParser):
    def __init__(self) -> None:
        super().__init__(convert_charrefs=True)
        self.links: list[str] = []

    def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
        if tag not in {"a", "area"}:
            return

        attr_map = dict(attrs)
        href = (attr_map.get("href") or "").strip()
        if href:
            self.links.append(href)


def normalize_url(url: str) -> str:
    clean_url, _fragment = urldefrag(url)
    parsed = urlparse(clean_url)
    scheme = parsed.scheme.lower()
    netloc = parsed.netloc.lower()
    path = parsed.path or "/"
    normalized = parsed._replace(scheme=scheme, netloc=netloc, path=path, fragment="")
    return normalized.geturl()


def extract_links(base_url: str, html: str) -> list[str]:
    parser = LinkExtractor()
    parser.feed(html)
    results = []
    for raw_link in parser.links:
        absolute = urljoin(base_url, raw_link)
        if should_skip(absolute):
            continue
        results.append(normalize_url(absolute))
    return results

The parser does more than collect links. If URLs are not normalized, the same page can show up several times with small differences and waste a lot of crawl time. If unhelpful schemes are not filtered out, the report gets cluttered with things that were never supposed to be treated like pages. A crawler works better when it is strict about what belongs in the scan.

Filtering Pages And URLs

After links are extracted, the script still has to decide what they mean for the crawl. Some URLs should be skipped completely, some should be checked but not crawled, and some should be added back into the queue. These helper functions keep those decisions out of the middle of the main loop.

The is_html function prevents the crawler from trying to parse images, PDFs, feeds, or other non HTML responses as pages. The domain helpers keep the crawl tied to the original site boundary. The skip helper removes URL schemes that do not make sense for this kind of checker.

Each helper answers one question about a URL or response. That keeps the main loop focused on crawl behavior instead of low-level checks. Here are the filtering helpers.

def is_html(content_type: str) -> bool:
    lowered = content_type.lower()
    return any(kind in lowered for kind in HTML_TYPES)


def same_domain(url_a: str, url_b: str) -> bool:
    a = urlparse(url_a)
    b = urlparse(url_b)
    return a.scheme.lower() == b.scheme.lower() and a.netloc.lower() == b.netloc.lower()


def filter_same_domain(urls: Iterable[str], root_url: str) -> list[str]:
    return [url for url in urls if same_domain(root_url, url)]


def should_skip(url: str) -> bool:
    parsed = urlparse(url)
    return parsed.scheme.lower() in SKIPPED_SCHEMES

This group is also where the sitemap domain fix lives. The crawler compares discovered URLs against the original root URL, not whichever page happens to be loaded at the moment. Without that rule, an external URL can accidentally become the new point of comparison.

Staying Inside The Website

One thing I wanted to keep under control was scope. If I tell the script to crawl a website, I want it to stay on that website. That means internal links should be added to the crawl queue, while external links should never be followed recursively. I may still want to validate external targets in some cases, but I do not want the script wandering across the internet because a page happened to link somewhere else.

The way I handled that was by anchoring the crawl to the original root domain. If a link matches the scheme and host of that root URL, it counts as internal and can be queued. If it does not match, it stays out of the crawl path, even when the script starts from a sitemap.

The practical result is predictable behavior. If I point it at a client site or a staging site, I know it is not going to fan out across every social network, CDN landing page, or partner site linked from the footer. It can still inspect those targets if I allow external checking, but it will not start crawling them.

The crawl loop uses that check before adding anything back into the queue. Without it, a link checker can drift away from the site it was supposed to inspect. The queue should only grow with pages that belong to the original site. This is the key part of the loop where the script decides whether a link gets queued:

for link in extract_links(final_url, page.body):
    discovered_links += 1

    internal = same_domain(normalized_root, link)
    if internal and link not in seen_pages and link not in queue:
        # Only internal links are added back into the crawl queue.
        queue.append(link)
        print(f"  Queued internal link: {link}")

    if not internal and skip_external:
        continue

That one decision controls the behavior of the entire crawl. I can still validate external links when I want to, or ignore them entirely with --skip-external. For maintenance work, having both choices is better than forcing every scan to behave the same way.

Checking Links Without Repeating Work

A crawler can waste a lot of time if it checks the same broken link every time it appears on a different page. That becomes especially common with repeated navigation, footer links, and reused content blocks. To avoid that, I cache checked links in a dictionary so each target gets requested once and the result is reused everywhere else it appears.

I also start with HEAD requests when checking links, because there is no reason to download a full response body just to confirm whether a URL is alive. Some servers do not support HEAD properly, though, so the script falls back to GET when it sees a 405 or 501. The checker needs that fallback because servers are not always consistent.

I have run into that exact problem enough times that I did not want to pretend every server behaves correctly. A checker that only works against polite servers is not much help when you are dealing with older sites, pieced together hosting, or strange application stacks. The fallback makes the script slower in a few cases, but much less fragile overall.

The fallback logic stays small, but it saves a lot of false failures. I want the checker to be quick when the server cooperates and patient when it does not. A broken link report only helps when the results are reliable enough to act on. Here is that part of the script:

def check_link(url: str, timeout: int) -> tuple[int | None, str | None]:
    try:
        result = fetch_url(url, timeout=timeout, method="HEAD")
        return result.status_code, None
    except HTTPError as exc:
        if exc.code in {405, 501}:
            try:
                # Some servers reject HEAD requests, so fall back to GET.
                result = fetch_url(url, timeout=timeout, method="GET")
                return result.status_code, None
            except HTTPError as retry_exc:
                return retry_exc.code, retry_exc.reason or "HTTP error"
            except URLError as retry_exc:
                return None, str(retry_exc.reason)
        return exc.code, exc.reason or "HTTP error"
    except URLError as exc:
        return None, str(exc.reason)

This keeps the script efficient without assuming every server behaves the same way. I still get the speed benefit of lightweight checks in most cases, while the fallback path handles the servers that need a little more patience. For a script I might run often, that tradeoff is worth it.

Adding Real Time Feedback

I do not like long running scripts that sit silently while they work. If I point a checker at a bigger site and it prints nothing for a while, I start wondering whether it is stuck, whether the requests are hanging, or whether I made a mistake with the command. Real time feedback removes that uncertainty from the crawl loop. I want to see what page is being processed, how large the queue is, and when links are being checked or flagged as broken.

This kind of feedback is not just cosmetic. It tells me the tool is still moving, and it helps me spot problems while the scan is running. On larger sites, it also gives me a rough feel for progress without needing to wait for a final summary. During a migration or cleanup window, that visibility saves a lot of guessing.

The feedback does not need to be fancy to help. It just needs to show enough movement that I can see what the script is doing. A short progress trail is usually enough to tell me whether the scan is behaving normally. The script now prints output like this as it runs:

Crawling page 1/25: https://example.com/
  Queue size: 1
  Loaded: https://example.com/ [200]
  Queued internal link: https://example.com/about/
  Checking link: https://example.com/about/

I also like that this output doubles as a sanity check. If the queue starts exploding or the same kind of URL keeps showing up, I can usually spot the pattern right in the terminal before the scan is finished. That saves time when I am testing the crawler against a site with odd navigation or duplicate content paths.

The goal was not to create a dashboard in the terminal. I just wanted enough visibility to know the script was alive, moving, and behaving the way I expected. In practice, the progress output made the tool feel much better during real scans.

Adding Sitemap Support

Starting from a single URL works well, but it is not always the fastest way to cover a site. Sometimes a sitemap gives me a cleaner list of pages right away, especially on larger sites or sites with pages that are not easy to discover through normal navigation. Adding --sitemap support gave the script a second path into the same audit.

The script can read a standard sitemap that lists page URLs directly, and it can also handle a sitemap index that points to child sitemap files. In sitemap mode, those URLs get loaded into the same checking flow as normal crawl mode, but they still get filtered against the original site boundary first. That means I do not need a second tool or a separate report format just because I started from XML instead of from a page.

I wanted that mode because not every site is well linked internally. Some pages live deep in archives, some are only exposed through search or taxonomy pages, and some are technically published without being easy to discover from the front end. A sitemap can surface those pages much faster than waiting for a crawler to stumble into them.

The sitemap parser looks for the XML namespace first because sitemap files often include one. After that, it handles either a sitemap index or a standard URL set. If the sitemap points to child sitemap files, the script loads each one and combines the URLs into a single list.

def extract_sitemap_urls(xml_text: str) -> list[str]:
    try:
        root = ElementTree.fromstring(xml_text)
    except ElementTree.ParseError as exc:
        raise RuntimeError(f"Failed to parse sitemap XML: {exc}") from exc

    namespace = ""
    if root.tag.startswith("{"):
        namespace = root.tag.split("}", 1)[0] + "}"

    urls: list[str] = []
    if root.tag == f"{namespace}sitemapindex":
        # A sitemap index points to one or more child sitemap files.
        for loc in root.findall(f".//{namespace}loc"):
            if loc.text and loc.text.strip():
                urls.append(loc.text.strip())
    elif root.tag == f"{namespace}urlset":
        # A standard sitemap lists page URLs directly.
        for loc in root.findall(f".//{namespace}loc"):
            if loc.text and loc.text.strip():
                urls.append(loc.text.strip())
    else:
        raise RuntimeError("Unsupported sitemap format.")

    return urls


def expand_sitemap_urls(sitemap_url: str, timeout: int) -> list[str]:
    print(f"Loading sitemap: {sitemap_url}")
    xml_text = fetch_text(sitemap_url, timeout=timeout)
    first_pass = extract_sitemap_urls(xml_text)

    if not first_pass:
        return []

    normalized_sitemap = normalize_url(sitemap_url)
    if normalized_sitemap.endswith(".xml") and all(url.endswith(".xml") for url in first_pass):
        expanded: list[str] = []
        for child_sitemap in first_pass:
            print(f"  Loading child sitemap: {child_sitemap}")
            # Pull URLs out of each child sitemap and combine them into one list.
            child_xml = fetch_text(child_sitemap, timeout=timeout)
            expanded.extend(extract_sitemap_urls(child_xml))
        return [normalize_url(url) for url in expanded]

    return [normalize_url(url) for url in first_pass]

With that in place, the script works for more than one kind of audit. I can run a regular crawl, a sitemap driven audit, or both depending on what kind of site I am dealing with. For site maintenance, that flexibility is worth adding early.

Building The Main Crawl Loop

The crawl loop is where the earlier pieces come together. It starts with either sitemap URLs or a single start URL, keeps a queue of pages to visit, and remembers pages it has already seen. It also keeps a cache of checked links so the same target does not get requested again and again.

When the script loads a page, it records page level errors, skips non HTML responses, extracts links from valid HTML, and decides whether each link is internal. Internal links can go back into the queue. External links are checked only when the user has not chosen --skip-external.

The loop follows the same order every time: set up the queue, fetch a page, parse links, check targets, and record issues. It is also where the real time feedback gets printed. This is the full crawl loop.

def crawl_site(
    start_url: str | None,
    root_url: str,
    timeout: int,
    max_pages: int | None,
    skip_external: bool,
    sitemap_urls: list[str] | None = None,
) -> tuple[list[LinkIssue], int, int]:
    normalized_root = normalize_url(root_url)
    queue: deque[str] = deque()
    if sitemap_urls:
        # Sitemap mode starts with every sitemap URL already queued.
        queue.extend(
            filter_same_domain(
                (normalize_url(url) for url in sitemap_urls),
                normalized_root,
            )
        )
    elif start_url:
        # Crawl mode starts with a single page and discovers the rest as it goes.
        queue.append(normalize_url(start_url))
    else:
        raise RuntimeError("A start URL or sitemap URL is required.")

    seen_pages: set[str] = set()
    checked_links: dict[str, tuple[int | None, str | None]] = {}
    issues: list[LinkIssue] = []
    discovered_links = 0

    while queue:
        current = queue.popleft()
        if current in seen_pages:
            continue
        if max_pages is not None and len(seen_pages) >= max_pages:
            break

        queued_count = len(queue) + 1
        print(
            f"Crawling page {len(seen_pages) + 1}"
            f"{f'/{max_pages}' if max_pages is not None else ''}: {current}"
        )
        print(f"  Queue size: {queued_count}")

        try:
            page = fetch_url(current, timeout=timeout, method="GET")
        except HTTPError as exc:
            issues.append(
                LinkIssue(
                    source_url=current,
                    target_url=current,
                    status_code=exc.code,
                    error=exc.reason or "HTTP error",
                )
            )
            print(f"  Page error: HTTP {exc.code} ({exc.reason or 'HTTP error'})")
            seen_pages.add(current)
            continue
        except URLError as exc:
            issues.append(
                LinkIssue(
                    source_url=current,
                    target_url=current,
                    status_code=None,
                    error=str(exc.reason),
                )
            )
            print(f"  Page error: {exc.reason}")
            seen_pages.add(current)
            continue

        final_url = normalize_url(page.url)
        seen_pages.add(final_url)
        print(f"  Loaded: {final_url} [{page.status_code}]")

        if not is_html(page.content_type):
            print(f"  Skipping non-HTML content: {page.content_type or 'unknown'}")
            continue

        for link in extract_links(final_url, page.body):
            discovered_links += 1

            internal = same_domain(normalized_root, link)
            if internal and link not in seen_pages and link not in queue:
                # Only internal links are added back into the crawl queue.
                queue.append(link)
                print(f"  Queued internal link: {link}")

            if not internal and skip_external:
                continue

            if link not in checked_links:
                # Cache link checks so the same target is not requested repeatedly.
                print(f"  Checking link: {link}")
                checked_links[link] = check_link(link, timeout=timeout)

            status_code, error = checked_links[link]
            if error is not None or (status_code is not None and status_code >= 400):
                status = status_code if status_code is not None else "ERR"
                print(f"  Broken link found [{status}]: {link}")
                issues.append(
                    LinkIssue(
                        source_url=final_url,
                        target_url=link,
                        status_code=status_code,
                        error=error or f"HTTP {status_code}",
                    )
                )

    return issues, len(seen_pages), discovered_links

This function is the longest part of the script because it coordinates almost everything else. I could split it further, but for this size of tool I like being able to read the crawl flow in one place. The helper functions keep the loop from getting too tangled.

Reporting The Broken Links Clearly

A broken link report is only useful if it tells me where to fix the problem. The script stores the source page, the target URL, the status code when there is one, and the error string when there is not. If all I know is that a URL failed somewhere on the site, I still have work to do before I can fix it. If I know exactly where it was found, the cleanup gets much easier.

I also added optional JSON and CSV output so the results can move beyond the terminal when needed. JSON works well when another script needs the results, and CSV is better when I want to sort or review the report in a spreadsheet. Those formats make the scan results easier to use after the command finishes.

I have learned that reports fail when they make me do one more round of detective work. If I have to manually retrace where a dead URL came from, the script only solved half of the problem. The more directly a result points back to the source, the more likely it is that I fix it immediately instead of leaving it for later.

The report writers take the same list of LinkIssue objects and write it in whichever format I asked for. JSON keeps the structure easy for another script to consume. CSV is better when I want to sort the results or hand them to someone who prefers a spreadsheet.

def write_json(path: Path, issues: Iterable[LinkIssue]) -> None:
    payload = [
        {
            "source_url": issue.source_url,
            "target_url": issue.target_url,
            "status_code": issue.status_code,
            "error": issue.error,
        }
        for issue in issues
    ]
    path.write_text(json.dumps(payload, indent=2) + "\n", encoding="utf-8")


def write_csv(path: Path, issues: Iterable[LinkIssue]) -> None:
    with path.open("w", encoding="utf-8", newline="") as handle:
        writer = csv.DictWriter(
            handle,
            fieldnames=["source_url", "target_url", "status_code", "error"],
        )
        writer.writeheader()
        for issue in issues:
            writer.writerow(
                {
                    "source_url": issue.source_url,
                    "target_url": issue.target_url,
                    "status_code": issue.status_code or "",
                    "error": issue.error,
                }
            )

This is the kind of script I like to keep around. It solves a specific problem, it stays readable, and it produces results I can act on right away. If it saves time during the next cleanup, it has already earned its place.

Wiring Up The Command

The last part of the script turns the helper functions into a command I can actually run. The argument parser defines the start URL, timeout, page limit, external link behavior, sitemap mode, and optional output files. The main function decides whether to load sitemap URLs first, runs the crawl, prints the summary, and writes reports when requested.

I like keeping this wiring near the bottom because the script reads from general pieces into execution. The earlier functions describe what the tool can do. The final section shows how those pieces get connected when the command runs.

This is the part that turns the script into a usable command. Both sitemap scans and normal crawls end up flowing into the same checker. Here is the command line wiring.

def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Crawl a website and report broken links."
    )
    parser.add_argument(
        "start_url",
        nargs="?",
        help="Starting URL to crawl.",
    )
    parser.add_argument(
        "--timeout",
        type=int,
        default=DEFAULT_TIMEOUT,
        help=f"Request timeout in seconds. Default: {DEFAULT_TIMEOUT}",
    )
    parser.add_argument(
        "--max-pages",
        type=int,
        default=None,
        help="Maximum number of internal pages to crawl.",
    )
    parser.add_argument(
        "--skip-external",
        action="store_true",
        help="Skip checking external links and only validate internal links.",
    )
    parser.add_argument(
        "--sitemap",
        help="Sitemap URL to scan instead of starting from a single page.",
    )
    parser.add_argument(
        "--json-out",
        type=Path,
        help="Write broken link results to a JSON file.",
    )
    parser.add_argument(
        "--csv-out",
        type=Path,
        help="Write broken link results to a CSV file.",
    )
    return parser.parse_args()


def main() -> int:
    args = parse_args()
    if not args.start_url and not args.sitemap:
        raise RuntimeError("You must provide a start_url or use --sitemap.")

    root_url = args.start_url or args.sitemap
    sitemap_urls = None
    if args.sitemap:
        # Load sitemap URLs first, then feed them into the same checker.
        sitemap_urls = expand_sitemap_urls(args.sitemap, timeout=args.timeout)
        sitemap_urls = filter_same_domain(sitemap_urls, root_url)
        print(f"Sitemap URLs loaded: {len(sitemap_urls)}")

    issues, crawled_pages, discovered_links = crawl_site(
        start_url=args.start_url,
        root_url=root_url,
        timeout=args.timeout,
        max_pages=args.max_pages,
        skip_external=args.skip_external,
        sitemap_urls=sitemap_urls,
    )

    print(f"Crawled pages: {crawled_pages}")
    print(f"Discovered links: {discovered_links}")
    print(f"Broken links found: {len(issues)}")

    for issue in issues:
        status = issue.status_code if issue.status_code is not None else "ERR"
        print(f"[{status}] {issue.target_url}")
        print(f"  Found on: {issue.source_url}")
        print(f"  Reason: {issue.error}")

    if args.json_out:
        write_json(args.json_out, issues)
        print(f"Wrote JSON report to {args.json_out}")

    if args.csv_out:
        write_csv(args.csv_out, issues)
        print(f"Wrote CSV report to {args.csv_out}")

    return 0


if __name__ == "__main__":
    try:
        raise SystemExit(main())
    except KeyboardInterrupt:
        print("Interrupted.", file=sys.stderr)
        raise SystemExit(1)

The keyboard interrupt block is small, but it keeps a manual stop from dumping an ugly traceback into the terminal. That matters when I am running a scan, realize I gave it the wrong option, and just want to stop it cleanly. The rest of the command output stays focused on crawl results instead of Python internals.

Where This Is Useful In Real Work

This kind of checker is useful any time a site is changing faster than someone can manually inspect it. Migrations are an obvious case, but they are not the only one. It also helps with site rebuilds, SEO cleanup, content audits, plugin changes, and old WordPress sites that have been patched together for years. Broken links tend to collect quietly until a tool like this forces them into the open.

I also like this kind of project because it stays close to the kind of work I actually do. It is not an academic crawler and it is not trying to become a search engine. It is a focused maintenance script built to answer a practical question: which links are broken, and where are they coming from. A tool with a narrow job is usually easier to test, fix, and keep using.

It also fits the way I tend to work on infrastructure and content problems. I would rather build a small tool that gives me reliable answers than spend an hour clicking through pages and hoping I do not miss anything. Once a script proves useful on two or three real jobs, it usually earns a permanent place in my toolbox.

I built this one to run quickly, stay readable, and solve the problem without turning into a larger project. When a utility hits that balance, it usually ends up doing more work than the bigger tools I thought I needed. That kind of practical result is exactly why I still reach for Python for site maintenance work.

The Full Script

#!/usr/bin/env python3
#
# Crawl a website or sitemap and report broken links with the page where each
# bad link was found. Supports internal crawling, optional external link
# checking, and JSON or CSV output.
#
# Example:
# python3 broken_link_checker.py https://example.com --max-pages 25 --skip-external

from __future__ import annotations

import argparse
import csv
import json
import sys
from collections import deque
from dataclasses import dataclass
from html.parser import HTMLParser
from pathlib import Path
from typing import Iterable
from xml.etree import ElementTree
from urllib.error import HTTPError, URLError
from urllib.parse import urldefrag, urljoin, urlparse
from urllib.request import Request, urlopen


DEFAULT_TIMEOUT = 15
DEFAULT_USER_AGENT = "JMooreWV-BrokenLinkChecker/1.0"
HTML_TYPES = ("text/html", "application/xhtml+xml")
SKIPPED_SCHEMES = {"mailto", "tel", "javascript", "data", "ftp"}


@dataclass(slots=True)
class LinkIssue:
    source_url: str
    target_url: str
    status_code: int | None
    error: str


@dataclass(slots=True)
class PageResult:
    url: str
    status_code: int
    content_type: str
    body: str


class LinkExtractor(HTMLParser):
    def __init__(self) -> None:
        super().__init__(convert_charrefs=True)
        self.links: list[str] = []

    def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
        if tag not in {"a", "area"}:
            return

        attr_map = dict(attrs)
        href = (attr_map.get("href") or "").strip()
        if href:
            self.links.append(href)


def extract_sitemap_urls(xml_text: str) -> list[str]:
    try:
        root = ElementTree.fromstring(xml_text)
    except ElementTree.ParseError as exc:
        raise RuntimeError(f"Failed to parse sitemap XML: {exc}") from exc

    namespace = ""
    if root.tag.startswith("{"):
        namespace = root.tag.split("}", 1)[0] + "}"

    urls: list[str] = []
    if root.tag == f"{namespace}sitemapindex":
        # A sitemap index points to one or more child sitemap files.
        for loc in root.findall(f".//{namespace}loc"):
            if loc.text and loc.text.strip():
                urls.append(loc.text.strip())
    elif root.tag == f"{namespace}urlset":
        # A standard sitemap lists page URLs directly.
        for loc in root.findall(f".//{namespace}loc"):
            if loc.text and loc.text.strip():
                urls.append(loc.text.strip())
    else:
        raise RuntimeError("Unsupported sitemap format.")

    return urls


def normalize_url(url: str) -> str:
    clean_url, _fragment = urldefrag(url)
    parsed = urlparse(clean_url)
    scheme = parsed.scheme.lower()
    netloc = parsed.netloc.lower()
    # Treat URLs without a path as the site root so duplicates collapse cleanly.
    path = parsed.path or "/"
    normalized = parsed._replace(scheme=scheme, netloc=netloc, path=path, fragment="")
    return normalized.geturl()


def is_html(content_type: str) -> bool:
    lowered = content_type.lower()
    return any(kind in lowered for kind in HTML_TYPES)


def same_domain(url_a: str, url_b: str) -> bool:
    a = urlparse(url_a)
    b = urlparse(url_b)
    return a.scheme.lower() == b.scheme.lower() and a.netloc.lower() == b.netloc.lower()


def filter_same_domain(urls: Iterable[str], root_url: str) -> list[str]:
    return [url for url in urls if same_domain(root_url, url)]


def should_skip(url: str) -> bool:
    parsed = urlparse(url)
    return parsed.scheme.lower() in SKIPPED_SCHEMES


def make_request(url: str, method: str, timeout: int) -> Request:
    return Request(
        url,
        method=method,
        headers={
            "User-Agent": DEFAULT_USER_AGENT,
            "Accept": "*/*",
        },
    )


def fetch_url(url: str, timeout: int, method: str = "GET") -> PageResult:
    request = make_request(url, method=method, timeout=timeout)
    with urlopen(request, timeout=timeout) as response:
        status_code = getattr(response, "status", response.getcode())
        content_type = response.headers.get("Content-Type", "")
        body = ""
        if method == "GET":
            charset = response.headers.get_content_charset() or "utf-8"
            body = response.read().decode(charset, errors="replace")
        return PageResult(
            url=response.geturl(),
            status_code=status_code,
            content_type=content_type,
            body=body,
        )


def extract_links(base_url: str, html: str) -> list[str]:
    parser = LinkExtractor()
    parser.feed(html)
    results = []
    for raw_link in parser.links:
        absolute = urljoin(base_url, raw_link)
        if should_skip(absolute):
            continue
        results.append(normalize_url(absolute))
    return results


def check_link(url: str, timeout: int) -> tuple[int | None, str | None]:
    try:
        result = fetch_url(url, timeout=timeout, method="HEAD")
        return result.status_code, None
    except HTTPError as exc:
        if exc.code in {405, 501}:
            try:
                # Some servers reject HEAD requests, so fall back to GET.
                result = fetch_url(url, timeout=timeout, method="GET")
                return result.status_code, None
            except HTTPError as retry_exc:
                return retry_exc.code, retry_exc.reason or "HTTP error"
            except URLError as retry_exc:
                return None, str(retry_exc.reason)
        return exc.code, exc.reason or "HTTP error"
    except URLError as exc:
        return None, str(exc.reason)


def fetch_text(url: str, timeout: int) -> str:
    result = fetch_url(url, timeout=timeout, method="GET")
    return result.body


def expand_sitemap_urls(sitemap_url: str, timeout: int) -> list[str]:
    print(f"Loading sitemap: {sitemap_url}")
    xml_text = fetch_text(sitemap_url, timeout=timeout)
    first_pass = extract_sitemap_urls(xml_text)

    if not first_pass:
        return []

    normalized_sitemap = normalize_url(sitemap_url)
    if normalized_sitemap.endswith(".xml") and all(url.endswith(".xml") for url in first_pass):
        expanded: list[str] = []
        for child_sitemap in first_pass:
            print(f"  Loading child sitemap: {child_sitemap}")
            # Pull URLs out of each child sitemap and combine them into one list.
            child_xml = fetch_text(child_sitemap, timeout=timeout)
            expanded.extend(extract_sitemap_urls(child_xml))
        return [normalize_url(url) for url in expanded]

    return [normalize_url(url) for url in first_pass]


def crawl_site(
    start_url: str | None,
    root_url: str,
    timeout: int,
    max_pages: int | None,
    skip_external: bool,
    sitemap_urls: list[str] | None = None,
) -> tuple[list[LinkIssue], int, int]:
    normalized_root = normalize_url(root_url)
    queue: deque[str] = deque()
    if sitemap_urls:
        # Sitemap mode starts with every sitemap URL already queued.
        queue.extend(
            filter_same_domain(
                (normalize_url(url) for url in sitemap_urls),
                normalized_root,
            )
        )
    elif start_url:
        # Crawl mode starts with a single page and discovers the rest as it goes.
        queue.append(normalize_url(start_url))
    else:
        raise RuntimeError("A start URL or sitemap URL is required.")

    seen_pages: set[str] = set()
    checked_links: dict[str, tuple[int | None, str | None]] = {}
    issues: list[LinkIssue] = []
    discovered_links = 0

    while queue:
        current = queue.popleft()
        if current in seen_pages:
            continue
        if max_pages is not None and len(seen_pages) >= max_pages:
            break

        queued_count = len(queue) + 1
        print(
            f"Crawling page {len(seen_pages) + 1}"
            f"{f'/{max_pages}' if max_pages is not None else ''}: {current}"
        )
        print(f"  Queue size: {queued_count}")

        try:
            page = fetch_url(current, timeout=timeout, method="GET")
        except HTTPError as exc:
            issues.append(
                LinkIssue(
                    source_url=current,
                    target_url=current,
                    status_code=exc.code,
                    error=exc.reason or "HTTP error",
                )
            )
            print(f"  Page error: HTTP {exc.code} ({exc.reason or 'HTTP error'})")
            seen_pages.add(current)
            continue
        except URLError as exc:
            issues.append(
                LinkIssue(
                    source_url=current,
                    target_url=current,
                    status_code=None,
                    error=str(exc.reason),
                )
            )
            print(f"  Page error: {exc.reason}")
            seen_pages.add(current)
            continue

        final_url = normalize_url(page.url)
        seen_pages.add(final_url)
        print(f"  Loaded: {final_url} [{page.status_code}]")

        if not is_html(page.content_type):
            print(f"  Skipping non-HTML content: {page.content_type or 'unknown'}")
            continue

        for link in extract_links(final_url, page.body):
            discovered_links += 1

            internal = same_domain(normalized_root, link)
            if internal and link not in seen_pages and link not in queue:
                # Only internal links are added back into the crawl queue.
                queue.append(link)
                print(f"  Queued internal link: {link}")

            if not internal and skip_external:
                continue

            if link not in checked_links:
                # Cache link checks so the same target is not requested repeatedly.
                print(f"  Checking link: {link}")
                checked_links[link] = check_link(link, timeout=timeout)

            status_code, error = checked_links[link]
            if error is not None or (status_code is not None and status_code >= 400):
                status = status_code if status_code is not None else "ERR"
                print(f"  Broken link found [{status}]: {link}")
                issues.append(
                    LinkIssue(
                        source_url=final_url,
                        target_url=link,
                        status_code=status_code,
                        error=error or f"HTTP {status_code}",
                    )
                )

    return issues, len(seen_pages), discovered_links


def write_json(path: Path, issues: Iterable[LinkIssue]) -> None:
    payload = [
        {
            "source_url": issue.source_url,
            "target_url": issue.target_url,
            "status_code": issue.status_code,
            "error": issue.error,
        }
        for issue in issues
    ]
    path.write_text(json.dumps(payload, indent=2) + "\n", encoding="utf-8")


def write_csv(path: Path, issues: Iterable[LinkIssue]) -> None:
    with path.open("w", encoding="utf-8", newline="") as handle:
        writer = csv.DictWriter(
            handle,
            fieldnames=["source_url", "target_url", "status_code", "error"],
        )
        writer.writeheader()
        for issue in issues:
            writer.writerow(
                {
                    "source_url": issue.source_url,
                    "target_url": issue.target_url,
                    "status_code": issue.status_code or "",
                    "error": issue.error,
                }
            )


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Crawl a website and report broken links."
    )
    parser.add_argument(
        "start_url",
        nargs="?",
        help="Starting URL to crawl.",
    )
    parser.add_argument(
        "--timeout",
        type=int,
        default=DEFAULT_TIMEOUT,
        help=f"Request timeout in seconds. Default: {DEFAULT_TIMEOUT}",
    )
    parser.add_argument(
        "--max-pages",
        type=int,
        default=None,
        help="Maximum number of internal pages to crawl.",
    )
    parser.add_argument(
        "--skip-external",
        action="store_true",
        help="Skip checking external links and only validate internal links.",
    )
    parser.add_argument(
        "--sitemap",
        help="Sitemap URL to scan instead of starting from a single page.",
    )
    parser.add_argument(
        "--json-out",
        type=Path,
        help="Write broken link results to a JSON file.",
    )
    parser.add_argument(
        "--csv-out",
        type=Path,
        help="Write broken link results to a CSV file.",
    )
    return parser.parse_args()


def main() -> int:
    args = parse_args()
    if not args.start_url and not args.sitemap:
        raise RuntimeError("You must provide a start_url or use --sitemap.")

    root_url = args.start_url or args.sitemap
    sitemap_urls = None
    if args.sitemap:
        # Load sitemap URLs first, then feed them into the same checker.
        sitemap_urls = expand_sitemap_urls(args.sitemap, timeout=args.timeout)
        sitemap_urls = filter_same_domain(sitemap_urls, root_url)
        print(f"Sitemap URLs loaded: {len(sitemap_urls)}")

    issues, crawled_pages, discovered_links = crawl_site(
        start_url=args.start_url,
        root_url=root_url,
        timeout=args.timeout,
        max_pages=args.max_pages,
        skip_external=args.skip_external,
        sitemap_urls=sitemap_urls,
    )

    print(f"Crawled pages: {crawled_pages}")
    print(f"Discovered links: {discovered_links}")
    print(f"Broken links found: {len(issues)}")

    for issue in issues:
        status = issue.status_code if issue.status_code is not None else "ERR"
        print(f"[{status}] {issue.target_url}")
        print(f"  Found on: {issue.source_url}")
        print(f"  Reason: {issue.error}")

    if args.json_out:
        write_json(args.json_out, issues)
        print(f"Wrote JSON report to {args.json_out}")

    if args.csv_out:
        write_csv(args.csv_out, issues)
        print(f"Wrote CSV report to {args.csv_out}")

    return 0


if __name__ == "__main__":
    try:
        raise SystemExit(main())
    except KeyboardInterrupt:
        print("Interrupted.", file=sys.stderr)
        raise SystemExit(1)
Tags: AutomationPythonRequestsSEOWebsite
ShareTweetSharePinShareShareScan
ADVERTISEMENT
Jonathan Moore

Jonathan Moore

I am a Software Architect and Senior Software Engineer with 30+ years of experience building applications for Linux and Windows systems. I focus on system architecture, custom web platforms, server infrastructure, and security-focused tools, with an emphasis on performance and reliability. Over the years, I have built everything from WordPress plugins and automation systems to full platforms, ad serving systems, monitoring tools, and API-driven applications. I prefer working close to the system, solving real problems, and building tools that are meant to be used.

Related Articles

Why I Stopped Treating SMS As Good Enough

Why I Stopped Treating SMS As Good Enough

I used to treat SMS based 2FA as a decent upgrade and leave it at that. It was clearly better...

Working With Webhooks in PHP

Working With Webhooks in PHP

I have used webhooks in a lot of different situations over the years, from payment alerts to form submissions to...

My Backup Setup for Linux PCs

My Backup Setup for Linux PCs

March 31st is World Backup Day. It is meant to be a reminder, but I do not rely on reminders...

Recommended Services

Latest Articles

Building a Python Broken Link Checker

Building a Python Broken Link Checker

Broken links have a way of hiding until they become somebody else's problem. A page gets renamed, a migration misses...

Read moreDetails

What Drawing and Animating in BASIC Taught Me About Programming

What Drawing and Animating in BASIC Taught Me About Programming

My first real introduction to programming came through BASICA and GWBASIC on a Tandy 2000 my father got in 1990...

Read moreDetails

Why I Stopped Treating SMS As Good Enough

Why I Stopped Treating SMS As Good Enough

I used to treat SMS based 2FA as a decent upgrade and leave it at that. It was clearly better...

Read moreDetails

Why I Built My Own Ad Server Instead of Relying on Outdated Platforms

Why I Built My Own Ad Server Instead of Relying on Outdated Platforms

I built my own ad server because the options I kept finding were either missing features I needed or felt...

Read moreDetails
  • Privacy Policy
  • Terms of Service
  • Legal Policies

© 2026 JMooreWV. All rights reserved.

No Result
View All Result
  • Home
  • Guides
    • Linux
    • Programming
      • JavaScript
      • PHP
      • Python
    • Tools
    • WordPress
  • Blog
    • Artificial Intelligence
    • Tutorials
    • Privacy
    • Security
  • Apps
    • Bible App
    • Bible Verse Screensaver
    • Blue AI Chatbot
    • Early Spring Predictor
    • FIGlet Generator
    • Password Generator
    • StegX
    • The Matrix
    • WeatherX
    • Website Risk Level Tool
  • About
    • About JMooreWV
    • Live Cyber Attack Stats
  • Contact
    • General Contact
    • Website Administration & Cybersecurity