firecrown.fctools.link_checker
Scan HTML files and check for broken anchor links.
The module provides a SiteChecker class which scans local HTML files,
extracts anchor identifiers (from id and name attributes), resolves
links (including optional downloading of external HTTP(S) pages into a
temporary cache), and reports missing files or missing anchor identifiers.
Usage example:
python -m firecrown.fctools.link_checker path/to/html_dir -v
Attributes
Classes
Holds information about a single HTML page. |
|
Scan a directory of HTML files and validate anchor links. |
Functions
|
Extract anchor identifiers from an HTML file. |
|
Main function. |
|
Command-line entry point using Typer and Rich for output. |
Module Contents
- class firecrown.fctools.link_checker.PageAnchors[source]
Holds information about a single HTML page.
- Parameters:
url_str – Canonical string used as a key for the page (local path or URL).
path – Filesystem Path pointing to the local copy of the page.
ids – Set of anchor identifiers discovered on the page.
- url_str: str
- path: pathlib.Path
- ids: set[str]
- firecrown.fctools.link_checker.extract_ids(file_path)[source]
Extract anchor identifiers from an HTML file.
- Parameters:
file_path (pathlib.Path) – Path to the local HTML file.
- Returns:
Set of anchor identifier strings found in the file.
- Return type:
set[str]
- class firecrown.fctools.link_checker.SiteChecker(root_dir, console, download_timeout, verbose=False, skip_external=False)[source]
Scan a directory of HTML files and validate anchor links.
SiteChecker walks
root_dircollecting local.htmlfiles and the set of anchors found in each. When checking links it normalizes eachhref, downloading external pages into a temporary cache so their anchors can be validated. The object is a context manager and will remove the temporary download cache when closed.The instance maintains counters of valid/invalid links and anchors for reporting.
- Parameters:
root_dir (str | pathlib.Path)
console (rich.console.Console)
download_timeout (int)
verbose (bool)
skip_external (bool)
- root_dir
- html_files: list[pathlib.Path] = []
- targets: dict[str, PageAnchors]
- downloaded_files: dict[str, pathlib.Path]
- tmp_root
- valid_links: int = 0
- invalid_links: int = 0
- valid_anchors: int = 0
- invalid_anchors: int = 0
- console
- download_timeout
- verbose = False
- skip_external = False
- session
- add_to_targets(url_str, full_path)[source]
Ensure a URL string is present in the
targetsmapping.- Parameters:
url_str (str) – Canonical URL string used as the dictionary key.
full_path (pathlib.Path) – Filesystem path to the file (downloaded or local).
- Return type:
None
- extract_links(file_path)[source]
Collect links from an HTML file and group fragments by target URL.
- Parameters:
file_path (pathlib.Path) – Local path of the HTML file to scan for anchor tags.
- Returns:
Mapping from target URL string to a set of anchor fragments (strings).
- Return type:
dict[str, set[str]]
- check_anchors()[source]
Validate all links and anchors collected from the site.
The method iterates over all collected HTML files, resolves links, ensures target files are reachable (downloading externals when necessary), and checks that requested anchor fragments exist on the target page.
- Returns:
A list of (source_file, target_url_str, reason) tuples describing broken links or missing anchors. Counters for valid/invalid links and anchors are updated on the instance.
- Return type:
list[tuple[str, str, str]]
- firecrown.fctools.link_checker.main(root_dir, download_timeout=20, verbose=False, skip_external=False)[source]
Main function.
- Parameters:
root_dir (str | pathlib.Path) – Directory to scan for HTML files.
download_timeout (int) – Timeout (seconds) for downloading external links.
verbose (bool) – Enable verbose output (show downloads and skipped links).
skip_external (bool) – When True, do not download or validate external http(s) links.
- Return type:
int
- firecrown.fctools.link_checker.app
- firecrown.fctools.link_checker.cli(root_dir, download_timeout=20, verbose=False, skip_external=False)[source]
Command-line entry point using Typer and Rich for output.
- Parameters:
root_dir (Annotated[pathlib.Path, typer.Argument(exists=True, file_okay=False, dir_okay=True, help='Directory with HTML files')])
download_timeout (Annotated[int, typer.Option('-t', '--download-timeout', help='Timeout in seconds for downloading external links')])
verbose (Annotated[bool, typer.Option('-v', '--verbose', help='Enable verbose output (show downloads)')])
skip_external (Annotated[bool, typer.Option('--skip-external', help='Do not download or validate external http(s) links; treat them as skipped')])