firecrown.fctools.link_checker

Scan HTML files and check for broken anchor links.

The module provides a SiteChecker class which scans local HTML files, extracts anchor identifiers (from id and name attributes), resolves links (including optional downloading of external HTTP(S) pages into a temporary cache), and reports missing files or missing anchor identifiers.

Usage example:

python -m firecrown.fctools.link_checker path/to/html_dir -v

Attributes

app

Classes

`PageAnchors`	Holds information about a single HTML page.
`SiteChecker`	Scan a directory of HTML files and validate anchor links.

Functions

`extract_ids`(file_path)	Extract anchor identifiers from an HTML file.
`main`(root_dir[, download_timeout, verbose, skip_external])	Main function.
`cli`(root_dir[, download_timeout, verbose, skip_external])	Command-line entry point using Typer and Rich for output.

Module Contents

class firecrown.fctools.link_checker.PageAnchors[source]

Holds information about a single HTML page.

Parameters:

url_str – Canonical string used as a key for the page (local path or URL).
path – Filesystem Path pointing to the local copy of the page.
ids – Set of anchor identifiers discovered on the page.

url_str: str

path: pathlib.Path

ids: set[str]

firecrown.fctools.link_checker.extract_ids(file_path)[source]

Extract anchor identifiers from an HTML file.

Parameters:: file_path (pathlib.Path) – Path to the local HTML file.
Returns:: Set of anchor identifier strings found in the file.
Return type:: set[str]

class firecrown.fctools.link_checker.SiteChecker(root_dir, console, download_timeout, verbose=False, skip_external=False)[source]

Scan a directory of HTML files and validate anchor links.

SiteChecker walks root_dir collecting local .html files and the set of anchors found in each. When checking links it normalizes each href, downloading external pages into a temporary cache so their anchors can be validated. The object is a context manager and will remove the temporary download cache when closed.

The instance maintains counters of valid/invalid links and anchors for reporting.

Parameters:

root_dir (str | pathlib.Path)
console (rich.console.Console)
download_timeout (int)
verbose (bool)
skip_external (bool)

root_dir

html_files: list[pathlib.Path] = []

targets: dict[str, PageAnchors]

downloaded_files: dict[str, pathlib.Path]

tmp_root

valid_links: int = 0

invalid_links: int = 0

valid_anchors: int = 0

invalid_anchors: int = 0

console

download_timeout

verbose = False

skip_external = False

session

add_to_targets(url_str, full_path)[source]

Ensure a URL string is present in the targets mapping.

Parameters:

url_str (str) – Canonical URL string used as the dictionary key.
full_path (pathlib.Path) – Filesystem path to the file (downloaded or local).

Return type:

None

extract_links(file_path)[source]

Collect links from an HTML file and group fragments by target URL.

Parameters:: file_path (pathlib.Path) – Local path of the HTML file to scan for anchor tags.
Returns:: Mapping from target URL string to a set of anchor fragments (strings).
Return type:: dict[str, set[str]]

check_anchors()[source]

Validate all links and anchors collected from the site.

The method iterates over all collected HTML files, resolves links, ensures target files are reachable (downloading externals when necessary), and checks that requested anchor fragments exist on the target page.

Returns:: A list of (source_file, target_url_str, reason) tuples describing broken links or missing anchors. Counters for valid/invalid links and anchors are updated on the instance.
Return type:: list[tuple[str, str, str]]

close()[source]

Remove the temporary download cache directory used for external pages.

Return type:: None

__enter__()[source]: Enter context manager.

__exit__(exc_type, exc, tb)[source]: Exit context manager and clean up temporary files.

firecrown.fctools.link_checker.main(root_dir, download_timeout=20, verbose=False, skip_external=False)[source]

Main function.

Parameters:

root_dir (str | pathlib.Path) – Directory to scan for HTML files.
download_timeout (int) – Timeout (seconds) for downloading external links.
verbose (bool) – Enable verbose output (show downloads and skipped links).
skip_external (bool) – When True, do not download or validate external http(s) links.

Return type:

int

firecrown.fctools.link_checker.app

firecrown.fctools.link_checker.cli(root_dir, download_timeout=20, verbose=False, skip_external=False)[source]

Command-line entry point using Typer and Rich for output.

Parameters:

root_dir (Annotated[pathlib.Path, typer.Argument(exists=True, file_okay=False, dir_okay=True, help='Directory with HTML files')])
download_timeout (Annotated[int, typer.Option('-t', '--download-timeout', help='Timeout in seconds for downloading external links')])
verbose (Annotated[bool, typer.Option('-v', '--verbose', help='Enable verbose output (show downloads)')])
skip_external (Annotated[bool, typer.Option('--skip-external', help='Do not download or validate external http(s) links; treat them as skipped')])