firecrown.fctools.link_checker ============================== .. py:module:: firecrown.fctools.link_checker .. autoapi-nested-parse:: Scan HTML files and check for broken anchor links. The module provides a SiteChecker class which scans local HTML files, extracts anchor identifiers (from ``id`` and ``name`` attributes), resolves links (including optional downloading of external HTTP(S) pages into a temporary cache), and reports missing files or missing anchor identifiers. Usage example:: python -m firecrown.fctools.link_checker path/to/html_dir -v Attributes ---------- .. autoapisummary:: firecrown.fctools.link_checker.app Classes ------- .. autoapisummary:: firecrown.fctools.link_checker.PageAnchors firecrown.fctools.link_checker.SiteChecker Functions --------- .. autoapisummary:: firecrown.fctools.link_checker.extract_ids firecrown.fctools.link_checker.main firecrown.fctools.link_checker.cli Module Contents --------------- .. py:class:: PageAnchors Holds information about a single HTML page. :param url_str: Canonical string used as a key for the page (local path or URL). :param path: Filesystem Path pointing to the local copy of the page. :param ids: Set of anchor identifiers discovered on the page. .. py:attribute:: url_str :type: str .. py:attribute:: path :type: pathlib.Path .. py:attribute:: ids :type: set[str] .. py:function:: extract_ids(file_path) Extract anchor identifiers from an HTML file. :param file_path: Path to the local HTML file. :returns: Set of anchor identifier strings found in the file. .. py:class:: SiteChecker(root_dir, console, download_timeout, verbose = False, skip_external = False) Scan a directory of HTML files and validate anchor links. SiteChecker walks ``root_dir`` collecting local ``.html`` files and the set of anchors found in each. When checking links it normalizes each ``href``, downloading external pages into a temporary cache so their anchors can be validated. The object is a context manager and will remove the temporary download cache when closed. The instance maintains counters of valid/invalid links and anchors for reporting. .. py:attribute:: root_dir .. py:attribute:: html_files :type: list[pathlib.Path] :value: [] .. py:attribute:: targets :type: dict[str, PageAnchors] .. py:attribute:: downloaded_files :type: dict[str, pathlib.Path] .. py:attribute:: tmp_root .. py:attribute:: valid_links :type: int :value: 0 .. py:attribute:: invalid_links :type: int :value: 0 .. py:attribute:: valid_anchors :type: int :value: 0 .. py:attribute:: invalid_anchors :type: int :value: 0 .. py:attribute:: console .. py:attribute:: download_timeout .. py:attribute:: verbose :value: False .. py:attribute:: skip_external :value: False .. py:attribute:: session .. py:method:: add_to_targets(url_str, full_path) Ensure a URL string is present in the ``targets`` mapping. :param url_str: Canonical URL string used as the dictionary key. :param full_path: Filesystem path to the file (downloaded or local). .. py:method:: extract_links(file_path) Collect links from an HTML file and group fragments by target URL. :param file_path: Local path of the HTML file to scan for anchor tags. :returns: Mapping from target URL string to a set of anchor fragments (strings). .. py:method:: check_anchors() Validate all links and anchors collected from the site. The method iterates over all collected HTML files, resolves links, ensures target files are reachable (downloading externals when necessary), and checks that requested anchor fragments exist on the target page. :returns: A list of (source_file, target_url_str, reason) tuples describing broken links or missing anchors. Counters for valid/invalid links and anchors are updated on the instance. .. py:method:: close() Remove the temporary download cache directory used for external pages. .. py:method:: __enter__() Enter context manager. .. py:method:: __exit__(exc_type, exc, tb) Exit context manager and clean up temporary files. .. py:function:: main(root_dir, download_timeout = 20, verbose = False, skip_external = False) Main function. :param root_dir: Directory to scan for HTML files. :param download_timeout: Timeout (seconds) for downloading external links. :param verbose: Enable verbose output (show downloads and skipped links). :param skip_external: When True, do not download or validate external http(s) links. .. py:data:: app .. py:function:: cli(root_dir, download_timeout = 20, verbose = False, skip_external = False) Command-line entry point using Typer and Rich for output.