firecrown.fctools.link_checker
==============================

.. py:module:: firecrown.fctools.link_checker

.. autoapi-nested-parse::

   Scan HTML files and check for broken anchor links.

   The module provides a SiteChecker class which scans local HTML files,
   extracts anchor identifiers (from ``id`` and ``name`` attributes), resolves
   links (including optional downloading of external HTTP(S) pages into a
   temporary cache), and reports missing files or missing anchor identifiers.

   Usage example::

       python -m firecrown.fctools.link_checker path/to/html_dir -v


Attributes
----------

.. autoapisummary::

   firecrown.fctools.link_checker.app


Classes
-------

.. autoapisummary::

   firecrown.fctools.link_checker.PageAnchors
   firecrown.fctools.link_checker.SiteChecker


Functions
---------

.. autoapisummary::

   firecrown.fctools.link_checker.extract_ids
   firecrown.fctools.link_checker.main
   firecrown.fctools.link_checker.cli


Module Contents
---------------

.. py:class:: PageAnchors

   Holds information about a single HTML page.

   :param url_str: Canonical string used as a key for the page (local path or URL).
   :param path: Filesystem Path pointing to the local copy of the page.
   :param ids: Set of anchor identifiers discovered on the page.


   .. py:attribute:: url_str
      :type:  str


   .. py:attribute:: path
      :type:  pathlib.Path


   .. py:attribute:: ids
      :type:  set[str]


.. py:function:: extract_ids(file_path)

   Extract anchor identifiers from an HTML file.

   :param file_path: Path to the local HTML file.
   :returns: Set of anchor identifier strings found in the file.


.. py:class:: SiteChecker(root_dir, console, download_timeout, verbose = False, skip_external = False)

   Scan a directory of HTML files and validate anchor links.

   SiteChecker walks ``root_dir`` collecting local ``.html`` files and the
   set of anchors found in each. When checking links it normalizes each
   ``href``, downloading external pages into a temporary cache so their
   anchors can be validated. The object is a context manager and will
   remove the temporary download cache when closed.

   The instance maintains counters of valid/invalid links and anchors for reporting.


   .. py:attribute:: root_dir


   .. py:attribute:: html_files
      :type:  list[pathlib.Path]
      :value: []


   .. py:attribute:: targets
      :type:  dict[str, PageAnchors]


   .. py:attribute:: downloaded_files
      :type:  dict[str, pathlib.Path]


   .. py:attribute:: tmp_root


   .. py:attribute:: valid_links
      :type:  int
      :value: 0


   .. py:attribute:: invalid_links
      :type:  int
      :value: 0


   .. py:attribute:: valid_anchors
      :type:  int
      :value: 0


   .. py:attribute:: invalid_anchors
      :type:  int
      :value: 0


   .. py:attribute:: console


   .. py:attribute:: download_timeout


   .. py:attribute:: verbose
      :value: False


   .. py:attribute:: skip_external
      :value: False


   .. py:attribute:: session


   .. py:method:: add_to_targets(url_str, full_path)

      Ensure a URL string is present in the ``targets`` mapping.

      :param url_str: Canonical URL string used as the dictionary key.
      :param full_path: Filesystem path to the file (downloaded or local).


   .. py:method:: extract_links(file_path)

      Collect links from an HTML file and group fragments by target URL.

      :param file_path: Local path of the HTML file to scan for anchor tags.
      :returns: Mapping from target URL string to a set of anchor fragments (strings).


   .. py:method:: check_anchors()

      Validate all links and anchors collected from the site.

      The method iterates over all collected HTML files, resolves links,
      ensures target files are reachable (downloading externals when
      necessary), and checks that requested anchor fragments exist on the
      target page.

      :returns: A list of (source_file, target_url_str, reason) tuples
          describing broken links or missing anchors. Counters for
          valid/invalid links and anchors are updated on the instance.


   .. py:method:: close()

      Remove the temporary download cache directory used for external pages.


   .. py:method:: __enter__()

      Enter context manager.


   .. py:method:: __exit__(exc_type, exc, tb)

      Exit context manager and clean up temporary files.


.. py:function:: main(root_dir, download_timeout = 20, verbose = False, skip_external = False)

   Main function.

   :param root_dir: Directory to scan for HTML files.
   :param download_timeout: Timeout (seconds) for downloading external links.
   :param verbose: Enable verbose output (show downloads and skipped links).
   :param skip_external: When True, do not download or validate external http(s) links.


.. py:data:: app

.. py:function:: cli(root_dir, download_timeout = 20, verbose = False, skip_external = False)

   Command-line entry point using Typer and Rich for output.