sec_parser.processing_engine
============================

.. py:module:: sec_parser.processing_engine

.. autoapi-nested-parse::

   The processing_engine subpackage contains the core logic
   for parsing SEC documents. It is designed to work
   in conjunction with the steps from the processing_steps
   subpackage to perform tasks like section
   identification, title parsing, and text extraction.


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/sec_parser/processing_engine/core/index
   /autoapi/sec_parser/processing_engine/html_tag/index
   /autoapi/sec_parser/processing_engine/html_tag_parser/index
   /autoapi/sec_parser/processing_engine/processing_log/index
   /autoapi/sec_parser/processing_engine/types/index


Classes
-------

.. autoapisummary::

   sec_parser.processing_engine.AbstractSemanticElementParser
   sec_parser.processing_engine.Edgar10KParser
   sec_parser.processing_engine.Edgar10QParser
   sec_parser.processing_engine.HtmlTag
   sec_parser.processing_engine.HtmlTagParser


Package Contents
----------------

.. py:class:: AbstractSemanticElementParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None)

   Bases: :py:obj:`abc.ABC`


   Responsible for parsing semantic elements from HTML documents.
   It takes raw HTML and turns it into a list of objects
   representing semantic elements.

   At a High Level:
   ==================
   1. Extract top-level HTML tags from the document.
   2. Transform these tags into a list of more specific semantic
      elements step-by-step.

   Why Focus on Top-Level Tags?
   ============================
   SEC filings usually have a flat HTML structure, which simplifies the
   parsing process. Each top-level HTML tag often directly corresponds
   to a single semantic element. This is different from many websites
   where HTML tags are nested deeply,requiring more complex parsing.

   For Advanced Users:
   ====================
   The parsing process is implemented as a sequence of steps and allows for
   customization at each step.

   - Pipeline Pattern: Raw HTML tags are processed in a sequential manner.
     The steps follow an ordered, step-by-step approach, akin to a Finite
     State Machine (FSM). Each element transitions through various states
     defined by the sequence of processing steps.

   - Strategy Pattern: Each step is customizable. You can either replace,
     remove, or extend any of the existing steps with your own or
     inherited implementation. Alternatively, you can replace the entire pipeline
     with your own process.


   .. py:attribute:: _get_steps


   .. py:attribute:: _parsing_options


   .. py:attribute:: _html_tag_parser


   .. py:method:: get_default_steps() -> list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]
      :abstractmethod:


   .. py:method:: parse(html: str | bytes, *, unwrap_elements: bool | None = None, include_containers: bool | None = None, include_irrelevant_elements: bool | None = None) -> list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]


   .. py:method:: parse_from_tags(root_tags: list[sec_parser.processing_engine.html_tag.HtmlTag], *, unwrap_elements: bool | None = None, include_containers: bool | None = None, include_irrelevant_elements: bool | None = None) -> list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]


.. py:class:: Edgar10KParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None)

   Bases: :py:obj:`AbstractSemanticElementParser`


   The Edgar10KParser class is responsible for parsing SEC EDGAR 10-K
   quarterly reports. It transforms the HTML documents into a list
   of elements. Each element in this list represents a part of
   the visual structure of the original document.


   .. py:method:: get_default_steps(get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None) -> list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]


   .. py:method:: get_default_single_element_checks() -> list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]


.. py:class:: Edgar10QParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None)

   Bases: :py:obj:`AbstractSemanticElementParser`


   The Edgar10QParser class is responsible for parsing SEC EDGAR 10-Q
   quarterly reports. It transforms the HTML documents into a list
   of elements. Each element in this list represents a part of
   the visual structure of the original document.


   .. py:method:: get_default_steps(get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None) -> list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]


   .. py:method:: get_default_single_element_checks() -> list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]


.. py:class:: HtmlTag(bs4_element: bs4.PageElement)

   The HtmlTag class is a wrapper for BeautifulSoup4 Tag objects.

   It serves three main purposes:

   1. Decoupling: By abstracting the underlying BeautifulSoup4 library, we
      can isolate our application logic from the library specifics. This
      makes it easier to modify or even replace the HTML parsing library in
      the future without extensive codebase changes.

   2. Usability: The HtmlTag class provides a convenient location to add
      extension methods or additional properties not offered by the native
      BeautifulSoup4 Tag class. This enhances the usability of the class.

   3. Caching: The HtmlTag class also caches processing results, improving
      performance by avoiding unnecessary re-computation.


   .. py:attribute:: _bs4
      :type:  bs4.Tag


   .. py:attribute:: _parent
      :type:  HtmlTag | None
      :value: None


   .. py:attribute:: _text
      :type:  str | None
      :value: None


   .. py:attribute:: _children
      :type:  list[HtmlTag] | None
      :value: None


   .. py:attribute:: _is_unary_tree
      :type:  bool | None
      :value: None


   .. py:attribute:: _first_deepest_tag
      :type:  HtmlTag | None | NotSetType


   .. py:attribute:: _text_styles_metrics
      :type:  dict[tuple[str, str], float] | None
      :value: None


   .. py:attribute:: _frozen_dict
      :type:  frozendict.frozendict | None
      :value: None


   .. py:attribute:: _source_code
      :type:  str | None
      :value: None


   .. py:attribute:: _pretty_source_code
      :type:  str | None
      :value: None


   .. py:attribute:: _compatible_source_code
      :type:  str | None
      :value: None


   .. py:attribute:: _approx_table_metrics
      :type:  sec_parser.utils.bs4_.approx_table_metrics.ApproxTableMetrics | None | NotSetType


   .. py:attribute:: _contains_tag
      :type:  dict[tuple[str, bool], bool]


   .. py:attribute:: _without_tags
      :type:  dict[tuple[str, Ellipsis], HtmlTag]


   .. py:attribute:: _count_tags
      :type:  dict[str, int]


   .. py:attribute:: _has_text_outside_tags
      :type:  dict[tuple[str, Ellipsis], bool]


   .. py:attribute:: _contains_words
      :type:  bool | None
      :value: None


   .. py:attribute:: _markdown_table
      :type:  str | None
      :value: None


   .. py:property:: parent
      :type: HtmlTag | None


   .. py:method:: get_source_code(*, pretty: bool = False, enable_compatibility: bool = False) -> str


   .. py:method:: _generate_preview(text: str) -> str

      Generate a preview of the text with a specified length.


   .. py:method:: to_dict() -> frozendict.frozendict

      Compute the hash of the HTML tag.


   .. py:method:: contains_words() -> bool

      Return True if the semantic element contains text.


   .. py:property:: text
      :type: str


      `text` property recursively extracts text from the child tags.
      The result is cached as the underlying data doesn't change.


   .. py:property:: name
      :type: str


      Returns tag name, e.g. for <div> return 'div'.


   .. py:method:: has_tag_children() -> bool


   .. py:method:: get_children() -> list[HtmlTag]


   .. py:method:: contains_tag(name: str, *, include_self: bool = False) -> bool

      `contains_tag` method checks if the current HTML tag contains a descendant tag
      with the specified name. For example, calling contains_tag("b") on an
      HtmlTag instance representing "<div><p><b>text</b></p></div>" would
      return True, as there is a 'b' tag within the descendants of the 'div' tag.


   .. py:method:: has_text_outside_tags(tags: list[str] | str) -> bool

      `has_text_outside_tags` function checks if the given
      node has any text outside the specified tag.
      For example, calling has_text_outside_tags(node, ["b"])
      on a node representing "<div><p><b>text</b>extra text</p></div>"
      would return True, as there is text outside the 'b'
      tag within the descendants of the 'div' tag.


   .. py:method:: without_tags(names: collections.abc.Iterable[str]) -> HtmlTag

      `without_tags` method creates a copy of the current HTML tag and removes all
      descendant tags with the specified name. For example, calling
      without_tags(tag, ["b","i"]) on an HtmlTag instance representing
      "<div><b>foo</b><p>bar<i>bax</i></p></div>" would
      return a copy HtmlTag instance representing "<div><p>bar</p></div>".


   .. py:method:: count_tags(name: str) -> int

      `count_tags` method counts the number of descendant tags with the specified name
      within the current HTML tag. For example, calling count_tags("b") on an
      HtmlTag instance representing "<div><p><b>text</b></p><b>more text</b></div>"
      would return 2, as there are two 'b' tags within the descendants of
      the 'div' tag.


   .. py:method:: is_unary_tree() -> bool

      `is_unary_tree` determines if a BeautifulSoup tag forms a unary tree.
      In a unary tree, each node has at most one child.

      However, if a non-leaf node contains a non-empty string even without a tag
      surrounding it, the tree is not considered unary.

      Additionally, if the some tag is a 'table', the function will return True
      regardless of its children. This is because in the context of this application,
      'table' tags are always considered unary.


   .. py:method:: get_text_styles_metrics() -> dict[tuple[str, str], float]

      Compute the percentage distribution of various CSS styles within the text
      content of a given HTML tag and its descendants.

      This function iterates through all the text nodes within the tag, recursively
      includes text from child elements, and calculates the effective styles applied
      to each text segment.

      It aggregates these styles and computes their percentage distribution based
      on the length of text they apply to.

      The function uses BeautifulSoup's recursive text search and parent traversal
      features. It returns a dictionary containing the aggregated style metrics
      (the percentage distribution of styles).

      Each dictionary entry corresponds to a unique style, (property, value) and
      the percentage of text it affects.


   .. py:method:: get_approx_table_metrics() -> sec_parser.utils.bs4_.approx_table_metrics.ApproxTableMetrics | None


   .. py:method:: is_table_of_content() -> bool


   .. py:method:: table_to_markdown() -> str


   .. py:method:: _to_tag(element: bs4.PageElement) -> bs4.Tag
      :staticmethod:


   .. py:method:: wrap_tags_in_new_parent(parent_tag_name: str, tags: collections.abc.Iterable[HtmlTag]) -> HtmlTag
      :staticmethod:


   .. py:method:: count_text_matches_in_descendants(predicate: Callable[[str], bool], *, exclude_links: bool | None = None) -> int


.. py:class:: HtmlTagParser(parser_backend: str | None = None)

   Bases: :py:obj:`AbstractHtmlTagParser`


   The HtmlTagParser parses an HTML document using BeautifulSoup4.
   It then wraps the parsed bs4.Tag objects into HtmlTag objects.


   .. py:attribute:: _parser_backend
      :value: ''


   .. py:method:: parse(html: str | bytes) -> list[sec_parser.processing_engine.html_tag.HtmlTag]


   .. py:method:: _parse_to_bs4(html: str | bytes) -> bs4.Tag