sec_parser.processing_engine ============================ .. py:module:: sec_parser.processing_engine .. autoapi-nested-parse:: The processing_engine subpackage contains the core logic for parsing SEC documents. It is designed to work in conjunction with the steps from the processing_steps subpackage to perform tasks like section identification, title parsing, and text extraction. Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/sec_parser/processing_engine/core/index /autoapi/sec_parser/processing_engine/html_tag/index /autoapi/sec_parser/processing_engine/html_tag_parser/index /autoapi/sec_parser/processing_engine/processing_log/index /autoapi/sec_parser/processing_engine/types/index Classes ------- .. autoapisummary:: sec_parser.processing_engine.AbstractSemanticElementParser sec_parser.processing_engine.Edgar10KParser sec_parser.processing_engine.Edgar10QParser sec_parser.processing_engine.HtmlTag sec_parser.processing_engine.HtmlTagParser Package Contents ---------------- .. py:class:: AbstractSemanticElementParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None) Bases: :py:obj:`abc.ABC` Responsible for parsing semantic elements from HTML documents. It takes raw HTML and turns it into a list of objects representing semantic elements. At a High Level: ================== 1. Extract top-level HTML tags from the document. 2. Transform these tags into a list of more specific semantic elements step-by-step. Why Focus on Top-Level Tags? ============================ SEC filings usually have a flat HTML structure, which simplifies the parsing process. Each top-level HTML tag often directly corresponds to a single semantic element. This is different from many websites where HTML tags are nested deeply,requiring more complex parsing. For Advanced Users: ==================== The parsing process is implemented as a sequence of steps and allows for customization at each step. - Pipeline Pattern: Raw HTML tags are processed in a sequential manner. The steps follow an ordered, step-by-step approach, akin to a Finite State Machine (FSM). Each element transitions through various states defined by the sequence of processing steps. - Strategy Pattern: Each step is customizable. You can either replace, remove, or extend any of the existing steps with your own or inherited implementation. Alternatively, you can replace the entire pipeline with your own process. .. py:attribute:: _get_steps .. py:attribute:: _parsing_options .. py:attribute:: _html_tag_parser .. py:method:: get_default_steps() -> list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep] :abstractmethod: .. py:method:: parse(html: str | bytes, *, unwrap_elements: bool | None = None, include_containers: bool | None = None, include_irrelevant_elements: bool | None = None) -> list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement] .. py:method:: parse_from_tags(root_tags: list[sec_parser.processing_engine.html_tag.HtmlTag], *, unwrap_elements: bool | None = None, include_containers: bool | None = None, include_irrelevant_elements: bool | None = None) -> list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement] .. py:class:: Edgar10KParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None) Bases: :py:obj:`AbstractSemanticElementParser` The Edgar10KParser class is responsible for parsing SEC EDGAR 10-K quarterly reports. It transforms the HTML documents into a list of elements. Each element in this list represents a part of the visual structure of the original document. .. py:method:: get_default_steps(get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None) -> list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep] .. py:method:: get_default_single_element_checks() -> list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck] .. py:class:: Edgar10QParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None) Bases: :py:obj:`AbstractSemanticElementParser` The Edgar10QParser class is responsible for parsing SEC EDGAR 10-Q quarterly reports. It transforms the HTML documents into a list of elements. Each element in this list represents a part of the visual structure of the original document. .. py:method:: get_default_steps(get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None) -> list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep] .. py:method:: get_default_single_element_checks() -> list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck] .. py:class:: HtmlTag(bs4_element: bs4.PageElement) The HtmlTag class is a wrapper for BeautifulSoup4 Tag objects. It serves three main purposes: 1. Decoupling: By abstracting the underlying BeautifulSoup4 library, we can isolate our application logic from the library specifics. This makes it easier to modify or even replace the HTML parsing library in the future without extensive codebase changes. 2. Usability: The HtmlTag class provides a convenient location to add extension methods or additional properties not offered by the native BeautifulSoup4 Tag class. This enhances the usability of the class. 3. Caching: The HtmlTag class also caches processing results, improving performance by avoiding unnecessary re-computation. .. py:attribute:: _bs4 :type: bs4.Tag .. py:attribute:: _parent :type: HtmlTag | None :value: None .. py:attribute:: _text :type: str | None :value: None .. py:attribute:: _children :type: list[HtmlTag] | None :value: None .. py:attribute:: _is_unary_tree :type: bool | None :value: None .. py:attribute:: _first_deepest_tag :type: HtmlTag | None | NotSetType .. py:attribute:: _text_styles_metrics :type: dict[tuple[str, str], float] | None :value: None .. py:attribute:: _frozen_dict :type: frozendict.frozendict | None :value: None .. py:attribute:: _source_code :type: str | None :value: None .. py:attribute:: _pretty_source_code :type: str | None :value: None .. py:attribute:: _compatible_source_code :type: str | None :value: None .. py:attribute:: _approx_table_metrics :type: sec_parser.utils.bs4_.approx_table_metrics.ApproxTableMetrics | None | NotSetType .. py:attribute:: _contains_tag :type: dict[tuple[str, bool], bool] .. py:attribute:: _without_tags :type: dict[tuple[str, Ellipsis], HtmlTag] .. py:attribute:: _count_tags :type: dict[str, int] .. py:attribute:: _has_text_outside_tags :type: dict[tuple[str, Ellipsis], bool] .. py:attribute:: _contains_words :type: bool | None :value: None .. py:attribute:: _markdown_table :type: str | None :value: None .. py:property:: parent :type: HtmlTag | None .. py:method:: get_source_code(*, pretty: bool = False, enable_compatibility: bool = False) -> str .. py:method:: _generate_preview(text: str) -> str Generate a preview of the text with a specified length. .. py:method:: to_dict() -> frozendict.frozendict Compute the hash of the HTML tag. .. py:method:: contains_words() -> bool Return True if the semantic element contains text. .. py:property:: text :type: str `text` property recursively extracts text from the child tags. The result is cached as the underlying data doesn't change. .. py:property:: name :type: str Returns tag name, e.g. for
text
textextra text
barbax
bar
text
more text