sec_parser.semantic_elements ============================ .. py:module:: sec_parser.semantic_elements .. autoapi-nested-parse:: The semantic_elements subpackage provides abstractions for meaningful units in SEC EDGAR documents. It converts raw HTML elements into representations that carry semantic significance. Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/sec_parser/semantic_elements/abstract_semantic_element/index /autoapi/sec_parser/semantic_elements/composite_semantic_element/index /autoapi/sec_parser/semantic_elements/highlighted_text_element/index /autoapi/sec_parser/semantic_elements/mixins/index /autoapi/sec_parser/semantic_elements/semantic_elements/index /autoapi/sec_parser/semantic_elements/table_element/index /autoapi/sec_parser/semantic_elements/title_element/index /autoapi/sec_parser/semantic_elements/top_section_start_marker/index /autoapi/sec_parser/semantic_elements/top_section_title/index /autoapi/sec_parser/semantic_elements/top_section_title_types/index Exceptions ---------- .. autoapisummary:: sec_parser.semantic_elements.InvalidLevelError Classes ------- .. autoapisummary:: sec_parser.semantic_elements.AbstractLevelElement sec_parser.semantic_elements.AbstractSemanticElement sec_parser.semantic_elements.CompositeSemanticElement sec_parser.semantic_elements.EmptyElement sec_parser.semantic_elements.ImageElement sec_parser.semantic_elements.IrrelevantElement sec_parser.semantic_elements.NotYetClassifiedElement sec_parser.semantic_elements.PageHeaderElement sec_parser.semantic_elements.PageNumberElement sec_parser.semantic_elements.SupplementaryText sec_parser.semantic_elements.TextElement sec_parser.semantic_elements.TableElement sec_parser.semantic_elements.TitleElement sec_parser.semantic_elements.TopSectionTitle Package Contents ---------------- .. py:class:: AbstractLevelElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, level: int | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`AbstractSemanticElement` The AbstractLevelElement class provides a level attribute to semantic elements. It represents hierarchical levels in the document structure. For instance, a main section title might be at level 1, a subsection at level 2, etc. .. py:attribute:: MIN_LEVEL :value: 0 .. py:attribute:: level :value: 0 .. py:method:: create_from_element(source: AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, level: int | None = None) -> AbstractLevelElement :classmethod: Convert the semantic element into another semantic element type. .. py:method:: to_dict(*, include_previews: bool = False, include_contents: bool = False) -> dict[str, Any] .. py:method:: __repr__() -> str .. py:class:: AbstractSemanticElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`abc.ABC` In the domain of HTML parsing, especially in the context of SEC EDGAR documents, a semantic element refers to a meaningful unit within the document that serves a specific purpose. For example, a paragraph or a table might be considered a semantic element. Unlike syntactic elements, which merely exist to structure the HTML, semantic elements carry information that is vital to the understanding of the document's content. This class serves as a foundational representation of such semantic elements, containing an HtmlTag object that stores the raw HTML tag information. Subclasses will implement additional behaviors based on the type of the semantic element. .. py:attribute:: _html_tag .. py:attribute:: processing_log .. py:method:: log_init(log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) -> None Has to be called at the very end of the __init__ method. .. py:property:: html_tag :type: sec_parser.processing_engine.html_tag.HtmlTag .. py:method:: create_from_element(source: AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin) -> AbstractSemanticElement :classmethod: Convert the semantic element into another semantic element type. .. py:method:: to_dict(*, include_previews: bool = False, include_contents: bool = False) -> dict[str, Any] .. py:method:: __repr__() -> str .. py:method:: contains_words() -> bool Return True if the semantic element contains text. .. py:property:: text :type: str Property text is a passthrough to the HtmlTag text property. .. py:method:: get_source_code(*, pretty: bool = False, enable_compatibility: bool = False) -> str get_source_code is a passthrough to the HtmlTag method. .. py:method:: get_summary() -> str Return a human-readable summary of the semantic element. This method aims to provide a simplified, human-friendly representation of the underlying HtmlTag. In this base implementation, it is a passthrough to the HtmlTag's get_text() method. Note: Subclasses may override this method to provide a more specific summary based on the type of element. .. py:exception:: InvalidLevelError Bases: :py:obj:`sec_parser.exceptions.SecParserValueError` Base exception class for sec_parser. All custom exceptions in sec_parser are inherited from this class. .. py:class:: CompositeSemanticElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, inner_elements: tuple[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, Ellipsis] | None, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement` CompositeSemanticElement acts as a container for other semantic elements, especially for cases where a single HTML root tag wraps multiple elements. This ensures structural integrity and enables various features like semantic segmentation visualization, and debugging by comparison with the original document. Why is this useful: =================== 1. Some semantic elements, like XBRL tags (), may wrap multiple semantic elements. The container ensures that these relationships are not broken during parsing. 2. Enables the parser to fully reconstruct the original HTML document, which opens up possibilities for features like semantic segmentation visualization (e.g. recreate the original document but put semi-transparent colored boxes on top, based on semantic meaning), serialization of parsed documents into an augmented HTML, and debugging by comparing to the original document. .. py:attribute:: _inner_elements :type: tuple[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, Ellipsis] :value: () .. py:property:: inner_elements :type: tuple[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, Ellipsis] .. py:method:: create_from_element(source: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, inner_elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement] | None = None) -> CompositeSemanticElement :classmethod: Convert the semantic element into another semantic element type. .. py:method:: to_dict(*, include_previews: bool = False, include_contents: bool = False) -> dict[str, Any] .. py:method:: unwrap_elements(elements: collections.abc.Iterable[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, include_containers: bool | None = None) -> list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement] :classmethod: Recursively flatten a list of AbstractSemanticElement objects. For each CompositeSemanticElement encountered, its inner_elements are also recursively flattened. The 'include_containers' parameter controls whether the CompositeSemanticElement itself is included in the flattened list. .. py:class:: EmptyElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`IrrelevantElement` The EmptyElement class represents an HTML element that does not contain any content. It is a subclass of the IrrelevantElement class and is used to identify and handle empty HTML tags in the document. .. py:class:: ImageElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement` The ImageElement class represents a standard image within a document. .. py:class:: IrrelevantElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement` The IrrelevantElement class identifies elements in the parsed HTML that do not contribute to the content. These elements often include page separators, page numbers, and other non-content items. For instance, HTML tags without content like

or
are deemed irrelevant, often used in documents just to add vertical space. .. py:class:: NotYetClassifiedElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement` The NotYetClassifiedElement class represents an element whose type has not yet been determined. The parsing process aims to classify all instances of this class into more specific subclasses of AbstractSemanticElement. .. py:class:: PageHeaderElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`IrrelevantElement` The PageHeaderElement class represents a page header within a document. It is a subclass of the IrrelevantElement class and is used to identify and handle page headers in the document, such as current section titles and company names. .. py:class:: PageNumberElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`IrrelevantElement` The PageNumberElement class represents a page number within a document. It is a subclass of the IrrelevantElement class and is used to identify and handle page numbers in the document. .. py:class:: SupplementaryText(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin`, :py:obj:`sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement` The SupplementaryText class captures various types of supplementary text within a document, such as unit qualifiers, additional notes, and disclaimers. For example: - "(In millions, except number of shares which are reflected in thousands and per share amounts)" - "See accompanying Notes to Condensed Consolidated Financial Statements." - "Disclaimer: This is not financial advice." .. py:class:: TextElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin`, :py:obj:`sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement` The TextElement class represents a standard text paragraph within a document. .. py:class:: TableElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement` The TableElement class represents a standard table within a document. .. py:method:: get_summary() -> str Return a human-readable summary of the semantic element. This method aims to provide a simplified, human-friendly representation of the underlying HtmlTag. .. py:method:: to_dict(*, include_previews: bool = False, include_contents: bool = False) -> dict[str, Any] .. py:method:: table_to_markdown() -> str .. py:class:: TitleElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, level: int | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) Bases: :py:obj:`sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin`, :py:obj:`sec_parser.semantic_elements.abstract_semantic_element.AbstractLevelElement` The TitleElement class represents the title of a paragraph or other content object. It serves as a semantic marker, providing context and structure to the document. .. py:class:: TopSectionTitle(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None, level: int | None = None, section_type: sec_parser.semantic_elements.top_section_title_types.TopSectionInFiling | None = None) Bases: :py:obj:`sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin`, :py:obj:`sec_parser.semantic_elements.top_section_start_marker.TopSectionStartMarker` The TopSectionTitle class represents the title and the beginning of a top-level section of a document. For instance, in SEC 10-Q reports, a top-level section could be "Part I, Item 3. Quantitative and Qualitative Disclosures About Market Risk.". .. py:method:: create_from_element(source: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, level: int | None = None, section_type: sec_parser.semantic_elements.top_section_title_types.TopSectionInFiling | None = None) -> sec_parser.semantic_elements.abstract_semantic_element.AbstractLevelElement :classmethod: Convert the semantic element into another semantic element type. .. py:method:: to_dict(*, include_previews: bool = False, include_contents: bool = False) -> dict[str, Any]