sec_parser.semantic_elements

The semantic_elements subpackage provides abstractions for meaningful units in SEC EDGAR documents. It converts raw HTML elements into representations that carry semantic significance.

Subpackages

Submodules

Exceptions

InvalidLevelError

Base exception class for sec_parser.

Classes

`AbstractLevelElement`	The AbstractLevelElement class provides a level attribute to semantic elements.
`AbstractSemanticElement`	In the domain of HTML parsing, especially in the context of SEC EDGAR documents,
`CompositeSemanticElement`	CompositeSemanticElement acts as a container for other semantic elements,
`EmptyElement`	The EmptyElement class represents an HTML element that does not contain any content.
`ImageElement`	The ImageElement class represents a standard image within a document.
`IrrelevantElement`	The IrrelevantElement class identifies elements in the parsed HTML that do not
`NotYetClassifiedElement`	The NotYetClassifiedElement class represents an element whose type
`PageHeaderElement`	The PageHeaderElement class represents a page header within a document.
`PageNumberElement`	The PageNumberElement class represents a page number within a document.
`SupplementaryText`	The SupplementaryText class captures various types of supplementary text
`TextElement`	The TextElement class represents a standard text paragraph within a document.
`TableElement`	The TableElement class represents a standard table within a document.
`TitleElement`	The TitleElement class represents the title of a paragraph or other content object.
`TopSectionTitle`	The TopSectionTitle class represents the title and the beginning of a top-level

Package Contents

class sec_parser.semantic_elements.AbstractLevelElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, level: int | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: AbstractSemanticElement

The AbstractLevelElement class provides a level attribute to semantic elements. It represents hierarchical levels in the document structure. For instance, a main section title might be at level 1, a subsection at level 2, etc.

MIN_LEVEL = 0

classmethod create_from_element(source: AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, level: int | None = None) → AbstractLevelElement: Convert the semantic element into another semantic element type.

to_dict(*, include_previews: bool = False, include_contents: bool = False) → dict[str, Any]

__repr__() → str: Return repr(self).

class sec_parser.semantic_elements.AbstractSemanticElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: abc.ABC

In the domain of HTML parsing, especially in the context of SEC EDGAR documents, a semantic element refers to a meaningful unit within the document that serves a specific purpose. For example, a paragraph or a table might be considered a semantic element. Unlike syntactic elements, which merely exist to structure the HTML, semantic elements carry information that is vital to the understanding of the document’s content.

This class serves as a foundational representation of such semantic elements, containing an HtmlTag object that stores the raw HTML tag information. Subclasses will implement additional behaviors based on the type of the semantic element.

log_init(log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) → None: Has to be called at the very end of the __init__ method.

property html_tag: sec_parser.processing_engine.html_tag.HtmlTag

classmethod create_from_element(source: AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin) → AbstractSemanticElement: Convert the semantic element into another semantic element type.

to_dict(*, include_previews: bool = False, include_contents: bool = False) → dict[str, Any]

__repr__() → str: Return repr(self).

contains_words() → bool: Return True if the semantic element contains text.

property text: str: Property text is a passthrough to the HtmlTag text property.

get_source_code(*, pretty: bool = False, enable_compatibility: bool = False) → str: get_source_code is a passthrough to the HtmlTag method.

get_summary() → str

Return a human-readable summary of the semantic element.

This method aims to provide a simplified, human-friendly representation of the underlying HtmlTag. In this base implementation, it is a passthrough to the HtmlTag’s get_text() method.

Note: Subclasses may override this method to provide a more specific summary based on the type of element.

exception sec_parser.semantic_elements.InvalidLevelError

Bases: sec_parser.exceptions.SecParserValueError

Base exception class for sec_parser. All custom exceptions in sec_parser are inherited from this class.

class sec_parser.semantic_elements.CompositeSemanticElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, inner_elements: tuple[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, Ellipsis] | None, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

CompositeSemanticElement acts as a container for other semantic elements, especially for cases where a single HTML root tag wraps multiple elements. This ensures structural integrity and enables various features like semantic segmentation visualization, and debugging by comparison with the original document.

Why is this useful:

1. Some semantic elements, like XBRL tags (<ix>), may wrap multiple semantic elements. The container ensures that these relationships are not broken during parsing. 2. Enables the parser to fully reconstruct the original HTML document, which opens up possibilities for features like semantic segmentation visualization (e.g. recreate the original document but put semi-transparent colored boxes on top, based on semantic meaning), serialization of parsed documents into an augmented HTML, and debugging by comparing to the original document.

property inner_elements: tuple[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, Ellipsis]

classmethod create_from_element(source: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, inner_elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement] | None = None) → CompositeSemanticElement: Convert the semantic element into another semantic element type.

to_dict(*, include_previews: bool = False, include_contents: bool = False) → dict[str, Any]

classmethod unwrap_elements(elements: collections.abc.Iterable[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, include_containers: bool | None = None) → list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]: Recursively flatten a list of AbstractSemanticElement objects. For each CompositeSemanticElement encountered, its inner_elements are also recursively flattened. The ‘include_containers’ parameter controls whether the CompositeSemanticElement itself is included in the flattened list.

class sec_parser.semantic_elements.EmptyElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: IrrelevantElement

The EmptyElement class represents an HTML element that does not contain any content. It is a subclass of the IrrelevantElement class and is used to identify and handle empty HTML tags in the document.

class sec_parser.semantic_elements.ImageElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The ImageElement class represents a standard image within a document.

class sec_parser.semantic_elements.IrrelevantElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The IrrelevantElement class identifies elements in the parsed HTML that do not contribute to the content. These elements often include page separators, page numbers, and other non-content items. For instance, HTML tags without content like <p></p> or <div></div> are deemed irrelevant, often used in documents just to add vertical space.

class sec_parser.semantic_elements.NotYetClassifiedElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The NotYetClassifiedElement class represents an element whose type has not yet been determined. The parsing process aims to classify all instances of this class into more specific subclasses of AbstractSemanticElement.

class sec_parser.semantic_elements.PageHeaderElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: IrrelevantElement

The PageHeaderElement class represents a page header within a document. It is a subclass of the IrrelevantElement class and is used to identify and handle page headers in the document, such as current section titles and company names.

class sec_parser.semantic_elements.PageNumberElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: IrrelevantElement

The PageNumberElement class represents a page number within a document. It is a subclass of the IrrelevantElement class and is used to identify and handle page numbers in the document.

class sec_parser.semantic_elements.SupplementaryText(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin, sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The SupplementaryText class captures various types of supplementary text within a document, such as unit qualifiers, additional notes, and disclaimers.

For example: - “(In millions, except number of shares which are reflected in thousands and

per share amounts)”

“See accompanying Notes to Condensed Consolidated Financial Statements.”
“Disclaimer: This is not financial advice.”

class sec_parser.semantic_elements.TextElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin, sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The TextElement class represents a standard text paragraph within a document.

class sec_parser.semantic_elements.TableElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The TableElement class represents a standard table within a document.

get_summary() → str

Return a human-readable summary of the semantic element.

This method aims to provide a simplified, human-friendly representation of the underlying HtmlTag.

to_dict(*, include_previews: bool = False, include_contents: bool = False) → dict[str, Any]

table_to_markdown() → str

class sec_parser.semantic_elements.TitleElement

Bases: sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin, sec_parser.semantic_elements.abstract_semantic_element.AbstractLevelElement

The TitleElement class represents the title of a paragraph or other content object. It serves as a semantic marker, providing context and structure to the document.

class sec_parser.semantic_elements.TopSectionTitle(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None, level: int | None = None, section_type: sec_parser.semantic_elements.top_section_title_types.TopSectionType | None = None)

Bases: sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin, sec_parser.semantic_elements.top_section_start_marker.TopSectionStartMarker

The TopSectionTitle class represents the title and the beginning of a top-level section of a document. For instance, in SEC 10-Q reports, a top-level section could be “Part I, Item 3. Quantitative and Qualitative Disclosures About Market Risk.”.

classmethod create_from_element(source: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, level: int | None = None, section_type: sec_parser.semantic_elements.top_section_title_types.TopSectionType | None = None) → sec_parser.semantic_elements.abstract_semantic_element.AbstractLevelElement

to_dict(*, include_previews: bool = False, include_contents: bool = False) → dict[str, Any]