sec_parser.processing_engine.core

Classes

`AbstractHtmlTagParser`	Helper class that provides a standard way to create an ABC using
`HtmlTagParser`	The HtmlTagParser parses an HTML document using BeautifulSoup4.
`ParsingOptions`
`EmptyElementClassifier`	IrrelevantElementClassifier class for converting elements
`HighlightedTextClassifier`	HighlightedText class for converting elements into HighlightedText instances.
`ImageClassifier`	ImageClassifier class for converting elements into ImageElement instances.
`IndividualSemanticElementExtractor`	Responsible for splitting a single HTML representing multiple semantic elements
`ImageCheck`
`TableCheck`	Helper class that provides a standard way to create an ABC using
`TopSectionTitleCheck`	Helper class that provides a standard way to create an ABC using
`XbrlTagCheck`
`IntroductorySectionElementClassifier`	The IntroductorySectionElementClassifier is a processing step designed
`PageHeaderClassifier`
`PageNumberClassifier`
`SupplementaryTextClassifier`	SupplementaryTextClassifier class for converting elements into
`TableClassifier`	TableClassifier class for converting elements into TableElement instances.
`TableOfContentsClassifier`	TableOfContentsClassifier class for converting elements into TableOfContentsElement instances.
`TextClassifier`	TextClassifier class for converting elements into TextElement instances.
`TextElementMerger`	TextElementMerger is a processing step that merges adjacent text elements
`TitleClassifier`	TitleClassifier elements into TitleElement instances by scanning a list
`TopSectionManagerFor10Q`	Documents are divided into sections, subsections, and so on.
`CompositeSemanticElement`	CompositeSemanticElement acts as a container for other semantic elements,
`HighlightedTextElement`	The HighlightedTextElement class, among other uses,
`IrrelevantElement`	The IrrelevantElement class identifies elements in the parsed HTML that do not
`NotYetClassifiedElement`	The NotYetClassifiedElement class represents an element whose type
`TextElement`	The TextElement class represents a standard text paragraph within a document.
`TableElement`	The TableElement class represents a standard table within a document.
`AbstractSemanticElementParser`	Responsible for parsing semantic elements from HTML documents.
`Edgar10QParser`	The Edgar10QParser class is responsible for parsing SEC EDGAR 10-Q

Module Contents

class sec_parser.processing_engine.core.AbstractHtmlTagParser

Bases: abc.ABC

Helper class that provides a standard way to create an ABC using inheritance.

abstract parse(html: str | bytes) → list[sec_parser.processing_engine.html_tag.HtmlTag]

class sec_parser.processing_engine.core.HtmlTagParser(parser_backend: str | None = None)

Bases: AbstractHtmlTagParser

The HtmlTagParser parses an HTML document using BeautifulSoup4. It then wraps the parsed bs4.Tag objects into HtmlTag objects.

parse(html: str | bytes) → list[sec_parser.processing_engine.html_tag.HtmlTag]

_parse_to_bs4(html: str | bytes) → bs4.Tag

class sec_parser.processing_engine.core.ParsingOptions

html_integrity_checks: bool = False

class sec_parser.processing_engine.core.EmptyElementClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

IrrelevantElementClassifier class for converting elements into IrrelevantElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with IrrelevantElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement: Transform a single semantic element into a EmptyElement if applicable.

class sec_parser.processing_engine.core.HighlightedTextClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

HighlightedText class for converting elements into HighlightedText instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with HighlightedText instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_engine.core.ImageClassifier

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

ImageClassifier class for converting elements into ImageElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with ImageElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_engine.core.IndividualSemanticElementExtractor(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

Responsible for splitting a single HTML representing multiple semantic elements into multiple Semantic Element instances with a shared parent instance of type CompositeSemanticElement. This ensures structural integrity during parsing, which is crucial for accurately reconstructing the original HTML document and for semantic analysis where the relationship between elements can hold significant meaning.

_create_composite_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_element method is responsible for transforming a single semantic element into another.

It can also be utilized to simply iterate over all elements without applying any transformations.

_contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → bool

class sec_parser.processing_engine.core.ImageCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → bool | None

class sec_parser.processing_engine.core.TableCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

Helper class that provides a standard way to create an ABC using inheritance.

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → bool | None: Designed to work as series of subsequent checks. - Returning None means that the check is inconclusive, and the next check should be performed. - Returning True means that no further checks are necessary, and the HTML element will be later be able to be converted into a semantic element without any splits. - Returning False means that the HTML element will be split into multiple semantic elements of type NotYetClassifiedElement.

class sec_parser.processing_engine.core.TopSectionTitleCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

Helper class that provides a standard way to create an ABC using inheritance.

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → bool | None: Designed to work as series of subsequent checks. - Returning None means that the check is inconclusive, and the next check should be performed. - Returning True means that no further checks are necessary, and the HTML element will be later be able to be converted into a semantic element without any splits. - Returning False means that the HTML element will be split into multiple semantic elements of type NotYetClassifiedElement.

class sec_parser.processing_engine.core.XbrlTagCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → bool | None

class sec_parser.processing_engine.core.IntroductorySectionElementClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

The IntroductorySectionElementClassifier is a processing step designed to classify elements that are located before the actual contents of the document.

For example, consider a SEC EDGAR 10-Q report. This processing step will mark all elements that appear before the ‘part1’ section.

_NUM_ITERATIONS = 2

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_engine.core.PageHeaderClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

_NUM_ITERATIONS = 2

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_find_page_header_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → None

_classify_elements(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_get_most_common_candidates() → dict[PageHeaderCandidate, int]

class sec_parser.processing_engine.core.PageNumberClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

_NUM_ITERATIONS = 2

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_find_page_number_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → None

_classify_elements(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_get_most_common_candidate() → PageNumberCandidate | None

class sec_parser.processing_engine.core.SupplementaryTextClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

SupplementaryTextClassifier class for converting elements into SupplementaryText instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with SupplementaryText instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement: Transform a single semantic element into a TextElement if applicable.

class sec_parser.processing_engine.core.TableClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TableClassifier class for converting elements into TableElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TableElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_engine.core.TableOfContentsClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TableOfContentsClassifier class for converting elements into TableOfContentsElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TableOfContentsElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_engine.core.TextClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TextClassifier class for converting elements into TextElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TextElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement: Transform a single semantic element into a TextElement if applicable.

class sec_parser.processing_engine.core.TextElementMerger

Bases: sec_parser.processing_steps.abstract_classes.abstract_element_batch_processing_step.AbstractElementBatchProcessingStep

TextElementMerger is a processing step that merges adjacent text elements For example, TextElement() and TextElement() into a single TextElement().

Intended to fix weird formatting artifacts, such as:

<ix:nonnumeric contextref=”c-1” name=”us-gaap:PropertyPlantAndEquipmentTextBlock” id=”f-989” escape=”true”>: Property and equipment, net, co nsisted of the following (in millions):

</ix:nonnumeric>

Notice, how text is split into two spans, even though it’s a single sentence. Source: https://www.sec.gov/Archives/edgar/data/1652044/000165204423000094/goog-20230930.htm

_process_elements(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], _: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) → list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]

classmethod _merge(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_engine.core.TitleClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TitleClassifier elements into TitleElement instances by scanning a list of semantic elements and replacing suitable candidates.

The “_unique_styles_by_order” tuple:

Represents an ordered set of unique styles found in the document.
Preserves the order of insertion, which determines the hierarchical level of each style.
Assumes that earlier “highlight” styles correspond to higher level paragraph or section headings.

_add_unique_style(style: sec_parser.semantic_elements.highlighted_text_element.TextStyle) → None: Add a new unique style if not already present.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement: Process each element and convert to TitleElement if necessary.

class sec_parser.processing_engine.core.TopSectionManagerFor10Q(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

Documents are divided into sections, subsections, and so on. Top level sections are the highest level of sections and are standardized across each type of document.

An example of a Top Level Section in a 10-Q report is “Part I, Item 3. Quantitative and Qualitative Disclosures About Market Risk.”.

_NUM_ITERATIONS = 2

classmethod is_match_part_or_item(text: str) → bool

static match_part(text: str) → str | None

static match_item(text: str) → str | None

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_iteration_0(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → None

_process_iteration_1(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_identify_candidate(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → None

_get_section_type(identifier: str) → sec_parser.semantic_elements.top_section_title_types.TopSectionType

_select_candidates() → tuple[_Candidate, Ellipsis]

_process_selected_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_update_last_order_number(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, order: float) → None

_log_order_number_not_greater(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, order: float) → None

_create_top_section_title(candidate: _Candidate) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_engine.core.CompositeSemanticElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, inner_elements: tuple[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, Ellipsis] | None, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

CompositeSemanticElement acts as a container for other semantic elements, especially for cases where a single HTML root tag wraps multiple elements. This ensures structural integrity and enables various features like semantic segmentation visualization, and debugging by comparison with the original document.

Why is this useful:

1. Some semantic elements, like XBRL tags (<ix>), may wrap multiple semantic elements. The container ensures that these relationships are not broken during parsing. 2. Enables the parser to fully reconstruct the original HTML document, which opens up possibilities for features like semantic segmentation visualization (e.g. recreate the original document but put semi-transparent colored boxes on top, based on semantic meaning), serialization of parsed documents into an augmented HTML, and debugging by comparing to the original document.

property inner_elements: tuple[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, Ellipsis]

classmethod create_from_element(source: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, inner_elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement] | None = None) → CompositeSemanticElement: Convert the semantic element into another semantic element type.

to_dict(*, include_previews: bool = False, include_contents: bool = False) → dict[str, Any]

classmethod unwrap_elements(elements: collections.abc.Iterable[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, include_containers: bool | None = None) → list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]: Recursively flatten a list of AbstractSemanticElement objects. For each CompositeSemanticElement encountered, its inner_elements are also recursively flattened. The ‘include_containers’ parameter controls whether the CompositeSemanticElement itself is included in the flattened list.

class sec_parser.processing_engine.core.HighlightedTextElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, style: TextStyle | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The HighlightedTextElement class, among other uses, is an intermediate step in identifying title elements.

For example:

First, elements with specific styles (like bold or italic text) are classified as HighlightedTextElements. These are later examined to determine if they should be considered TitleElements.

classmethod create_from_element(source: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, style: TextStyle | None = None) → HighlightedTextElement: Convert the semantic element into another semantic element type.

to_dict(*, include_previews: bool = False, include_contents: bool = False) → dict[str, Any]

class sec_parser.processing_engine.core.IrrelevantElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The IrrelevantElement class identifies elements in the parsed HTML that do not contribute to the content. These elements often include page separators, page numbers, and other non-content items. For instance, HTML tags without content like or <div></div> are deemed irrelevant, often used in documents just to add vertical space.

class sec_parser.processing_engine.core.NotYetClassifiedElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The NotYetClassifiedElement class represents an element whose type has not yet been determined. The parsing process aims to classify all instances of this class into more specific subclasses of AbstractSemanticElement.

class sec_parser.processing_engine.core.TextElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin, sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The TextElement class represents a standard text paragraph within a document.

class sec_parser.processing_engine.core.TableElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The TableElement class represents a standard table within a document.

get_summary() → str

Return a human-readable summary of the semantic element.

This method aims to provide a simplified, human-friendly representation of the underlying HtmlTag.

to_dict(*, include_previews: bool = False, include_contents: bool = False) → dict[str, Any]

table_to_markdown() → str

class sec_parser.processing_engine.core.AbstractSemanticElementParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None)

Bases: abc.ABC

Responsible for parsing semantic elements from HTML documents. It takes raw HTML and turns it into a list of objects representing semantic elements.

At a High Level:

Extract top-level HTML tags from the document.
Transform these tags into a list of more specific semantic elements step-by-step.

Why Focus on Top-Level Tags?

SEC filings usually have a flat HTML structure, which simplifies the parsing process.Each top-level HTML tag often directly corresponds to a single semantic element. This is different from many websites where HTML tags are nested deeply,requiring more complex parsing.

For Advanced Users:

The parsing process is implemented as a sequence of steps and allows for customization at each step.

Pipeline Pattern: Raw HTML tags are processed in a sequential manner. The steps follow an ordered, step-by-step approach, akin to a Finite State Machine (FSM). Each element transitions through various states defined by the sequence of processing steps.
Strategy Pattern: Each step is customizable. You can either replace, remove, or extend any of the existing steps with your own or inherited implementation. Alternatively, you can replace the entire pipeline with your own process.

abstract get_default_steps() → list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]

parse(html: str | bytes, *, unwrap_elements: bool | None = None, include_containers: bool | None = None, include_irrelevant_elements: bool | None = None) → list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]

parse_from_tags(root_tags: list[sec_parser.processing_engine.html_tag.HtmlTag], *, unwrap_elements: bool | None = None, include_containers: bool | None = None, include_irrelevant_elements: bool | None = None) → list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]

class sec_parser.processing_engine.core.Edgar10QParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None)

Bases: AbstractSemanticElementParser

The Edgar10QParser class is responsible for parsing SEC EDGAR 10-Q quarterly reports. It transforms the HTML documents into a list of elements. Each element in this list represents a part of the visual structure of the original document.

get_default_steps(get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None) → list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]

get_default_single_element_checks() → list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]