sec_parser.processing_steps

The processing_steps subpackage provides a collection of steps designed to work with parser engines from the parsing_engine subpackage. These steps carry out specific tasks such as section identification, title parsing, and text extraction, etc.

Subpackages

Submodules

Classes

AbstractElementwiseProcessingStep

AbstractElementwiseTransformStep class is used to iterate over

AbstractProcessingStep

AbstractProcessingStep class for transforming a list of elements.

EmptyElementClassifier

IrrelevantElementClassifier class for converting elements

HighlightedTextClassifier

HighlightedText class for converting elements into HighlightedText instances.

ImageClassifier

ImageClassifier class for converting elements into ImageElement instances.

IndividualSemanticElementExtractor

Responsible for splitting a single HTML representing multiple semantic elements

ImageCheck

TableCheck

Helper class that provides a standard way to create an ABC using

TopSectionTitleCheck

Helper class that provides a standard way to create an ABC using

XbrlTagCheck

IntroductorySectionElementClassifier

The IntroductorySectionElementClassifier is a processing step designed

PageHeaderClassifier

PageNumberClassifier

SupplementaryTextClassifier

SupplementaryTextClassifier class for converting elements into

TableClassifier

TableClassifier class for converting elements into TableElement instances.

TableOfContentsClassifier

TableOfContentsClassifier class for converting elements into TableOfContentsElement instances.

TextClassifier

TextClassifier class for converting elements into TextElement instances.

TextElementMerger

TextElementMerger is a processing step that merges adjacent text elements

TitleClassifier

TitleClassifier elements into TitleElement instances by scanning a list

TopSectionManagerFor10Q

Documents are divided into sections, subsections, and so on.

Package Contents

class sec_parser.processing_steps.AbstractElementwiseProcessingStep(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep

AbstractElementwiseTransformStep class is used to iterate over all Semantic Elements with or without applying transformations.

_NUM_ITERATIONS = 1
abstract _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_element method is responsible for transforming a single semantic element into another.

It can also be utilized to simply iterate over all elements without applying any transformations.

_process_recursively(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, _context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
_process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
class sec_parser.processing_steps.AbstractProcessingStep

Bases: abc.ABC

AbstractProcessingStep class for transforming a list of elements. Chaining multiple steps together allows for complex transformations while keeping the code modular.

Each instance of a step is designed to be used for a single transformation operation. This ensures that any internal state maintained during a transformation is isolated to the processing of a single document.

process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]

Transform the list of semantic elements.

Note: The elements argument could potentially be mutated for performance reasons.

abstract _process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]

Implement the actual transformation logic in child classes.

This method is intended to be overridden by child classes to provide specific transformation logic.

class sec_parser.processing_steps.EmptyElementClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

IrrelevantElementClassifier class for converting elements into IrrelevantElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with IrrelevantElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

Transform a single semantic element into a EmptyElement if applicable.

class sec_parser.processing_steps.HighlightedTextClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

HighlightedText class for converting elements into HighlightedText instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with HighlightedText instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
class sec_parser.processing_steps.ImageClassifier

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

ImageClassifier class for converting elements into ImageElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with ImageElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
class sec_parser.processing_steps.IndividualSemanticElementExtractor(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

Responsible for splitting a single HTML representing multiple semantic elements into multiple Semantic Element instances with a shared parent instance of type CompositeSemanticElement. This ensures structural integrity during parsing, which is crucial for accurately reconstructing the original HTML document and for semantic analysis where the relationship between elements can hold significant meaning.

_create_composite_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_element method is responsible for transforming a single semantic element into another.

It can also be utilized to simply iterate over all elements without applying any transformations.

_contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool
class sec_parser.processing_steps.ImageCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool | None
class sec_parser.processing_steps.TableCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

Helper class that provides a standard way to create an ABC using inheritance.

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool | None

Designed to work as series of subsequent checks. - Returning None means that the check is inconclusive, and the next check should be performed. - Returning True means that no further checks are necessary, and the HTML element will be later be able to be converted into a semantic element without any splits. - Returning False means that the HTML element will be split into multiple semantic elements of type NotYetClassifiedElement.

class sec_parser.processing_steps.TopSectionTitleCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

Helper class that provides a standard way to create an ABC using inheritance.

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool | None

Designed to work as series of subsequent checks. - Returning None means that the check is inconclusive, and the next check should be performed. - Returning True means that no further checks are necessary, and the HTML element will be later be able to be converted into a semantic element without any splits. - Returning False means that the HTML element will be split into multiple semantic elements of type NotYetClassifiedElement.

class sec_parser.processing_steps.XbrlTagCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool | None
class sec_parser.processing_steps.IntroductorySectionElementClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

The IntroductorySectionElementClassifier is a processing step designed to classify elements that are located before the actual contents of the document.

For example, consider a SEC EDGAR 10-Q report. This processing step will mark all elements that appear before the ‘part1’ section.

_NUM_ITERATIONS = 2
_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
class sec_parser.processing_steps.PageHeaderClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

_NUM_ITERATIONS = 2
_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_find_page_header_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) None
_classify_elements(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_get_most_common_candidates() dict[PageHeaderCandidate, int]
class sec_parser.processing_steps.PageNumberClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

_NUM_ITERATIONS = 2
_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_find_page_number_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) None
_classify_elements(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_get_most_common_candidate() PageNumberCandidate | None
class sec_parser.processing_steps.SupplementaryTextClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

SupplementaryTextClassifier class for converting elements into SupplementaryText instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with SupplementaryText instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

Transform a single semantic element into a TextElement if applicable.

class sec_parser.processing_steps.TableClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TableClassifier class for converting elements into TableElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TableElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
class sec_parser.processing_steps.TableOfContentsClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TableOfContentsClassifier class for converting elements into TableOfContentsElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TableOfContentsElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
class sec_parser.processing_steps.TextClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TextClassifier class for converting elements into TextElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TextElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

Transform a single semantic element into a TextElement if applicable.

class sec_parser.processing_steps.TextElementMerger

Bases: sec_parser.processing_steps.abstract_classes.abstract_element_batch_processing_step.AbstractElementBatchProcessingStep

TextElementMerger is a processing step that merges adjacent text elements For example, TextElement(<span></span>) and TextElement(<span></span>) into a single TextElement(<span></span><span></span>).

Intended to fix weird formatting artifacts, such as:
<ix:nonnumeric contextref=”c-1” name=”us-gaap:PropertyPlantAndEquipmentTextBlock” id=”f-989” escape=”true”>

<span style=”background-color:#ffffff;color:#000000;font-family:’Arial’,sans-serif;font-size:10pt;font-weight:400;line-height:120%”>Property and equipment, net, co</span> <span style=”color:#000000;font-family:’Arial’,sans-serif;font-size:10pt;font-weight:400;line-height:120%”>nsisted of the following (in millions):</span>

</ix:nonnumeric>

Notice, how text is split into two spans, even though it’s a single sentence. Source: https://www.sec.gov/Archives/edgar/data/1652044/000165204423000094/goog-20230930.htm

_process_elements(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], _: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
classmethod _merge(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
class sec_parser.processing_steps.TitleClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TitleClassifier elements into TitleElement instances by scanning a list of semantic elements and replacing suitable candidates.

The “_unique_styles_by_order” tuple:

  • Represents an ordered set of unique styles found in the document.

  • Preserves the order of insertion, which determines the hierarchical level of each style.

  • Assumes that earlier “highlight” styles correspond to higher level paragraph or section headings.

_add_unique_style(style: sec_parser.semantic_elements.highlighted_text_element.TextStyle) None

Add a new unique style if not already present.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

Process each element and convert to TitleElement if necessary.

class sec_parser.processing_steps.TopSectionManagerFor10Q(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

Documents are divided into sections, subsections, and so on. Top level sections are the highest level of sections and are standardized across each type of document.

An example of a Top Level Section in a 10-Q report is “Part I, Item 3. Quantitative and Qualitative Disclosures About Market Risk.”.

_NUM_ITERATIONS = 2
classmethod is_match_part_or_item(text: str) bool
static match_part(text: str) str | None
static match_item(text: str) str | None
_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_process_iteration_0(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) None
_process_iteration_1(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_identify_candidate(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) None
_get_section_type(identifier: str) sec_parser.semantic_elements.top_section_title_types.TopSectionType
_select_candidates() tuple[_Candidate, Ellipsis]
_process_selected_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_update_last_order_number(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, order: float) None
_log_order_number_not_greater(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, order: float) None
_create_top_section_title(candidate: _Candidate) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement