sec_parser.processing_steps

The processing_steps subpackage provides a collection of steps designed to work with parser engines from the parsing_engine subpackage. These steps carry out specific tasks such as section identification, title parsing, and text extraction, etc.

Subpackages

Submodules

Classes

`AbstractElementwiseProcessingStep`	AbstractElementwiseTransformStep class is used to iterate over
`AbstractProcessingStep`	AbstractProcessingStep class for transforming a list of elements.
`EmptyElementClassifier`	IrrelevantElementClassifier class for converting elements
`HighlightedTextClassifier`	HighlightedText class for converting elements into HighlightedText instances.
`ImageClassifier`	ImageClassifier class for converting elements into ImageElement instances.
`IndividualSemanticElementExtractor`	Responsible for splitting a single HTML representing multiple semantic elements
`ImageCheck`
`TableCheck`	Helper class that provides a standard way to create an ABC using
`TopSectionTitleCheck`	Helper class that provides a standard way to create an ABC using
`XbrlTagCheck`
`IntroductorySectionElementClassifier`	The IntroductorySectionElementClassifier is a processing step designed
`PageHeaderClassifier`
`PageNumberClassifier`
`SupplementaryTextClassifier`	SupplementaryTextClassifier class for converting elements into
`TableClassifier`	TableClassifier class for converting elements into TableElement instances.
`TableOfContentsClassifier`	TableOfContentsClassifier class for converting elements into TableOfContentsElement instances.
`TextClassifier`	TextClassifier class for converting elements into TextElement instances.
`TextElementMerger`	TextElementMerger is a processing step that merges adjacent text elements
`TitleClassifier`	TitleClassifier elements into TitleElement instances by scanning a list
`TopSectionManagerFor10Q`	Documents are divided into sections, subsections, and so on.

Package Contents

class sec_parser.processing_steps.AbstractElementwiseProcessingStep(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep

AbstractElementwiseTransformStep class is used to iterate over all Semantic Elements with or without applying transformations.

_NUM_ITERATIONS = 1

abstract _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_element method is responsible for transforming a single semantic element into another.

It can also be utilized to simply iterate over all elements without applying any transformations.

_process_recursively(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, _context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) → list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]

_process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) → list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]

class sec_parser.processing_steps.AbstractProcessingStep

Bases: abc.ABC

AbstractProcessingStep class for transforming a list of elements. Chaining multiple steps together allows for complex transformations while keeping the code modular.

Each instance of a step is designed to be used for a single transformation operation. This ensures that any internal state maintained during a transformation is isolated to the processing of a single document.

process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) → list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]

Transform the list of semantic elements.

Note: The elements argument could potentially be mutated for performance reasons.

abstract _process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) → list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]

Implement the actual transformation logic in child classes.

This method is intended to be overridden by child classes to provide specific transformation logic.

class sec_parser.processing_steps.EmptyElementClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

IrrelevantElementClassifier class for converting elements into IrrelevantElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with IrrelevantElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement: Transform a single semantic element into a EmptyElement if applicable.

class sec_parser.processing_steps.HighlightedTextClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

HighlightedText class for converting elements into HighlightedText instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with HighlightedText instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_steps.ImageClassifier

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

ImageClassifier class for converting elements into ImageElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with ImageElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_steps.IndividualSemanticElementExtractor(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

Responsible for splitting a single HTML representing multiple semantic elements into multiple Semantic Element instances with a shared parent instance of type CompositeSemanticElement. This ensures structural integrity during parsing, which is crucial for accurately reconstructing the original HTML document and for semantic analysis where the relationship between elements can hold significant meaning.

_create_composite_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_element method is responsible for transforming a single semantic element into another.

It can also be utilized to simply iterate over all elements without applying any transformations.

_contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → bool

class sec_parser.processing_steps.ImageCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → bool | None

class sec_parser.processing_steps.TableCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

Helper class that provides a standard way to create an ABC using inheritance.

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → bool | None: Designed to work as series of subsequent checks. - Returning None means that the check is inconclusive, and the next check should be performed. - Returning True means that no further checks are necessary, and the HTML element will be later be able to be converted into a semantic element without any splits. - Returning False means that the HTML element will be split into multiple semantic elements of type NotYetClassifiedElement.

class sec_parser.processing_steps.TopSectionTitleCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

Helper class that provides a standard way to create an ABC using inheritance.

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → bool | None: Designed to work as series of subsequent checks. - Returning None means that the check is inconclusive, and the next check should be performed. - Returning True means that no further checks are necessary, and the HTML element will be later be able to be converted into a semantic element without any splits. - Returning False means that the HTML element will be split into multiple semantic elements of type NotYetClassifiedElement.

class sec_parser.processing_steps.XbrlTagCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → bool | None

class sec_parser.processing_steps.IntroductorySectionElementClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

The IntroductorySectionElementClassifier is a processing step designed to classify elements that are located before the actual contents of the document.

For example, consider a SEC EDGAR 10-Q report. This processing step will mark all elements that appear before the ‘part1’ section.

_NUM_ITERATIONS = 2

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_steps.PageHeaderClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

_NUM_ITERATIONS = 2

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_find_page_header_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → None

_classify_elements(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_get_most_common_candidates() → dict[PageHeaderCandidate, int]

class sec_parser.processing_steps.PageNumberClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

_NUM_ITERATIONS = 2

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_find_page_number_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → None

_classify_elements(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_get_most_common_candidate() → PageNumberCandidate | None

class sec_parser.processing_steps.SupplementaryTextClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

SupplementaryTextClassifier class for converting elements into SupplementaryText instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with SupplementaryText instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement: Transform a single semantic element into a TextElement if applicable.

class sec_parser.processing_steps.TableClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TableClassifier class for converting elements into TableElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TableElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_steps.TableOfContentsClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TableOfContentsClassifier class for converting elements into TableOfContentsElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TableOfContentsElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_steps.TextClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TextClassifier class for converting elements into TextElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TextElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement: Transform a single semantic element into a TextElement if applicable.

class sec_parser.processing_steps.TextElementMerger

Bases: sec_parser.processing_steps.abstract_classes.abstract_element_batch_processing_step.AbstractElementBatchProcessingStep

TextElementMerger is a processing step that merges adjacent text elements For example, TextElement() and TextElement() into a single TextElement().

Intended to fix weird formatting artifacts, such as:

<ix:nonnumeric contextref=”c-1” name=”us-gaap:PropertyPlantAndEquipmentTextBlock” id=”f-989” escape=”true”>: Property and equipment, net, co nsisted of the following (in millions):

</ix:nonnumeric>

Notice, how text is split into two spans, even though it’s a single sentence. Source: https://www.sec.gov/Archives/edgar/data/1652044/000165204423000094/goog-20230930.htm

_process_elements(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], _: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) → list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]

classmethod _merge(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

class sec_parser.processing_steps.TitleClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TitleClassifier elements into TitleElement instances by scanning a list of semantic elements and replacing suitable candidates.

The “_unique_styles_by_order” tuple:

Represents an ordered set of unique styles found in the document.
Preserves the order of insertion, which determines the hierarchical level of each style.
Assumes that earlier “highlight” styles correspond to higher level paragraph or section headings.

_add_unique_style(style: sec_parser.semantic_elements.highlighted_text_element.TextStyle) → None: Add a new unique style if not already present.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement: Process each element and convert to TitleElement if necessary.

class sec_parser.processing_steps.TopSectionManagerFor10Q(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

Documents are divided into sections, subsections, and so on. Top level sections are the highest level of sections and are standardized across each type of document.

An example of a Top Level Section in a 10-Q report is “Part I, Item 3. Quantitative and Qualitative Disclosures About Market Risk.”.

_NUM_ITERATIONS = 2

classmethod is_match_part_or_item(text: str) → bool

static match_part(text: str) → str | None

static match_item(text: str) → str | None

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_iteration_0(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → None

_process_iteration_1(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_identify_candidate(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → None

_get_section_type(identifier: str) → sec_parser.semantic_elements.top_section_title_types.TopSectionType

_select_candidates() → tuple[_Candidate, Ellipsis]

_process_selected_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_update_last_order_number(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, order: float) → None

_log_order_number_not_greater(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, order: float) → None

_create_top_section_title(candidate: _Candidate) → sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement