sec_parser.processing_steps

The processing_steps subpackage provides a collection of steps designed to work with parser engines from the parsing_engine subpackage. These steps carry out specific tasks such as section identification, title parsing, and text extraction, etc.

Submodules

Classes

AbstractElementwiseProcessingStep

AbstractElementwiseTransformStep class is used to iterate over

AbstractProcessingStep

AbstractProcessingStep class for transforming a list of elements.

EmptyElementClassifier

IrrelevantElementClassifier class for converting elements

HighlightedTextClassifier

HighlightedText class for converting elements into HighlightedText instances.

ImageClassifier

ImageClassifier class for converting elements into ImageElement instances.

IndividualSemanticElementExtractor

Responsible for splitting a single HTML representing multiple semantic elements

ImageCheck

Helper class that provides a standard way to create an ABC using

TableCheck

Helper class that provides a standard way to create an ABC using

TopSectionTitleCheck

XbrlTagCheck

Helper class that provides a standard way to create an ABC using

IntroductorySectionElementClassifier

The IntroductorySectionElementClassifier is a processing step designed

PageHeaderClassifier

PageNumberClassifier

SupplementaryTextClassifier

SupplementaryTextClassifier class for converting elements into

TableClassifier

TableClassifier class for converting elements into TableElement instances.

TableOfContentsClassifier

TableOfContentsClassifier class for converting elements into TableOfContentsElement instances.

TextClassifier

TextClassifier class for converting elements into TextElement instances.

TextElementMerger

TextElementMerger is a processing step that merges adjacent text elements

TitleClassifier

TitleClassifier elements into TitleElement instances by scanning a list

TopSectionManagerFor10K

Specialized version of TopSectionManagerForFiling for handling 10-K filings.

TopSectionManagerFor10Q

Specialized version of TopSectionManagerForFiling for handling 10-Q filings.

Package Contents

class sec_parser.processing_steps.AbstractElementwiseProcessingStep(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep

AbstractElementwiseTransformStep class is used to iterate over all Semantic Elements with or without applying transformations.

_NUM_ITERATIONS = 1
_types_to_process
_types_to_exclude
abstract _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_element method is responsible for transforming a single semantic element into another.

It can also be utilized to simply iterate over all elements without applying any transformations.

_process_recursively(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, _context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
_process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
class sec_parser.processing_steps.AbstractProcessingStep

Bases: abc.ABC

AbstractProcessingStep class for transforming a list of elements. Chaining multiple steps together allows for complex transformations while keeping the code modular.

Each instance of a step is designed to be used for a single transformation operation. This ensures that any internal state maintained during a transformation is isolated to the processing of a single document.

_already_processed = False
process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]

Transform the list of semantic elements.

Note: The elements argument could potentially be mutated for performance reasons.

abstract _process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]

Implement the actual transformation logic in child classes.

This method is intended to be overridden by child classes to provide specific transformation logic.

class sec_parser.processing_steps.EmptyElementClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

IrrelevantElementClassifier class for converting elements into IrrelevantElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with IrrelevantElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

Transform a single semantic element into a EmptyElement if applicable.

class sec_parser.processing_steps.HighlightedTextClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

HighlightedText class for converting elements into HighlightedText instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with HighlightedText instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
class sec_parser.processing_steps.ImageClassifier

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

ImageClassifier class for converting elements into ImageElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with ImageElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
class sec_parser.processing_steps.IndividualSemanticElementExtractor(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

Responsible for splitting a single HTML representing multiple semantic elements into multiple Semantic Element instances with a shared parent instance of type CompositeSemanticElement. This ensures structural integrity during parsing, which is crucial for accurately reconstructing the original HTML document and for semantic analysis where the relationship between elements can hold significant meaning.

_contains_single_element_checks
_create_composite_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_element method is responsible for transforming a single semantic element into another.

It can also be utilized to simply iterate over all elements without applying any transformations.

_contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool
class sec_parser.processing_steps.ImageCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

Helper class that provides a standard way to create an ABC using inheritance.

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool | None

Designed to work as series of subsequent checks. - Returning None means that the check is inconclusive, and the next check should be performed. - Returning True means that no further checks are necessary, and the HTML element will be later be able to be converted into a semantic element without any splits. - Returning False means that the HTML element will be split into multiple semantic elements of type NotYetClassifiedElement.

class sec_parser.processing_steps.TableCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

Helper class that provides a standard way to create an ABC using inheritance.

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool | None

Designed to work as series of subsequent checks. - Returning None means that the check is inconclusive, and the next check should be performed. - Returning True means that no further checks are necessary, and the HTML element will be later be able to be converted into a semantic element without any splits. - Returning False means that the HTML element will be split into multiple semantic elements of type NotYetClassifiedElement.

class sec_parser.processing_steps.TopSectionTitleCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool | None
class sec_parser.processing_steps.XbrlTagCheck

Bases: sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck

Helper class that provides a standard way to create an ABC using inheritance.

contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool | None

Designed to work as series of subsequent checks. - Returning None means that the check is inconclusive, and the next check should be performed. - Returning True means that no further checks are necessary, and the HTML element will be later be able to be converted into a semantic element without any splits. - Returning False means that the HTML element will be split into multiple semantic elements of type NotYetClassifiedElement.

class sec_parser.processing_steps.IntroductorySectionElementClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

The IntroductorySectionElementClassifier is a processing step designed to classify elements that are located before the actual contents of the document.

For example, consider a SEC EDGAR 10-Q report. This processing step will mark all elements that appear before the ‘part1’ section.

_NUM_ITERATIONS = 2
_part1_exists = False
_part1_found = False
_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
class sec_parser.processing_steps.PageHeaderClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

_NUM_ITERATIONS = 2
_element_to_page_header_candidate: dict[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, PageHeaderCandidate]
_candidate_count: collections.Counter[PageHeaderCandidate]
_most_common_candidates: dict[PageHeaderCandidate, int] | None = None
_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_find_page_header_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) None
_classify_elements(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_get_most_common_candidates() dict[PageHeaderCandidate, int]
class sec_parser.processing_steps.PageNumberClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

_NUM_ITERATIONS = 2
_element_to_page_number_candidate: dict[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, PageNumberCandidate]
_candidate_count: collections.Counter[PageNumberCandidate]
_most_common_candidate: PageNumberCandidate | None = None
_most_common_candidate_count: int = 0
_search_status: MostCommonCandidateSearchStatus
_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_find_page_number_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) None
_classify_elements(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_get_most_common_candidate() PageNumberCandidate | None
class sec_parser.processing_steps.SupplementaryTextClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

SupplementaryTextClassifier class for converting elements into SupplementaryText instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with SupplementaryText instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

Transform a single semantic element into a TextElement if applicable.

class sec_parser.processing_steps.TableClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TableClassifier class for converting elements into TableElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TableElement instances.

_row_count_threshold = 1
_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
class sec_parser.processing_steps.TableOfContentsClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TableOfContentsClassifier class for converting elements into TableOfContentsElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TableOfContentsElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
class sec_parser.processing_steps.TextClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TextClassifier class for converting elements into TextElement instances.

This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TextElement instances.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

Transform a single semantic element into a TextElement if applicable.

class sec_parser.processing_steps.TextElementMerger

Bases: sec_parser.processing_steps.abstract_classes.abstract_element_batch_processing_step.AbstractElementBatchProcessingStep

TextElementMerger is a processing step that merges adjacent text elements For example, TextElement(<span></span>) and TextElement(<span></span>) into a single TextElement(<span></span><span></span>).

Intended to fix weird formatting artifacts, such as:
<ix:nonnumeric contextref=”c-1” name=”us-gaap:PropertyPlantAndEquipmentTextBlock” id=”f-989” escape=”true”>

<span style=”background-color:#ffffff;color:#000000;font-family:’Arial’,sans-serif;font-size:10pt;font-weight:400;line-height:120%”>Property and equipment, net, co</span> <span style=”color:#000000;font-family:’Arial’,sans-serif;font-size:10pt;font-weight:400;line-height:120%”>nsisted of the following (in millions):</span>

</ix:nonnumeric>

Notice, how text is split into two spans, even though it’s a single sentence. Source: https://www.sec.gov/Archives/edgar/data/1652044/000165204423000094/goog-20230930.htm

_process_elements(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], _: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
classmethod _merge(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
class sec_parser.processing_steps.TitleClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TitleClassifier elements into TitleElement instances by scanning a list of semantic elements and replacing suitable candidates.

The “_unique_styles_by_order” tuple:

  • Represents an ordered set of unique styles found in the document.

  • Preserves the order of insertion, which determines the hierarchical level of each style.

  • Assumes that earlier “highlight” styles correspond to higher level paragraph or section headings.

_unique_styles_by_order: tuple[sec_parser.semantic_elements.highlighted_text_element.TextStyle, Ellipsis] = ()
_add_unique_style(style: sec_parser.semantic_elements.highlighted_text_element.TextStyle) None

Add a new unique style if not already present.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

Process each element and convert to TitleElement if necessary.

class sec_parser.processing_steps.TopSectionManagerFor10K(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: TopSectionManager

Specialized version of TopSectionManagerForFiling for handling 10-K filings. Automatically uses FilingSectionsIn10K while maintaining all the functionality of the base class.

class sec_parser.processing_steps.TopSectionManagerFor10Q(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: TopSectionManager

Specialized version of TopSectionManagerForFiling for handling 10-Q filings. Automatically uses FilingSectionsIn10Q while maintaining all the functionality of the base class.