sec_parser.processing_steps
The processing_steps subpackage provides a collection of steps designed to work with parser engines from the parsing_engine subpackage. These steps carry out specific tasks such as section identification, title parsing, and text extraction, etc.
Subpackages
Submodules
- sec_parser.processing_steps.empty_element_classifier
- sec_parser.processing_steps.highlighted_text_classifier
- sec_parser.processing_steps.image_classifier
- sec_parser.processing_steps.introductory_section_classifier
- sec_parser.processing_steps.page_header_classifier
- sec_parser.processing_steps.page_number_classifier
- sec_parser.processing_steps.supplementary_text_classifier
- sec_parser.processing_steps.table_classifier
- sec_parser.processing_steps.table_of_contents_classifier
- sec_parser.processing_steps.text_classifier
- sec_parser.processing_steps.text_element_merger
- sec_parser.processing_steps.title_classifier
- sec_parser.processing_steps.top_section_manager_for_10q
Classes
AbstractElementwiseTransformStep class is used to iterate over |
|
AbstractProcessingStep class for transforming a list of elements. |
|
IrrelevantElementClassifier class for converting elements |
|
HighlightedText class for converting elements into HighlightedText instances. |
|
ImageClassifier class for converting elements into ImageElement instances. |
|
Responsible for splitting a single HTML representing multiple semantic elements |
|
Helper class that provides a standard way to create an ABC using |
|
Helper class that provides a standard way to create an ABC using |
|
The IntroductorySectionElementClassifier is a processing step designed |
|
SupplementaryTextClassifier class for converting elements into |
|
TableClassifier class for converting elements into TableElement instances. |
|
TableOfContentsClassifier class for converting elements into TableOfContentsElement instances. |
|
TextClassifier class for converting elements into TextElement instances. |
|
TextElementMerger is a processing step that merges adjacent text elements |
|
TitleClassifier elements into TitleElement instances by scanning a list |
|
Documents are divided into sections, subsections, and so on. |
Package Contents
- class sec_parser.processing_steps.AbstractElementwiseProcessingStep(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
Bases:
sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStepAbstractElementwiseTransformStep class is used to iterate over all Semantic Elements with or without applying transformations.
- _NUM_ITERATIONS = 1
- abstract _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_process_element method is responsible for transforming a single semantic element into another.
It can also be utilized to simply iterate over all elements without applying any transformations.
- _process_recursively(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, _context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
- class sec_parser.processing_steps.AbstractProcessingStep
Bases:
abc.ABCAbstractProcessingStep class for transforming a list of elements. Chaining multiple steps together allows for complex transformations while keeping the code modular.
Each instance of a step is designed to be used for a single transformation operation. This ensures that any internal state maintained during a transformation is isolated to the processing of a single document.
- process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
Transform the list of semantic elements.
Note: The elements argument could potentially be mutated for performance reasons.
- abstract _process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
Implement the actual transformation logic in child classes.
This method is intended to be overridden by child classes to provide specific transformation logic.
- class sec_parser.processing_steps.EmptyElementClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
-
IrrelevantElementClassifier class for converting elements into IrrelevantElement instances.
This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with IrrelevantElement instances.
- _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
Transform a single semantic element into a EmptyElement if applicable.
- class sec_parser.processing_steps.HighlightedTextClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
-
HighlightedText class for converting elements into HighlightedText instances.
This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with HighlightedText instances.
- class sec_parser.processing_steps.ImageClassifier
-
ImageClassifier class for converting elements into ImageElement instances.
This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with ImageElement instances.
- class sec_parser.processing_steps.IndividualSemanticElementExtractor(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None)
-
Responsible for splitting a single HTML representing multiple semantic elements into multiple Semantic Element instances with a shared parent instance of type CompositeSemanticElement. This ensures structural integrity during parsing, which is crucial for accurately reconstructing the original HTML document and for semantic analysis where the relationship between elements can hold significant meaning.
- _create_composite_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
- _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_process_element method is responsible for transforming a single semantic element into another.
It can also be utilized to simply iterate over all elements without applying any transformations.
- _contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool
- class sec_parser.processing_steps.ImageCheck
-
- contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool | None
- class sec_parser.processing_steps.TableCheck
-
Helper class that provides a standard way to create an ABC using inheritance.
- contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool | None
Designed to work as series of subsequent checks. - Returning None means that the check is inconclusive, and the next check should be performed. - Returning True means that no further checks are necessary, and the HTML element will be later be able to be converted into a semantic element without any splits. - Returning False means that the HTML element will be split into multiple semantic elements of type NotYetClassifiedElement.
- class sec_parser.processing_steps.TopSectionTitleCheck
-
Helper class that provides a standard way to create an ABC using inheritance.
- contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool | None
Designed to work as series of subsequent checks. - Returning None means that the check is inconclusive, and the next check should be performed. - Returning True means that no further checks are necessary, and the HTML element will be later be able to be converted into a semantic element without any splits. - Returning False means that the HTML element will be split into multiple semantic elements of type NotYetClassifiedElement.
- class sec_parser.processing_steps.XbrlTagCheck
-
- contains_single_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool | None
- class sec_parser.processing_steps.IntroductorySectionElementClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
-
The IntroductorySectionElementClassifier is a processing step designed to classify elements that are located before the actual contents of the document.
For example, consider a SEC EDGAR 10-Q report. This processing step will mark all elements that appear before the ‘part1’ section.
- _NUM_ITERATIONS = 2
- _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
- class sec_parser.processing_steps.PageHeaderClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
-
- _NUM_ITERATIONS = 2
- _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
- _find_page_header_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) None
- _classify_elements(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
- _get_most_common_candidates() dict[PageHeaderCandidate, int]
- class sec_parser.processing_steps.PageNumberClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
-
- _NUM_ITERATIONS = 2
- _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
- _find_page_number_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) None
- _classify_elements(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
- _get_most_common_candidate() PageNumberCandidate | None
- class sec_parser.processing_steps.SupplementaryTextClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
-
SupplementaryTextClassifier class for converting elements into SupplementaryText instances.
This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with SupplementaryText instances.
- _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
Transform a single semantic element into a TextElement if applicable.
- class sec_parser.processing_steps.TableClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
-
TableClassifier class for converting elements into TableElement instances.
This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TableElement instances.
- class sec_parser.processing_steps.TableOfContentsClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
-
TableOfContentsClassifier class for converting elements into TableOfContentsElement instances.
This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TableOfContentsElement instances.
- class sec_parser.processing_steps.TextClassifier(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
-
TextClassifier class for converting elements into TextElement instances.
This step scans through a list of semantic elements and changes it, primarily by replacing suitable candidates with TextElement instances.
- _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
Transform a single semantic element into a TextElement if applicable.
- class sec_parser.processing_steps.TextElementMerger
-
TextElementMerger is a processing step that merges adjacent text elements For example, TextElement(<span></span>) and TextElement(<span></span>) into a single TextElement(<span></span><span></span>).
- Intended to fix weird formatting artifacts, such as:
- <ix:nonnumeric contextref=”c-1” name=”us-gaap:PropertyPlantAndEquipmentTextBlock” id=”f-989” escape=”true”>
<span style=”background-color:#ffffff;color:#000000;font-family:’Arial’,sans-serif;font-size:10pt;font-weight:400;line-height:120%”>Property and equipment, net, co</span> <span style=”color:#000000;font-family:’Arial’,sans-serif;font-size:10pt;font-weight:400;line-height:120%”>nsisted of the following (in millions):</span>
</ix:nonnumeric>
Notice, how text is split into two spans, even though it’s a single sentence. Source: https://www.sec.gov/Archives/edgar/data/1652044/000165204423000094/goog-20230930.htm
- class sec_parser.processing_steps.TitleClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
-
TitleClassifier elements into TitleElement instances by scanning a list of semantic elements and replacing suitable candidates.
The “_unique_styles_by_order” tuple:
Represents an ordered set of unique styles found in the document.
Preserves the order of insertion, which determines the hierarchical level of each style.
Assumes that earlier “highlight” styles correspond to higher level paragraph or section headings.
- _add_unique_style(style: sec_parser.semantic_elements.highlighted_text_element.TextStyle) None
Add a new unique style if not already present.
- _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
Process each element and convert to TitleElement if necessary.
- class sec_parser.processing_steps.TopSectionManagerFor10Q(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
-
Documents are divided into sections, subsections, and so on. Top level sections are the highest level of sections and are standardized across each type of document.
An example of a Top Level Section in a 10-Q report is “Part I, Item 3. Quantitative and Qualitative Disclosures About Market Risk.”.
- _NUM_ITERATIONS = 2
- classmethod is_match_part_or_item(text: str) bool
- static match_part(text: str) str | None
- static match_item(text: str) str | None
- _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
- _process_iteration_0(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) None
- _process_iteration_1(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
- _identify_candidate(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) None
- _get_section_type(identifier: str) sec_parser.semantic_elements.top_section_title_types.TopSectionType
- _select_candidates() tuple[_Candidate, Ellipsis]
- _process_selected_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
- _update_last_order_number(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, order: float) None
- _log_order_number_not_greater(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, order: float) None
- _create_top_section_title(candidate: _Candidate) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement