sec_parser.processing_steps.title_classifier
Classes
AbstractElementwiseTransformStep class is used to iterate over |
|
The ElementProcessingContext class is designed to provide context information |
|
The HighlightedTextElement class, among other uses, |
|
The TitleElement class represents the title of a paragraph or other content object. |
|
TitleClassifier elements into TitleElement instances by scanning a list |
Module Contents
- class sec_parser.processing_steps.title_classifier.AbstractElementwiseProcessingStep(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
Bases:
sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStepAbstractElementwiseTransformStep class is used to iterate over all Semantic Elements with or without applying transformations.
- _NUM_ITERATIONS = 1
- abstract _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_process_element method is responsible for transforming a single semantic element into another.
It can also be utilized to simply iterate over all elements without applying any transformations.
- _process_recursively(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, _context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
- class sec_parser.processing_steps.title_classifier.ElementProcessingContext
The ElementProcessingContext class is designed to provide context information for elementwise processing steps.
- iteration: int
- class sec_parser.processing_steps.title_classifier.HighlightedTextElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, style: TextStyle | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElementThe HighlightedTextElement class, among other uses, is an intermediate step in identifying title elements.
For example:
First, elements with specific styles (like bold or italic text) are classified as HighlightedTextElements. These are later examined to determine if they should be considered TitleElements.
- classmethod create_from_element(source: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, style: TextStyle | None = None) HighlightedTextElement
Convert the semantic element into another semantic element type.
- to_dict(*, include_previews: bool = False, include_contents: bool = False) dict[str, Any]
- class sec_parser.processing_steps.title_classifier.TextStyle
- PERCENTAGE_THRESHOLD = 80
- BOLD_THRESHOLD = 600
- is_all_uppercase: bool = False
- bold_with_font_weight: bool = False
- italic: bool = False
- centered: bool = False
- underline: bool = False
- __bool__() bool
- classmethod from_style_and_text(style_percentage: dict[tuple[str, str], float], text: str) TextStyle
- classmethod _is_bold_with_font_weight(key: str, value: str) bool
- class sec_parser.processing_steps.title_classifier.TitleElement
Bases:
sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin,sec_parser.semantic_elements.abstract_semantic_element.AbstractLevelElementThe TitleElement class represents the title of a paragraph or other content object. It serves as a semantic marker, providing context and structure to the document.
- class sec_parser.processing_steps.title_classifier.TitleClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
-
TitleClassifier elements into TitleElement instances by scanning a list of semantic elements and replacing suitable candidates.
The “_unique_styles_by_order” tuple:
Represents an ordered set of unique styles found in the document.
Preserves the order of insertion, which determines the hierarchical level of each style.
Assumes that earlier “highlight” styles correspond to higher level paragraph or section headings.
- _add_unique_style(style: sec_parser.semantic_elements.highlighted_text_element.TextStyle) None
Add a new unique style if not already present.
- _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
Process each element and convert to TitleElement if necessary.