sec_parser.processing_steps.title_classifier

Classes

AbstractElementwiseProcessingStep

AbstractElementwiseTransformStep class is used to iterate over

ElementProcessingContext

The ElementProcessingContext class is designed to provide context information

HighlightedTextElement

The HighlightedTextElement class, among other uses,

TextStyle

TitleElement

The TitleElement class represents the title of a paragraph or other content object.

TitleClassifier

TitleClassifier elements into TitleElement instances by scanning a list

Module Contents

class sec_parser.processing_steps.title_classifier.AbstractElementwiseProcessingStep(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep

AbstractElementwiseTransformStep class is used to iterate over all Semantic Elements with or without applying transformations.

_NUM_ITERATIONS = 1
abstract _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_element method is responsible for transforming a single semantic element into another.

It can also be utilized to simply iterate over all elements without applying any transformations.

_process_recursively(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, _context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
_process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
class sec_parser.processing_steps.title_classifier.ElementProcessingContext

The ElementProcessingContext class is designed to provide context information for elementwise processing steps.

iteration: int
class sec_parser.processing_steps.title_classifier.HighlightedTextElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, style: TextStyle | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The HighlightedTextElement class, among other uses, is an intermediate step in identifying title elements.

For example:

First, elements with specific styles (like bold or italic text) are classified as HighlightedTextElements. These are later examined to determine if they should be considered TitleElements.

classmethod create_from_element(source: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, style: TextStyle | None = None) HighlightedTextElement

Convert the semantic element into another semantic element type.

to_dict(*, include_previews: bool = False, include_contents: bool = False) dict[str, Any]
class sec_parser.processing_steps.title_classifier.TextStyle
PERCENTAGE_THRESHOLD = 80
BOLD_THRESHOLD = 600
is_all_uppercase: bool = False
bold_with_font_weight: bool = False
italic: bool = False
centered: bool = False
underline: bool = False
__bool__() bool
classmethod from_style_and_text(style_percentage: dict[tuple[str, str], float], text: str) TextStyle
classmethod _is_bold_with_font_weight(key: str, value: str) bool
class sec_parser.processing_steps.title_classifier.TitleElement

Bases: sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin, sec_parser.semantic_elements.abstract_semantic_element.AbstractLevelElement

The TitleElement class represents the title of a paragraph or other content object. It serves as a semantic marker, providing context and structure to the document.

class sec_parser.processing_steps.title_classifier.TitleClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

TitleClassifier elements into TitleElement instances by scanning a list of semantic elements and replacing suitable candidates.

The “_unique_styles_by_order” tuple:

  • Represents an ordered set of unique styles found in the document.

  • Preserves the order of insertion, which determines the hierarchical level of each style.

  • Assumes that earlier “highlight” styles correspond to higher level paragraph or section headings.

_add_unique_style(style: sec_parser.semantic_elements.highlighted_text_element.TextStyle) None

Add a new unique style if not already present.

_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, _: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

Process each element and convert to TitleElement if necessary.