sec_parser.processing_steps.page_header_classifier

Classes

AbstractElementwiseProcessingStep

AbstractElementwiseTransformStep class is used to iterate over

ElementProcessingContext

The ElementProcessingContext class is designed to provide context information

HighlightedTextElement

The HighlightedTextElement class, among other uses,

TextStyle

PageHeaderElement

The PageHeaderElement class represents a page header within a document.

PageHeaderCandidate

PageHeaderClassifier

Module Contents

class sec_parser.processing_steps.page_header_classifier.AbstractElementwiseProcessingStep(*, types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep

AbstractElementwiseTransformStep class is used to iterate over all Semantic Elements with or without applying transformations.

_NUM_ITERATIONS = 1
abstract _process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

_process_element method is responsible for transforming a single semantic element into another.

It can also be utilized to simply iterate over all elements without applying any transformations.

_process_recursively(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, _context: sec_parser.processing_steps.abstract_classes.processing_context.ElementProcessingContext) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
_process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
class sec_parser.processing_steps.page_header_classifier.ElementProcessingContext

The ElementProcessingContext class is designed to provide context information for elementwise processing steps.

iteration: int
class sec_parser.processing_steps.page_header_classifier.HighlightedTextElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, style: TextStyle | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The HighlightedTextElement class, among other uses, is an intermediate step in identifying title elements.

For example:

First, elements with specific styles (like bold or italic text) are classified as HighlightedTextElements. These are later examined to determine if they should be considered TitleElements.

classmethod create_from_element(source: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, style: TextStyle | None = None) HighlightedTextElement

Convert the semantic element into another semantic element type.

to_dict(*, include_previews: bool = False, include_contents: bool = False) dict[str, Any]
class sec_parser.processing_steps.page_header_classifier.TextStyle
PERCENTAGE_THRESHOLD = 80
BOLD_THRESHOLD = 600
is_all_uppercase: bool = False
bold_with_font_weight: bool = False
italic: bool = False
centered: bool = False
underline: bool = False
__bool__() bool
classmethod from_style_and_text(style_percentage: dict[tuple[str, str], float], text: str) TextStyle
classmethod _is_bold_with_font_weight(key: str, value: str) bool
class sec_parser.processing_steps.page_header_classifier.PageHeaderElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: IrrelevantElement

The PageHeaderElement class represents a page header within a document. It is a subclass of the IrrelevantElement class and is used to identify and handle page headers in the document, such as current section titles and company names.

class sec_parser.processing_steps.page_header_classifier.PageHeaderCandidate
TEXT_LENGTH_THRESHOLD = 100
OCCURRENCE_THRESHOLD = 5
MOST_COMMON_CANDIDATE_LIMIT = None
text: str
style: sec_parser.semantic_elements.highlighted_text_element.TextStyle | None
class sec_parser.processing_steps.page_header_classifier.PageHeaderClassifier(types_to_process: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, types_to_exclude: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)

Bases: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.AbstractElementwiseProcessingStep

_NUM_ITERATIONS = 2
_process_element(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, context: sec_parser.processing_steps.abstract_classes.abstract_elementwise_processing_step.ElementProcessingContext) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_find_page_header_candidates(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) None
_classify_elements(element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
_get_most_common_candidates() dict[PageHeaderCandidate, int]