sec_parser.processing_engine.core
Classes
Responsible for parsing semantic elements from HTML documents. |
|
The Edgar10QParser class is responsible for parsing SEC EDGAR 10-Q |
|
The Edgar10KParser class is responsible for parsing SEC EDGAR 10-K |
Module Contents
- class sec_parser.processing_engine.core.AbstractSemanticElementParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None)
Bases:
abc.ABCResponsible for parsing semantic elements from HTML documents. It takes raw HTML and turns it into a list of objects representing semantic elements.
At a High Level:
Extract top-level HTML tags from the document.
Transform these tags into a list of more specific semantic elements step-by-step.
For Advanced Users:
The parsing process is implemented as a sequence of steps and allows for customization at each step.
Pipeline Pattern: Raw HTML tags are processed in a sequential manner. The steps follow an ordered, step-by-step approach, akin to a Finite State Machine (FSM). Each element transitions through various states defined by the sequence of processing steps.
Strategy Pattern: Each step is customizable. You can either replace, remove, or extend any of the existing steps with your own or inherited implementation. Alternatively, you can replace the entire pipeline with your own process.
- _get_steps
- _parsing_options
- _html_tag_parser
- abstract get_default_steps() list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]
- parse(html: str | bytes, *, unwrap_elements: bool | None = None, include_containers: bool | None = None, include_irrelevant_elements: bool | None = None) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
- parse_from_tags(root_tags: list[sec_parser.processing_engine.html_tag.HtmlTag], *, unwrap_elements: bool | None = None, include_containers: bool | None = None, include_irrelevant_elements: bool | None = None) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
- class sec_parser.processing_engine.core.Edgar10QParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None)
Bases:
AbstractSemanticElementParserThe Edgar10QParser class is responsible for parsing SEC EDGAR 10-Q quarterly reports. It transforms the HTML documents into a list of elements. Each element in this list represents a part of the visual structure of the original document.
- get_default_steps(get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None) list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]
- get_default_single_element_checks() list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]
- class sec_parser.processing_engine.core.Edgar10KParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None)
Bases:
AbstractSemanticElementParserThe Edgar10KParser class is responsible for parsing SEC EDGAR 10-K quarterly reports. It transforms the HTML documents into a list of elements. Each element in this list represents a part of the visual structure of the original document.
- get_default_steps(get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None) list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]
- get_default_single_element_checks() list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]