sec_parser.processing_engine.core ================================= .. py:module:: sec_parser.processing_engine.core Classes ------- .. autoapisummary:: sec_parser.processing_engine.core.AbstractSemanticElementParser sec_parser.processing_engine.core.Edgar10QParser sec_parser.processing_engine.core.Edgar10KParser Module Contents --------------- .. py:class:: AbstractSemanticElementParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None) Bases: :py:obj:`abc.ABC` Responsible for parsing semantic elements from HTML documents. It takes raw HTML and turns it into a list of objects representing semantic elements. At a High Level: ================== 1. Extract top-level HTML tags from the document. 2. Transform these tags into a list of more specific semantic elements step-by-step. Why Focus on Top-Level Tags? ============================ SEC filings usually have a flat HTML structure, which simplifies the parsing process. Each top-level HTML tag often directly corresponds to a single semantic element. This is different from many websites where HTML tags are nested deeply,requiring more complex parsing. For Advanced Users: ==================== The parsing process is implemented as a sequence of steps and allows for customization at each step. - Pipeline Pattern: Raw HTML tags are processed in a sequential manner. The steps follow an ordered, step-by-step approach, akin to a Finite State Machine (FSM). Each element transitions through various states defined by the sequence of processing steps. - Strategy Pattern: Each step is customizable. You can either replace, remove, or extend any of the existing steps with your own or inherited implementation. Alternatively, you can replace the entire pipeline with your own process. .. py:attribute:: _get_steps .. py:attribute:: _parsing_options .. py:attribute:: _html_tag_parser .. py:method:: get_default_steps() -> list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep] :abstractmethod: .. py:method:: parse(html: str | bytes, *, unwrap_elements: bool | None = None, include_containers: bool | None = None, include_irrelevant_elements: bool | None = None) -> list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement] .. py:method:: parse_from_tags(root_tags: list[sec_parser.processing_engine.html_tag.HtmlTag], *, unwrap_elements: bool | None = None, include_containers: bool | None = None, include_irrelevant_elements: bool | None = None) -> list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement] .. py:class:: Edgar10QParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None) Bases: :py:obj:`AbstractSemanticElementParser` The Edgar10QParser class is responsible for parsing SEC EDGAR 10-Q quarterly reports. It transforms the HTML documents into a list of elements. Each element in this list represents a part of the visual structure of the original document. .. py:method:: get_default_steps(get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None) -> list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep] .. py:method:: get_default_single_element_checks() -> list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck] .. py:class:: Edgar10KParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None) Bases: :py:obj:`AbstractSemanticElementParser` The Edgar10KParser class is responsible for parsing SEC EDGAR 10-K quarterly reports. It transforms the HTML documents into a list of elements. Each element in this list represents a part of the visual structure of the original document. .. py:method:: get_default_steps(get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None) -> list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep] .. py:method:: get_default_single_element_checks() -> list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]