sec_parser.processing_engine

The processing_engine subpackage contains the core logic for parsing SEC documents. It is designed to work in conjunction with the steps from the processing_steps subpackage to perform tasks like section identification, title parsing, and text extraction.

Submodules

Classes

AbstractSemanticElementParser

Responsible for parsing semantic elements from HTML documents.

Edgar10QParser

The Edgar10QParser class is responsible for parsing SEC EDGAR 10-Q

HtmlTag

The HtmlTag class is a wrapper for BeautifulSoup4 Tag objects.

HtmlTagParser

The HtmlTagParser parses an HTML document using BeautifulSoup4.

Package Contents

class sec_parser.processing_engine.AbstractSemanticElementParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None)

Bases: abc.ABC

Responsible for parsing semantic elements from HTML documents. It takes raw HTML and turns it into a list of objects representing semantic elements.

At a High Level:

  1. Extract top-level HTML tags from the document.

  2. Transform these tags into a list of more specific semantic elements step-by-step.

Why Focus on Top-Level Tags?

SEC filings usually have a flat HTML structure, which simplifies the parsing process.Each top-level HTML tag often directly corresponds to a single semantic element. This is different from many websites where HTML tags are nested deeply,requiring more complex parsing.

For Advanced Users:

The parsing process is implemented as a sequence of steps and allows for customization at each step.

  • Pipeline Pattern: Raw HTML tags are processed in a sequential manner. The steps follow an ordered, step-by-step approach, akin to a Finite State Machine (FSM). Each element transitions through various states defined by the sequence of processing steps.

  • Strategy Pattern: Each step is customizable. You can either replace, remove, or extend any of the existing steps with your own or inherited implementation. Alternatively, you can replace the entire pipeline with your own process.

abstract get_default_steps() list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]
parse(html: str | bytes, *, unwrap_elements: bool | None = None, include_containers: bool | None = None, include_irrelevant_elements: bool | None = None) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
parse_from_tags(root_tags: list[sec_parser.processing_engine.html_tag.HtmlTag], *, unwrap_elements: bool | None = None, include_containers: bool | None = None, include_irrelevant_elements: bool | None = None) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
class sec_parser.processing_engine.Edgar10QParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None)

Bases: AbstractSemanticElementParser

The Edgar10QParser class is responsible for parsing SEC EDGAR 10-Q quarterly reports. It transforms the HTML documents into a list of elements. Each element in this list represents a part of the visual structure of the original document.

get_default_steps(get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None) list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]
get_default_single_element_checks() list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]
class sec_parser.processing_engine.HtmlTag(bs4_element: bs4.PageElement)

The HtmlTag class is a wrapper for BeautifulSoup4 Tag objects.

It serves three main purposes:

  1. Decoupling: By abstracting the underlying BeautifulSoup4 library, we can isolate our application logic from the library specifics. This makes it easier to modify or even replace the HTML parsing library in the future without extensive codebase changes.

  2. Usability: The HtmlTag class provides a convenient location to add extension methods or additional properties not offered by the native BeautifulSoup4 Tag class. This enhances the usability of the class.

  3. Caching: The HtmlTag class also caches processing results, improving performance by avoiding unnecessary re-computation.

property parent: HtmlTag | None
get_source_code(*, pretty: bool = False, enable_compatibility: bool = False) str
_generate_preview(text: str) str

Generate a preview of the text with a specified length.

to_dict() frozendict.frozendict

Compute the hash of the HTML tag.

contains_words() bool

Return True if the semantic element contains text.

property text: str

text property recursively extracts text from the child tags. The result is cached as the underlying data doesn’t change.

property name: str

Returns tag name, e.g. for <div> return ‘div’.

has_tag_children() bool
get_children() list[HtmlTag]
contains_tag(name: str, *, include_self: bool = False) bool

contains_tag method checks if the current HTML tag contains a descendant tag with the specified name. For example, calling contains_tag(“b”) on an HtmlTag instance representing “<div><p><b>text</b></p></div>” would return True, as there is a ‘b’ tag within the descendants of the ‘div’ tag.

has_text_outside_tags(tags: list[str] | str) bool

has_text_outside_tags function checks if the given node has any text outside the specified tag. For example, calling has_text_outside_tags(node, [“b”]) on a node representing “<div><p><b>text</b>extra text</p></div>” would return True, as there is text outside the ‘b’ tag within the descendants of the ‘div’ tag.

without_tags(names: collections.abc.Iterable[str]) HtmlTag

without_tags method creates a copy of the current HTML tag and removes all descendant tags with the specified name. For example, calling without_tags(tag, [“b”,”i”]) on an HtmlTag instance representing “<div><b>foo</b><p>bar<i>bax</i></p></div>” would return a copy HtmlTag instance representing “<div><p>bar</p></div>”.

count_tags(name: str) int

count_tags method counts the number of descendant tags with the specified name within the current HTML tag. For example, calling count_tags(“b”) on an HtmlTag instance representing “<div><p><b>text</b></p><b>more text</b></div>” would return 2, as there are two ‘b’ tags within the descendants of the ‘div’ tag.

is_unary_tree() bool

is_unary_tree determines if a BeautifulSoup tag forms a unary tree. In a unary tree, each node has at most one child.

However, if a non-leaf node contains a non-empty string even without a tag surrounding it, the tree is not considered unary.

Additionally, if the some tag is a ‘table’, the function will return True regardless of its children. This is because in the context of this application, ‘table’ tags are always considered unary.

get_text_styles_metrics() dict[tuple[str, str], float]

Compute the percentage distribution of various CSS styles within the text content of a given HTML tag and its descendants.

This function iterates through all the text nodes within the tag, recursively includes text from child elements, and calculates the effective styles applied to each text segment.

It aggregates these styles and computes their percentage distribution based on the length of text they apply to.

The function uses BeautifulSoup’s recursive text search and parent traversal features. It returns a dictionary containing the aggregated style metrics (the percentage distribution of styles).

Each dictionary entry corresponds to a unique style, (property, value) and the percentage of text it affects.

get_approx_table_metrics() sec_parser.utils.bs4_.approx_table_metrics.ApproxTableMetrics | None
is_table_of_content() bool
table_to_markdown() str
static _to_tag(element: bs4.PageElement) bs4.Tag
static wrap_tags_in_new_parent(parent_tag_name: str, tags: collections.abc.Iterable[HtmlTag]) HtmlTag
count_text_matches_in_descendants(predicate: Callable[[str], bool], *, exclude_links: bool | None = None) int
class sec_parser.processing_engine.HtmlTagParser(parser_backend: str | None = None)

Bases: AbstractHtmlTagParser

The HtmlTagParser parses an HTML document using BeautifulSoup4. It then wraps the parsed bs4.Tag objects into HtmlTag objects.

parse(html: str | bytes) list[sec_parser.processing_engine.html_tag.HtmlTag]
_parse_to_bs4(html: str | bytes) bs4.Tag