sec_parser
Subpackages
Submodules
Exceptions
Base exception class for sec_parser. |
|
Base exception class for sec_parser. |
|
Base exception class for sec_parser. |
Classes
The Edgar10QParser class is responsible for parsing SEC EDGAR 10-Q |
|
The HtmlTag class is a wrapper for BeautifulSoup4 Tag objects. |
|
AbstractProcessingStep class for transforming a list of elements. |
|
CompositeSemanticElement acts as a container for other semantic elements, |
|
In the domain of HTML parsing, especially in the context of SEC EDGAR documents, |
|
The EmptyElement class represents an HTML element that does not contain any content. |
|
The ImageElement class represents a standard image within a document. |
|
The IrrelevantElement class identifies elements in the parsed HTML that do not |
|
The NotYetClassifiedElement class represents an element whose type |
|
The PageHeaderElement class represents a page header within a document. |
|
The PageNumberElement class represents a page number within a document. |
|
The SupplementaryText class captures various types of supplementary text |
|
The TextElement class represents a standard text paragraph within a document. |
|
The TableElement class represents a standard table within a document. |
|
The TitleElement class represents the title of a paragraph or other content object. |
|
The TopSectionTitle class represents the title and the beginning of a top-level |
|
AbstractNestingRule is a base class for defining rules for nesting |
|
Builds a semantic tree from a list of semantic elements. |
|
The TreeNode class is a fundamental part of the semantic tree structure. |
Functions
|
render function is used to visualize the structure of the semantic tree. |
Package Contents
- exception sec_parser.SecParserError
Bases:
ExceptionBase exception class for sec_parser. All custom exceptions in sec_parser are inherited from this class.
- exception sec_parser.SecParserRuntimeError
Bases:
SecParserError,RuntimeErrorBase exception class for sec_parser. All custom exceptions in sec_parser are inherited from this class.
- exception sec_parser.SecParserValueError
Bases:
SecParserError,ValueErrorBase exception class for sec_parser. All custom exceptions in sec_parser are inherited from this class.
- class sec_parser.Edgar10QParser(get_steps: Callable[[], list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]] | None = None, *, parsing_options: sec_parser.processing_engine.types.ParsingOptions | None = None, html_tag_parser: sec_parser.processing_engine.html_tag_parser.AbstractHtmlTagParser | None = None)
Bases:
AbstractSemanticElementParserThe Edgar10QParser class is responsible for parsing SEC EDGAR 10-Q quarterly reports. It transforms the HTML documents into a list of elements. Each element in this list represents a part of the visual structure of the original document.
- get_default_steps(get_checks: Callable[[], list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]] | None = None) list[sec_parser.processing_steps.abstract_classes.abstract_processing_step.AbstractProcessingStep]
- get_default_single_element_checks() list[sec_parser.processing_steps.individual_semantic_element_extractor.single_element_checks.abstract_single_element_check.AbstractSingleElementCheck]
- class sec_parser.HtmlTag(bs4_element: bs4.PageElement)
The HtmlTag class is a wrapper for BeautifulSoup4 Tag objects.
It serves three main purposes:
Decoupling: By abstracting the underlying BeautifulSoup4 library, we can isolate our application logic from the library specifics. This makes it easier to modify or even replace the HTML parsing library in the future without extensive codebase changes.
Usability: The HtmlTag class provides a convenient location to add extension methods or additional properties not offered by the native BeautifulSoup4 Tag class. This enhances the usability of the class.
Caching: The HtmlTag class also caches processing results, improving performance by avoiding unnecessary re-computation.
- get_source_code(*, pretty: bool = False, enable_compatibility: bool = False) str
- _generate_preview(text: str) str
Generate a preview of the text with a specified length.
- to_dict() frozendict.frozendict
Compute the hash of the HTML tag.
- contains_words() bool
Return True if the semantic element contains text.
- property text: str
text property recursively extracts text from the child tags. The result is cached as the underlying data doesn’t change.
- property name: str
Returns tag name, e.g. for <div> return ‘div’.
- has_tag_children() bool
- contains_tag(name: str, *, include_self: bool = False) bool
contains_tag method checks if the current HTML tag contains a descendant tag with the specified name. For example, calling contains_tag(“b”) on an HtmlTag instance representing “<div><p><b>text</b></p></div>” would return True, as there is a ‘b’ tag within the descendants of the ‘div’ tag.
- has_text_outside_tags(tags: list[str] | str) bool
has_text_outside_tags function checks if the given node has any text outside the specified tag. For example, calling has_text_outside_tags(node, [“b”]) on a node representing “<div><p><b>text</b>extra text</p></div>” would return True, as there is text outside the ‘b’ tag within the descendants of the ‘div’ tag.
- without_tags(names: collections.abc.Iterable[str]) HtmlTag
without_tags method creates a copy of the current HTML tag and removes all descendant tags with the specified name. For example, calling without_tags(tag, [“b”,”i”]) on an HtmlTag instance representing “<div><b>foo</b><p>bar<i>bax</i></p></div>” would return a copy HtmlTag instance representing “<div><p>bar</p></div>”.
- count_tags(name: str) int
count_tags method counts the number of descendant tags with the specified name within the current HTML tag. For example, calling count_tags(“b”) on an HtmlTag instance representing “<div><p><b>text</b></p><b>more text</b></div>” would return 2, as there are two ‘b’ tags within the descendants of the ‘div’ tag.
- is_unary_tree() bool
is_unary_tree determines if a BeautifulSoup tag forms a unary tree. In a unary tree, each node has at most one child.
However, if a non-leaf node contains a non-empty string even without a tag surrounding it, the tree is not considered unary.
Additionally, if the some tag is a ‘table’, the function will return True regardless of its children. This is because in the context of this application, ‘table’ tags are always considered unary.
- get_text_styles_metrics() dict[tuple[str, str], float]
Compute the percentage distribution of various CSS styles within the text content of a given HTML tag and its descendants.
This function iterates through all the text nodes within the tag, recursively includes text from child elements, and calculates the effective styles applied to each text segment.
It aggregates these styles and computes their percentage distribution based on the length of text they apply to.
The function uses BeautifulSoup’s recursive text search and parent traversal features. It returns a dictionary containing the aggregated style metrics (the percentage distribution of styles).
Each dictionary entry corresponds to a unique style, (property, value) and the percentage of text it affects.
- get_approx_table_metrics() sec_parser.utils.bs4_.approx_table_metrics.ApproxTableMetrics | None
- is_table_of_content() bool
- table_to_markdown() str
- static _to_tag(element: bs4.PageElement) bs4.Tag
- static wrap_tags_in_new_parent(parent_tag_name: str, tags: collections.abc.Iterable[HtmlTag]) HtmlTag
- count_text_matches_in_descendants(predicate: Callable[[str], bool], *, exclude_links: bool | None = None) int
- class sec_parser.AbstractProcessingStep
Bases:
abc.ABCAbstractProcessingStep class for transforming a list of elements. Chaining multiple steps together allows for complex transformations while keeping the code modular.
Each instance of a step is designed to be used for a single transformation operation. This ensures that any internal state maintained during a transformation is isolated to the processing of a single document.
- process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
Transform the list of semantic elements.
Note: The elements argument could potentially be mutated for performance reasons.
- abstract _process(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
Implement the actual transformation logic in child classes.
This method is intended to be overridden by child classes to provide specific transformation logic.
- class sec_parser.CompositeSemanticElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, inner_elements: tuple[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, Ellipsis] | None, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElementCompositeSemanticElement acts as a container for other semantic elements, especially for cases where a single HTML root tag wraps multiple elements. This ensures structural integrity and enables various features like semantic segmentation visualization, and debugging by comparison with the original document.
Why is this useful:
1. Some semantic elements, like XBRL tags (<ix>), may wrap multiple semantic elements. The container ensures that these relationships are not broken during parsing. 2. Enables the parser to fully reconstruct the original HTML document, which opens up possibilities for features like semantic segmentation visualization (e.g. recreate the original document but put semi-transparent colored boxes on top, based on semantic meaning), serialization of parsed documents into an augmented HTML, and debugging by comparing to the original document.
- property inner_elements: tuple[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, Ellipsis]
- classmethod create_from_element(source: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, inner_elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement] | None = None) CompositeSemanticElement
Convert the semantic element into another semantic element type.
- to_dict(*, include_previews: bool = False, include_contents: bool = False) dict[str, Any]
- classmethod unwrap_elements(elements: collections.abc.Iterable[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, include_containers: bool | None = None) list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]
Recursively flatten a list of AbstractSemanticElement objects. For each CompositeSemanticElement encountered, its inner_elements are also recursively flattened. The ‘include_containers’ parameter controls whether the CompositeSemanticElement itself is included in the flattened list.
- class sec_parser.AbstractSemanticElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
abc.ABCIn the domain of HTML parsing, especially in the context of SEC EDGAR documents, a semantic element refers to a meaningful unit within the document that serves a specific purpose. For example, a paragraph or a table might be considered a semantic element. Unlike syntactic elements, which merely exist to structure the HTML, semantic elements carry information that is vital to the understanding of the document’s content.
This class serves as a foundational representation of such semantic elements, containing an HtmlTag object that stores the raw HTML tag information. Subclasses will implement additional behaviors based on the type of the semantic element.
- log_init(log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) None
Has to be called at the very end of the __init__ method.
- property html_tag: sec_parser.processing_engine.html_tag.HtmlTag
- classmethod create_from_element(source: AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin) AbstractSemanticElement
Convert the semantic element into another semantic element type.
- to_dict(*, include_previews: bool = False, include_contents: bool = False) dict[str, Any]
- __repr__() str
Return repr(self).
- contains_words() bool
Return True if the semantic element contains text.
- property text: str
Property text is a passthrough to the HtmlTag text property.
- get_source_code(*, pretty: bool = False, enable_compatibility: bool = False) str
get_source_code is a passthrough to the HtmlTag method.
- get_summary() str
Return a human-readable summary of the semantic element.
This method aims to provide a simplified, human-friendly representation of the underlying HtmlTag. In this base implementation, it is a passthrough to the HtmlTag’s get_text() method.
Note: Subclasses may override this method to provide a more specific summary based on the type of element.
- class sec_parser.EmptyElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
IrrelevantElementThe EmptyElement class represents an HTML element that does not contain any content. It is a subclass of the IrrelevantElement class and is used to identify and handle empty HTML tags in the document.
- class sec_parser.ImageElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElementThe ImageElement class represents a standard image within a document.
- class sec_parser.IrrelevantElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElementThe IrrelevantElement class identifies elements in the parsed HTML that do not contribute to the content. These elements often include page separators, page numbers, and other non-content items. For instance, HTML tags without content like <p></p> or <div></div> are deemed irrelevant, often used in documents just to add vertical space.
- class sec_parser.NotYetClassifiedElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElementThe NotYetClassifiedElement class represents an element whose type has not yet been determined. The parsing process aims to classify all instances of this class into more specific subclasses of AbstractSemanticElement.
- class sec_parser.PageHeaderElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
IrrelevantElementThe PageHeaderElement class represents a page header within a document. It is a subclass of the IrrelevantElement class and is used to identify and handle page headers in the document, such as current section titles and company names.
- class sec_parser.PageNumberElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
IrrelevantElementThe PageNumberElement class represents a page number within a document. It is a subclass of the IrrelevantElement class and is used to identify and handle page numbers in the document.
- class sec_parser.SupplementaryText(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin,sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElementThe SupplementaryText class captures various types of supplementary text within a document, such as unit qualifiers, additional notes, and disclaimers.
For example: - “(In millions, except number of shares which are reflected in thousands and
per share amounts)”
“See accompanying Notes to Condensed Consolidated Financial Statements.”
“Disclaimer: This is not financial advice.”
- class sec_parser.TextElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin,sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElementThe TextElement class represents a standard text paragraph within a document.
- class sec_parser.TableElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElementThe TableElement class represents a standard table within a document.
- get_summary() str
Return a human-readable summary of the semantic element.
This method aims to provide a simplified, human-friendly representation of the underlying HtmlTag.
- to_dict(*, include_previews: bool = False, include_contents: bool = False) dict[str, Any]
- table_to_markdown() str
- class sec_parser.TitleElement
Bases:
sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin,sec_parser.semantic_elements.abstract_semantic_element.AbstractLevelElementThe TitleElement class represents the title of a paragraph or other content object. It serves as a semantic marker, providing context and structure to the document.
- class sec_parser.TopSectionTitle(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None, level: int | None = None, section_type: sec_parser.semantic_elements.top_section_title_types.TopSectionType | None = None)
Bases:
sec_parser.semantic_elements.mixins.dict_text_content_mixin.DictTextContentMixin,sec_parser.semantic_elements.top_section_start_marker.TopSectionStartMarkerThe TopSectionTitle class represents the title and the beginning of a top-level section of a document. For instance, in SEC 10-Q reports, a top-level section could be “Part I, Item 3. Quantitative and Qualitative Disclosures About Market Risk.”.
- classmethod create_from_element(source: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin, *, level: int | None = None, section_type: sec_parser.semantic_elements.top_section_title_types.TopSectionType | None = None) sec_parser.semantic_elements.abstract_semantic_element.AbstractLevelElement
- to_dict(*, include_previews: bool = False, include_contents: bool = False) dict[str, Any]
- class sec_parser.AbstractNestingRule(*, exclude_parents: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None, exclude_children: set[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]] | None = None)
Bases:
abc.ABCAbstractNestingRule is a base class for defining rules for nesting semantic elements. Each rule should ideally mention at most one or two types of semantic elements to reduce coupling and complexity.
In case of conflicts between rules, they should be resolved through parameters like exclude_parents and exclude_children.
- should_be_nested_under(parent: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, child: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool
- abstract _should_be_nested_under(parent: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, child: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement) bool
- sec_parser.render(tree: list[sec_parser.semantic_tree.tree_node.TreeNode] | sec_parser.semantic_tree.tree_node.TreeNode | sec_parser.semantic_tree.semantic_tree.SemanticTree | list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, pretty: bool | None = True, ignored_types: tuple[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], Ellipsis] | None = None, char_display_limit: int | None = None, verbose: bool = False, _nodes: list[sec_parser.semantic_tree.tree_node.TreeNode] | None = None, _level: int = 0, _prefix: str = '', _is_root: bool = True) str
render function is used to visualize the structure of the semantic tree. It is primarily used for debugging purposes.
- class sec_parser.SemanticTree(root_nodes: list[sec_parser.semantic_tree.tree_node.TreeNode])
- __iter__() collections.abc.Iterator[sec_parser.semantic_tree.tree_node.TreeNode]
Iterate over the root nodes of the tree.
- __len__() int
- property nodes: collections.abc.Iterator[sec_parser.semantic_tree.tree_node.TreeNode]
Get all nodes in the semantic tree. This includes the root nodes and all their descendants.
- render(*, pretty: bool | None = True, ignored_types: tuple[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], Ellipsis] | None = None, char_display_limit: int | None = None, verbose: bool = False) str
Render the semantic tree as a human-readable string.
Syntactic sugar for a more convenient usage of render.
- print(*, pretty: bool | None = True, ignored_types: tuple[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], Ellipsis] | None = None, char_display_limit: int | None = None, verbose: bool = False, line_limit: int | None = None) None
Print the semantic tree as a human-readable string.
Syntactic sugar for a more convenient usage of render.
- class sec_parser.TreeBuilder(get_rules: Callable[[], list[sec_parser.semantic_tree.nesting_rules.AbstractNestingRule]] | None = None)
Builds a semantic tree from a list of semantic elements.
Why Use a Tree Structure?
Using a tree data structure allows for easier and more robust filtering of sections. With a tree, you can select specific branches to filter, making it straightforward to identify section boundaries. This approach is more maintainable and robust compared to attempting the same operations on a flat list of elements.
Overview:
Takes a list of semantic elements.
Applies nesting rules to these elements.
Customization:
The nesting process is customizable through a list of rules. These rules determine how new elements should be nested under existing ones.
Advanced Customization:
You can supply your own set of rules by providing a callable to get_rules, which should return a list of AbstractNestingRule instances.
- static get_default_rules() list[sec_parser.semantic_tree.nesting_rules.AbstractNestingRule]
- build(elements: list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement]) sec_parser.semantic_tree.semantic_tree.SemanticTree
- _find_parent_node(new_node: sec_parser.semantic_tree.tree_node.TreeNode, stack: list[sec_parser.semantic_tree.tree_node.TreeNode], rules: list[sec_parser.semantic_tree.nesting_rules.AbstractNestingRule]) sec_parser.semantic_tree.tree_node.TreeNode | None
- _should_nest_under(child_node: sec_parser.semantic_tree.tree_node.TreeNode, parent_node: sec_parser.semantic_tree.tree_node.TreeNode, rules: list[sec_parser.semantic_tree.nesting_rules.AbstractNestingRule]) bool
- class sec_parser.TreeNode(semantic_element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, *, parent: TreeNode | None = None, children: collections.abc.Iterable[TreeNode] | None = None)
The TreeNode class is a fundamental part of the semantic tree structure. Each TreeNode represents a node in the tree. It holds a reference to a semantic element, maintains a list of its child nodes, and a reference to its parent node. This class provides methods for managing the tree structure, such as adding and removing child nodes. Importantly, these methods ensure logical consistency as children/parents are being changed. For example, if a parent is removed from a child, the child is automatically removed from the parent.
- property semantic_element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
- __repr__() str
Return repr(self).
- property text: str
Property text is a passthrough to the SemanticElement text property.
- get_source_code(*, pretty: bool = False) str
get_source_code is a passthrough to the SemanticElement method.