sec_parser.semantic_tree.render
Attributes
Classes
In the domain of HTML parsing, especially in the context of SEC EDGAR documents, |
|
The IrrelevantElement class identifies elements in the parsed HTML that do not |
|
The TreeNode class is a fundamental part of the semantic tree structure. |
Functions
|
render function is used to visualize the structure of the semantic tree. |
Module Contents
- class sec_parser.semantic_tree.render_.AbstractSemanticElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
abc.ABCIn the domain of HTML parsing, especially in the context of SEC EDGAR documents, a semantic element refers to a meaningful unit within the document that serves a specific purpose. For example, a paragraph or a table might be considered a semantic element. Unlike syntactic elements, which merely exist to structure the HTML, semantic elements carry information that is vital to the understanding of the document’s content.
This class serves as a foundational representation of such semantic elements, containing an HtmlTag object that stores the raw HTML tag information. Subclasses will implement additional behaviors based on the type of the semantic element.
- log_init(log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) None
Has to be called at the very end of the __init__ method.
- property html_tag: sec_parser.processing_engine.html_tag.HtmlTag
- classmethod create_from_element(source: AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin) AbstractSemanticElement
Convert the semantic element into another semantic element type.
- to_dict(*, include_previews: bool = False, include_contents: bool = False) dict[str, Any]
- __repr__() str
Return repr(self).
- contains_words() bool
Return True if the semantic element contains text.
- property text: str
Property text is a passthrough to the HtmlTag text property.
- get_source_code(*, pretty: bool = False, enable_compatibility: bool = False) str
get_source_code is a passthrough to the HtmlTag method.
- get_summary() str
Return a human-readable summary of the semantic element.
This method aims to provide a simplified, human-friendly representation of the underlying HtmlTag. In this base implementation, it is a passthrough to the HtmlTag’s get_text() method.
Note: Subclasses may override this method to provide a more specific summary based on the type of element.
- class sec_parser.semantic_tree.render_.IrrelevantElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)
Bases:
sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElementThe IrrelevantElement class identifies elements in the parsed HTML that do not contribute to the content. These elements often include page separators, page numbers, and other non-content items. For instance, HTML tags without content like <p></p> or <div></div> are deemed irrelevant, often used in documents just to add vertical space.
- class sec_parser.semantic_tree.render_.SemanticTree(root_nodes: list[sec_parser.semantic_tree.tree_node.TreeNode])
- __iter__() collections.abc.Iterator[sec_parser.semantic_tree.tree_node.TreeNode]
Iterate over the root nodes of the tree.
- __len__() int
- property nodes: collections.abc.Iterator[sec_parser.semantic_tree.tree_node.TreeNode]
Get all nodes in the semantic tree. This includes the root nodes and all their descendants.
- render(*, pretty: bool | None = True, ignored_types: tuple[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], Ellipsis] | None = None, char_display_limit: int | None = None, verbose: bool = False) str
Render the semantic tree as a human-readable string.
Syntactic sugar for a more convenient usage of render.
- print(*, pretty: bool | None = True, ignored_types: tuple[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], Ellipsis] | None = None, char_display_limit: int | None = None, verbose: bool = False, line_limit: int | None = None) None
Print the semantic tree as a human-readable string.
Syntactic sugar for a more convenient usage of render.
- class sec_parser.semantic_tree.render_.TreeNode(semantic_element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, *, parent: TreeNode | None = None, children: collections.abc.Iterable[TreeNode] | None = None)
The TreeNode class is a fundamental part of the semantic tree structure. Each TreeNode represents a node in the tree. It holds a reference to a semantic element, maintains a list of its child nodes, and a reference to its parent node. This class provides methods for managing the tree structure, such as adding and removing child nodes. Importantly, these methods ensure logical consistency as children/parents are being changed. For example, if a parent is removed from a child, the child is automatically removed from the parent.
- property semantic_element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
- __repr__() str
Return repr(self).
- property text: str
Property text is a passthrough to the SemanticElement text property.
- get_source_code(*, pretty: bool = False) str
get_source_code is a passthrough to the SemanticElement method.
- sec_parser.semantic_tree.render_.DEFAULT_CHAR_DISPLAY_LIMIT = 65
- sec_parser.semantic_tree.render_.render(tree: list[sec_parser.semantic_tree.tree_node.TreeNode] | sec_parser.semantic_tree.tree_node.TreeNode | sec_parser.semantic_tree.semantic_tree.SemanticTree | list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, pretty: bool | None = True, ignored_types: tuple[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], Ellipsis] | None = None, char_display_limit: int | None = None, verbose: bool = False, _nodes: list[sec_parser.semantic_tree.tree_node.TreeNode] | None = None, _level: int = 0, _prefix: str = '', _is_root: bool = True) str
render function is used to visualize the structure of the semantic tree. It is primarily used for debugging purposes.