sec_parser.semantic_tree.render

Attributes

DEFAULT_CHAR_DISPLAY_LIMIT

Classes

AbstractSemanticElement

In the domain of HTML parsing, especially in the context of SEC EDGAR documents,

IrrelevantElement

The IrrelevantElement class identifies elements in the parsed HTML that do not

SemanticTree

TreeNode

The TreeNode class is a fundamental part of the semantic tree structure.

Functions

render(→ str)

render function is used to visualize the structure of the semantic tree.

Module Contents

class sec_parser.semantic_tree.render_.AbstractSemanticElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: abc.ABC

In the domain of HTML parsing, especially in the context of SEC EDGAR documents, a semantic element refers to a meaningful unit within the document that serves a specific purpose. For example, a paragraph or a table might be considered a semantic element. Unlike syntactic elements, which merely exist to structure the HTML, semantic elements carry information that is vital to the understanding of the document’s content.

This class serves as a foundational representation of such semantic elements, containing an HtmlTag object that stores the raw HTML tag information. Subclasses will implement additional behaviors based on the type of the semantic element.

log_init(log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None) None

Has to be called at the very end of the __init__ method.

property html_tag: sec_parser.processing_engine.html_tag.HtmlTag
classmethod create_from_element(source: AbstractSemanticElement, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin) AbstractSemanticElement

Convert the semantic element into another semantic element type.

to_dict(*, include_previews: bool = False, include_contents: bool = False) dict[str, Any]
__repr__() str

Return repr(self).

contains_words() bool

Return True if the semantic element contains text.

property text: str

Property text is a passthrough to the HtmlTag text property.

get_source_code(*, pretty: bool = False, enable_compatibility: bool = False) str

get_source_code is a passthrough to the HtmlTag method.

get_summary() str

Return a human-readable summary of the semantic element.

This method aims to provide a simplified, human-friendly representation of the underlying HtmlTag. In this base implementation, it is a passthrough to the HtmlTag’s get_text() method.

Note: Subclasses may override this method to provide a more specific summary based on the type of element.

class sec_parser.semantic_tree.render_.IrrelevantElement(html_tag: sec_parser.processing_engine.html_tag.HtmlTag, *, processing_log: sec_parser.processing_engine.processing_log.ProcessingLog | None = None, log_origin: sec_parser.processing_engine.processing_log.LogItemOrigin | None = None)

Bases: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement

The IrrelevantElement class identifies elements in the parsed HTML that do not contribute to the content. These elements often include page separators, page numbers, and other non-content items. For instance, HTML tags without content like <p></p> or <div></div> are deemed irrelevant, often used in documents just to add vertical space.

class sec_parser.semantic_tree.render_.SemanticTree(root_nodes: list[sec_parser.semantic_tree.tree_node.TreeNode])
__iter__() collections.abc.Iterator[sec_parser.semantic_tree.tree_node.TreeNode]

Iterate over the root nodes of the tree.

__len__() int
property nodes: collections.abc.Iterator[sec_parser.semantic_tree.tree_node.TreeNode]

Get all nodes in the semantic tree. This includes the root nodes and all their descendants.

render(*, pretty: bool | None = True, ignored_types: tuple[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], Ellipsis] | None = None, char_display_limit: int | None = None, verbose: bool = False) str

Render the semantic tree as a human-readable string.

Syntactic sugar for a more convenient usage of render.

print(*, pretty: bool | None = True, ignored_types: tuple[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], Ellipsis] | None = None, char_display_limit: int | None = None, verbose: bool = False, line_limit: int | None = None) None

Print the semantic tree as a human-readable string.

Syntactic sugar for a more convenient usage of render.

class sec_parser.semantic_tree.render_.TreeNode(semantic_element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement, *, parent: TreeNode | None = None, children: collections.abc.Iterable[TreeNode] | None = None)

The TreeNode class is a fundamental part of the semantic tree structure. Each TreeNode represents a node in the tree. It holds a reference to a semantic element, maintains a list of its child nodes, and a reference to its parent node. This class provides methods for managing the tree structure, such as adding and removing child nodes. Importantly, these methods ensure logical consistency as children/parents are being changed. For example, if a parent is removed from a child, the child is automatically removed from the parent.

property semantic_element: sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement
property children: list[TreeNode]
property parent: TreeNode | None
add_child(child: TreeNode) None
add_children(children: collections.abc.Iterable[TreeNode]) None
remove_child(child: TreeNode) None
has_child(child: TreeNode) bool
get_descendants() collections.abc.Iterator[TreeNode]
__repr__() str

Return repr(self).

property text: str

Property text is a passthrough to the SemanticElement text property.

get_source_code(*, pretty: bool = False) str

get_source_code is a passthrough to the SemanticElement method.

sec_parser.semantic_tree.render_.DEFAULT_CHAR_DISPLAY_LIMIT = 65
sec_parser.semantic_tree.render_.render(tree: list[sec_parser.semantic_tree.tree_node.TreeNode] | sec_parser.semantic_tree.tree_node.TreeNode | sec_parser.semantic_tree.semantic_tree.SemanticTree | list[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], *, pretty: bool | None = True, ignored_types: tuple[type[sec_parser.semantic_elements.abstract_semantic_element.AbstractSemanticElement], Ellipsis] | None = None, char_display_limit: int | None = None, verbose: bool = False, _nodes: list[sec_parser.semantic_tree.tree_node.TreeNode] | None = None, _level: int = 0, _prefix: str = '', _is_root: bool = True) str

render function is used to visualize the structure of the semantic tree. It is primarily used for debugging purposes.