Developer Guide: Comprehensive Overview

Welcome to the Comprehensive Developer Guide for sec-parser. This guide is designed to provide an in-depth understanding of the sec-parser project, whether you’re a new developer looking to contribute, or an experienced one seeking to leverage its capabilities. We’ll walk you through the codebase, explaining key components and their interactions, and provide examples to help you get started.

This guide is interactive, allowing you to engage with the code and concepts as you learn. You can run and modify all the code examples shown here for yourself by cloning the repository and running the developer_guide.ipynb in a Jupyter notebook.

Alternatively, you can also run the notebook directly in your browser using Cloud-based Jupyter environments:

Open In Colab My Binder Kaggle Open in SageMaker Studio Lab

Let’s dive in!

Environment Setup

In order to run the example code in this Guide, you’ll need the sec_parser package:

[2]:
try:
    import sec_parser
except ImportError:
    !pip install -q sec-parser
    import sec_parser

Working with a Simplified Example

It will make it easier to follow along if we’ll have a specific simplified example in mind. Consider the following HTML:

[3]:
from IPython.display import display, HTML, Code

html = """
<b>Financial Overview</b>
<p>The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.</p>
<div>
    <b>Strategies of Investment</b>
    <p>Investment strategies</font> are plans that guide investors to choose <font color="green" style="color:green">the best investment opportunities</font> that align with their financial goals.</p>
    <img src="https://en.wikipedia.org/static/images/icons/wikipedia.png" width="20" height="20">
</div>
"""

display(Code(html))
display(HTML(html))
<b>Financial Overview</b>
<p>The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.</p>
<div>
    <b>Strategies of Investment</b>
    <p>Investment strategies</font> are plans that guide investors to choose <font color="green" style="color:green">the best investment opportunities</font> that align with their financial goals.</p>
    <img src="https://en.wikipedia.org/static/images/icons/wikipedia.png" width="20" height="20">
</div>
Financial Overview

The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.

Strategies of Investment

Investment strategies are plans that guide investors to choose the best investment opportunities that align with their financial goals.

Utilizing BeautifulSoup for Parsing

Many SEC EDGAR filings are available in HTML document format. To ease the process of reading the documents, we will be using the BeautifulSoup (“bs4”) library to parse an HTML document into a tree-like structure of HTML Tags (bs4.Tag).

Let’s apply this to our example:

[4]:
import bs4


# Utility function, ignore it
def get_children_tags(source) -> list[bs4.Tag]:
    return [tag for tag in source.children if isinstance(tag, bs4.Tag)]


# Utility function, ignore it
def tag_to_string(tag):
    text = tag.text.strip()
    if len(text) > 0:
        text = text[:10] + "..." if len(text) > 10 else text
        return f"{tag.name} (text: {text})"
    else:
        return f"{tag.name} (no text)"


parse_result = bs4.BeautifulSoup(html, "lxml").html.body
bs4_tags = get_children_tags(parse_result)
for i, tag in enumerate(bs4_tags):
    print(f"Tag {i}: {tag_to_string(tag)}")
Tag 0: b (text: Financial ...)
Tag 1: p (text: The financ...)
Tag 2: div (text: Strategies...)

Notice that we children iterates only over the top-level tags. Children of children can be accessed by using children attribute again:

[5]:
for i, tag in enumerate(get_children_tags(bs4_tags[2])):
    print(f"Tag 2 -> Tag {i}: {tag_to_string(tag)})")
Tag 2 -> Tag 0: b (text: Strategies...))
Tag 2 -> Tag 1: p (text: Investment...))
Tag 2 -> Tag 2: img (no text))

Understanding the Role of HtmlTag

Instead of interacting directly with bs4.Tag, the SEC EDGAR HTML Parser uses HtmlTag, a wrapper around bs4.Tag.

[6]:
from sec_parser.processing_engine import HtmlTag, HtmlTagParser

print(HtmlTag.__doc__)
print(HtmlTagParser.__doc__)

    The HtmlTag class is a wrapper for BeautifulSoup4 Tag objects.

    It serves three main purposes:

    1. Decoupling: By abstracting the underlying BeautifulSoup4 library, we
       can isolate our application logic from the library specifics. This
       makes it easier to modify or even replace the HTML parsing library in
       the future without extensive codebase changes.

    2. Usability: The HtmlTag class provides a convenient location to add
       extension methods or additional properties not offered by the native
       BeautifulSoup4 Tag class. This enhances the usability of the class.

    3. Caching: The HtmlTag class also caches processing results, improving
       performance by avoiding unnecessary re-computation.


    The HtmlTagParser parses an HTML document using BeautifulSoup4.
    It then wraps the parsed bs4.Tag objects into HtmlTag objects.

Let’s apply this to our example:

[7]:
bs4_div_tag = bs4_tags[2]
display(HTML(str(bs4_div_tag)))
Strategies of Investment

Investment strategies are plans that guide investors to choose the best investment opportunities that align with their financial goals.

By applying HtmlTag to the bs4.Tag object, we can now access the HtmlTag attributes and methods that are not available in bs4. For example, we can get a percentage of green text:

[8]:
div_tag = HtmlTag(bs4_div_tag)
percentage = div_tag.get_text_styles_metrics()[("color", "green")]
print(f"The fraction of text within this div that is colored green: {percentage:.0f}%")
The fraction of text within this div that is colored green: 21%

Let’s wrap the rest of the tags in our example with HtmlTag:

[9]:
tags = [HtmlTag(bs4_tag) for bs4_tag in bs4_tags]

Defining Semantic Elements

[10]:
from sec_parser.semantic_elements import AbstractSemanticElement

print(AbstractSemanticElement.__doc__)

    In the domain of HTML parsing, especially in the context of SEC EDGAR documents,
    a semantic element refers to a meaningful unit within the document that serves a
    specific purpose. For example, a paragraph or a table might be considered a
    semantic element. Unlike syntactic elements, which merely exist to structure the
    HTML, semantic elements carry information that is vital to the understanding of the
    document's content.

    This class serves as a foundational representation of such semantic elements,
    containing an HtmlTag object that stores the raw HTML tag information. Subclasses
    will implement additional behaviors based on the type of the semantic element.

A few examples of Semantic Elements:

[11]:
from sec_parser.semantic_elements import (
    TextElement,
    TableElement,
    TitleElement,
    TopSectionTitle,
    NotYetClassifiedElement,
)

print(TextElement.__doc__)
print(TableElement.__doc__)
print(TitleElement.__doc__)
print(TopSectionTitle.__doc__)
print(NotYetClassifiedElement.__doc__)
The TextElement class represents a standard text paragraph within a document.
The TableElement class represents a standard table within a document.

    The TitleElement class represents the title of a paragraph or other content object.
    It serves as a semantic marker, providing context and structure to the document.


    The TopSectionTitle class represents the title and the beginning of a top-level
    section of a document. For instance, in SEC 10-Q reports, a
    top-level section could be "Part I, Item 3. Quantitative and Qualitative
    Disclosures About Market Risk.".


    The NotYetClassifiedElement class represents an element whose type
    has not yet been determined. The parsing process aims to
    classify all instances of this class into more specific
    subclasses of AbstractSemanticElement.

To summarize, the purpose of parsing is to produce an ordered list of Semantic Elements from a tree of HTML Tags.

Let’s apply this to our example:

At the beginning of parsing the example we would have the following Semantic Elements:

[12]:
# Utility function, ignore it
def show(elements):
    for element in elements:
        text = element.text[:10]
        if hasattr(element, "inner_elements"):
            print(f"{element} (has {len(element.inner_elements)} elements inside)")
        elif text:
            print(f"{element} (text: {text}...)")
        else:
            print(f"{element}")


initial_elements = [NotYetClassifiedElement(tag) for tag in tags]
show(initial_elements)
NotYetClassifiedElement<b> (text: Financial ...)
NotYetClassifiedElement<p> (text: The financ...)
NotYetClassifiedElement<div> (text: Strategies...)

At the end of our parsing we expect to have the following Semantic Elements:

[13]:
from sec_parser import ImageElement

expected_elements: list[AbstractSemanticElement] = [
    TitleElement(tags[0]),
    TextElement(tags[1]),
    TitleElement(tags[2].get_children()[0]),
    TextElement(tags[2].get_children()[1]),
    ImageElement(tags[2].get_children()[2]),
]
show(expected_elements)
TitleElement[L0]<b> (text: Financial ...)
TextElement<p> (text: The financ...)
TitleElement[L0]<b> (text: Strategies...)
TextElement<p> (text: Investment...)
ImageElement<img>

Understanding the Parsing Process

[14]:
from sec_parser.processing_engine import AbstractSemanticElementParser

print(AbstractSemanticElementParser.__doc__)

    Responsible for parsing semantic elements from HTML documents.
    It takes raw HTML and turns it into a list of objects
    representing semantic elements.

    At a High Level:
    ==================
    1. Extract top-level HTML tags from the document.
    2. Transform these tags into a list of more specific semantic
       elements step-by-step.

    Why Focus on Top-Level Tags?
    ============================
    SEC filings usually have a flat HTML structure, which simplifies the
    parsing process.Each top-level HTML tag often directly corresponds
    to a single semantic element. This is different from many websites
    where HTML tags are nested deeply,requiring more complex parsing.

    For Advanced Users:
    ====================
    The parsing process is implemented as a sequence of steps and allows for
    customization at each step.

    - Pipeline Pattern: Raw HTML tags are processed in a sequential manner.
      The steps follow an ordered, step-by-step approach, akin to a Finite
      State Machine (FSM). Each element transitions through various states
      defined by the sequence of processing steps.

    - Strategy Pattern: Each step is customizable. You can either replace,
      remove, or extend any of the existing steps with your own or
      inherited implementation. Alternatively, you can replace the entire pipeline
      with your own process.

Let’s apply this to our example:

Processing is organized in steps. If there are no steps, there will be no processing:

[15]:
from sec_parser import Edgar10QParser


def get_steps():
    return []


parser = Edgar10QParser(get_steps)
elements = parser.parse(html)
show(elements)
NotYetClassifiedElement<b> (text: Financial ...)
NotYetClassifiedElement<p> (text: The financ...)
NotYetClassifiedElement<div> (text: Strategies...)

As you can see, it is exactly the same as just wrapping the tags with UndeterminedElement:

[16]:
show([NotYetClassifiedElement(tag) for tag in tags])
NotYetClassifiedElement<b> (text: Financial ...)
NotYetClassifiedElement<p> (text: The financ...)
NotYetClassifiedElement<div> (text: Strategies...)

Let’s create the first simple parsing step that naively identifies title and text tags.

[17]:
from sec_parser.processing_steps import AbstractProcessingStep


class MyClassifier(AbstractProcessingStep):
    def __init__(self):
        super().__init__()
        # You can hold state in your processing steps
        self.processed_tags_count = 0

    # This method must be implemented when inheriting from AbstractProcessingStep
    def _process(self, elements):
        parsed = []
        for e in elements:
            self.processed_tags_count += 1
            if e.html_tag.name == "b":
                parsed.append(TitleElement.create_from_element(e, ""))
            elif e.html_tag.name == "p":
                parsed.append(TextElement.create_from_element(e, ""))
            else:
                parsed.append(e)
        print(
            f"MyClassifier: Successfully processed {self.processed_tags_count} tags!\n"
        )
        return parsed


def get_steps() -> list[AbstractProcessingStep]:
    return [MyClassifier()]


parser = Edgar10QParser(get_steps)
elements = parser.parse(html)
MyClassifier: Successfully processed 3 tags!

[18]:
show(elements)
TitleElement[L0]<b> (text: Financial ...)
TextElement<p> (text: The financ...)
NotYetClassifiedElement<div> (text: Strategies...)

The third tag cannot be identified as a single Semantic Element, let’s see what can we do about it.

Handling Multiple Semantic Elements in a Single HTML Tag

If multiple Semantic Elements are in the same HTML tag, we would first identify such cases by naming the element as CompositeSemanticElement.

[19]:
from sec_parser.semantic_elements import CompositeSemanticElement

print(CompositeSemanticElement.__doc__)

    CompositeSemanticElement acts as a container for other semantic elements,
    especially for cases where a single HTML root tag wraps multiple elements.
    This ensures structural integrity and enables various features like
    semantic segmentation visualization, and debugging by comparison with the
    original document.

    Why is this useful:
    ===================
    1. Some semantic elements, like XBRL tags (<ix>), may wrap multiple semantic
    elements. The container ensures that these relationships are not broken
    during parsing.
    2. Enables the parser to fully reconstruct the original HTML document, which
    opens up possibilities for features like semantic segmentation visualization
    (e.g. recreate the original document but put semi-transparent colored boxes
    on top, based on semantic meaning), serialization of parsed documents into
    an augmented HTML, and debugging by comparing to the original document.

Let’s apply this to our example:

by creating a naive implementation of doing the identification:

[20]:
class CompositeElementIdentificationStep(AbstractProcessingStep):
    def _process(self, elements):
        result = []
        for e in elements:
            if e.html_tag.name == "div":
                result.append(
                    CompositeSemanticElement.create_from_element(
                        e,
                        inner_elements=[
                            NotYetClassifiedElement(t)
                            for t in e.html_tag.get_children()
                        ],
                        log_origin="CompositeElementIdentificationStep",
                    )
                )
            else:
                result.append(e)
        return result


parser = Edgar10QParser(lambda: [CompositeElementIdentificationStep()])
elements = parser.parse(html, unwrap_elements=False)
show(elements)
NotYetClassifiedElement<b> (text: Financial ...)
NotYetClassifiedElement<p> (text: The financ...)
CompositeSemanticElement<div> (has 3 elements inside)

We have successfully identified the tag as a CompositeSemanticElement.

However, CompositeSemanticElement is intended for more advanced use cases, normally we won’t even notice it (we had to set unwrap_elements flag to False to see it):

[21]:
elements = parser.parse(html)
show(elements)
NotYetClassifiedElement<b> (text: Financial ...)
NotYetClassifiedElement<p> (text: The financ...)
NotYetClassifiedElement<b> (text: Strategies...)
NotYetClassifiedElement<p> (text: Investment...)
NotYetClassifiedElement<img>

We can now combine the steps together. One steps output is another steps input, therefore order is important:

[22]:
def get_steps():
    return [
        CompositeElementIdentificationStep(),
        MyClassifier(),
    ]


parser = Edgar10QParser(get_steps)
elements = parser.parse(html)
show(elements)
MyClassifier: Successfully processed 3 tags!

TitleElement[L0]<b> (text: Financial ...)
TextElement<p> (text: The financ...)
NotYetClassifiedElement<b> (text: Strategies...)
NotYetClassifiedElement<p> (text: Investment...)
NotYetClassifiedElement<img>

Notice that the inner elements of CompositeSemanticElement did not get processed. This is because it requires special handling. A simple way to do it would be to inherit from ElementwiseProcessingStep:

[23]:
from sec_parser.processing_steps import AbstractElementwiseProcessingStep


class BetterClassifier(AbstractElementwiseProcessingStep):
    def _process_element(self, element, context):
        if element.html_tag.name == "b":
            return TitleElement.create_from_element(element, "")
        elif element.html_tag.name == "p":
            return TextElement.create_from_element(element, "")
        elif element.html_tag.name == "img":
            return ImageElement.create_from_element(element, "")
        return element
[24]:
def get_steps():
    return [
        CompositeElementIdentificationStep(),
        BetterClassifier(),
    ]


parser = Edgar10QParser(get_steps)
elements = parser.parse(html)
show(elements)
TitleElement[L0]<b> (text: Financial ...)
TextElement<p> (text: The financ...)
TitleElement[L0]<b> (text: Strategies...)
TextElement<p> (text: Investment...)
ImageElement<img>

We have completed the HTML parsing as the result looks the same as we intended:

[25]:
show(expected_elements)
TitleElement[L0]<b> (text: Financial ...)
TextElement<p> (text: The financ...)
TitleElement[L0]<b> (text: Strategies...)
TextElement<p> (text: Investment...)
ImageElement<img>

Introduction to Semantic Trees

[26]:
from sec_parser.semantic_tree import TreeBuilder

print(TreeBuilder.__doc__)

    Builds a semantic tree from a list of semantic elements.

    Why Use a Tree Structure?
    =========================
    Using a tree data structure allows for easier and more robust filtering of sections.
    With a tree, you can select specific branches to filter, making it straightforward
    to identify section boundaries. This approach is more maintainable and robust
    compared to attempting the same operations on a flat list of elements.

    Overview:
    =========
    1. Takes a list of semantic elements.
    2. Applies nesting rules to these elements.

    Customization:
    ==============
    The nesting process is customizable through a list of rules. These rules determine
    how new elements should be nested under existing ones.

    Advanced Customization:
    =======================
    You can supply your own set of rules by providing a callable to `get_rules`, which
    should return a list of `AbstractNestingRule` instances.

Let’s apply this to our example:

A very similar processing pattern is used here as well:

[27]:
from sec_parser.semantic_tree import AlwaysNestAsParentRule, AbstractNestingRule, render


def get_rules() -> list[AbstractNestingRule]:
    return [
        AlwaysNestAsParentRule(TitleElement),
    ]


builder = TreeBuilder(get_rules)
tree = builder.build(elements)
print(render(list(tree)))
TitleElement: Financial Overview
└── TextElement: The financial sector is a categ...ommercial and retail customers.
TitleElement: Strategies of Investment
├── TextElement: Investment strategies are plans...ign with their financial goals.
└── ImageElement
[28]:
print(render(list(tree)[0]))
TitleElement: Financial Overview
└── TextElement: The financial sector is a categ...ommercial and retail customers.
[29]:
print(render(list(tree)[1]))
TitleElement: Strategies of Investment
├── TextElement: Investment strategies are plans...ign with their financial goals.
└── ImageElement

For further understanding of sec-parser, refer to the Documentation. If you’re interested in contributing, consider checking out our Contribution Guide.