Developer Guide: Comprehensive Overview
Welcome to the Comprehensive Developer Guide for sec-parser
. This guide is designed to provide an in-depth understanding of the sec-parser
project, whether you’re a new developer looking to contribute, or an experienced one seeking to leverage its capabilities. We’ll walk you through the codebase, explaining key components and their interactions, and provide examples to help you get started.
This guide is interactive, allowing you to engage with the code and concepts as you learn. You can run and modify all the code examples shown here for yourself by cloning the repository and running the developer_guide.ipynb in a Jupyter notebook.
Alternatively, you can also run the notebook directly in your browser using Cloud-based Jupyter environments:
Let’s dive in!
Environment Setup
In order to run the example code in this Guide, you’ll need the sec_parser
package:
[2]:
try:
import sec_parser
except ImportError:
!pip install -q sec-parser
import sec_parser
Working with a Simplified Example
It will make it easier to follow along if we’ll have a specific simplified example in mind. Consider the following HTML:
[3]:
from IPython.display import display, HTML, Code
html = """
<b>Financial Overview</b>
<p>The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.</p>
<div>
<b>Strategies of Investment</b>
<p>Investment strategies</font> are plans that guide investors to choose <font color="green" style="color:green">the best investment opportunities</font> that align with their financial goals.</p>
<img src="https://en.wikipedia.org/static/images/icons/wikipedia.png" width="20" height="20">
</div>
"""
display(Code(html))
display(HTML(html))
<b>Financial Overview</b>
<p>The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.</p>
<div>
<b>Strategies of Investment</b>
<p>Investment strategies</font> are plans that guide investors to choose <font color="green" style="color:green">the best investment opportunities</font> that align with their financial goals.</p>
<img src="https://en.wikipedia.org/static/images/icons/wikipedia.png" width="20" height="20">
</div>
The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.
Investment strategies are plans that guide investors to choose the best investment opportunities that align with their financial goals.
Utilizing BeautifulSoup for Parsing
Many SEC EDGAR filings are available in HTML document format. To ease the process of reading the documents, we will be using the BeautifulSoup (“bs4”) library to parse an HTML document into a tree-like structure of HTML Tags (bs4.Tag
).
Let’s apply this to our example:
[4]:
import bs4
# Utility function, ignore it
def get_children_tags(source) -> list[bs4.Tag]:
return [tag for tag in source.children if isinstance(tag, bs4.Tag)]
# Utility function, ignore it
def tag_to_string(tag):
text = tag.text.strip()
if len(text) > 0:
text = text[:10] + "..." if len(text) > 10 else text
return f"{tag.name} (text: {text})"
else:
return f"{tag.name} (no text)"
parse_result = bs4.BeautifulSoup(html, "lxml").html.body
bs4_tags = get_children_tags(parse_result)
for i, tag in enumerate(bs4_tags):
print(f"Tag {i}: {tag_to_string(tag)}")
Tag 0: b (text: Financial ...)
Tag 1: p (text: The financ...)
Tag 2: div (text: Strategies...)
Notice that we children
iterates only over the top-level tags. Children of children can be accessed by using children
attribute again:
[5]:
for i, tag in enumerate(get_children_tags(bs4_tags[2])):
print(f"Tag 2 -> Tag {i}: {tag_to_string(tag)})")
Tag 2 -> Tag 0: b (text: Strategies...))
Tag 2 -> Tag 1: p (text: Investment...))
Tag 2 -> Tag 2: img (no text))
Understanding the Role of HtmlTag
Instead of interacting directly with bs4.Tag
, the SEC EDGAR HTML Parser uses HtmlTag
, a wrapper around bs4.Tag
.
[6]:
from sec_parser.processing_engine import HtmlTag, HtmlTagParser
print(HtmlTag.__doc__)
print(HtmlTagParser.__doc__)
The HtmlTag class is a wrapper for BeautifulSoup4 Tag objects.
It serves three main purposes:
1. Decoupling: By abstracting the underlying BeautifulSoup4 library, we
can isolate our application logic from the library specifics. This
makes it easier to modify or even replace the HTML parsing library in
the future without extensive codebase changes.
2. Usability: The HtmlTag class provides a convenient location to add
extension methods or additional properties not offered by the native
BeautifulSoup4 Tag class. This enhances the usability of the class.
3. Caching: The HtmlTag class also caches processing results, improving
performance by avoiding unnecessary re-computation.
The HtmlTagParser parses an HTML document using BeautifulSoup4.
It then wraps the parsed bs4.Tag objects into HtmlTag objects.
Let’s apply this to our example:
[7]:
bs4_div_tag = bs4_tags[2]
display(HTML(str(bs4_div_tag)))
Investment strategies are plans that guide investors to choose the best investment opportunities that align with their financial goals.
By applying HtmlTag
to the bs4.Tag
object, we can now access the HtmlTag
attributes and methods that are not available in bs4
. For example, we can get a percentage of green text:
[8]:
div_tag = HtmlTag(bs4_div_tag)
percentage = div_tag.get_text_styles_metrics()[("color", "green")]
print(f"The fraction of text within this div that is colored green: {percentage:.0f}%")
The fraction of text within this div that is colored green: 21%
Let’s wrap the rest of the tags in our example with HtmlTag
:
[9]:
tags = [HtmlTag(bs4_tag) for bs4_tag in bs4_tags]
Defining Semantic Elements
[10]:
from sec_parser.semantic_elements import AbstractSemanticElement
print(AbstractSemanticElement.__doc__)
In the domain of HTML parsing, especially in the context of SEC EDGAR documents,
a semantic element refers to a meaningful unit within the document that serves a
specific purpose. For example, a paragraph or a table might be considered a
semantic element. Unlike syntactic elements, which merely exist to structure the
HTML, semantic elements carry information that is vital to the understanding of the
document's content.
This class serves as a foundational representation of such semantic elements,
containing an HtmlTag object that stores the raw HTML tag information. Subclasses
will implement additional behaviors based on the type of the semantic element.
A few examples of Semantic Elements:
[11]:
from sec_parser.semantic_elements import (
TextElement,
TableElement,
TitleElement,
TopSectionTitle,
NotYetClassifiedElement,
)
print(TextElement.__doc__)
print(TableElement.__doc__)
print(TitleElement.__doc__)
print(TopSectionTitle.__doc__)
print(NotYetClassifiedElement.__doc__)
The TextElement class represents a standard text paragraph within a document.
The TableElement class represents a standard table within a document.
The TitleElement class represents the title of a paragraph or other content object.
It serves as a semantic marker, providing context and structure to the document.
The TopSectionTitle class represents the title and the beginning of a top-level
section of a document. For instance, in SEC 10-Q reports, a
top-level section could be "Part I, Item 3. Quantitative and Qualitative
Disclosures About Market Risk.".
The NotYetClassifiedElement class represents an element whose type
has not yet been determined. The parsing process aims to
classify all instances of this class into more specific
subclasses of AbstractSemanticElement.
To summarize, the purpose of parsing is to produce an ordered list of Semantic Elements from a tree of HTML Tags.
Let’s apply this to our example:
At the beginning of parsing the example we would have the following Semantic Elements:
[12]:
# Utility function, ignore it
def show(elements):
for element in elements:
text = element.text[:10]
if hasattr(element, "inner_elements"):
print(f"{element} (has {len(element.inner_elements)} elements inside)")
elif text:
print(f"{element} (text: {text}...)")
else:
print(f"{element}")
initial_elements = [NotYetClassifiedElement(tag) for tag in tags]
show(initial_elements)
NotYetClassifiedElement<b> (text: Financial ...)
NotYetClassifiedElement<p> (text: The financ...)
NotYetClassifiedElement<div> (text: Strategies...)
At the end of our parsing we expect to have the following Semantic Elements:
[13]:
from sec_parser import ImageElement
expected_elements: list[AbstractSemanticElement] = [
TitleElement(tags[0]),
TextElement(tags[1]),
TitleElement(tags[2].get_children()[0]),
TextElement(tags[2].get_children()[1]),
ImageElement(tags[2].get_children()[2]),
]
show(expected_elements)
TitleElement[L0]<b> (text: Financial ...)
TextElement<p> (text: The financ...)
TitleElement[L0]<b> (text: Strategies...)
TextElement<p> (text: Investment...)
ImageElement<img>
Understanding the Parsing Process
[14]:
from sec_parser.processing_engine import AbstractSemanticElementParser
print(AbstractSemanticElementParser.__doc__)
Responsible for parsing semantic elements from HTML documents.
It takes raw HTML and turns it into a list of objects
representing semantic elements.
At a High Level:
==================
1. Extract top-level HTML tags from the document.
2. Transform these tags into a list of more specific semantic
elements step-by-step.
Why Focus on Top-Level Tags?
============================
SEC filings usually have a flat HTML structure, which simplifies the
parsing process.Each top-level HTML tag often directly corresponds
to a single semantic element. This is different from many websites
where HTML tags are nested deeply,requiring more complex parsing.
For Advanced Users:
====================
The parsing process is implemented as a sequence of steps and allows for
customization at each step.
- Pipeline Pattern: Raw HTML tags are processed in a sequential manner.
The steps follow an ordered, step-by-step approach, akin to a Finite
State Machine (FSM). Each element transitions through various states
defined by the sequence of processing steps.
- Strategy Pattern: Each step is customizable. You can either replace,
remove, or extend any of the existing steps with your own or
inherited implementation. Alternatively, you can replace the entire pipeline
with your own process.
Let’s apply this to our example:
Processing is organized in steps. If there are no steps, there will be no processing:
[15]:
from sec_parser import Edgar10QParser
def get_steps():
return []
parser = Edgar10QParser(get_steps)
elements = parser.parse(html)
show(elements)
NotYetClassifiedElement<b> (text: Financial ...)
NotYetClassifiedElement<p> (text: The financ...)
NotYetClassifiedElement<div> (text: Strategies...)
As you can see, it is exactly the same as just wrapping the tags with UndeterminedElement
:
[16]:
show([NotYetClassifiedElement(tag) for tag in tags])
NotYetClassifiedElement<b> (text: Financial ...)
NotYetClassifiedElement<p> (text: The financ...)
NotYetClassifiedElement<div> (text: Strategies...)
Let’s create the first simple parsing step that naively identifies title and text tags.
[17]:
from sec_parser.processing_steps import AbstractProcessingStep
class MyClassifier(AbstractProcessingStep):
def __init__(self):
super().__init__()
# You can hold state in your processing steps
self.processed_tags_count = 0
# This method must be implemented when inheriting from AbstractProcessingStep
def _process(self, elements):
parsed = []
for e in elements:
self.processed_tags_count += 1
if e.html_tag.name == "b":
parsed.append(TitleElement.create_from_element(e, ""))
elif e.html_tag.name == "p":
parsed.append(TextElement.create_from_element(e, ""))
else:
parsed.append(e)
print(
f"MyClassifier: Successfully processed {self.processed_tags_count} tags!\n"
)
return parsed
def get_steps() -> list[AbstractProcessingStep]:
return [MyClassifier()]
parser = Edgar10QParser(get_steps)
elements = parser.parse(html)
MyClassifier: Successfully processed 3 tags!
[18]:
show(elements)
TitleElement[L0]<b> (text: Financial ...)
TextElement<p> (text: The financ...)
NotYetClassifiedElement<div> (text: Strategies...)
The third tag cannot be identified as a single Semantic Element, let’s see what can we do about it.
Handling Multiple Semantic Elements in a Single HTML Tag
If multiple Semantic Elements are in the same HTML tag, we would first identify such cases by naming the element as CompositeSemanticElement
.
[19]:
from sec_parser.semantic_elements import CompositeSemanticElement
print(CompositeSemanticElement.__doc__)
CompositeSemanticElement acts as a container for other semantic elements,
especially for cases where a single HTML root tag wraps multiple elements.
This ensures structural integrity and enables various features like
semantic segmentation visualization, and debugging by comparison with the
original document.
Why is this useful:
===================
1. Some semantic elements, like XBRL tags (<ix>), may wrap multiple semantic
elements. The container ensures that these relationships are not broken
during parsing.
2. Enables the parser to fully reconstruct the original HTML document, which
opens up possibilities for features like semantic segmentation visualization
(e.g. recreate the original document but put semi-transparent colored boxes
on top, based on semantic meaning), serialization of parsed documents into
an augmented HTML, and debugging by comparing to the original document.
Let’s apply this to our example:
by creating a naive implementation of doing the identification:
[20]:
class CompositeElementIdentificationStep(AbstractProcessingStep):
def _process(self, elements):
result = []
for e in elements:
if e.html_tag.name == "div":
result.append(
CompositeSemanticElement.create_from_element(
e,
inner_elements=[
NotYetClassifiedElement(t)
for t in e.html_tag.get_children()
],
log_origin="CompositeElementIdentificationStep",
)
)
else:
result.append(e)
return result
parser = Edgar10QParser(lambda: [CompositeElementIdentificationStep()])
elements = parser.parse(html, unwrap_elements=False)
show(elements)
NotYetClassifiedElement<b> (text: Financial ...)
NotYetClassifiedElement<p> (text: The financ...)
CompositeSemanticElement<div> (has 3 elements inside)
We have successfully identified the tag as a CompositeSemanticElement
.
However, CompositeSemanticElement
is intended for more advanced use cases, normally we won’t even notice it (we had to set unwrap_elements
flag to False
to see it):
[21]:
elements = parser.parse(html)
show(elements)
NotYetClassifiedElement<b> (text: Financial ...)
NotYetClassifiedElement<p> (text: The financ...)
NotYetClassifiedElement<b> (text: Strategies...)
NotYetClassifiedElement<p> (text: Investment...)
NotYetClassifiedElement<img>
We can now combine the steps together. One steps output is another steps input, therefore order is important:
[22]:
def get_steps():
return [
CompositeElementIdentificationStep(),
MyClassifier(),
]
parser = Edgar10QParser(get_steps)
elements = parser.parse(html)
show(elements)
MyClassifier: Successfully processed 3 tags!
TitleElement[L0]<b> (text: Financial ...)
TextElement<p> (text: The financ...)
NotYetClassifiedElement<b> (text: Strategies...)
NotYetClassifiedElement<p> (text: Investment...)
NotYetClassifiedElement<img>
Notice that the inner elements of CompositeSemanticElement
did not get processed. This is because it requires special handling. A simple way to do it would be to inherit from ElementwiseProcessingStep
:
[23]:
from sec_parser.processing_steps import AbstractElementwiseProcessingStep
class BetterClassifier(AbstractElementwiseProcessingStep):
def _process_element(self, element, context):
if element.html_tag.name == "b":
return TitleElement.create_from_element(element, "")
elif element.html_tag.name == "p":
return TextElement.create_from_element(element, "")
elif element.html_tag.name == "img":
return ImageElement.create_from_element(element, "")
return element
[24]:
def get_steps():
return [
CompositeElementIdentificationStep(),
BetterClassifier(),
]
parser = Edgar10QParser(get_steps)
elements = parser.parse(html)
show(elements)
TitleElement[L0]<b> (text: Financial ...)
TextElement<p> (text: The financ...)
TitleElement[L0]<b> (text: Strategies...)
TextElement<p> (text: Investment...)
ImageElement<img>
We have completed the HTML parsing as the result looks the same as we intended:
[25]:
show(expected_elements)
TitleElement[L0]<b> (text: Financial ...)
TextElement<p> (text: The financ...)
TitleElement[L0]<b> (text: Strategies...)
TextElement<p> (text: Investment...)
ImageElement<img>
Introduction to Semantic Trees
[26]:
from sec_parser.semantic_tree import TreeBuilder
print(TreeBuilder.__doc__)
Builds a semantic tree from a list of semantic elements.
Why Use a Tree Structure?
=========================
Using a tree data structure allows for easier and more robust filtering of sections.
With a tree, you can select specific branches to filter, making it straightforward
to identify section boundaries. This approach is more maintainable and robust
compared to attempting the same operations on a flat list of elements.
Overview:
=========
1. Takes a list of semantic elements.
2. Applies nesting rules to these elements.
Customization:
==============
The nesting process is customizable through a list of rules. These rules determine
how new elements should be nested under existing ones.
Advanced Customization:
=======================
You can supply your own set of rules by providing a callable to `get_rules`, which
should return a list of `AbstractNestingRule` instances.
Let’s apply this to our example:
A very similar processing pattern is used here as well:
[27]:
from sec_parser.semantic_tree import AlwaysNestAsParentRule, AbstractNestingRule, render
def get_rules() -> list[AbstractNestingRule]:
return [
AlwaysNestAsParentRule(TitleElement),
]
builder = TreeBuilder(get_rules)
tree = builder.build(elements)
print(render(list(tree)))
TitleElement: Financial Overview
└── TextElement: The financial sector is a categ...ommercial and retail customers.
TitleElement: Strategies of Investment
├── TextElement: Investment strategies are plans...ign with their financial goals.
└── ImageElement
[28]:
print(render(list(tree)[0]))
TitleElement: Financial Overview
└── TextElement: The financial sector is a categ...ommercial and retail customers.
[29]:
print(render(list(tree)[1]))
TitleElement: Strategies of Investment
├── TextElement: Investment strategies are plans...ign with their financial goals.
└── ImageElement
For further understanding of sec-parser
, refer to the Documentation. If you’re interested in contributing, consider checking out our Contribution Guide.