User Guide: Quick Start

Welcome to the User Guide for sec-parser! This guide is designed to walk you through the fundamental steps needed to install and use the library for parsing SEC EDGAR HTML documents into semantic elements and trees. Whether you’re a financial analyst, a data scientist, or someone interested in SEC filings, this guide provides examples and code snippets to help you get started.

This guide is interactive, allowing you to engage with the code and concepts as you learn. You can run and modify all the code examples shown here for yourself by cloning the repository and running the user_guide.ipynb in a Jupyter notebook.

Alternatively, you can also run the notebook directly in your browser using Google Colab:

Let’s get started!

Getting Started

This guide will walk you through the process of installing the sec-parser package and using it to extract the “Segment Operating Performance” section as a semantic tree from the latest Apple 10-Q filing.

Installation

First, install the sec-parser package using pip:

[2]:

try:
    import sec_parser
except ImportError:
    !pip install -q sec-parser
    import sec_parser

In order to run the example code in this Guide, you’ll also need the sec_downloader package:

[3]:

import os

try:
    import sec_downloader
except ImportError:
    !pip install -q sec-downloader
    import sec_downloader

Usage

Once you’ve installed the necessary packages, you can start by downloading the filing from the SEC EDGAR website. Here’s how you can do it:

[4]:

from sec_downloader import Downloader

# Initialize the downloader with your company name and email
dl = Downloader("MyCompanyName", "email@example.com")

# Download the latest 10-Q filing for Apple
html = dl.get_filing_html(ticker="AAPL", form="10-Q")

[!NOTE] The company name and email address are used to form a user-agent string that adheres to the SEC EDGAR’s fair access policy for programmatic downloading. Source

Now, we can parse the filing HTML into a list of semantic elements:

[5]:

# Utility function to make the example code a bit more compact
def print_first_n_lines(text: str, *, n: int):
    print("\n".join(text.split("\n")[:n]), "...", sep="\n")

[6]:

import sec_parser as sp

elements: list = sp.Edgar10QParser().parse(html)

demo_output: str = sp.render(elements)
print_first_n_lines(demo_output, n=7)

TopSectionTitle: PART I  —  FINANCIAL INFORMATION
TopSectionTitle: Item 1.    Financial Statements
TitleElement: CONDENSED CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited)
SupplementaryText: (In millions, except number of ...ousands, and per-share amounts)
TableElement: Table with ~24 rows, ~40 numbers, and 742 characters.
SupplementaryText: See accompanying Notes to Conde...solidated Financial Statements.
TitleElement: CONDENSED CONSOLIDATED STATEMEN...OMPREHENSIVE INCOME (Unaudited)
...

We can also construct a semantic tree to allow for easy filtering by parent sections:

[7]:

tree = sp.TreeBuilder().build(elements)

demo_output: str = sp.render(tree)
print_first_n_lines(demo_output, n=7)

TopSectionTitle: PART I  —  FINANCIAL INFORMATION
├── TopSectionTitle: Item 1.    Financial Statements
│   ├── TitleElement: CONDENSED CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited)
│   │   ├── SupplementaryText: (In millions, except number of ...ousands, and per-share amounts)
│   │   ├── TableElement: Table with ~24 rows, ~40 numbers, and 742 characters.
│   │   └── SupplementaryText: See accompanying Notes to Conde...solidated Financial Statements.
│   ├── TitleElement: CONDENSED CONSOLIDATED STATEMEN...OMPREHENSIVE INCOME (Unaudited)
...

Advanced Usage

Processing is organized in steps. You can modify, add, remove steps as needed. Each step is a function that takes a list of elements as input and returns a list of elements as output. The output of one step is the input of the next step.

[8]:

steps = sp.Edgar10QParser().get_default_steps()

for i, step in enumerate(steps, 1):
    print(f"Step {i}: {step.__class__.__name__}")

Step 1: IndividualSemanticElementExtractor
Step 2: ImageClassifier
Step 3: EmptyElementClassifier
Step 4: TableClassifier
Step 5: TableOfContentsClassifier
Step 6: TopSectionManagerFor10Q
Step 7: IntroductorySectionElementClassifier
Step 8: TextClassifier
Step 9: HighlightedTextClassifier
Step 10: SupplementaryTextClassifier
Step 11: PageHeaderClassifier
Step 12: PageNumberClassifier
Step 13: TitleClassifier
Step 14: TextElementMerger

Let’s illustrate an example where we replace the text element classifier with our custom classifier. This custom classifier is designed to identify, which elements match our custom element description:

[9]:

from sec_parser.processing_steps import TextClassifier


# Create a custom element class
class MyElement(sp.TextElement):
    pass


# Create a custom parsing step
class MyClassifier(TextClassifier):
    def _process_element(self, element, context):
        if element.text != "":
            return MyElement.create_from_element(element, log_origin="MyClassifier")

        # Let the parent class handle the other cases
        return super()._process_element(element, context)


# Replace the default text parsing step with our custom one
steps = [MyClassifier() if isinstance(step, TextClassifier) else step for step in steps]
for i, step in enumerate(steps, 1):
    print(f"Step {i}: {step.__class__.__name__}")

Step 1: IndividualSemanticElementExtractor
Step 2: ImageClassifier
Step 3: EmptyElementClassifier
Step 4: TableClassifier
Step 5: TableOfContentsClassifier
Step 6: TopSectionManagerFor10Q
Step 7: IntroductorySectionElementClassifier
Step 8: MyClassifier
Step 9: HighlightedTextClassifier
Step 10: SupplementaryTextClassifier
Step 11: PageHeaderClassifier
Step 12: PageNumberClassifier
Step 13: TitleClassifier
Step 14: TextElementMerger

As demonstrated above, our custom classifier is now integrated into the pipeline.

There’s an additional caveat to consider. Without specifying an “allowlist” of types, TableElement will be classified as TextElement, as it contains text. To prevent this, we will process only NotYetClassifiedElement types and bypass processing for all other types.

[10]:

def get_steps():
    return [
        (
            MyClassifier(types_to_process={sp.NotYetClassifiedElement})
            if isinstance(step, TextClassifier)
            else step
        )
        for step in sp.Edgar10QParser().get_default_steps()
    ]


elements = sp.Edgar10QParser(get_steps).parse(html)
tree = sp.TreeBuilder().build(elements)
section = [n for n in tree.nodes if n.text.startswith("Segment")][0]
print("\n".join(sp.render(section).split("\n")[:13]), "...", sep="\n")

TitleElement: Segment Operating Performance
├── MyElement: The following table shows net s...31, 2022 (dollars in millions):
├── TableElement: Table with ~7 rows, ~20 numbers, and 264 characters.
├── TitleElement: Americas
│   └── TextElement: Americas net sales increased 2%...ring the first quarter of 2024.
├── TitleElement: Greater China
│   └── MyElement: Greater China net sales decreas...ring the first quarter of 2024.
├── TitleElement: Japan
│   └── MyElement: Japan net sales increased 15% o...ring the first quarter of 2024.
└── TitleElement: Rest of Asia Pacific
    └── MyElement: Rest of Asia Pacific net sales ...earables, Home and Accessories.
...

For more examples and advanced usage, you can continue learning how to use sec-parser by referring to the Developer Guide and Documentation. If you’re interested in contributing, consider checking out our Contribution Guide.

What’s Next?

You’ve successfully parsed an SEC document into semantic elements and arranged them into a tree structure. To further analyze this data with analytics or AI, you can use any tool of your choice.

For a tailored experience, consider using our free and open-source library for AI-powered financial analysis:

Explore sec-ai on GitHub

pip install sec-ai