{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "nbsphinx": "hidden" }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This will auto-format your code. You can optionally install 'jupyter-black' using pip.\n", "# Note: this cell is hidden from the HTML output. Read more: https://nbsphinx.readthedocs.io/en/0.2.1/hidden-cells.html\n", "try:\n", " import jupyter_black\n", "\n", " jupyter_black.load()\n", "except ImportError:\n", " pass" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# User Guide: Quick Start" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Welcome to the User Guide for `sec-parser`! This guide is designed to walk you through the fundamental steps needed to install and use the library for parsing SEC EDGAR HTML documents into semantic elements and trees. Whether you're a financial analyst, a data scientist, or someone interested in SEC filings, this guide provides examples and code snippets to help you get started.\n", "\n", "This guide is interactive, allowing you to engage with the code and concepts as you learn. You can run and modify all the code examples shown here for yourself by cloning the repository and running the [user_guide.ipynb](https://github.com/alphanome-ai/sec-parser/blob/main/docs/source/notebooks/user_guide.ipynb) in a Jupyter notebook.\n", "\n", "Alternatively, you can also run the notebook directly in your browser using Google Colab:\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alphanome-ai/sec-parser/blob/main/docs/source/notebooks/user_guide.ipynb)\n", "[![My Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/alphanome-ai/sec-parser/main?filepath=docs/source/notebooks/user_guide.ipynb)\n", "[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/kernels/welcome?src=https://github.com/alphanome-ai/sec-parser/blob/main/docs/source/notebooks/user_guide.ipynb)\n", "[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/https://github.com/alphanome-ai/sec-parser/blob/main/docs/source/notebooks/user_guide.ipynb)\n", "\n", "Let's get started!" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Started" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "This guide will walk you through the process of installing the `sec-parser` package and using it to extract the \"Segment Operating Performance\" section as a semantic tree from the latest Apple 10-Q filing." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Installation" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "First, install the `sec-parser` package using pip:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "try:\n", " import sec_parser\n", "except ImportError:\n", " !pip install -q sec-parser\n", " import sec_parser" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In order to run the example code in this Guide, you'll also need the `sec_downloader` package:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "try:\n", " import sec_downloader\n", "except ImportError:\n", " !pip install -q sec-downloader\n", " import sec_downloader" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Usage" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Once you've installed the necessary packages, you can start by downloading the filing from the SEC EDGAR website. Here's how you can do it:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sec_downloader import Downloader\n", "\n", "# Initialize the downloader with your company name and email\n", "dl = Downloader(\"MyCompanyName\", \"email@example.com\")\n", "\n", "# Download the latest 10-Q filing for Apple\n", "html = dl.get_filing_html(ticker=\"AAPL\", form=\"10-Q\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> [!NOTE]\n", "The company name and email address are used to form a user-agent string that adheres to the SEC EDGAR's fair access policy for programmatic downloading. [Source](https://www.sec.gov/os/webmaster-faq#code-support)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can parse the filing HTML into a list of semantic elements:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Utility function to make the example code a bit more compact\n", "def print_first_n_lines(text: str, *, n: int):\n", " print(\"\\n\".join(text.split(\"\\n\")[:n]), \"...\", sep=\"\\n\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;34mTopSectionTitle\u001b[0m: PART I — FINANCIAL INFORMATION\n", "\u001b[1;34mTopSectionTitle\u001b[0m: Item 1.    Financial Statements\n", "\u001b[1;34mTitleElement\u001b[0m: CONDENSED CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited)\n", "\u001b[1;34mSupplementaryText\u001b[0m: (In millions, except number of ...ousands, and per-share amounts)\n", "\u001b[1;34mTableElement\u001b[0m: Table with ~24 rows, ~80 numbers, and 1058 characters.\n", "\u001b[1;34mSupplementaryText\u001b[0m: See accompanying Notes to Conde...solidated Financial Statements.\n", "\u001b[1;34mTitleElement\u001b[0m: CONDENSED CONSOLIDATED STATEMEN...OMPREHENSIVE INCOME (Unaudited)\n", "...\n" ] } ], "source": [ "import sec_parser as sp\n", "\n", "elements: list = sp.Edgar10QParser().parse(html)\n", "\n", "demo_output: str = sp.render(elements)\n", "print_first_n_lines(demo_output, n=7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also construct a semantic tree to allow for easy filtering by parent sections:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;34mTopSectionTitle\u001b[0m: PART I — FINANCIAL INFORMATION\n", "├── \u001b[1;34mTopSectionTitle\u001b[0m: Item 1.    Financial Statements\n", "│ ├── \u001b[1;34mTitleElement\u001b[0m: CONDENSED CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited)\n", "│ │ ├── \u001b[1;34mSupplementaryText\u001b[0m: (In millions, except number of ...ousands, and per-share amounts)\n", "│ │ ├── \u001b[1;34mTableElement\u001b[0m: Table with ~24 rows, ~80 numbers, and 1058 characters.\n", "│ │ └── \u001b[1;34mSupplementaryText\u001b[0m: See accompanying Notes to Conde...solidated Financial Statements.\n", "│ ├── \u001b[1;34mTitleElement\u001b[0m: CONDENSED CONSOLIDATED STATEMEN...OMPREHENSIVE INCOME (Unaudited)\n", "...\n" ] } ], "source": [ "tree = sp.TreeBuilder().build(elements)\n", "\n", "demo_output: str = sp.render(tree)\n", "print_first_n_lines(demo_output, n=7)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Advanced Usage" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Processing is organized in steps. You can modify, add, remove steps as needed. Each step is a function that takes a list of elements as input and returns a list of elements as output. The output of one step is the input of the next step." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Step 1: IndividualSemanticElementExtractor\n", "Step 2: ImageClassifier\n", "Step 3: EmptyElementClassifier\n", "Step 4: TableClassifier\n", "Step 5: TableOfContentsClassifier\n", "Step 6: TopSectionManagerFor10Q\n", "Step 7: IntroductorySectionElementClassifier\n", "Step 8: TextClassifier\n", "Step 9: HighlightedTextClassifier\n", "Step 10: SupplementaryTextClassifier\n", "Step 11: PageHeaderClassifier\n", "Step 12: PageNumberClassifier\n", "Step 13: TitleClassifier\n", "Step 14: TextElementMerger\n" ] } ], "source": [ "steps = sp.Edgar10QParser().get_default_steps()\n", "\n", "for i, step in enumerate(steps, 1):\n", " print(f\"Step {i}: {step.__class__.__name__}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Let's illustrate an example where we replace the text element classifier with our custom classifier. This custom classifier is designed to identify, which elements match our custom element description:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Step 1: IndividualSemanticElementExtractor\n", "Step 2: ImageClassifier\n", "Step 3: EmptyElementClassifier\n", "Step 4: TableClassifier\n", "Step 5: TableOfContentsClassifier\n", "Step 6: TopSectionManagerFor10Q\n", "Step 7: IntroductorySectionElementClassifier\n", "Step 8: MyClassifier\n", "Step 9: HighlightedTextClassifier\n", "Step 10: SupplementaryTextClassifier\n", "Step 11: PageHeaderClassifier\n", "Step 12: PageNumberClassifier\n", "Step 13: TitleClassifier\n", "Step 14: TextElementMerger\n" ] } ], "source": [ "from sec_parser.processing_steps import TextClassifier\n", "\n", "\n", "# Create a custom element class\n", "class MyElement(sp.TextElement):\n", " pass\n", "\n", "\n", "# Create a custom parsing step\n", "class MyClassifier(TextClassifier):\n", " def _process_element(self, element, context):\n", " if element.text != \"\":\n", " return MyElement.create_from_element(element, log_origin=\"MyClassifier\")\n", "\n", " # Let the parent class handle the other cases\n", " return super()._process_element(element, context)\n", "\n", "\n", "# Replace the default text parsing step with our custom one\n", "steps = [MyClassifier() if isinstance(step, TextClassifier) else step for step in steps]\n", "for i, step in enumerate(steps, 1):\n", " print(f\"Step {i}: {step.__class__.__name__}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "As demonstrated above, our custom classifier is now integrated into the pipeline. \n", "\n", "There's an additional caveat to consider. Without specifying an \"allowlist\" of types, TableElement will be classified as TextElement, as it contains text. To prevent this, we will process only `NotYetClassifiedElement` types and bypass processing for all other types.\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;34mTitleElement\u001b[0m: Segment Operating Performance\n", "├── \u001b[1;34mMyElement\u001b[0m: The following table shows net s... 1, 2023 (dollars in millions):\n", "├── \u001b[1;34mTableElement\u001b[0m: Table with ~7 rows, ~39 numbers, and 408 characters.\n", "├── \u001b[1;34mTitleElement\u001b[0m: Americas\n", "│ └── \u001b[1;34mMyElement\u001b[0m: Americas net sales increased du... the first nine months of 2024.\n", "├── \u001b[1;34mTitleElement\u001b[0m: Europe\n", "│ └── \u001b[1;34mMyElement\u001b[0m: Europe net sales increased duri...earables, Home and Accessories.\n", "├── \u001b[1;34mTitleElement\u001b[0m: Greater China\n", "│ └── \u001b[1;34mMyElement\u001b[0m: Greater China net sales decreas... and first nine months of 2024.\n", "├── \u001b[1;34mTitleElement\u001b[0m: Japan\n", "│ └── \u001b[1;34mMyElement\u001b[0m: Japan net sales increased durin... and first nine months of 2024.\n", "└── \u001b[1;34mTitleElement\u001b[0m: Rest of Asia Pacific\n", " └── \u001b[1;34mMyElement\u001b[0m: Rest of Asia Pacific net sales ... and first nine months of 2024.\n", "...\n" ] } ], "source": [ "def get_steps():\n", " return [\n", " (\n", " MyClassifier(types_to_process={sp.NotYetClassifiedElement})\n", " if isinstance(step, TextClassifier)\n", " else step\n", " )\n", " for step in sp.Edgar10QParser().get_default_steps()\n", " ]\n", "\n", "\n", "elements = sp.Edgar10QParser(get_steps).parse(html)\n", "tree = sp.TreeBuilder().build(elements)\n", "section = [n for n in tree.nodes if n.text.startswith(\"Segment\")][0]\n", "print(\"\\n\".join(sp.render(section).split(\"\\n\")[:13]), \"...\", sep=\"\\n\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "For more examples and advanced usage, you can continue learning how to use `sec-parser` by referring to the [**Developer Guide**](https://sec-parser.readthedocs.io/en/latest/notebooks/developer_guide.html) and [**Documentation**](https://sec-parser.rtfd.io). If you're interested in contributing, consider checking out our [**Contribution Guide**](https://github.com/alphanome-ai/sec-parser/blob/main/CONTRIBUTING.md)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## What's Next?\n", "\n", "You've successfully parsed an SEC document into semantic elements and arranged them into a tree structure. To further analyze this data with analytics or AI, you can use any tool of your choice.\n", "\n", "For a tailored experience, consider using our free and open-source library for AI-powered financial analysis: \n", "\n", "[**Explore sec-ai on GitHub**](https://github.com/alphanome-ai/sec-ai)\n", "\n", "```bash\n", "pip install sec-ai\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }