{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "nbsphinx": "hidden" }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This will auto-format your code. You can optionally install 'jupyter-black' using pip.\n", "# Note: this cell is hidden from the HTML output. Read more: https://nbsphinx.readthedocs.io/en/0.2.1/hidden-cells.html\n", "try:\n", " import jupyter_black\n", " jupyter_black.load()\n", "except ImportError:\n", " pass" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Developer Guide: Comprehensive Overview" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Welcome to the Comprehensive Developer Guide for `sec-parser`. This guide is designed to provide an in-depth understanding of the `sec-parser` project, whether you're a new developer looking to contribute, or an experienced one seeking to leverage its capabilities. We'll walk you through the codebase, explaining key components and their interactions, and provide examples to help you get started. \n", "\n", "This guide is interactive, allowing you to engage with the code and concepts as you learn. You can run and modify all the code examples shown here for yourself by cloning the repository and running the [developer_guide.ipynb](https://github.com/alphanome-ai/sec-parser/blob/main/docs/source/notebooks/developer_guide.ipynb) in a Jupyter notebook. \n", "\n", "Alternatively, you can also run the notebook directly in your browser using Cloud-based Jupyter environments:\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alphanome-ai/sec-parser/blob/main/docs/source/notebooks/developer_guide.ipynb)\n", "[![My Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/alphanome-ai/sec-parser/main?filepath=docs/source/notebooks/developer_guide.ipynb)\n", "[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/kernels/welcome?src=https://github.com/alphanome-ai/sec-parser/blob/main/docs/source/notebooks/developer_guide.ipynb)\n", "[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/https://github.com/alphanome-ai/sec-parser/blob/main/docs/source/notebooks/developer_guide.ipynb)\n", "\n", "Let's dive in!" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Environment Setup" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "In order to run the example code in this Guide, you'll need the `sec_parser` package:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "try:\n", " import sec_parser\n", "except ImportError:\n", " !pip install -q sec-parser\n", " import sec_parser" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Working with a Simplified Example\n", "\n", "It will make it easier to follow along if we'll have a specific simplified example in mind. Consider the following HTML:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
<b>Financial Overview</b>\n",
       "<p>The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.</p>\n",
       "<div>\n",
       "    <b>Strategies of Investment</b>\n",
       "    <p>Investment strategies</font> are plans that guide investors to choose <font color="green" style="color:green">the best investment opportunities</font> that align with their financial goals.</p>\n",
       "    <img src="https://en.wikipedia.org/static/images/icons/wikipedia.png" width="20" height="20">\n",
       "</div>\n",
       "
\n" ], "text/latex": [ "\\begin{Verbatim}[commandchars=\\\\\\{\\}]\n", "\\PY{n+nt}{\\PYZlt{}b}\\PY{n+nt}{\\PYZgt{}}Financial\\PY{+w}{ }Overview\\PY{n+nt}{\\PYZlt{}/b\\PYZgt{}}\n", "\\PY{n+nt}{\\PYZlt{}p}\\PY{n+nt}{\\PYZgt{}}The\\PY{+w}{ }financial\\PY{+w}{ }sector\\PY{+w}{ }is\\PY{+w}{ }a\\PY{+w}{ }category\\PY{+w}{ }of\\PY{+w}{ }the\\PY{+w}{ }economy\\PY{+w}{ }made\\PY{+w}{ }up\\PY{+w}{ }of\\PY{+w}{ }firms\\PY{+w}{ }that\\PY{+w}{ }provide\\PY{+w}{ }financial\\PY{+w}{ }services\\PY{+w}{ }to\\PY{+w}{ }commercial\\PY{+w}{ }and\\PY{+w}{ }retail\\PY{+w}{ }customers.\\PY{n+nt}{\\PYZlt{}/p\\PYZgt{}}\n", "\\PY{n+nt}{\\PYZlt{}div}\\PY{n+nt}{\\PYZgt{}}\n", "\\PY{+w}{ }\\PY{n+nt}{\\PYZlt{}b}\\PY{n+nt}{\\PYZgt{}}Strategies\\PY{+w}{ }of\\PY{+w}{ }Investment\\PY{n+nt}{\\PYZlt{}/b\\PYZgt{}}\n", "\\PY{+w}{ }\\PY{n+nt}{\\PYZlt{}p}\\PY{n+nt}{\\PYZgt{}}Investment\\PY{+w}{ }strategies\\PY{n+nt}{\\PYZlt{}/font\\PYZgt{}}\\PY{+w}{ }are\\PY{+w}{ }plans\\PY{+w}{ }that\\PY{+w}{ }guide\\PY{+w}{ }investors\\PY{+w}{ }to\\PY{+w}{ }choose\\PY{+w}{ }\\PY{n+nt}{\\PYZlt{}font}\\PY{+w}{ }\\PY{n+na}{color=}\\PY{l+s}{\\PYZdq{}green\\PYZdq{}}\\PY{+w}{ }\\PY{n+na}{style=}\\PY{l+s}{\\PYZdq{}color:green\\PYZdq{}}\\PY{n+nt}{\\PYZgt{}}the\\PY{+w}{ }best\\PY{+w}{ }investment\\PY{+w}{ }opportunities\\PY{n+nt}{\\PYZlt{}/font\\PYZgt{}}\\PY{+w}{ }that\\PY{+w}{ }align\\PY{+w}{ }with\\PY{+w}{ }their\\PY{+w}{ }financial\\PY{+w}{ }goals.\\PY{n+nt}{\\PYZlt{}/p\\PYZgt{}}\n", "\\PY{+w}{ }\\PY{n+nt}{\\PYZlt{}img}\\PY{+w}{ }\\PY{n+na}{src=}\\PY{l+s}{\\PYZdq{}https://en.wikipedia.org/static/images/icons/wikipedia.png\\PYZdq{}}\\PY{+w}{ }\\PY{n+na}{width=}\\PY{l+s}{\\PYZdq{}20\\PYZdq{}}\\PY{+w}{ }\\PY{n+na}{height=}\\PY{l+s}{\\PYZdq{}20\\PYZdq{}}\\PY{n+nt}{\\PYZgt{}}\n", "\\PY{n+nt}{\\PYZlt{}/div\\PYZgt{}}\n", "\\end{Verbatim}\n" ], "text/plain": [ "\n", "Financial Overview\n", "

The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.

\n", "
\n", " Strategies of Investment\n", "

Investment strategies are plans that guide investors to choose the best investment opportunities that align with their financial goals.

\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "Financial Overview\n", "

The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.

\n", "
\n", " Strategies of Investment\n", "

Investment strategies are plans that guide investors to choose the best investment opportunities that align with their financial goals.

\n", " \n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import display, HTML, Code\n", "\n", "html = \"\"\"\n", "Financial Overview\n", "

The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.

\n", "
\n", " Strategies of Investment\n", "

Investment strategies are plans that guide investors to choose the best investment opportunities that align with their financial goals.

\n", " \n", "
\n", "\"\"\"\n", "\n", "display(Code(html))\n", "display(HTML(html))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Utilizing BeautifulSoup for Parsing\n", "Many SEC EDGAR filings are available in HTML document format. To ease the process of reading the documents, we will be using the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) (\"bs4\") library to parse an HTML document into a tree-like structure of HTML Tags (`bs4.Tag`)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "***Let's apply this to our example:***" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tag 0: b (text: Financial ...)\n", "Tag 1: p (text: The financ...)\n", "Tag 2: div (text: Strategies...)\n" ] } ], "source": [ "import bs4\n", "\n", "\n", "# Utility function, ignore it\n", "def get_children_tags(source) -> list[bs4.Tag]:\n", " return [tag for tag in source.children if isinstance(tag, bs4.Tag)]\n", "\n", "\n", "# Utility function, ignore it\n", "def tag_to_string(tag):\n", " text = tag.text.strip()\n", " if len(text) > 0:\n", " text = text[:10] + \"...\" if len(text) > 10 else text\n", " return f\"{tag.name} (text: {text})\"\n", " else:\n", " return f\"{tag.name} (no text)\"\n", "\n", "\n", "parse_result = bs4.BeautifulSoup(html, \"lxml\").html.body\n", "bs4_tags = get_children_tags(parse_result)\n", "for i, tag in enumerate(bs4_tags):\n", " print(f\"Tag {i}: {tag_to_string(tag)}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we `children` iterates only over the top-level tags. Children of children can be accessed by using `children` attribute again:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tag 2 -> Tag 0: b (text: Strategies...))\n", "Tag 2 -> Tag 1: p (text: Investment...))\n", "Tag 2 -> Tag 2: img (no text))\n" ] } ], "source": [ "for i, tag in enumerate(get_children_tags(bs4_tags[2])):\n", " print(f\"Tag 2 -> Tag {i}: {tag_to_string(tag)})\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Understanding the Role of HtmlTag\n", "Instead of interacting directly with `bs4.Tag`, the SEC EDGAR HTML Parser uses `HtmlTag`, a wrapper around `bs4.Tag`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " The HtmlTag class is a wrapper for BeautifulSoup4 Tag objects.\n", "\n", " It serves three main purposes:\n", "\n", " 1. Decoupling: By abstracting the underlying BeautifulSoup4 library, we\n", " can isolate our application logic from the library specifics. This\n", " makes it easier to modify or even replace the HTML parsing library in\n", " the future without extensive codebase changes.\n", "\n", " 2. Usability: The HtmlTag class provides a convenient location to add\n", " extension methods or additional properties not offered by the native\n", " BeautifulSoup4 Tag class. This enhances the usability of the class.\n", "\n", " 3. Caching: The HtmlTag class also caches processing results, improving\n", " performance by avoiding unnecessary re-computation.\n", " \n", "\n", " The HtmlTagParser parses an HTML document using BeautifulSoup4.\n", " It then wraps the parsed bs4.Tag objects into HtmlTag objects.\n", " \n" ] } ], "source": [ "from sec_parser.processing_engine import HtmlTag, HtmlTagParser\n", "\n", "print(HtmlTag.__doc__)\n", "print(HtmlTagParser.__doc__)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "***Let's apply this to our example:***" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "Strategies of Investment\n", "

Investment strategies are plans that guide investors to choose the best investment opportunities that align with their financial goals.

\n", "\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bs4_div_tag = bs4_tags[2]\n", "display(HTML(str(bs4_div_tag)))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "By applying `HtmlTag` to the `bs4.Tag` object, we can now access the `HtmlTag` attributes and methods that are not available in `bs4`. For example, we can get a percentage of green text:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The fraction of text within this div that is colored green: 21%\n" ] } ], "source": [ "div_tag = HtmlTag(bs4_div_tag)\n", "percentage = div_tag.get_text_styles_metrics()[(\"color\", \"green\")]\n", "print(f\"The fraction of text within this div that is colored green: {percentage:.0f}%\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Let's wrap the rest of the tags in our example with `HtmlTag`:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "tags = [HtmlTag(bs4_tag) for bs4_tag in bs4_tags]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Defining Semantic Elements" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " In the domain of HTML parsing, especially in the context of SEC EDGAR documents,\n", " a semantic element refers to a meaningful unit within the document that serves a\n", " specific purpose. For example, a paragraph or a table might be considered a\n", " semantic element. Unlike syntactic elements, which merely exist to structure the\n", " HTML, semantic elements carry information that is vital to the understanding of the\n", " document's content.\n", "\n", " This class serves as a foundational representation of such semantic elements,\n", " containing an HtmlTag object that stores the raw HTML tag information. Subclasses\n", " will implement additional behaviors based on the type of the semantic element.\n", " \n" ] } ], "source": [ "from sec_parser.semantic_elements import AbstractSemanticElement\n", "\n", "print(AbstractSemanticElement.__doc__)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "A few examples of Semantic Elements:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The TextElement class represents a standard text paragraph within a document.\n", "The TableElement class represents a standard table within a document.\n", "\n", " The TitleElement class represents the title of a paragraph or other content object.\n", " It serves as a semantic marker, providing context and structure to the document.\n", " \n", "\n", " The TopSectionTitle class represents the title and the beginning of a top-level\n", " section of a document. For instance, in SEC 10-Q reports, a\n", " top-level section could be \"Part I, Item 3. Quantitative and Qualitative\n", " Disclosures About Market Risk.\".\n", " \n", "\n", " The NotYetClassifiedElement class represents an element whose type\n", " has not yet been determined. The parsing process aims to\n", " classify all instances of this class into more specific\n", " subclasses of AbstractSemanticElement.\n", " \n" ] } ], "source": [ "from sec_parser.semantic_elements import (\n", " TextElement,\n", " TableElement,\n", " TitleElement,\n", " TopSectionTitle,\n", " NotYetClassifiedElement,\n", ")\n", "\n", "print(TextElement.__doc__)\n", "print(TableElement.__doc__)\n", "print(TitleElement.__doc__)\n", "print(TopSectionTitle.__doc__)\n", "print(NotYetClassifiedElement.__doc__)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To summarize, the purpose of parsing is to produce an ordered list of Semantic Elements from a tree of HTML Tags." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "***Let's apply this to our example:***" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "At the beginning of parsing the example we would have the following Semantic Elements:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NotYetClassifiedElement (text: Financial ...)\n", "NotYetClassifiedElement

(text: The financ...)\n", "NotYetClassifiedElement

(text: Strategies...)\n" ] } ], "source": [ "# Utility function, ignore it\n", "def show(elements):\n", " for element in elements:\n", " text = element.text[:10]\n", " if hasattr(element, \"inner_elements\"):\n", " print(f\"{element} (has {len(element.inner_elements)} elements inside)\")\n", " elif text:\n", " print(f\"{element} (text: {text}...)\")\n", " else:\n", " print(f\"{element}\")\n", "\n", "\n", "initial_elements = [NotYetClassifiedElement(tag) for tag in tags]\n", "show(initial_elements)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "At the end of our parsing we expect to have the following Semantic Elements:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TitleElement[L0] (text: Financial ...)\n", "TextElement

(text: The financ...)\n", "TitleElement[L0] (text: Strategies...)\n", "TextElement

(text: Investment...)\n", "ImageElement\n" ] } ], "source": [ "from sec_parser import ImageElement\n", "\n", "expected_elements: list[AbstractSemanticElement] = [\n", " TitleElement(tags[0]),\n", " TextElement(tags[1]),\n", " TitleElement(tags[2].get_children()[0]),\n", " TextElement(tags[2].get_children()[1]),\n", " ImageElement(tags[2].get_children()[2]),\n", "]\n", "show(expected_elements)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Understanding the Parsing Process" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Responsible for parsing semantic elements from HTML documents.\n", " It takes raw HTML and turns it into a list of objects\n", " representing semantic elements.\n", "\n", " At a High Level:\n", " ==================\n", " 1. Extract top-level HTML tags from the document.\n", " 2. Transform these tags into a list of more specific semantic\n", " elements step-by-step.\n", "\n", " Why Focus on Top-Level Tags?\n", " ============================\n", " SEC filings usually have a flat HTML structure, which simplifies the\n", " parsing process. Each top-level HTML tag often directly corresponds\n", " to a single semantic element. This is different from many websites\n", " where HTML tags are nested deeply,requiring more complex parsing.\n", "\n", " For Advanced Users:\n", " ====================\n", " The parsing process is implemented as a sequence of steps and allows for\n", " customization at each step.\n", "\n", " - Pipeline Pattern: Raw HTML tags are processed in a sequential manner.\n", " The steps follow an ordered, step-by-step approach, akin to a Finite\n", " State Machine (FSM). Each element transitions through various states\n", " defined by the sequence of processing steps.\n", "\n", " - Strategy Pattern: Each step is customizable. You can either replace,\n", " remove, or extend any of the existing steps with your own or\n", " inherited implementation. Alternatively, you can replace the entire pipeline\n", " with your own process.\n", " \n" ] } ], "source": [ "from sec_parser.processing_engine import AbstractSemanticElementParser\n", "\n", "print(AbstractSemanticElementParser.__doc__)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "***Let's apply this to our example:***" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Processing is organized in steps. If there are no steps, there will be no processing:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NotYetClassifiedElement (text: Financial ...)\n", "NotYetClassifiedElement

(text: The financ...)\n", "NotYetClassifiedElement

(text: Strategies...)\n" ] } ], "source": [ "from sec_parser import Edgar10QParser\n", "\n", "\n", "def get_steps():\n", " return []\n", "\n", "\n", "parser = Edgar10QParser(get_steps)\n", "elements = parser.parse(html)\n", "show(elements)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, it is exactly the same as just wrapping the tags with `UndeterminedElement`:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NotYetClassifiedElement (text: Financial ...)\n", "NotYetClassifiedElement

(text: The financ...)\n", "NotYetClassifiedElement

(text: Strategies...)\n" ] } ], "source": [ "show([NotYetClassifiedElement(tag) for tag in tags])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Let's create the first simple parsing step that naively identifies title and text tags." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MyClassifier: Successfully processed 3 tags!\n", "\n" ] } ], "source": [ "from sec_parser.processing_steps import AbstractProcessingStep\n", "\n", "\n", "class MyClassifier(AbstractProcessingStep):\n", " def __init__(self):\n", " super().__init__()\n", " # You can hold state in your processing steps\n", " self.processed_tags_count = 0\n", "\n", " # This method must be implemented when inheriting from AbstractProcessingStep\n", " def _process(self, elements):\n", " parsed = []\n", " for e in elements:\n", " self.processed_tags_count += 1\n", " if e.html_tag.name == \"b\":\n", " parsed.append(TitleElement.create_from_element(e, \"\"))\n", " elif e.html_tag.name == \"p\":\n", " parsed.append(TextElement.create_from_element(e, \"\"))\n", " else:\n", " parsed.append(e)\n", " print(\n", " f\"MyClassifier: Successfully processed {self.processed_tags_count} tags!\\n\"\n", " )\n", " return parsed\n", "\n", "\n", "def get_steps() -> list[AbstractProcessingStep]:\n", " return [MyClassifier()]\n", "\n", "\n", "parser = Edgar10QParser(get_steps)\n", "elements = parser.parse(html)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TitleElement[L0] (text: Financial ...)\n", "TextElement

(text: The financ...)\n", "NotYetClassifiedElement

(text: Strategies...)\n" ] } ], "source": [ "show(elements)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The third tag cannot be identified as a single Semantic Element, let's see what can we do about it." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Handling Multiple Semantic Elements in a Single HTML Tag" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "If multiple Semantic Elements are in the same HTML tag, we would first identify such cases by naming the element as `CompositeSemanticElement`." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " CompositeSemanticElement acts as a container for other semantic elements,\n", " especially for cases where a single HTML root tag wraps multiple elements.\n", " This ensures structural integrity and enables various features like\n", " semantic segmentation visualization, and debugging by comparison with the\n", " original document.\n", "\n", " Why is this useful:\n", " ===================\n", " 1. Some semantic elements, like XBRL tags (), may wrap multiple semantic\n", " elements. The container ensures that these relationships are not broken\n", " during parsing.\n", " 2. Enables the parser to fully reconstruct the original HTML document, which\n", " opens up possibilities for features like semantic segmentation visualization\n", " (e.g. recreate the original document but put semi-transparent colored boxes\n", " on top, based on semantic meaning), serialization of parsed documents into\n", " an augmented HTML, and debugging by comparing to the original document.\n", " \n" ] } ], "source": [ "from sec_parser.semantic_elements import CompositeSemanticElement\n", "\n", "print(CompositeSemanticElement.__doc__)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "***Let's apply this to our example:***" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "by creating a naive implementation of doing the identification:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NotYetClassifiedElement (text: Financial ...)\n", "NotYetClassifiedElement

(text: The financ...)\n", "CompositeSemanticElement

(has 3 elements inside)\n" ] } ], "source": [ "class CompositeElementIdentificationStep(AbstractProcessingStep):\n", " def _process(self, elements):\n", " result = []\n", " for e in elements:\n", " if e.html_tag.name == \"div\":\n", " result.append(\n", " CompositeSemanticElement.create_from_element(\n", " e,\n", " inner_elements=[\n", " NotYetClassifiedElement(t)\n", " for t in e.html_tag.get_children()\n", " ],\n", " log_origin=\"CompositeElementIdentificationStep\",\n", " )\n", " )\n", " else:\n", " result.append(e)\n", " return result\n", "\n", "\n", "parser = Edgar10QParser(lambda: [CompositeElementIdentificationStep()])\n", "elements = parser.parse(html, unwrap_elements=False)\n", "show(elements)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We have successfully identified the tag as a `CompositeSemanticElement`.\n", "\n", "However, `CompositeSemanticElement` is intended for more advanced use cases, normally we won't even notice it (we had to set `unwrap_elements` flag to `False` to see it):" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NotYetClassifiedElement (text: Financial ...)\n", "NotYetClassifiedElement

(text: The financ...)\n", "NotYetClassifiedElement (text: Strategies...)\n", "NotYetClassifiedElement

(text: Investment...)\n", "NotYetClassifiedElement\n" ] } ], "source": [ "elements = parser.parse(html)\n", "show(elements)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can now combine the steps together. One steps output is another steps input, therefore order is important:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MyClassifier: Successfully processed 3 tags!\n", "\n", "TitleElement[L0] (text: Financial ...)\n", "TextElement

(text: The financ...)\n", "NotYetClassifiedElement (text: Strategies...)\n", "NotYetClassifiedElement

(text: Investment...)\n", "NotYetClassifiedElement\n" ] } ], "source": [ "def get_steps():\n", " return [\n", " CompositeElementIdentificationStep(),\n", " MyClassifier(),\n", " ]\n", "\n", "\n", "parser = Edgar10QParser(get_steps)\n", "elements = parser.parse(html)\n", "show(elements)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the inner elements of `CompositeSemanticElement` did not get processed. This is because it requires special handling. A simple way to do it would be to inherit from `ElementwiseProcessingStep`:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from sec_parser.processing_steps import AbstractElementwiseProcessingStep\n", "\n", "\n", "class BetterClassifier(AbstractElementwiseProcessingStep):\n", " def _process_element(self, element, context):\n", " if element.html_tag.name == \"b\":\n", " return TitleElement.create_from_element(element, \"\")\n", " elif element.html_tag.name == \"p\":\n", " return TextElement.create_from_element(element, \"\")\n", " elif element.html_tag.name == \"img\":\n", " return ImageElement.create_from_element(element, \"\")\n", " return element" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TitleElement[L0] (text: Financial ...)\n", "TextElement

(text: The financ...)\n", "TitleElement[L0] (text: Strategies...)\n", "TextElement

(text: Investment...)\n", "ImageElement\n" ] } ], "source": [ "def get_steps():\n", " return [\n", " CompositeElementIdentificationStep(),\n", " BetterClassifier(),\n", " ]\n", "\n", "\n", "parser = Edgar10QParser(get_steps)\n", "elements = parser.parse(html)\n", "show(elements)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We have completed the HTML parsing as the result looks the same as we intended:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TitleElement[L0] (text: Financial ...)\n", "TextElement

(text: The financ...)\n", "TitleElement[L0] (text: Strategies...)\n", "TextElement

(text: Investment...)\n", "ImageElement\n" ] } ], "source": [ "show(expected_elements)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction to Semantic Trees" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Builds a semantic tree from a list of semantic elements.\n", "\n", " Why Use a Tree Structure?\n", " =========================\n", " Using a tree data structure allows for easier and more robust filtering of sections.\n", " With a tree, you can select specific branches to filter, making it straightforward\n", " to identify section boundaries. This approach is more maintainable and robust\n", " compared to attempting the same operations on a flat list of elements.\n", "\n", " Overview:\n", " =========\n", " 1. Takes a list of semantic elements.\n", " 2. Applies nesting rules to these elements.\n", "\n", " Customization:\n", " ==============\n", " The nesting process is customizable through a list of rules. These rules determine\n", " how new elements should be nested under existing ones.\n", "\n", " Advanced Customization:\n", " =======================\n", " You can supply your own set of rules by providing a callable to `get_rules`, which\n", " should return a list of `AbstractNestingRule` instances.\n", " \n" ] } ], "source": [ "from sec_parser.semantic_tree import TreeBuilder\n", "\n", "print(TreeBuilder.__doc__)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "***Let's apply this to our example:***" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "A very similar processing pattern is used here as well:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;34mTitleElement\u001b[0m: Financial Overview\n", "└── \u001b[1;34mTextElement\u001b[0m: The financial sector is a categ...ommercial and retail customers.\n", "\u001b[1;34mTitleElement\u001b[0m: Strategies of Investment\n", "├── \u001b[1;34mTextElement\u001b[0m: Investment strategies are plans...ign with their financial goals.\n", "└── \u001b[1;34mImageElement\u001b[0m\n" ] } ], "source": [ "from sec_parser.semantic_tree import AlwaysNestAsParentRule, AbstractNestingRule, render\n", "\n", "\n", "def get_rules() -> list[AbstractNestingRule]:\n", " return [\n", " AlwaysNestAsParentRule(TitleElement),\n", " ]\n", "\n", "\n", "builder = TreeBuilder(get_rules)\n", "tree = builder.build(elements)\n", "print(render(list(tree)))" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;34mTitleElement\u001b[0m: Financial Overview\n", "└── \u001b[1;34mTextElement\u001b[0m: The financial sector is a categ...ommercial and retail customers.\n" ] } ], "source": [ "print(render(list(tree)[0]))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;34mTitleElement\u001b[0m: Strategies of Investment\n", "├── \u001b[1;34mTextElement\u001b[0m: Investment strategies are plans...ign with their financial goals.\n", "└── \u001b[1;34mImageElement\u001b[0m\n" ] } ], "source": [ "print(render(list(tree)[1]))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "For further understanding of `sec-parser`, refer to the [**Documentation**](https://sec-parser.rtfd.io). If you're interested in contributing, consider checking out our [**Contribution Guide**](https://github.com/alphanome-ai/sec-parser/blob/main/CONTRIBUTING.md)." ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }