<b>Financial Overview</b>\n",
"<p>The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.</p>\n",
"<div>\n",
" <b>Strategies of Investment</b>\n",
" <p>Investment strategies</font> are plans that guide investors to choose <font color="green" style="color:green">the best investment opportunities</font> that align with their financial goals.</p>\n",
" <img src="https://en.wikipedia.org/static/images/icons/wikipedia.png" width="20" height="20">\n",
"</div>\n",
"The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.
\n", "Investment strategies are plans that guide investors to choose the best investment opportunities that align with their financial goals.
\n", "The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.
\n", "Investment strategies are plans that guide investors to choose the best investment opportunities that align with their financial goals.
\n", "The financial sector is a category of the economy made up of firms that provide financial services to commercial and retail customers.
\n", "Investment strategies are plans that guide investors to choose the best investment opportunities that align with their financial goals.
\n", "Investment strategies are plans that guide investors to choose the best investment opportunities that align with their financial goals.
\n", "(text: The financ...)\n", "NotYetClassifiedElement
(text: The financ...)\n", "TitleElement[L0] (text: Strategies...)\n", "TextElement
(text: Investment...)\n",
"ImageElement\n"
]
}
],
"source": [
"from sec_parser import ImageElement\n",
"\n",
"expected_elements: list[AbstractSemanticElement] = [\n",
" TitleElement(tags[0]),\n",
" TextElement(tags[1]),\n",
" TitleElement(tags[2].get_children()[0]),\n",
" TextElement(tags[2].get_children()[1]),\n",
" ImageElement(tags[2].get_children()[2]),\n",
"]\n",
"show(expected_elements)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Understanding the Parsing Process"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Responsible for parsing semantic elements from HTML documents.\n",
" It takes raw HTML and turns it into a list of objects\n",
" representing semantic elements.\n",
"\n",
" At a High Level:\n",
" ==================\n",
" 1. Extract top-level HTML tags from the document.\n",
" 2. Transform these tags into a list of more specific semantic\n",
" elements step-by-step.\n",
"\n",
" Why Focus on Top-Level Tags?\n",
" ============================\n",
" SEC filings usually have a flat HTML structure, which simplifies the\n",
" parsing process. Each top-level HTML tag often directly corresponds\n",
" to a single semantic element. This is different from many websites\n",
" where HTML tags are nested deeply,requiring more complex parsing.\n",
"\n",
" For Advanced Users:\n",
" ====================\n",
" The parsing process is implemented as a sequence of steps and allows for\n",
" customization at each step.\n",
"\n",
" - Pipeline Pattern: Raw HTML tags are processed in a sequential manner.\n",
" The steps follow an ordered, step-by-step approach, akin to a Finite\n",
" State Machine (FSM). Each element transitions through various states\n",
" defined by the sequence of processing steps.\n",
"\n",
" - Strategy Pattern: Each step is customizable. You can either replace,\n",
" remove, or extend any of the existing steps with your own or\n",
" inherited implementation. Alternatively, you can replace the entire pipeline\n",
" with your own process.\n",
" \n"
]
}
],
"source": [
"from sec_parser.processing_engine import AbstractSemanticElementParser\n",
"\n",
"print(AbstractSemanticElementParser.__doc__)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"***Let's apply this to our example:***"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Processing is organized in steps. If there are no steps, there will be no processing:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NotYetClassifiedElement (text: Financial ...)\n",
"NotYetClassifiedElement
(text: The financ...)\n", "NotYetClassifiedElement
(text: The financ...)\n", "NotYetClassifiedElement
(text: The financ...)\n", "NotYetClassifiedElement
(text: The financ...)\n", "CompositeSemanticElement
(text: The financ...)\n", "NotYetClassifiedElement (text: Strategies...)\n", "NotYetClassifiedElement
(text: Investment...)\n",
"NotYetClassifiedElement\n"
]
}
],
"source": [
"elements = parser.parse(html)\n",
"show(elements)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now combine the steps together. One steps output is another steps input, therefore order is important:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MyClassifier: Successfully processed 3 tags!\n",
"\n",
"TitleElement[L0] (text: Financial ...)\n",
"TextElement
(text: The financ...)\n", "NotYetClassifiedElement (text: Strategies...)\n", "NotYetClassifiedElement
(text: Investment...)\n",
"NotYetClassifiedElement\n"
]
}
],
"source": [
"def get_steps():\n",
" return [\n",
" CompositeElementIdentificationStep(),\n",
" MyClassifier(),\n",
" ]\n",
"\n",
"\n",
"parser = Edgar10QParser(get_steps)\n",
"elements = parser.parse(html)\n",
"show(elements)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the inner elements of `CompositeSemanticElement` did not get processed. This is because it requires special handling. A simple way to do it would be to inherit from `ElementwiseProcessingStep`:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"from sec_parser.processing_steps import AbstractElementwiseProcessingStep\n",
"\n",
"\n",
"class BetterClassifier(AbstractElementwiseProcessingStep):\n",
" def _process_element(self, element, context):\n",
" if element.html_tag.name == \"b\":\n",
" return TitleElement.create_from_element(element, \"\")\n",
" elif element.html_tag.name == \"p\":\n",
" return TextElement.create_from_element(element, \"\")\n",
" elif element.html_tag.name == \"img\":\n",
" return ImageElement.create_from_element(element, \"\")\n",
" return element"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TitleElement[L0] (text: Financial ...)\n",
"TextElement
(text: The financ...)\n", "TitleElement[L0] (text: Strategies...)\n", "TextElement
(text: Investment...)\n",
"ImageElement\n"
]
}
],
"source": [
"def get_steps():\n",
" return [\n",
" CompositeElementIdentificationStep(),\n",
" BetterClassifier(),\n",
" ]\n",
"\n",
"\n",
"parser = Edgar10QParser(get_steps)\n",
"elements = parser.parse(html)\n",
"show(elements)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We have completed the HTML parsing as the result looks the same as we intended:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TitleElement[L0] (text: Financial ...)\n",
"TextElement
(text: The financ...)\n", "TitleElement[L0] (text: Strategies...)\n", "TextElement
(text: Investment...)\n",
"ImageElement\n"
]
}
],
"source": [
"show(expected_elements)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction to Semantic Trees"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Builds a semantic tree from a list of semantic elements.\n",
"\n",
" Why Use a Tree Structure?\n",
" =========================\n",
" Using a tree data structure allows for easier and more robust filtering of sections.\n",
" With a tree, you can select specific branches to filter, making it straightforward\n",
" to identify section boundaries. This approach is more maintainable and robust\n",
" compared to attempting the same operations on a flat list of elements.\n",
"\n",
" Overview:\n",
" =========\n",
" 1. Takes a list of semantic elements.\n",
" 2. Applies nesting rules to these elements.\n",
"\n",
" Customization:\n",
" ==============\n",
" The nesting process is customizable through a list of rules. These rules determine\n",
" how new elements should be nested under existing ones.\n",
"\n",
" Advanced Customization:\n",
" =======================\n",
" You can supply your own set of rules by providing a callable to `get_rules`, which\n",
" should return a list of `AbstractNestingRule` instances.\n",
" \n"
]
}
],
"source": [
"from sec_parser.semantic_tree import TreeBuilder\n",
"\n",
"print(TreeBuilder.__doc__)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"***Let's apply this to our example:***"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"A very similar processing pattern is used here as well:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[1;34mTitleElement\u001b[0m: Financial Overview\n",
"└── \u001b[1;34mTextElement\u001b[0m: The financial sector is a categ...ommercial and retail customers.\n",
"\u001b[1;34mTitleElement\u001b[0m: Strategies of Investment\n",
"├── \u001b[1;34mTextElement\u001b[0m: Investment strategies are plans...ign with their financial goals.\n",
"└── \u001b[1;34mImageElement\u001b[0m\n"
]
}
],
"source": [
"from sec_parser.semantic_tree import AlwaysNestAsParentRule, AbstractNestingRule, render\n",
"\n",
"\n",
"def get_rules() -> list[AbstractNestingRule]:\n",
" return [\n",
" AlwaysNestAsParentRule(TitleElement),\n",
" ]\n",
"\n",
"\n",
"builder = TreeBuilder(get_rules)\n",
"tree = builder.build(elements)\n",
"print(render(list(tree)))"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[1;34mTitleElement\u001b[0m: Financial Overview\n",
"└── \u001b[1;34mTextElement\u001b[0m: The financial sector is a categ...ommercial and retail customers.\n"
]
}
],
"source": [
"print(render(list(tree)[0]))"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[1;34mTitleElement\u001b[0m: Strategies of Investment\n",
"├── \u001b[1;34mTextElement\u001b[0m: Investment strategies are plans...ign with their financial goals.\n",
"└── \u001b[1;34mImageElement\u001b[0m\n"
]
}
],
"source": [
"print(render(list(tree)[1]))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"For further understanding of `sec-parser`, refer to the [**Documentation**](https://sec-parser.rtfd.io). If you're interested in contributing, consider checking out our [**Contribution Guide**](https://github.com/alphanome-ai/sec-parser/blob/main/CONTRIBUTING.md)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}