Methodology
Introduction
Parsifex is a platform for extracting, classifying, and analysing risk factor disclosures from US public company 10-K filings. It combines automated text extraction from SEC filings with a machine-learning topic model and a suite of natural language processing (NLP) tools to produce structured, analysis-ready datasets.
The platform isbuilt around three core capabilities that work together in a pipeline:
-
Extraction. The filing parser reads the HTML source of a 10-K filing as published on EDGAR, locates the Item 1A (Risk Factors) section, and splits it into individual risk factors. It also extracts metadata such as the company name, fiscal year end date, and CIK (Central Index Key).
-
Classification. Each extracted risk factor is fed into a topic model that assigns it to a risk topic (e.g. Regulatory & Compliance, Cybersecurity & Technology) along with a confidence score and a set of representative keywords.
-
Linguistic analysis. A suite of NLP metrics is computed for each risk factor, covering sentiment, readability, verb tense orientation, word count, and named entity recognition. These attributes allow researchers and analysts to quantify the language and tone of corporate risk disclosures.
This documentation explains what each analytical component does, how to interpret its output, and how to use each of the platform’s three services: the Risk Classifier, the Report Processor, and the Bulk Data Download.
Topic Modeling
Parsifex uses BERTopic, a transformer-based topic modelling framework, to classify risk factors into topics. Unlike traditional bag-of-words models (e.g. LDA), BERTopic leverages contextual sentence embeddings to capture the semantic meaning of text, then clusters those embeddings and extracts representative terms for each cluster.
The model was trained on a large corpus of risk factor texts extracted from US 10-K filings. The training pipeline followed these steps:
-
Embedding. Each risk factor text is cleaned, tokenised, and then encoded into a dense vector using a sentence-transformer model. These embeddings capture the semantic meaning of each risk factor, not just surface-level word overlap.
-
Dimensionality reduction. The high-dimensional embedding space is compressed using UMAP so that semantically similar risk factors are placed close together in a lower-dimensional space.
-
Clustering. HDBSCAN groups the reduced embeddings into clusters. Each cluster becomes a candidate topic.
-
Topic labelling. Each cluster is described using c-TF-IDF (class-based TF-IDF) to extract its most representative keywords. An LLM is then used to generate a human-readable label for each topic based on those keywords and representative documents.
Linguistic Attributes
Beyond topic classification, Parsifex computes a set of linguistic attributes for each risk factor. These metrics are widely used in accounting and finance research to quantify the characteristics of corporate disclosures.
Word Count
The number of whitespace-separated tokens in the risk factor text. Longer risk factors may indicate more detailed disclosure or greater complexity.
Sentiment (VADER)
Sentiment is measured using VADER (Valence Aware Dictionary and sEntiment Reasoner), a lexicon- and rule-based tool designed for sentiment analysis of texts encountered in social media and other informal domains, but also effective on corporate disclosures.
VADER returns a compound score between −1 (most negative) and +1 (most positive). The score can be interpreted as follows:
-
Above +0.05: positive tone
-
Between −0.05 and +0.05: neutral
-
Below −0.05: negative tone
Readability (Gunning Fog Index)
The Gunning Fog Index estimates the number of years of formal education a reader needs to understand the text on a first reading. It is calculated based on average sentence length and the proportion of complex words (words with three or more syllables).
Typical ranges:
-
6–10: fairly easy to read (mainstream prose)
-
10–14: standard difficulty (business writing)
-
14–17: difficult (academic orlegal text)
-
17+: very difficult (highly specialised or dense prose)
Tense Orientation
Tense orientation measures the distribution of verb forms across three temporal categories: past, present, and future. It is computed using spaCy’s part-of-speech tagger. Risk factors with a high future-tense share tend to be forward-looking; those dominated by past tense may describe historical events or precedents.
Named Entity Recognition (NER)
Named entity recognition identifies and counts real-world entities mentioned in the text. Parsifex uses spaCy’s pre-trained NER model to detect entity types (learn more here).
Product Guides
Risk Classifier
The Risk Classifier is a single-factor analysis tool. You paste one risk factor and receive its topic classification plus all linguistic attributes in real time.
The interfaceincludes pre-loaded examples (Regulatory, Cybersecurity, Supply chain, Interestrate, Competition) you can use to test the service.
Input
-
Risk factor text: paste any risk factor disclosure from a US firm’s 10-K filing. The text must be between 100 and 5,000 characters. It should be the full text of one risk factor. Pasting multiple risk factors or a short fragment of a risk factor can result in inaccurate classification.
This documentation explains what each analytical component does, how to interpret its output, and how to use each of the platform’s three services: the Risk Classifier, the Report Processor, and the Bulk Data Download.
Output
When a new risk factor is submitted, the classifier returns two result panels:
Topic classification:
-
Topic label: a short, human-readable description of the risk category (e.g. “Regulatory & Compliance Risk”).
-
Topic ID: the numeric identifier of the assigned cluster.
-
Confidence (probability): a score between 0 and 1 indicating how strongly the risk factor matches its assigned topic. Higher values indicate a better fit1.
-
Top keywords: the most representative terms for the assigned topic, extracted via c-TF-IDF.
-
Similar risk factors: example risk factors from the training corpus that belong to the same topic, giving context for what the topic encompasses.
-
Similar topics: other topics that are semantically close to the risk factor, each with a lower probability than the primary assignment. Useful when a risk factor touches on more than one theme.
Text attributes:
-
Word count: total number of words.
-
Sentiment: VADER compound score with a color-coded bar (green = positive, red = negative, grey = neutral).
-
Readability: Gunning Fog Index with a gradient gauge (green = easy, red = difficult).
-
Tense orientation: dominant tense with a stacked bar showing past / present / future percentages.
-
Named entities: total count plus a per-type breakdown with horizontal bars.
1 Interpreting Confidence Scores
A confidence score of 0.90 means the model is highly certain the risk factor belongs to the assigned topic. Scores below 0.60 suggest the text may touch on multiple themes or may not fit neatly into any single trained topic. Parsifex reports the score transparently so users can decide their own threshold for downstream analysis.
Report Processor
The Report Processor handles full annual reports end to end. You upload one or more 10-K filings in HTML format, select which output attributes you want, and receive a downloadable structured file plus a visual summary.
Input
File upload: Upload 10-K filings in HTML format (.html or .htm), as downloaded directly from EDGAR. Each file can be up to 10 MB. You can drag and drop files into the upload area or click to browse.
Analysis options — select which attribute groups to compute:
Output format: choose between Excel (.xlsx), CSV, or JSON.
Output
The output file contains one row per extracted risk factor. The following columns are always present regardless of which options you select:
Bulk Data Download
The Bulk Data Download service lets you build custom queries across the full database of pre-processed filings, firms, risk factors, and topics. It is designed for researchers who need large, filtered datasets for quantitative analysis.
Start by configuring your filters, then select the columns you need and choose an output format. Before committing, click Preview to check the row count, estimated file size, and a five-row sample. When you are ready, submit the query — large jobs are processed asynchronously and can be tracked in the Activity tab. Once complete, download the zipped output file; files remain available for 24 hours. You can also save your queries for quick reuse.
Filters
All filters are optional. Leaving a filter empty means no restriction on that dimension. Available filters:
Output Columns
Columns are organized into bundles. Select the bundles relevant to your research:
-
Identifiers: CIK, accession number, company name, ticker.
-
Filing metadata: form type, report date, report year, filing date, filing year, filing URL, primary document URL, entity type, SIC code and descriptions, exchange, category, owner organization.
-
Risk factor content: risk factor index, extraction path2, full risk factor text.
-
Topic: topic ID, topic label, topic probability, topic keywords, factor count (when aggregated).
When you select only filing-level columns (no risk-factor-level content), the output is automatically aggregated to one row per filing per topic, with a factor count. When you include risk-factor-level columns (e.g. the full text), the output contains one row per individual risk factor.
Output format: choose between CSV, Parquet (for Python), or Excel (.xlsx). The downloadable file is always delivered as a zipped file
2 The platform locates the Item 1A (Risk Factors) section in the HTML filing by searching for the standard heading pattern. Within that section, individual risk factors are split using a tiered strategy:
-
Primary path: bold and italic headings. Most 10-K filings use bold+italic formatting for risk factor titles.
-
Secondary path: bold-only headings. Some filers use bold text (without italic) for their risk factor titles.
-
Fallback path: text-shape heuristics. When no consistent typographic structure is detected, the parser falls back to analysing text patterns to estimate factor boundaries.