Documentation | Parsifex

Introduction

Parsifex is a platform for extracting, classifying, and analysing risk factor disclosures from US public company 10-K filings. It combines automated text extraction from SEC filings with a machine-learning topic model and a suite of natural language processing (NLP) tools to produce structured, analysis-ready datasets.

The platform isbuilt around three core capabilities that work together in a pipeline:

Extraction. The filing parser reads the HTML source of a 10-K filing as published on EDGAR, locates the Item 1A (Risk Factors) section, and splits it into individual risk factors. It also extracts metadata such as the company name, fiscal year end date, and CIK (Central Index Key).
Classification. Each extracted risk factor is fed into a topic model that assigns it to a risk topic (e.g. Regulatory & Compliance, Cybersecurity & Technology) along with a confidence score and a set of representative keywords.
Linguistic analysis. A suite of NLP metrics is computed for each risk factor, covering sentiment, readability, verb tense orientation, word count, and named entity recognition. These attributes allow researchers and analysts to quantify the language and tone of corporate risk disclosures.

This documentation explains what each analytical component does, how to interpret its output, and how to use each of the platform’s three services: the Risk Classifier, the Report Processor, and the Bulk Data Download.

Topic Modeling

Parsifex uses BERTopic, a transformer-based topic modelling framework, to classify risk factors into topics. Unlike traditional bag-of-words models (e.g. LDA), BERTopic leverages contextual sentence embeddings to capture the semantic meaning of text, then clusters those embeddings and extracts representative terms for each cluster.

The model was trained on a large corpus of risk factor texts extracted from US 10-K filings. The training pipeline followed these steps:

Embedding. Each risk factor text is cleaned, tokenised, and then encoded into a dense vector using a sentence-transformer model. These embeddings capture the semantic meaning of each risk factor, not just surface-level word overlap.
Dimensionality reduction. The high-dimensional embedding space is compressed using UMAP so that semantically similar risk factors are placed close together in a lower-dimensional space.
Clustering. HDBSCAN groups the reduced embeddings into clusters. Each cluster becomes a candidate topic.
Topic labelling. Each cluster is described using c-TF-IDF (class-based TF-IDF) to extract its most representative keywords. An LLM is then used to generate a human-readable label for each topic based on those keywords and representative documents.

Linguistic Attributes

Beyond topic classification, Parsifex computes a set of linguistic attributes for each risk factor. These metrics are widely used in accounting and finance research to quantify the characteristics of corporate disclosures.

Word Count

The number of whitespace-separated tokens in the risk factor text. Longer risk factors may indicate more detailed disclosure or greater complexity.

Sentiment (VADER)

Sentiment is measured using VADER (Valence Aware Dictionary and sEntiment Reasoner), a lexicon- and rule-based tool designed for sentiment analysis of texts encountered in social media and other informal domains, but also effective on corporate disclosures.

VADER returns a compound score between −1 (most negative) and +1 (most positive). The score can be interpreted as follows:

Above +0.05: positive tone
Between −0.05 and +0.05: neutral
Below −0.05: negative tone

Readability (Gunning Fog Index)

The Gunning Fog Index estimates the number of years of formal education a reader needs to understand the text on a first reading. It is calculated based on average sentence length and the proportion of complex words (words with three or more syllables).

Typical ranges:

6–10: fairly easy to read (mainstream prose)
10–14: standard difficulty (business writing)
14–17: difficult (academic or legal text)
17+: very difficult (highly specialised or dense prose)

Tense Orientation

Tense orientation measures the distribution of verb forms across three temporal categories: past, present, and future. It is computed using spaCy’s part-of-speech tagger. Risk factors with a high future-tense share tend to be forward-looking; those dominated by past tense may describe historical events or precedents.

Named Entity Recognition (NER)

Named entity recognition identifies and counts real-world entities mentioned in the text. Parsifex uses spaCy’s pre-trained NER model to detect entity types (learn more here).

Risk Classifier

The Risk Classifier is a single-factor analysis tool. You paste one risk factor and receive its topic classification plus all linguistic attributes in real time.

The interface includes pre-loaded examples (Regulatory, IT, Supply chain, Interest rate, Competition) you can use to test the service.

Input

Risk factor text: paste any risk factor disclosure from a US firm’s 10-K filing. The text must be between 100 and 5,000 characters. It should be the full text of one risk factor. Pasting multiple risk factors or a short fragment of a risk factor can result in inaccurate classification.

Output

When a new risk factor is submitted, the classifier returns two result panels:

Topic classification:

Topic label: a short, human-readable description of the risk category (e.g. “Regulatory & Compliance Risk”).
Topic ID: the numeric identifier of the assigned cluster.
Confidence (probability): a score between 0 and 1 indicating how strongly the risk factor matches its assigned topic. Higher values indicate a better fit¹.
Top keywords: the most representative terms for the assigned topic, extracted via c-TF-IDF.
Similar risk factors: example risk factors from the training corpus that belong to the same topic, giving context for what the topic encompasses.
Similar topics: other topics that are semantically close to the risk factor, each with a lower probability than the primary assignment. Useful when a risk factor touches on more than one theme.

Text attributes:

Word count: total number of words.
Sentiment: VADER compound score with a color-coded bar (green = positive, red = negative, grey = neutral).
Readability: Gunning Fog Index with a gradient gauge (green = easy, red = difficult).
Tense orientation: dominant tense with a stacked bar showing past / present / future percentages.
Named entities: total count plus a per-type breakdown with horizontal bars.

¹ Interpreting Confidence Scores
A confidence score of 0.90 means the model is highly certain the risk factor belongs to the assigned topic. Scores below 0.60 suggest the text may touch on multiple themes or may not fit neatly into any single trained topic. Parsifex reports the score transparently so users can decide their own threshold for downstream analysis.

Report Processor

The Report Processor handles full annual reports end to end. You upload one or more 10-K filings in HTML format, select which output attributes you want, and receive a downloadable structured file plus a visual summary.

Input

File upload: Upload 10-K filings in HTML format (.html or .htm), as downloaded directly from EDGAR. Each file can be up to 10 MB. You can drag and drop files into the upload area or click to browse.

Analysis options — select which attribute groups to compute:

Option

Columns added to output

Topic

topic_label, topic_probability, topic_keywords

Full risk factor text

full_text

Word count

word_count

Sentiment

sentiment_vader (VADER compound score)

Readability

readability_fog (Gunning Fog Index)

Tense

tense_past_pct, tense_present_pct,tense_future_pct

Named entities

ner_org, ner_gpe, ner_money, ner_date, ner_person

Output format: choose between Excel (.xlsx), CSV, or JSON.

Output

The output file contains one row per extracted risk factor. The following columns are always present regardless of which options you select:

Column

Description

factor_id

Unique identifier for the risk factor (CIK-year-index)

cik

10-digit SEC Central Index Key

cik_source

How the CIK was resolved: “xbrl” (from the filing’s XBRL tags) or “firm_data” (matched by company name against SEC records)

company_name

Company name extracted form the uploaded file

fiscal_year_end

Fiscal year end date extracted form the uploaded file

factor_index

Position of this risk factor within the filing (0-based)

headline

The risk factor’s headline (or first sentence if no typeset headline was detected)

extraction_warnings

Any issues encountered during extraction (semicolon-separated)

Bulk Data Download

The Bulk Data Download service lets you build custom queries across the full database of pre-processed filings, firms, risk factors, and topics. It is designed for researchers who need large, filtered datasets for quantitative analysis.

Start by configuring your filters, then select the columns you need and choose an output format. Before committing, click Preview to check the row count, estimated file size, and a five-row sample. When you are ready, submit the query — large jobs are processed asynchronously and can be tracked in the Activity tab. Once complete, download the zipped output file; files remain available for 24 hours. You can also save your queries for quick reuse.

Filters

All filters are optional. Leaving a filter empty means no restriction on that dimension. Available filters:

Filter

Description

Year range

Filing report year (inclusive).

SIC division

Standard Industrial Classification division. Filters firms by broad industry sector.

SIC 2-digit group

Two-digit SIC major group codes (1–99). Can be combined with division filters for finer granularity.

Exchange

Stock exchange (e.g. NYSE, Nasdaq, CBOE, OTC). Select one or more.

Risk topics

Restrict the output to risk factors assigned to specific risk categories.

CIK / Ticker list

Paste a list of CIKs or tickers to restrict the output to specific firms. These are mutually exclusive—use one or the other, not both.

Output Columns

Columns are organized into bundles. Select the bundles relevant to your research:

Identifiers: CIK, accession number, company name, ticker.
Filing metadata: form type, report date, report year, filing date, filing year, filing URL, primary document URL, entity type, SIC code and descriptions, exchange, category, owner organization.
Risk factor content: risk factor index, extraction path², full risk factor text.
Topic: topic ID, topic label, topic probability, topic keywords, factor count (when aggregated).

When you select only filing-level columns (no risk-factor-level content), the output is automatically aggregated to one row per filing per topic, with a factor count. When you include risk-factor-level columns (e.g. the full text), the output contains one row per individual risk factor.

Output format: choose between CSV, Parquet (for Python), or Excel (.xlsx). The downloadable file is always delivered as a zipped file

² The platform locates the Item 1A (Risk Factors) section in the HTML filing by searching for the standard heading pattern. Within that section, individual risk factors are split using a tiered strategy:

Primary path: bold and italic headings. Most 10-K filings use bold+italic formatting for risk factor titles.
Secondary path: bold-only headings. Some filers use bold text (without italic) for their risk factor titles.
Fallback path: text-shape heuristics. When no consistent typographic structure is detected, the parser falls back to analysing text patterns to estimate factor boundaries.

Methodology

Introduction

Topic Modeling

Linguistic Attributes

Word Count

Sentiment (VADER)

Readability (Gunning Fog Index)

Tense Orientation

Named Entity Recognition (NER)

Product Guides

Risk Classifier

Input

Output

Report Processor

Input

Output

Bulk Data Download

Filters

Output Columns