Academic Research

Peer-relative textual analysis
of corporate disclosures

A novel framework for measuring disclosure behavior as competitive deltas, grounded in established literature and validated across six years of out-of-sample data. We welcome collaboration, co-authorship, and data access inquiries.

The Dataset
Where it is. Where it is going.
The ContextQuant dataset is a structured panel of corporate textual and financial data designed for empirical research in accounting, finance, and NLP. It currently covers 185 US-listed and ADR-traded companies across seven years. The effective test sample is 130 US-domestic 10-K filers.
17,207SEC Filings
73,660Parsed Text Sections
198,192NLP Features
7M+Words Analyzed
311,692Daily Price Rows
53,579Macro Observations
406,000+Political Contributions
2,907Exec Comp Entries
24,078Corporate Events
The dataset is expanding along three dimensions. Coverage is extending to full US-listed market (~4,000 companies via EDGAR) and Canadian-listed companies (~3,500 via SEDAR+), with European and Asian filing repositories planned. Data sources are expanding to include patent filings (USPTO), insider transactions, and additional transcript coverage. NLP methods are extending from Loughran-McDonald dictionaries to transformer-based models (FinBERT, GPT-4 calibrated sentiment).
The underlying database uses SQLite with an 18-table schema covering filings, sections, features, price data, macro time series, competitive pairs, corporate events, executive compensation, and political contributions. The full pipeline is reproducible from raw EDGAR downloads through feature extraction and hypothesis testing.
Tested Hypotheses
Seven hypotheses tested in the first wave
All hypotheses tested using a rolling out-of-sample framework with six annual evaluation windows (2020-2025). Spearman rank IC, quintile spreads, t-statistics, and hit rates reported. Additional hypotheses on transcript analysis, patent innovation, and insider transactions are in development.
H1
Risk Factor Specificity Predicts Forward Returns
TF-IDF scoring of Item 1A risk factors measures company-specific vs. boilerplate language. Higher specificity predicts 6-12 month underperformance. Mean IC = -0.121, spread = -6.9%, 4/6 years. 2023: IC = -0.275, t = -3.48 (p < 0.01). Extends Campbell et al. (2014, RAS) and Kravet & Muslu (2013, RAS).
Strong
H2
MD&A Sentiment Drift Predicts Short-Term Returns
YoY changes in Loughran-McDonald net sentiment from Item 7. Perfect consistency: 6/6 years at 0-30 day horizon (p = 0.016, sign test). Excessive optimism is a contrarian indicator. Extends Jiang et al. (2019, JFE) at the firm level and Huang et al. (2014, TAR) on tone management.
Strong
H3
Combined Signal Contains Independent Information
Composite of H1 and H2 achieves 75% hit rate, exceeding both individual signals. Complementary time horizons: sentiment at 0-30d, risk specificity at 6-12m, composite dominates at 90-180d (IC = -0.106). Composite does not generate false signals in dormant environments.
Strong
H10
Macro Regime Determines Which Signal Dominates
Risk specificity 2.7x more informative in high-uncertainty regimes; sentiment 2.1x in low-uncertainty. Composite at 90-180d in high uncertainty: IC = -0.137, 6/6 hit rate. Regime classification: VIX > 20 or fed funds change > 50bps/6mo. Novel finding of regime-complementary textual signals.
Confirmed
H5
8-K Event Clustering Predicts Short-Term Returns
Executive change flag (Item 5.02) shows directional signal at 0-30d but not robust across years. Event frequency and materiality surge inconsistent. Requires larger sample for statistical power.
Exploratory
H6
Compensation Structure Predicts Disclosure Behavior
Limited by data availability (15 tickers with CEO-level comp). Supplementary test: IC = -0.149 between equity comp increases and subsequent risk specificity changes. Direction interesting but sample insufficient. Relates to Core et al. (1999, JFE) excess compensation framework.
Data-Limited
H7
Political Spending Patterns and Returns
Original hypothesis rejected but revised with expanded data (134 tickers, 406K contributions, up from 49). Raw total contributions show positive ICs (avg +0.159, peak IC = +0.273), confirming large political spenders outperform (regulatory capture). However, contribution intensity (spending normalized by dollar volume) at 90-180d shows negative spreads in all 3 cycles (3/3 hit rate): smaller companies spending disproportionately underperform. Direction depends on absolute vs. size-relative measurement. This size-conditioning effect is a novel finding.
Size-Dependent
Literature Positioning
Three novel contributions
First, we introduce a peer-relative framework measuring each company's textual characteristics against direct competitive peers rather than in isolation or against broad industry benchmarks. Second, we demonstrate that the specificity dimension of risk disclosures (TF-IDF) contains return-predictive information distinct from aggregate disclosure volume. Third, we document regime-complementary behavior of textual signals: different disclosure dimensions become informative under opposing macroeconomic conditions.
Key References
Campbell, Chen, Dhaliwal, Lu, Steele (2014). The Information Content of Mandatory Risk Factor Disclosures. Review of Accounting Studies, 19(1), 396-455.
Jiang, Lee, Martin, Zhou (2019). Manager Sentiment and Stock Returns. Journal of Financial Economics, 132(1), 126-149.
Kravet, Muslu (2013). Textual Risk Disclosures and Investors' Risk Perceptions. Review of Accounting Studies, 18(4), 1088-1122.
Huang, Teoh, Zhang (2014). Tone Management. The Accounting Review, 89(3), 1083-1113.
Loughran, McDonald (2011). When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. Journal of Finance, 66(1), 35-65.
Price, Doran, Peterson, Bliss (2012). Earnings Conference Calls and Stock Returns. Journal of Banking & Finance, 36(4), 992-1011.
Tetlock (2007). Giving Content to Investor Sentiment. Journal of Finance, 62(3), 1139-1168.
Feldman, Govindaraj, Livnat, Segal (2009). Management's Tone Change. Review of Accounting Studies, 15(4), 915-953.
Working Paper
[PDF]
Peer-Relative Textual Analysis of Corporate Disclosures: Evidence from Risk Specificity, Management Sentiment, and Macro Regime Conditioning
Full methodology, robustness checks, and portfolio simulation results. Includes complete hypothesis development, data description, and discussion of limitations.
Forthcoming — March 2026
Collaboration
How to work with us
Data Access
We are open to providing dataset access to researchers working on related questions in disclosure quality, textual analysis, or market microstructure. Contact us to discuss scope and terms.
Co-Authorship
We welcome collaboration with faculty working in accounting, finance, or NLP. The dataset and framework lend themselves to multiple publishable studies beyond our initial hypotheses.
Extension Research
Open questions: transformer-based sentiment models, international filer analysis, mid/small-cap generalizability, Fama-French alpha estimation, and dynamic signal weighting optimization.
Interested in the research?
Whether you are exploring data access, considering co-authorship, or have questions about the methodology, we would welcome the conversation.
info@contextquant.com