GAAP & SEC Compliance Dataset

A comprehensive dataset for financial AI applications

Dataset Overview

This dataset contains 470,151 documents covering US GAAP (Generally Accepted Accounting Principles) standards and SEC (Securities and Exchange Commission) filing requirements. It's designed for training and evaluating AI systems for financial compliance, accounting Q&A, and regulatory analysis.

Key Statistics

  • Total Documents: 470,151
  • Average Length: 363 characters
  • Unique Companies: 6,573
  • Date Range: 2007-01-31 to 2025-12-01
  • Dataset Size: ~296MB

Content Distribution

By Source

  • XBRL: 445,211 (94.7%)
  • SEC_FILING: 24,935 (5.3%)
  • GAAP_STANDARD: 5 (0.0%)

By Document Type

  • tag: 445,211 (94.7%)
  • financial_data: 24,935 (5.3%)
  • standard: 5 (0.0%)

By Category (Top 10)

  • Other: 294,775 (62.7%)
  • Expenses: 55,303 (11.8%)
  • Assets: 35,592 (7.6%)
  • Liabilities: 32,958 (7.0%)
  • Income: 24,658 (5.2%)
  • Equity: 19,732 (4.2%)
  • Revenue: 7,133 (1.5%)

Use Cases

AI Chatbots

Build intelligent assistants for:

  • GAAP compliance questions
  • SEC filing analysis
  • Accounting standard lookup
  • Financial regulation guidance

Information Retrieval

Power search engines for:

  • Financial document discovery
  • Regulatory text mining
  • Compliance research
  • Academic studies

Machine Learning

Train models for:

  • Financial text classification
  • Accounting Q&A systems
  • Regulatory NLP tasks
  • Domain adaptation

Quick Start

Load Dataset

from datasets import load_dataset

# Load full dataset
dataset = load_dataset("aanshshah/gaap-sec-compliance-dataset")

# Or stream for memory efficiency
dataset = load_dataset("aanshshah/gaap-sec-compliance-dataset", streaming=True)

# Access examples
for example in dataset["train"]:
    print(f"Title: {example['metadata']['title']}")
    print(f"Source: {example['metadata']['source']}")
    print(f"Content: {example['content'][:200]}...")
    break

Build RAG System

from transformers import pipeline
from sentence_transformers import SentenceTransformer
import faiss

# Load dataset
docs = load_dataset("aanshshah/gaap-sec-compliance-dataset")["train"]

# Create embeddings
encoder = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = encoder.encode([doc["content"] for doc in docs])

# Build FAISS index
index = faiss.IndexFlatL2(384)
index.add(embeddings)

def search_docs(query, k=5):
    query_vec = encoder.encode([query])
    _, indices = index.search(query_vec, k)
    return [docs[i] for i in indices[0]]

# Example usage
results = search_docs("What is ASC 606?")

Use with LangChain

from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# Load documents
loader = HuggingFaceDatasetLoader(
    path="aanshshah/gaap-sec-compliance-dataset",
    page_content_column="content"
)
documents = loader.load()

# Create vector store
embeddings = HuggingFaceEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)

# Query
results = vectorstore.similarity_search("revenue recognition", k=5)

Dataset Structure

Document Schema

Each document contains:

  • id: Unique identifier
  • content: Full text content
  • metadata: Structured information including:
    • source: Origin (XBRL, SEC_FILING, GAAP_STANDARD)
    • type: Document type (tag, financial_data, standard)
    • category: Financial category (Assets, Revenue, etc.)
    • code: Standard code (e.g., "ASC 606", "us-gaap:Assets")
    • title: Human-readable title
    • date: Date in YYYY-MM-DD format
    • company: Company name (for SEC filings)

Example Document

{
  "id": "gaap_standard_67a64e72e3390f7e",
  "content": "# ASC 606: Revenue from Contracts with Customers...",
  "metadata": {
    "source": "GAAP_STANDARD",
    "type": "standard",
    "category": "Revenue",
    "code": "ASC 606",
    "title": "ASC 606: Revenue from Contracts with Customers",
    "date": "2025-01-01"
  }
}

Data Creation Process

Sources

  1. XBRL US GAAP Taxonomy (94.7%)

    • Complete standardized accounting tags
    • Hierarchical relationships preserved
  2. SEC EDGAR Database (5.3%)

    • Real company 10-K/10-Q filings
    • Quarterly data from 2007-2025
  3. FASB Standards (<0.1%)

    • Core GAAP standards (ASC)
    • Implementation guidance

Processing Pipeline

  1. Extraction: Parse XBRL, HTML, PDF sources
  2. Standardization: Convert to consistent JSON format
  3. Cleaning: Remove duplicates and invalid entries
  4. Enrichment: Add metadata and categories
  5. Validation: Ensure quality and completeness

Applications in Production

Financial Institutions

  • Compliance monitoring systems
  • Risk assessment tools
  • Regulatory report generation
  • Audit automation

FinTech Companies

  • AI-powered accounting assistants
  • Automated bookkeeping
  • Financial analysis platforms
  • Investment research tools

Education & Training

  • Interactive learning platforms
  • Professional certification prep
  • Academic research
  • Student Q&A systems

Quality & Coverage

Quality Metrics

  • Deduplicated: No duplicate documents
  • Validated: All required fields present
  • Cleaned: Invalid entries removed
  • Structured: Consistent schema
  • Current: Up-to-date as of December 2025

Coverage Areas

  • Complete US GAAP taxonomy
  • Major public company filings
  • All accounting categories
  • Historical and current standards
  • Multiple filing types (10-K, 10-Q, 8-K)

Legal & Ethics

Data Sources

  • All data from public sources
  • No proprietary information
  • SEC EDGAR publicly available filings
  • XBRL taxonomy open standard

Use Restrictions

  • Not for investment advice
  • Educational/research purposes
  • Verify critical information with official sources
  • Comply with applicable regulations

Privacy

  • No personal identifying information
  • No material non-public information
  • Only public company data
  • Anonymized where appropriate

Updates & Maintenance

Version History

  • v1.0.0 (December 2025): Initial release with 470K documents

Update Schedule

  • Quarterly updates planned
  • New SEC filings added
  • GAAP standard updates included
  • Community feedback incorporated

Support & Community

Getting Help

Contributing

  • Report data quality issues
  • Suggest additional sources
  • Share use cases and applications
  • Submit improvements

Citation

If you use this dataset in your research or applications, please cite:

@dataset{gaap_sec_compliance_2025,
  title={GAAP & SEC Compliance Dataset},
  author={Shah, Aansh},
  year={2025},
  month={12},
  publisher={HuggingFace},
  url={https://huggingface.co/datasets/aanshshah/gaap-sec-compliance-dataset},
  note={A comprehensive dataset of 470,151 financial documents for AI applications}
}

Acknowledgments

  • XBRL US for taxonomy data
  • SEC EDGAR for public filings
  • FASB for accounting standards
  • HuggingFace for hosting platform
  • FlexAI for compute resources

Dataset Statistics

Metric Value
Documents 470,151
Characters 171,055,320
Companies 6,573
Date Span 6,573 days
Storage ~296MB

Built for the financial AI community

Ready to build the next generation of financial AI? Start with this dataset!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using aanshshah/gaap-sec-compliance-dataset 1