Language Model Training Data: Sources and Preparation Guide

High-quality training data is fundamental to successful language model development. This comprehensive guide covers how to acquire, prepare, and manage training data for your language models, with emphasis on leveraging the vast resources available in the Hugging Face ecosystem.

The Hugging Face Revolution: 1 Million Models and Counting

It is worth highlighting the incredible growth of the Hugging Face platform, which now hosts over 1 million AI models. This milestone reflects not just the platform's success but the explosive growth in AI development worldwide.

Beyond models, Hugging Face maintains an impressive collection of over 75,000 datasets spanning more than 100 languages. These datasets support a wide range of tasks across natural language processing, computer vision, and audio processing, making it an invaluable resource for training and fine-tuning AI models.

Public Datasets for Language Model Training

General Text Corpora

The Pile

Description: A diverse 825GB English text corpus designed specifically for LLM training
Access: https://pile.eleuther.ai/
Usage:

# Using the Hugging Face datasets library

pip install datasets

python -c "from datasets import load_dataset; dataset = load_dataset('the_pile')"

C4 (Colossal Clean Crawled Corpus)

Description: 750GB of clean English web text data
Access: https://huggingface.co/datasets/c4
Usage:

# Load English subset

python -c "from datasets import load_dataset; dataset = load_dataset('c4', 'en')"

WikiText

Description: Long-term dependency language modeling dataset with clean Wikipedia articles
Access: https://huggingface.co/datasets/wikitext
Best for: Smaller models and initial testing (low resource requirement)
Usage:

# Load WikiText-103

python -c "from datasets import load_dataset; dataset = load_dataset('wikitext', 'wikitext-103-v1')"

BookCorpus

Description: Collection of free books
Access: https://huggingface.co/datasets/bookcorpus
Usage:

python -c "from datasets import load_dataset; dataset = load_dataset('bookcorpus')"

Popular Hugging Face Datasets

Some of the most widely used datasets on Hugging Face include:

IMDB: A collection of 50,000 movie reviews labeled as positive or negative, commonly used for sentiment analysis tasks.
Common Voice: A multilingual speech dataset featuring over 9,000 hours of recordings, essential for speech recognition development.
Amazon Polarity: Contains over 3 million product reviews across multiple languages for sentiment analysis.
Emotion: Texts labeled with six primary emotions (anger, fear, joy, love, sadness, surprise) for emotion classification.
Yahoo Answers Topics: Over 1 million questions categorized by topic, useful for text classification tasks.
Hate Speech18: Specialized dataset for detecting hate speech and offensive language in social media content.

Code and Technical Data

GitHub Code

Description: Code from public GitHub repositories
Access: https://huggingface.co/datasets/codeparrot/github-code
Usage:

python -c "from datasets import load_dataset; dataset = load_dataset('codeparrot/github-code')"

The Stack

Description: 3TB dataset with 6 million GitHub repositories
Access: https://huggingface.co/datasets/bigcode/the-stack
Usage:

# Load specific language subset (e.g., Python)

python -c "from datasets import load_dataset; dataset = load_dataset('bigcode/the-stack', 'data')"

Multilingual Data

OSCAR

Description: Large multilingual corpus obtained from Common Crawl
Access: https://huggingface.co/datasets/oscar
Usage:

# Load French subset

python -c "from datasets import load_dataset; dataset = load_dataset('oscar', 'unshuffled_deduplicated_fr')"

mC4

Description: Multilingual version of C4 with 101 languages
Access: https://huggingface.co/datasets/mc4
Usage:

# Load Spanish subset

python -c "from datasets import load_dataset; dataset = load_dataset('mc4', 'es')"

Specialized Datasets

PubMed Abstracts

Description: Medical and biomedical research papers
Access: https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
Best for: Medical AI models

ArXiv Dataset

Description: Scientific papers across multiple disciplines
Access: https://huggingface.co/datasets/arxiv_dataset
Usage:

python -c "from datasets import load_dataset; dataset = load_dataset('arxiv_dataset')"

Creating Custom Training Datasets

Web Scraping

Web scraping is an effective method for creating custom datasets. See the appendix for a comprehensive web scraper implementation that respects website permissions.

Local Text Sources

eBooks

Project Gutenberg offers over 60,000 free books in the public domain
Access: https://www.gutenberg.org/
Download books in plain text format

PDF Conversion

PDF documents can be converted to text for inclusion in training datasets. See the appendix for a PDF conversion implementation.

Data Augmentation Techniques

Text substitutions

Synonym replacement and other text substitution techniques can expand your dataset with variations. See the appendix for implementation.

Data Cleaning and Processing

Text Cleaning Functions

Effective text cleaning is crucial for model performance. The appendix contains implementations for:

Basic text cleaning
Removing lines that are too short or too long
Filtering non-language content

Deduplication

Removing duplicate content prevents models from giving undue weight to repeated information. The appendix includes a paragraph-level deduplication implementation.

Quality Filtering

Quality filtering removes low-value content that could degrade model performance. See the appendix for implementation details.

Creating Train/Validation/Test Splits

Proper dataset splits are essential for effective model evaluation. The appendix includes a function for creating dataset splits with configurable ratios.

The Challenges of Web Crawl Datasets

When planning your language model training, it's important to understand both the practical and ethical challenges of working with large-scale web datasets that are often described as "freely available."

The Infrastructure Barrier

Common web crawl datasets present several significant challenges that aren't immediately obvious:

Enormous file sizes: Individual files often measure in multiple gigabytes, with complete datasets reaching petabyte scale
Raw, unprocessed formats: Data typically comes in specialized formats like WARC (Web ARChive) that require additional processing
Significant storage requirements: Even working with a small subset can quickly consume terabytes of storage
Bandwidth limitations: Downloading the data becomes a major hurdle, even with good internet connections

The Real-World Cost of "Free" Data

While these datasets are technically free to access, processing them requires:

Substantial computing resources: Distributed computing clusters or high-end machines are often necessary
Cloud computing costs: Storage, bandwidth, and compute time on cloud platforms can quickly add up to thousands of dollars
Technical expertise: Working with these datasets requires familiarity with big data tools, distributed systems, and specialized file formats
Time investment: The learning curve and processing time represent a significant hidden cost

Who Can Realistically Use These Datasets?

The infrastructure requirements effectively limit access to:

Organizations with substantial computing resources
Academic institutions with research computing facilities
Companies with significant technology budgets
Individuals with specialized technical knowledge and access to computing resources

This creates an accessibility gap that contradicts the "open for anyone" messaging often associated with these datasets.

Problematic Content in Common Crawl

Beyond the technical challenges, Common Crawl and similar web crawl datasets present significant ethical concerns:

Deliberate Lack of Curation

Common Crawl maintains a minimal curation approach to its data collection:

The organization intentionally does not remove hate speech, pornography, violent content, or other problematic material from its dataset
Responsibility for filtering is shifted to downstream users (like AI builders)
This philosophical stance aims to provide raw web data for various research purposes, but creates significant challenges for AI development

Types of Problematic Content Present

Research has identified several categories of concerning content in Common Crawl data:

Hate Speech and Toxic Content: Studies have found significant amounts of hate speech that can perpetuate harmful stereotypes and biases
Sexually Explicit Material: Despite filtering efforts, sexually explicit content remains prevalent
Violent Content: Descriptions of violence and disturbing content are present throughout the dataset
Racist and Discriminatory Content: Content that promotes racism, xenophobia, and other forms of discrimination is present
Misinformation: The dataset contains various forms of misinformation and factually incorrect information

Representation and Bias Issues

Common Crawl data suffers from significant representation issues:

English-Language Dominance: English is the primary language for 46% of documents
Western Cultural Bias: The dataset overrepresents Western perspectives
Digital Divide Reflection: The crawling process makes domains related to digitally marginalized communities less likely to be included
Incomplete Web Coverage: Despite claims that Common Crawl contains the "entire web," it represents only a small fraction of existing web content

Growing Access Restrictions

The use of Common Crawl for AI training is facing increasing challenges:

Content Creator Pushback: More platforms and publishers are blocking or charging for access to their data
Legal Challenges: The dataset includes copyrighted work distributed under fair use claims, which is increasingly being challenged through lawsuits
API Restrictions: Platforms are increasingly limiting unpaid API access to their data

Alternative Approaches

Rather than attempting to process entire web crawl datasets with their inherent problems, consider these more practical and ethical approaches:

Work with curated subsets: Many organizations offer filtered, topic-specific extractions from larger datasets
Use pre-processed versions: Look for already cleaned and filtered derivatives of web crawls
Supplement with diverse sources: Combine web crawl data with more carefully curated datasets that address representation gaps
Implement robust filtering: Develop comprehensive filtering pipelines that address the full range of problematic content
Be transparent: Document your filtering methods and their limitations
Pool resources: Consider collaborative approaches where processing costs and filtering expertise can be shared
Start small: Begin with manageable portions to validate your approach before scaling up

Commercial Data Sources

For commercial projects, consider these sources:

Common Crawl: Petabytes of web crawl data

- Access: https://commoncrawl.org/

Google Books Ngrams: Statistical information about word usage

- Access: http://storage.googleapis.com/books/ngrams/books/datasetsv3.html

Licensed Content:

- Academic journals
- Newspaper archives
- Specialized industry texts

Data Providers:

- LightTag
- Scale AI
- Appen

Ethical and Legal Considerations for AI Training Data

When developing AI systems, it's crucial to be aware of the evolving legal and ethical landscape surrounding training data. The AI industry has seen significant legal challenges, with publishers, authors, and media organizations filing lawsuits alleging copyright infringement and unauthorized use of their content for AI training.

Legal Precedents and Challenges

Recent court cases have established precedents that could significantly impact AI development:

Several publishing companies have filed lawsuits against AI companies alleging systematic copyright infringement
Authors have sued over book content being used without permission for AI training
Programmers have brought class action lawsuits over AI coding assistants trained on billions of lines of code
Courts have rejected some "fair use" claims for training data
News publishers have initiated legal action, claiming articles were used for training without authorization

These challenges are pushing the industry toward more transparent and ethical practices, with some companies now offering licensing deals with publishers and emphasizing copyright controls.

Responsible Training Data Practices

To protect yourself and your organization while creating ethical AI systems, consider these best practices:

Legal Protection

Document all data sources and maintain detailed records
Respect website crawling instructions and honor robots.txt
Consider licensing for commercial applications
For third-party models, look for providers that offer legal protection
Clearly define intended uses and limitations in your terms of service
Consult specialized legal counsel as the field is complex and rapidly evolving

Mitigating Harmful Content

Implement comprehensive filtering to address all types of problematic content, not just explicit material
Use more sophisticated filtering techniques rather than simplistic keyword approaches
Be transparent about your filtering methods and their limitations
Regularly audit your system's outputs for bias, toxicity, and other harmful content
Create channels for reporting problematic outputs

Addressing Representation Issues

Supplement web crawl data with more diverse and representative datasets
Consider the biases inherent in your training data and take steps to mitigate them
Be mindful of language and cultural representation in your datasets
Design systems that can trace outputs back to source material

By implementing these practices, you can develop AI systems that are not only effective but also ethically and legally sound. The future of AI depends on responsible development that respects copyright, prevents harmful content generation, and ensures fair representation across different languages and cultures.

Appendix: Code Implementation Samples

# Web Scraping Implementation
import requests
from bs4 import BeautifulSoup
import time
import random
import markdown
import re
import json
import os

def check_llms_txt(base_url):
    # Check for llms.txt and parse its contents if available
    try:
        llms_url = f"{base_url.rstrip('/')}/llms.txt"
        response = requests.get(
            llms_url,
            headers={'User-Agent': 'Custom Dataset Builder Bot (your@email.com)'}
        )
    

        if response.status_code == 200:
            print(f"Found llms.txt at {llms_url}")
            return response.text
        return None
    except Exception as e:
        print(f"Error checking for llms.txt: {e}")
        return None

def parse_llms_txt(content):
    # Parse llms.txt content to extract relevant directives
    directives = {
        'content_endpoints': [],
        'content_selectors': [],
        'markdown_sources': [],
        'rate_limits': {},
        'allowed_training': False,
        'attribution': None
    }

    # Extract markdown links

    md_links = re.findall(r'\[(.*?)\]\((.*?\.md)\)', content)
    for name, url in md_links:
        directives['markdown_sources'].append(url)
  
    # Extract content endpoints

    if "ContentEndpoint:" in content:
        endpoints = re.findall(r'ContentEndpoint:\s*(\S+)', content)
        directives['content_endpoints'].extend(endpoints)
    

    # Extract content selectors

    if "ContentSelector:" in content:
        selectors = re.findall(r'ContentSelector:\s*(\S+)', content)
        directives['content_selectors'].extend(selectors)
    

    # Check training permissions

    if re.search(r'AllowAITraining:\s*true', content, re.IGNORECASE):
        directives['allowed_training'] = True
  

    # Extract rate limits

    rate_limit_match = re.search(r'Rate limit:\s*(\d+)\s+requests\s+per\s+(\w+)', content, re.IGNORECASE)
    if rate_limit_match:
        amount, period = rate_limit_match.groups()
        directives['rate_limits'] = {'amount': int(amount), 'period': period}
    

    # Extract attribution requirements

    attribution_match = re.search(r'Attribution.*?format:?\s*"([^"]+)"', content)
    if attribution_match:
        directives['attribution'] = attribution_match.group(1)
    return directives

def scrape_website(urls, output_file, delay_range=(1, 3)):
    # Scrape text content from a list of URLs with llms.txt support

    with open(output_file, 'w', encoding='utf-8') as f:

        for base_url in urls:

            try:

                # Check for llms.txt first
                llms_content = check_llms_txt(base_url)
                llms_directives = parse_llms_txt(llms_content) if llms_content else None
                # Apply rate limiting from llms.txt if available
                if llms_directives and 'rate_limits' in llms_directives and llms_directives['rate_limits']:

                    # Here you would implement dynamic rate limiting based on the directives
                    # For this example, we'll just use our default delay

                    print(f"Respecting rate limit: {llms_directives['rate_limits']['amount']} requests per {llms_directives['rate_limits']['period']}")

                

                # Be a good web citizen with delays between requests
                time.sleep(random.uniform(*delay_range))
                

                # If llms.txt specifies markdown sources, prioritize those

                if llms_directives and llms_directives['markdown_sources']:
                    for md_url in llms_directives['markdown_sources']:
                        # Ensure the URL is absolute
                        if not md_url.startswith('http'):
                            md_url = f"{base_url.rstrip('/')}/{md_url.lstrip('/')}"
                  

                        print(f"Fetching markdown content from {md_url}")
                        md_response = requests.get(
                            md_url,
                            headers={'User-Agent': 'Custom Dataset Builder Bot (your@email.com)'}
                        )

                        if md_response.status_code == 200:

                           # For Markdown files, we can directly save the content

                            md_content = md_response.text
                            f.write(f"# Content from {md_url}\n\n")
                            f.write(md_content)
                            f.write("\n\n")
                            print(f"Saved markdown content from {md_url}")

                # If llms.txt specifies content endpoints (e.g., static JSON), use those

                if llms_directives and llms_directives['content_endpoints']:
                    for endpoint in llms_directives['content_endpoints']:
                        # Ensure the URL is absolute
                        if not endpoint.startswith('http'):
                            endpoint_url = f"{base_url.rstrip('/')}/{endpoint.lstrip('/')}"
                        else:
                            endpoint_url = endpoint                       

                        print(f"Fetching content from endpoint {endpoint_url}")
                        endpoint_response = requests.get(
                            endpoint_url,
                            headers={'User-Agent': 'Custom Dataset Builder Bot (your@email.com)'}                        )

                        
                        if endpoint_response.status_code == 200:
                            try:
                                # Try to parse as JSON
                                json_data = endpoint_response.json()
                                # Extract text content from JSON (depends on structure)
                                if 'content' in json_data:
                                    f.write(f"# Content from {endpoint_url}\n\n")
                                    f.write(json_data['content'])
                                    f.write("\n\n")
                                    print(f"Saved JSON content from {endpoint_url}")
                            except ValueError:
                                # Not JSON, treat as regular text
                                f.write(f"# Content from {endpoint_url}\n\n")
                                f.write(endpoint_response.text)
                                f.write("\n\n")
                                print(f"Saved text content from {endpoint_url}")
                # Fall back to traditional scraping if needed

                print(f"Scraping {base_url}")
                response = requests.get(
                    base_url, 
                    headers={'User-Agent': 'Custom Dataset Builder Bot (your@email.com)'}
                )

                response.raise_for_status()

                
                # Parse content

                soup = BeautifulSoup(response.text, 'html.parser')               

                # If llms.txt specifies content selectors, use those
                if llms_directives and llms_directives['content_selectors']:
                    content_text = ""
                    for selector in llms_directives['content_selectors']:
                        for element in soup.select(selector):
                            content_text += element.get_text(separator='\n') + "\n\n"
                    if content_text:

                        f.write(f"# Content from {base_url} using selectors\n\n")
                        f.write(content_text)
                        f.write("\n\n")
                        print(f"Scraped {base_url} using content selectors")
                        continue  # Skip traditional extraction if we used selectors

                

                # Traditional extraction if no selectors or they didn't yield content
                # Remove scripts, styles, and other non-content elements
                for element in soup(['script', 'style', 'header', 'footer', 'nav']):
                    element.decompose()
                # Extract text
                text = soup.get_text(separator='\n')
                # Clean text (remove extra whitespace, etc.)
                lines = [line.strip() for line in text.split('\n')]
                text = '\n'.join(line for line in lines if line)
                

                # Add attribution if required by llms.txt
                if llms_directives and llms_directives['attribution']:
                    text += f"\n\nSource: {llms_directives['attribution']}"
                

                # Write to file
                f.write(f"# Content from {base_url}\n\n")
                f.write(text)
                f.write("\n\n")
                print(f"Scraped {base_url}")
            except Exception as e:
                print(f"Error scraping {base_url}: {e}")

# PDF Conversion
import os
import glob
import PyPDF2
def pdf_to_text(pdf_dir, output_dir):
    # Convert PDF files to text files.
    os.makedirs(output_dir, exist_ok=True)
    

    for pdf_path in glob.glob(os.path.join(pdf_dir, "*.pdf")):
        try:
            pdf_name = os.path.basename(pdf_path).replace('.pdf', '')
            output_path = os.path.join(output_dir, f"{pdf_name}.txt")

            # Open the PDF

            with open(pdf_path, 'rb') as pdf_file:
                reader = PyPDF2.PdfReader(pdf_file)

                

                # Extract text from each page

                text = ""
                for page_num in range(len(reader.pages)):
                    text += reader.pages[page_num].extract_text()
            # Write to text file

            with open(output_path, 'w', encoding='utf-8') as text_file:
                text_file.write(text)
            print(f"Converted {pdf_path} to {output_path}")
        except Exception as e:
            print(f"Error converting {pdf_path}: {e}")

#Text Substitutions
import random
import nltk
from nltk.corpus import wordnet

# Download wordnet
nltk.download('wordnet')
def synonym_replacement(text, n=1):

    # Replace n random words with their synonyms.

    words = text.split()
    new_words = words.copy()
    # Find random words with synonyms
    for _ in range(min(n, len(words))):
        random_idx = random.randint(0, len(words) - 1)
        random_word = words[random_idx]
        # Get synonyms
        synonyms = []
        for syn in wordnet.synsets(random_word):
            for lemma in syn.lemmas():
                synonyms.append(lemma.name())

        # Replace if synonyms exist

        if len(synonyms) > 0:
            synonym = random.choice(synonyms)
            new_words[random_idx] = synonym
    return ' '.join(new_words)

#Text Cleaning Functions
import re
import html
import unicodedata
def clean_text(text):

    # Basic text cleaning function.
    # Normalize unicode characters

    text = unicodedata.normalize('NFKC', text)

    # Decode HTML entities

    text = html.unescape(text)

    # Remove URLs

    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Remove email addresses

    text = re.sub(r'\S*@\S*\s?', '', text)  

    # Remove extra whitespace

    text = re.sub(r'\s+', ' ', text).strip()
    return text

def remove_long_short_lines(text, min_length=3, max_length=10000):

    # Filter out too short or too long lines.

    lines = text.split('\n')
    filtered_lines = [
        line for line in lines 
        if min_length <= len(line.split()) <= max_length
    ]
    return '\n'.join(filtered_lines)

def remove_non_language_lines(text, language_threshold=0.7):
    # Remove lines that are likely not natural language (requires langdetect)

    from langdetect import detect_langs
    lines = text.split('\n')
    filtered_lines = []
    for line in lines:
        if not line.strip():
            filtered_lines.append(line)
            continue

        try:
            # Check if the dominant language probability is high enough
            langs = detect_langs(line)
            if langs and langs[0].prob >= language_threshold:
                filtered_lines.append(line)
        except:
            # If detection fails, keep the line
            filtered_lines.append(line)
    return '\n'.join(filtered_lines)

#Deduplication
import hashlib
def deduplicate_paragraphs(texts):
    #Remove duplicate paragraphs from a list of texts 
    seen_hashes = set()
    unique_texts = []
    for text in texts:
        # Split into paragraphs

        paragraphs = text.split('\n\n')
        unique_paragraphs = []

        for paragraph in paragraphs:
            if not paragraph.strip():
                continue

            # Create a hash of the normalized paragraph

            paragraph_norm = ' '.join(paragraph.lower().split())
            paragraph_hash = hashlib.md5(paragraph_norm.encode()).hexdigest()

            # Only keep unique paragraphs

            if paragraph_hash not in seen_hashes:
                seen_hashes.add(paragraph_hash)
                unique_paragraphs.append(paragraph)

        # Rejoin paragraphs

        if unique_paragraphs:
            unique_texts.append('\n\n'.join(unique_paragraphs))

    return unique_texts

#Quality Filtering
def filter_by_quality(texts, min_words=5, min_chars=20, max_repeated_chars=4):

    #Filter texts based on quality heuristics

    filtered_texts = []

    for text in texts:
        # Check minimum words

        if len(text.split()) < min_words:
            continue

        # Check minimum characters

        if len(text) < min_chars:
            continue

        # Check for excessive repeated characters
        if re.search(r'(.)\1{%d,}' % max_repeated_chars, text):
            continue

        # Additional quality heuristics can be added
       

        filtered_texts.append(text)
    return filtered_texts

#Creating Train/Validation/Test Splits
import numpy as np

def create_dataset_splits(files, train_ratio=0.9, val_ratio=0.05, test_ratio=0.05, seed=42):

    #Split files into train, validation, and test set

    # Ensure ratios sum to 1

    total_ratio = train_ratio + val_ratio + test_ratio
    train_ratio /= total_ratio
    val_ratio /= total_ratio
    test_ratio /= total_ratio
    

    # Shuffle files

    np.random.seed(seed)
    np.random.shuffle(files)

    # Calculate split indices

    n_files = len(files)
    train_end = int(n_files * train_ratio)
    val_end = train_end + int(n_files * val_ratio)
   

    # Split files

    train_files = files[:train_end]
    val_files = files[train_end:val_end]
    test_files = files[val_end:]

    return {
        'train': train_files,
        'validation': val_files,
        'test': test_files
    }

<hr>

Thank you for reading

/fragments/ddt/ai-proposition