Language Model Training Data: Sources and Preparation Guide

Author: Tom Cranstoun
High-quality training data is fundamental to successful language model development. This comprehensive guide covers how to acquire, prepare, and manage training data for your language models, with emphasis on leveraging the vast resources available in the Hugging Face ecosystem.

The Hugging Face Revolution: 1 Million Models and Counting

It is worth highlighting the incredible growth of the Hugging Face platform, which now hosts over 1 million AI models. This milestone reflects not just the platform's success but the explosive growth in AI development worldwide.

Beyond models, Hugging Face maintains an impressive collection of over 75,000 datasets spanning more than 100 languages. These datasets support a wide range of tasks across natural language processing, computer vision, and audio processing, making it an invaluable resource for training and fine-tuning AI models.

Public Datasets for Language Model Training

General Text Corpora

The Pile

# Using the Hugging Face datasets library

pip install datasets

python -c "from datasets import load_dataset; dataset = load_dataset('the_pile')"

C4 (Colossal Clean Crawled Corpus)

# Load English subset

python -c "from datasets import load_dataset; dataset = load_dataset('c4', 'en')"

WikiText

# Load WikiText-103

python -c "from datasets import load_dataset; dataset = load_dataset('wikitext', 'wikitext-103-v1')"

BookCorpus

python -c "from datasets import load_dataset; dataset = load_dataset('bookcorpus')"

Some of the most widely used datasets on Hugging Face include:

Code and Technical Data

GitHub Code

python -c "from datasets import load_dataset; dataset = load_dataset('codeparrot/github-code')"

The Stack

# Load specific language subset (e.g., Python)

python -c "from datasets import load_dataset; dataset = load_dataset('bigcode/the-stack', 'data')"

Multilingual Data

OSCAR

# Load French subset

python -c "from datasets import load_dataset; dataset = load_dataset('oscar', 'unshuffled_deduplicated_fr')"

mC4

# Load Spanish subset

python -c "from datasets import load_dataset; dataset = load_dataset('mc4', 'es')"

Specialized Datasets

PubMed Abstracts

ArXiv Dataset

python -c "from datasets import load_dataset; dataset = load_dataset('arxiv_dataset')"

Creating Custom Training Datasets

Web Scraping

Web scraping is an effective method for creating custom datasets. See the appendix for a comprehensive web scraper implementation that respects website permissions.

Local Text Sources

eBooks

PDF Conversion

PDF documents can be converted to text for inclusion in training datasets. See the appendix for a PDF conversion implementation.

Data Augmentation Techniques

Text substitutions

Synonym replacement and other text substitution techniques can expand your dataset with variations. See the appendix for implementation.

Data Cleaning and Processing

Text Cleaning Functions

Effective text cleaning is crucial for model performance. The appendix contains implementations for:

Deduplication

Removing duplicate content prevents models from giving undue weight to repeated information. The appendix includes a paragraph-level deduplication implementation.

Quality Filtering

Quality filtering removes low-value content that could degrade model performance. See the appendix for implementation details.

Creating Train/Validation/Test Splits

Proper dataset splits are essential for effective model evaluation. The appendix includes a function for creating dataset splits with configurable ratios.

The Challenges of Web Crawl Datasets

When planning your language model training, it's important to understand both the practical and ethical challenges of working with large-scale web datasets that are often described as "freely available."

The Infrastructure Barrier

Common web crawl datasets present several significant challenges that aren't immediately obvious:

The Real-World Cost of "Free" Data

While these datasets are technically free to access, processing them requires:

Who Can Realistically Use These Datasets?

The infrastructure requirements effectively limit access to:

This creates an accessibility gap that contradicts the "open for anyone" messaging often associated with these datasets.

Problematic Content in Common Crawl

Beyond the technical challenges, Common Crawl and similar web crawl datasets present significant ethical concerns:

Deliberate Lack of Curation

Common Crawl maintains a minimal curation approach to its data collection:

Types of Problematic Content Present

Research has identified several categories of concerning content in Common Crawl data:

Representation and Bias Issues

Common Crawl data suffers from significant representation issues:

Growing Access Restrictions

The use of Common Crawl for AI training is facing increasing challenges:

Alternative Approaches

Rather than attempting to process entire web crawl datasets with their inherent problems, consider these more practical and ethical approaches:

Commercial Data Sources

For commercial projects, consider these sources:

When developing AI systems, it's crucial to be aware of the evolving legal and ethical landscape surrounding training data. The AI industry has seen significant legal challenges, with publishers, authors, and media organizations filing lawsuits alleging copyright infringement and unauthorized use of their content for AI training.

Recent court cases have established precedents that could significantly impact AI development:

These challenges are pushing the industry toward more transparent and ethical practices, with some companies now offering licensing deals with publishers and emphasizing copyright controls.

Responsible Training Data Practices

To protect yourself and your organization while creating ethical AI systems, consider these best practices:

Mitigating Harmful Content

Addressing Representation Issues

By implementing these practices, you can develop AI systems that are not only effective but also ethically and legally sound. The future of AI depends on responsible development that respects copyright, prevents harmful content generation, and ensures fair representation across different languages and cultures.

Appendix: Code Implementation Samples

# Web Scraping Implementation
import requests
from bs4 import BeautifulSoup
import time
import random
import markdown
import re
import json
import os

def check_llms_txt(base_url):
    # Check for llms.txt and parse its contents if available
    try:
        llms_url = f"{base_url.rstrip('/')}/llms.txt"
        response = requests.get(
            llms_url,
            headers={'User-Agent': 'Custom Dataset Builder Bot (your@email.com)'}
        )
    

        if response.status_code == 200:
            print(f"Found llms.txt at {llms_url}")
            return response.text
        return None
    except Exception as e:
        print(f"Error checking for llms.txt: {e}")
        return None

def parse_llms_txt(content):
    # Parse llms.txt content to extract relevant directives
    directives = {
        'content_endpoints': [],
        'content_selectors': [],
        'markdown_sources': [],
        'rate_limits': {},
        'allowed_training': False,
        'attribution': None
    }

    # Extract markdown links

    md_links = re.findall(r'\[(.*?)\]\((.*?\.md)\)', content)
    for name, url in md_links:
        directives['markdown_sources'].append(url)
  
    # Extract content endpoints

    if "ContentEndpoint:" in content:
        endpoints = re.findall(r'ContentEndpoint:\s*(\S+)', content)
        directives['content_endpoints'].extend(endpoints)
    

    # Extract content selectors

    if "ContentSelector:" in content:
        selectors = re.findall(r'ContentSelector:\s*(\S+)', content)
        directives['content_selectors'].extend(selectors)
    

    # Check training permissions

    if re.search(r'AllowAITraining:\s*true', content, re.IGNORECASE):
        directives['allowed_training'] = True
  

    # Extract rate limits

    rate_limit_match = re.search(r'Rate limit:\s*(\d+)\s+requests\s+per\s+(\w+)', content, re.IGNORECASE)
    if rate_limit_match:
        amount, period = rate_limit_match.groups()
        directives['rate_limits'] = {'amount': int(amount), 'period': period}
    

    # Extract attribution requirements

    attribution_match = re.search(r'Attribution.*?format:?\s*"([^"]+)"', content)
    if attribution_match:
        directives['attribution'] = attribution_match.group(1)
    return directives

def scrape_website(urls, output_file, delay_range=(1, 3)):
    # Scrape text content from a list of URLs with llms.txt support

    with open(output_file, 'w', encoding='utf-8') as f:

        for base_url in urls:

            try:

                # Check for llms.txt first
                llms_content = check_llms_txt(base_url)
                llms_directives = parse_llms_txt(llms_content) if llms_content else None
                # Apply rate limiting from llms.txt if available
                if llms_directives and 'rate_limits' in llms_directives and llms_directives['rate_limits']:

                    # Here you would implement dynamic rate limiting based on the directives
                    # For this example, we'll just use our default delay

                    print(f"Respecting rate limit: {llms_directives['rate_limits']['amount']} requests per {llms_directives['rate_limits']['period']}")

                

                # Be a good web citizen with delays between requests
                time.sleep(random.uniform(*delay_range))
                

                # If llms.txt specifies markdown sources, prioritize those

                if llms_directives and llms_directives['markdown_sources']:
                    for md_url in llms_directives['markdown_sources']:
                        # Ensure the URL is absolute
                        if not md_url.startswith('http'):
                            md_url = f"{base_url.rstrip('/')}/{md_url.lstrip('/')}"
                  

                        print(f"Fetching markdown content from {md_url}")
                        md_response = requests.get(
                            md_url,
                            headers={'User-Agent': 'Custom Dataset Builder Bot (your@email.com)'}
                        )

                        if md_response.status_code == 200:

                           # For Markdown files, we can directly save the content

                            md_content = md_response.text
                            f.write(f"# Content from {md_url}\n\n")
                            f.write(md_content)
                            f.write("\n\n")
                            print(f"Saved markdown content from {md_url}")

                # If llms.txt specifies content endpoints (e.g., static JSON), use those

                if llms_directives and llms_directives['content_endpoints']:
                    for endpoint in llms_directives['content_endpoints']:
                        # Ensure the URL is absolute
                        if not endpoint.startswith('http'):
                            endpoint_url = f"{base_url.rstrip('/')}/{endpoint.lstrip('/')}"
                        else:
                            endpoint_url = endpoint                       

                        print(f"Fetching content from endpoint {endpoint_url}")
                        endpoint_response = requests.get(
                            endpoint_url,
                            headers={'User-Agent': 'Custom Dataset Builder Bot (your@email.com)'}                        )

                        
                        if endpoint_response.status_code == 200:
                            try:
                                # Try to parse as JSON
                                json_data = endpoint_response.json()
                                # Extract text content from JSON (depends on structure)
                                if 'content' in json_data:
                                    f.write(f"# Content from {endpoint_url}\n\n")
                                    f.write(json_data['content'])
                                    f.write("\n\n")
                                    print(f"Saved JSON content from {endpoint_url}")
                            except ValueError:
                                # Not JSON, treat as regular text
                                f.write(f"# Content from {endpoint_url}\n\n")
                                f.write(endpoint_response.text)
                                f.write("\n\n")
                                print(f"Saved text content from {endpoint_url}")
                # Fall back to traditional scraping if needed

                print(f"Scraping {base_url}")
                response = requests.get(
                    base_url, 
                    headers={'User-Agent': 'Custom Dataset Builder Bot (your@email.com)'}
                )

                response.raise_for_status()

                
                # Parse content

                soup = BeautifulSoup(response.text, 'html.parser')               

                # If llms.txt specifies content selectors, use those
                if llms_directives and llms_directives['content_selectors']:
                    content_text = ""
                    for selector in llms_directives['content_selectors']:
                        for element in soup.select(selector):
                            content_text += element.get_text(separator='\n') + "\n\n"
                    if content_text:

                        f.write(f"# Content from {base_url} using selectors\n\n")
                        f.write(content_text)
                        f.write("\n\n")
                        print(f"Scraped {base_url} using content selectors")
                        continue  # Skip traditional extraction if we used selectors

                

                # Traditional extraction if no selectors or they didn't yield content
                # Remove scripts, styles, and other non-content elements
                for element in soup(['script', 'style', 'header', 'footer', 'nav']):
                    element.decompose()
                # Extract text
                text = soup.get_text(separator='\n')
                # Clean text (remove extra whitespace, etc.)
                lines = [line.strip() for line in text.split('\n')]
                text = '\n'.join(line for line in lines if line)
                

                # Add attribution if required by llms.txt
                if llms_directives and llms_directives['attribution']:
                    text += f"\n\nSource: {llms_directives['attribution']}"
                

                # Write to file
                f.write(f"# Content from {base_url}\n\n")
                f.write(text)
                f.write("\n\n")
                print(f"Scraped {base_url}")
            except Exception as e:
                print(f"Error scraping {base_url}: {e}")
# PDF Conversion
import os
import glob
import PyPDF2
def pdf_to_text(pdf_dir, output_dir):
    # Convert PDF files to text files.
    os.makedirs(output_dir, exist_ok=True)
    

    for pdf_path in glob.glob(os.path.join(pdf_dir, "*.pdf")):
        try:
            pdf_name = os.path.basename(pdf_path).replace('.pdf', '')
            output_path = os.path.join(output_dir, f"{pdf_name}.txt")

            # Open the PDF

            with open(pdf_path, 'rb') as pdf_file:
                reader = PyPDF2.PdfReader(pdf_file)

                

                # Extract text from each page

                text = ""
                for page_num in range(len(reader.pages)):
                    text += reader.pages[page_num].extract_text()
            # Write to text file

            with open(output_path, 'w', encoding='utf-8') as text_file:
                text_file.write(text)
            print(f"Converted {pdf_path} to {output_path}")
        except Exception as e:
            print(f"Error converting {pdf_path}: {e}")
#Text Substitutions
import random
import nltk
from nltk.corpus import wordnet

# Download wordnet
nltk.download('wordnet')
def synonym_replacement(text, n=1):

    # Replace n random words with their synonyms.

    words = text.split()
    new_words = words.copy()
    # Find random words with synonyms
    for _ in range(min(n, len(words))):
        random_idx = random.randint(0, len(words) - 1)
        random_word = words[random_idx]
        # Get synonyms
        synonyms = []
        for syn in wordnet.synsets(random_word):
            for lemma in syn.lemmas():
                synonyms.append(lemma.name())

        # Replace if synonyms exist

        if len(synonyms) > 0:
            synonym = random.choice(synonyms)
            new_words[random_idx] = synonym
    return ' '.join(new_words)
#Text Cleaning Functions
import re
import html
import unicodedata
def clean_text(text):

    # Basic text cleaning function.
    # Normalize unicode characters

    text = unicodedata.normalize('NFKC', text)

    # Decode HTML entities

    text = html.unescape(text)

    # Remove URLs

    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Remove email addresses

    text = re.sub(r'\S*@\S*\s?', '', text)  

    # Remove extra whitespace

    text = re.sub(r'\s+', ' ', text).strip()
    return text

def remove_long_short_lines(text, min_length=3, max_length=10000):

    # Filter out too short or too long lines.

    lines = text.split('\n')
    filtered_lines = [
        line for line in lines 
        if min_length <= len(line.split()) <= max_length
    ]
    return '\n'.join(filtered_lines)

def remove_non_language_lines(text, language_threshold=0.7):
    # Remove lines that are likely not natural language (requires langdetect)

    from langdetect import detect_langs
    lines = text.split('\n')
    filtered_lines = []
    for line in lines:
        if not line.strip():
            filtered_lines.append(line)
            continue

        try:
            # Check if the dominant language probability is high enough
            langs = detect_langs(line)
            if langs and langs[0].prob >= language_threshold:
                filtered_lines.append(line)
        except:
            # If detection fails, keep the line
            filtered_lines.append(line)
    return '\n'.join(filtered_lines)
#Deduplication
import hashlib
def deduplicate_paragraphs(texts):
    #Remove duplicate paragraphs from a list of texts 
    seen_hashes = set()
    unique_texts = []
    for text in texts:
        # Split into paragraphs

        paragraphs = text.split('\n\n')
        unique_paragraphs = []

        for paragraph in paragraphs:
            if not paragraph.strip():
                continue

            # Create a hash of the normalized paragraph

            paragraph_norm = ' '.join(paragraph.lower().split())
            paragraph_hash = hashlib.md5(paragraph_norm.encode()).hexdigest()

            # Only keep unique paragraphs

            if paragraph_hash not in seen_hashes:
                seen_hashes.add(paragraph_hash)
                unique_paragraphs.append(paragraph)

        # Rejoin paragraphs

        if unique_paragraphs:
            unique_texts.append('\n\n'.join(unique_paragraphs))

    return unique_texts
#Quality Filtering
def filter_by_quality(texts, min_words=5, min_chars=20, max_repeated_chars=4):

    #Filter texts based on quality heuristics

    filtered_texts = []

    for text in texts:
        # Check minimum words

        if len(text.split()) < min_words:
            continue

        # Check minimum characters

        if len(text) < min_chars:
            continue

        # Check for excessive repeated characters
        if re.search(r'(.)\1{%d,}' % max_repeated_chars, text):
            continue

        # Additional quality heuristics can be added
       

        filtered_texts.append(text)
    return filtered_texts
#Creating Train/Validation/Test Splits
import numpy as np

def create_dataset_splits(files, train_ratio=0.9, val_ratio=0.05, test_ratio=0.05, seed=42):

    #Split files into train, validation, and test set

    # Ensure ratios sum to 1

    total_ratio = train_ratio + val_ratio + test_ratio
    train_ratio /= total_ratio
    val_ratio /= total_ratio
    test_ratio /= total_ratio
    

    # Shuffle files

    np.random.seed(seed)
    np.random.shuffle(files)

    # Calculate split indices

    n_files = len(files)
    train_end = int(n_files * train_ratio)
    val_end = train_end + int(n_files * val_ratio)
   

    # Split files

    train_files = files[:train_end]
    val_files = files[train_end:val_end]
    test_files = files[val_end:]

    return {
        'train': train_files,
        'validation': val_files,
        'test': test_files
    }
<hr>

Thank you for reading

/fragments/ddt/ai-proposition

Related Articles

path=*
path=*
Back to Top