Language Model Training Data: Sources and Preparation Guide
The Hugging Face Revolution: 1 Million Models and Counting
It is worth highlighting the incredible growth of the Hugging Face platform, which now hosts over 1 million AI models. This milestone reflects not just the platform's success but the explosive growth in AI development worldwide.
Beyond models, Hugging Face maintains an impressive collection of over 75,000 datasets spanning more than 100 languages. These datasets support a wide range of tasks across natural language processing, computer vision, and audio processing, making it an invaluable resource for training and fine-tuning AI models.
Public Datasets for Language Model Training
General Text Corpora
The Pile
- Description: A diverse 825GB English text corpus designed specifically for LLM training
- Access: https://pile.eleuther.ai/
- Usage:
# Using the Hugging Face datasets library
pip install datasets
python -c "from datasets import load_dataset; dataset = load_dataset('the_pile')"
C4 (Colossal Clean Crawled Corpus)
- Description: 750GB of clean English web text data
- Access: https://huggingface.co/datasets/c4
- Usage:
# Load English subset
python -c "from datasets import load_dataset; dataset = load_dataset('c4', 'en')"
WikiText
- Description: Long-term dependency language modeling dataset with clean Wikipedia articles
- Access: https://huggingface.co/datasets/wikitext
- Best for: Smaller models and initial testing (low resource requirement)
- Usage:
# Load WikiText-103
python -c "from datasets import load_dataset; dataset = load_dataset('wikitext', 'wikitext-103-v1')"
BookCorpus
- Description: Collection of free books
- Access: https://huggingface.co/datasets/bookcorpus
- Usage:
python -c "from datasets import load_dataset; dataset = load_dataset('bookcorpus')"
Popular Hugging Face Datasets
Some of the most widely used datasets on Hugging Face include:
- IMDB: A collection of 50,000 movie reviews labeled as positive or negative, commonly used for sentiment analysis tasks.
- Common Voice: A multilingual speech dataset featuring over 9,000 hours of recordings, essential for speech recognition development.
- Amazon Polarity: Contains over 3 million product reviews across multiple languages for sentiment analysis.
- Emotion: Texts labeled with six primary emotions (anger, fear, joy, love, sadness, surprise) for emotion classification.
- Yahoo Answers Topics: Over 1 million questions categorized by topic, useful for text classification tasks.
- Hate Speech18: Specialized dataset for detecting hate speech and offensive language in social media content.
Code and Technical Data
GitHub Code
- Description: Code from public GitHub repositories
- Access: https://huggingface.co/datasets/codeparrot/github-code
- Usage:
python -c "from datasets import load_dataset; dataset = load_dataset('codeparrot/github-code')"
The Stack
- Description: 3TB dataset with 6 million GitHub repositories
- Access: https://huggingface.co/datasets/bigcode/the-stack
- Usage:
# Load specific language subset (e.g., Python)
python -c "from datasets import load_dataset; dataset = load_dataset('bigcode/the-stack', 'data')"
Multilingual Data
OSCAR
- Description: Large multilingual corpus obtained from Common Crawl
- Access: https://huggingface.co/datasets/oscar
- Usage:
# Load French subset
python -c "from datasets import load_dataset; dataset = load_dataset('oscar', 'unshuffled_deduplicated_fr')"
mC4
- Description: Multilingual version of C4 with 101 languages
- Access: https://huggingface.co/datasets/mc4
- Usage:
# Load Spanish subset
python -c "from datasets import load_dataset; dataset = load_dataset('mc4', 'es')"
Specialized Datasets
PubMed Abstracts
- Description: Medical and biomedical research papers
- Access: https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
- Best for: Medical AI models
ArXiv Dataset
- Description: Scientific papers across multiple disciplines
- Access: https://huggingface.co/datasets/arxiv_dataset
- Usage:
python -c "from datasets import load_dataset; dataset = load_dataset('arxiv_dataset')"
Creating Custom Training Datasets
Web Scraping
Web scraping is an effective method for creating custom datasets. See the appendix for a comprehensive web scraper implementation that respects website permissions.
Local Text Sources
eBooks
- Project Gutenberg offers over 60,000 free books in the public domain
- Access: https://www.gutenberg.org/
- Download books in plain text format
PDF Conversion
PDF documents can be converted to text for inclusion in training datasets. See the appendix for a PDF conversion implementation.
Data Augmentation Techniques
Text substitutions
Synonym replacement and other text substitution techniques can expand your dataset with variations. See the appendix for implementation.
Data Cleaning and Processing
Text Cleaning Functions
Effective text cleaning is crucial for model performance. The appendix contains implementations for:
- Basic text cleaning
- Removing lines that are too short or too long
- Filtering non-language content
Deduplication
Removing duplicate content prevents models from giving undue weight to repeated information. The appendix includes a paragraph-level deduplication implementation.
Quality Filtering
Quality filtering removes low-value content that could degrade model performance. See the appendix for implementation details.
Creating Train/Validation/Test Splits
Proper dataset splits are essential for effective model evaluation. The appendix includes a function for creating dataset splits with configurable ratios.
The Challenges of Web Crawl Datasets
When planning your language model training, it's important to understand both the practical and ethical challenges of working with large-scale web datasets that are often described as "freely available."
The Infrastructure Barrier
Common web crawl datasets present several significant challenges that aren't immediately obvious:
- Enormous file sizes: Individual files often measure in multiple gigabytes, with complete datasets reaching petabyte scale
- Raw, unprocessed formats: Data typically comes in specialized formats like WARC (Web ARChive) that require additional processing
- Significant storage requirements: Even working with a small subset can quickly consume terabytes of storage
- Bandwidth limitations: Downloading the data becomes a major hurdle, even with good internet connections
The Real-World Cost of "Free" Data
While these datasets are technically free to access, processing them requires:
- Substantial computing resources: Distributed computing clusters or high-end machines are often necessary
- Cloud computing costs: Storage, bandwidth, and compute time on cloud platforms can quickly add up to thousands of dollars
- Technical expertise: Working with these datasets requires familiarity with big data tools, distributed systems, and specialized file formats
- Time investment: The learning curve and processing time represent a significant hidden cost
Who Can Realistically Use These Datasets?
The infrastructure requirements effectively limit access to:
- Organizations with substantial computing resources
- Academic institutions with research computing facilities
- Companies with significant technology budgets
- Individuals with specialized technical knowledge and access to computing resources
This creates an accessibility gap that contradicts the "open for anyone" messaging often associated with these datasets.
Problematic Content in Common Crawl
Beyond the technical challenges, Common Crawl and similar web crawl datasets present significant ethical concerns:
Deliberate Lack of Curation
Common Crawl maintains a minimal curation approach to its data collection:
- The organization intentionally does not remove hate speech, pornography, violent content, or other problematic material from its dataset
- Responsibility for filtering is shifted to downstream users (like AI builders)
- This philosophical stance aims to provide raw web data for various research purposes, but creates significant challenges for AI development
Types of Problematic Content Present
Research has identified several categories of concerning content in Common Crawl data:
- Hate Speech and Toxic Content: Studies have found significant amounts of hate speech that can perpetuate harmful stereotypes and biases
- Sexually Explicit Material: Despite filtering efforts, sexually explicit content remains prevalent
- Violent Content: Descriptions of violence and disturbing content are present throughout the dataset
- Racist and Discriminatory Content: Content that promotes racism, xenophobia, and other forms of discrimination is present
- Misinformation: The dataset contains various forms of misinformation and factually incorrect information
Representation and Bias Issues
Common Crawl data suffers from significant representation issues:
- English-Language Dominance: English is the primary language for 46% of documents
- Western Cultural Bias: The dataset overrepresents Western perspectives
- Digital Divide Reflection: The crawling process makes domains related to digitally marginalized communities less likely to be included
- Incomplete Web Coverage: Despite claims that Common Crawl contains the "entire web," it represents only a small fraction of existing web content
Growing Access Restrictions
The use of Common Crawl for AI training is facing increasing challenges:
- Content Creator Pushback: More platforms and publishers are blocking or charging for access to their data
- Legal Challenges: The dataset includes copyrighted work distributed under fair use claims, which is increasingly being challenged through lawsuits
- API Restrictions: Platforms are increasingly limiting unpaid API access to their data
Alternative Approaches
Rather than attempting to process entire web crawl datasets with their inherent problems, consider these more practical and ethical approaches:
- Work with curated subsets: Many organizations offer filtered, topic-specific extractions from larger datasets
- Use pre-processed versions: Look for already cleaned and filtered derivatives of web crawls
- Supplement with diverse sources: Combine web crawl data with more carefully curated datasets that address representation gaps
- Implement robust filtering: Develop comprehensive filtering pipelines that address the full range of problematic content
- Be transparent: Document your filtering methods and their limitations
- Pool resources: Consider collaborative approaches where processing costs and filtering expertise can be shared
- Start small: Begin with manageable portions to validate your approach before scaling up
Commercial Data Sources
For commercial projects, consider these sources:
- Common Crawl: Petabytes of web crawl data
-
- Access: https://commoncrawl.org/
- Google Books Ngrams: Statistical information about word usage
- Licensed Content:
-
- Academic journals
- Newspaper archives
- Specialized industry texts
- Data Providers:
-
- LightTag
- Scale AI
- Appen
Ethical and Legal Considerations for AI Training Data
When developing AI systems, it's crucial to be aware of the evolving legal and ethical landscape surrounding training data. The AI industry has seen significant legal challenges, with publishers, authors, and media organizations filing lawsuits alleging copyright infringement and unauthorized use of their content for AI training.
Legal Precedents and Challenges
Recent court cases have established precedents that could significantly impact AI development:
- Several publishing companies have filed lawsuits against AI companies alleging systematic copyright infringement
- Authors have sued over book content being used without permission for AI training
- Programmers have brought class action lawsuits over AI coding assistants trained on billions of lines of code
- Courts have rejected some "fair use" claims for training data
- News publishers have initiated legal action, claiming articles were used for training without authorization
These challenges are pushing the industry toward more transparent and ethical practices, with some companies now offering licensing deals with publishers and emphasizing copyright controls.
Responsible Training Data Practices
To protect yourself and your organization while creating ethical AI systems, consider these best practices:
Legal Protection
- Document all data sources and maintain detailed records
- Respect website crawling instructions and honor robots.txt
- Consider licensing for commercial applications
- For third-party models, look for providers that offer legal protection
- Clearly define intended uses and limitations in your terms of service
- Consult specialized legal counsel as the field is complex and rapidly evolving
Mitigating Harmful Content
- Implement comprehensive filtering to address all types of problematic content, not just explicit material
- Use more sophisticated filtering techniques rather than simplistic keyword approaches
- Be transparent about your filtering methods and their limitations
- Regularly audit your system's outputs for bias, toxicity, and other harmful content
- Create channels for reporting problematic outputs
Addressing Representation Issues
- Supplement web crawl data with more diverse and representative datasets
- Consider the biases inherent in your training data and take steps to mitigate them
- Be mindful of language and cultural representation in your datasets
- Design systems that can trace outputs back to source material
By implementing these practices, you can develop AI systems that are not only effective but also ethically and legally sound. The future of AI depends on responsible development that respects copyright, prevents harmful content generation, and ensures fair representation across different languages and cultures.
Appendix: Code Implementation Samples
# Web Scraping Implementation
import requests
from bs4 import BeautifulSoup
import time
import random
import markdown
import re
import json
import os
def check_llms_txt(base_url):
# Check for llms.txt and parse its contents if available
try:
llms_url = f"{base_url.rstrip('/')}/llms.txt"
response = requests.get(
llms_url,
headers={'User-Agent': 'Custom Dataset Builder Bot (your@email.com)'}
)
if response.status_code == 200:
print(f"Found llms.txt at {llms_url}")
return response.text
return None
except Exception as e:
print(f"Error checking for llms.txt: {e}")
return None
def parse_llms_txt(content):
# Parse llms.txt content to extract relevant directives
directives = {
'content_endpoints': [],
'content_selectors': [],
'markdown_sources': [],
'rate_limits': {},
'allowed_training': False,
'attribution': None
}
# Extract markdown links
md_links = re.findall(r'\[(.*?)\]\((.*?\.md)\)', content)
for name, url in md_links:
directives['markdown_sources'].append(url)
# Extract content endpoints
if "ContentEndpoint:" in content:
endpoints = re.findall(r'ContentEndpoint:\s*(\S+)', content)
directives['content_endpoints'].extend(endpoints)
# Extract content selectors
if "ContentSelector:" in content:
selectors = re.findall(r'ContentSelector:\s*(\S+)', content)
directives['content_selectors'].extend(selectors)
# Check training permissions
if re.search(r'AllowAITraining:\s*true', content, re.IGNORECASE):
directives['allowed_training'] = True
# Extract rate limits
rate_limit_match = re.search(r'Rate limit:\s*(\d+)\s+requests\s+per\s+(\w+)', content, re.IGNORECASE)
if rate_limit_match:
amount, period = rate_limit_match.groups()
directives['rate_limits'] = {'amount': int(amount), 'period': period}
# Extract attribution requirements
attribution_match = re.search(r'Attribution.*?format:?\s*"([^"]+)"', content)
if attribution_match:
directives['attribution'] = attribution_match.group(1)
return directives
def scrape_website(urls, output_file, delay_range=(1, 3)):
# Scrape text content from a list of URLs with llms.txt support
with open(output_file, 'w', encoding='utf-8') as f:
for base_url in urls:
try:
# Check for llms.txt first
llms_content = check_llms_txt(base_url)
llms_directives = parse_llms_txt(llms_content) if llms_content else None
# Apply rate limiting from llms.txt if available
if llms_directives and 'rate_limits' in llms_directives and llms_directives['rate_limits']:
# Here you would implement dynamic rate limiting based on the directives
# For this example, we'll just use our default delay
print(f"Respecting rate limit: {llms_directives['rate_limits']['amount']} requests per {llms_directives['rate_limits']['period']}")
# Be a good web citizen with delays between requests
time.sleep(random.uniform(*delay_range))
# If llms.txt specifies markdown sources, prioritize those
if llms_directives and llms_directives['markdown_sources']:
for md_url in llms_directives['markdown_sources']:
# Ensure the URL is absolute
if not md_url.startswith('http'):
md_url = f"{base_url.rstrip('/')}/{md_url.lstrip('/')}"
print(f"Fetching markdown content from {md_url}")
md_response = requests.get(
md_url,
headers={'User-Agent': 'Custom Dataset Builder Bot (your@email.com)'}
)
if md_response.status_code == 200:
# For Markdown files, we can directly save the content
md_content = md_response.text
f.write(f"# Content from {md_url}\n\n")
f.write(md_content)
f.write("\n\n")
print(f"Saved markdown content from {md_url}")
# If llms.txt specifies content endpoints (e.g., static JSON), use those
if llms_directives and llms_directives['content_endpoints']:
for endpoint in llms_directives['content_endpoints']:
# Ensure the URL is absolute
if not endpoint.startswith('http'):
endpoint_url = f"{base_url.rstrip('/')}/{endpoint.lstrip('/')}"
else:
endpoint_url = endpoint
print(f"Fetching content from endpoint {endpoint_url}")
endpoint_response = requests.get(
endpoint_url,
headers={'User-Agent': 'Custom Dataset Builder Bot (your@email.com)'} )
if endpoint_response.status_code == 200:
try:
# Try to parse as JSON
json_data = endpoint_response.json()
# Extract text content from JSON (depends on structure)
if 'content' in json_data:
f.write(f"# Content from {endpoint_url}\n\n")
f.write(json_data['content'])
f.write("\n\n")
print(f"Saved JSON content from {endpoint_url}")
except ValueError:
# Not JSON, treat as regular text
f.write(f"# Content from {endpoint_url}\n\n")
f.write(endpoint_response.text)
f.write("\n\n")
print(f"Saved text content from {endpoint_url}")
# Fall back to traditional scraping if needed
print(f"Scraping {base_url}")
response = requests.get(
base_url,
headers={'User-Agent': 'Custom Dataset Builder Bot (your@email.com)'}
)
response.raise_for_status()
# Parse content
soup = BeautifulSoup(response.text, 'html.parser')
# If llms.txt specifies content selectors, use those
if llms_directives and llms_directives['content_selectors']:
content_text = ""
for selector in llms_directives['content_selectors']:
for element in soup.select(selector):
content_text += element.get_text(separator='\n') + "\n\n"
if content_text:
f.write(f"# Content from {base_url} using selectors\n\n")
f.write(content_text)
f.write("\n\n")
print(f"Scraped {base_url} using content selectors")
continue # Skip traditional extraction if we used selectors
# Traditional extraction if no selectors or they didn't yield content
# Remove scripts, styles, and other non-content elements
for element in soup(['script', 'style', 'header', 'footer', 'nav']):
element.decompose()
# Extract text
text = soup.get_text(separator='\n')
# Clean text (remove extra whitespace, etc.)
lines = [line.strip() for line in text.split('\n')]
text = '\n'.join(line for line in lines if line)
# Add attribution if required by llms.txt
if llms_directives and llms_directives['attribution']:
text += f"\n\nSource: {llms_directives['attribution']}"
# Write to file
f.write(f"# Content from {base_url}\n\n")
f.write(text)
f.write("\n\n")
print(f"Scraped {base_url}")
except Exception as e:
print(f"Error scraping {base_url}: {e}")
# PDF Conversion
import os
import glob
import PyPDF2
def pdf_to_text(pdf_dir, output_dir):
# Convert PDF files to text files.
os.makedirs(output_dir, exist_ok=True)
for pdf_path in glob.glob(os.path.join(pdf_dir, "*.pdf")):
try:
pdf_name = os.path.basename(pdf_path).replace('.pdf', '')
output_path = os.path.join(output_dir, f"{pdf_name}.txt")
# Open the PDF
with open(pdf_path, 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
# Extract text from each page
text = ""
for page_num in range(len(reader.pages)):
text += reader.pages[page_num].extract_text()
# Write to text file
with open(output_path, 'w', encoding='utf-8') as text_file:
text_file.write(text)
print(f"Converted {pdf_path} to {output_path}")
except Exception as e:
print(f"Error converting {pdf_path}: {e}")
#Text Substitutions
import random
import nltk
from nltk.corpus import wordnet
# Download wordnet
nltk.download('wordnet')
def synonym_replacement(text, n=1):
# Replace n random words with their synonyms.
words = text.split()
new_words = words.copy()
# Find random words with synonyms
for _ in range(min(n, len(words))):
random_idx = random.randint(0, len(words) - 1)
random_word = words[random_idx]
# Get synonyms
synonyms = []
for syn in wordnet.synsets(random_word):
for lemma in syn.lemmas():
synonyms.append(lemma.name())
# Replace if synonyms exist
if len(synonyms) > 0:
synonym = random.choice(synonyms)
new_words[random_idx] = synonym
return ' '.join(new_words)
#Text Cleaning Functions
import re
import html
import unicodedata
def clean_text(text):
# Basic text cleaning function.
# Normalize unicode characters
text = unicodedata.normalize('NFKC', text)
# Decode HTML entities
text = html.unescape(text)
# Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '', text)
# Remove email addresses
text = re.sub(r'\S*@\S*\s?', '', text)
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
def remove_long_short_lines(text, min_length=3, max_length=10000):
# Filter out too short or too long lines.
lines = text.split('\n')
filtered_lines = [
line for line in lines
if min_length <= len(line.split()) <= max_length
]
return '\n'.join(filtered_lines)
def remove_non_language_lines(text, language_threshold=0.7):
# Remove lines that are likely not natural language (requires langdetect)
from langdetect import detect_langs
lines = text.split('\n')
filtered_lines = []
for line in lines:
if not line.strip():
filtered_lines.append(line)
continue
try:
# Check if the dominant language probability is high enough
langs = detect_langs(line)
if langs and langs[0].prob >= language_threshold:
filtered_lines.append(line)
except:
# If detection fails, keep the line
filtered_lines.append(line)
return '\n'.join(filtered_lines)
#Deduplication
import hashlib
def deduplicate_paragraphs(texts):
#Remove duplicate paragraphs from a list of texts
seen_hashes = set()
unique_texts = []
for text in texts:
# Split into paragraphs
paragraphs = text.split('\n\n')
unique_paragraphs = []
for paragraph in paragraphs:
if not paragraph.strip():
continue
# Create a hash of the normalized paragraph
paragraph_norm = ' '.join(paragraph.lower().split())
paragraph_hash = hashlib.md5(paragraph_norm.encode()).hexdigest()
# Only keep unique paragraphs
if paragraph_hash not in seen_hashes:
seen_hashes.add(paragraph_hash)
unique_paragraphs.append(paragraph)
# Rejoin paragraphs
if unique_paragraphs:
unique_texts.append('\n\n'.join(unique_paragraphs))
return unique_texts
#Quality Filtering
def filter_by_quality(texts, min_words=5, min_chars=20, max_repeated_chars=4):
#Filter texts based on quality heuristics
filtered_texts = []
for text in texts:
# Check minimum words
if len(text.split()) < min_words:
continue
# Check minimum characters
if len(text) < min_chars:
continue
# Check for excessive repeated characters
if re.search(r'(.)\1{%d,}' % max_repeated_chars, text):
continue
# Additional quality heuristics can be added
filtered_texts.append(text)
return filtered_texts
#Creating Train/Validation/Test Splits
import numpy as np
def create_dataset_splits(files, train_ratio=0.9, val_ratio=0.05, test_ratio=0.05, seed=42):
#Split files into train, validation, and test set
# Ensure ratios sum to 1
total_ratio = train_ratio + val_ratio + test_ratio
train_ratio /= total_ratio
val_ratio /= total_ratio
test_ratio /= total_ratio
# Shuffle files
np.random.seed(seed)
np.random.shuffle(files)
# Calculate split indices
n_files = len(files)
train_end = int(n_files * train_ratio)
val_end = train_end + int(n_files * val_ratio)
# Split files
train_files = files[:train_end]
val_files = files[train_end:val_end]
test_files = files[val_end:]
return {
'train': train_files,
'validation': val_files,
'test': test_files
}
Thank you for reading
Related Articles