Disadvantages and Problematic Content in Common Crawl

While Common Crawl provides an invaluable resource for researchers and AI developers, it also presents significant challenges and concerns, particularly regarding problematic content. This document examines the disadvantages of using Common Crawl data, with a specific focus on unsafe, pornographic, violent, hate speech, and racist content that may be present in the dataset.

Introduction

The Common Crawl is a massive, freely available web archive maintained by the Common Crawl non-profit organization founded in 2007. It contains petabytes of data collected through regular web crawls since 2008, currently encompassing billions of web pages. This open repository has become a cornerstone resource for researchers, data scientists, and AI developers, most notably serving as the primary training data source for many large language models including OpenAI's GPT-3. The data is hosted on Amazon Web Services' Public Data Sets and is accessible through various methods detailed on their Get Started page . While Common Crawl has democratized access to web-scale data, it also presents significant challenges regarding content quality and problematic material , requiring careful filtering by downstream users.

Deliberate Lack of Curation

Common Crawl deliberately maintains a minimal curation approach to its data collection:

Philosophical Stance: Common Crawl's mission is to provide raw web data for various research purposes, including studies on hate speech and problematic content. As stated by the Common Crawl director: "If you say that a human is allowed to read a webpage, but a machine isn't, I think that's a disparity that we would challenge."
Preservation of Problematic Content: The organization intentionally does not remove hate speech, pornography, violent content, or other problematic material from its dataset, believing that researchers should have access to this content for legitimate research purposes.
Shifting Responsibility: Common Crawl places the responsibility for filtering problematic content on downstream users (like AI builders), rather than taking on this responsibility itself.

Types of Problematic Content Present

Research has identified several categories of problematic content in Common Crawl data:

Hate Speech and Toxic Content: Studies have found "significant amounts of undesirable content, including hate speech" in Common Crawl datasets. This content can perpetuate harmful stereotypes and biases when used to train AI models.
Sexually Explicit Material: Despite filtering efforts by AI builders, sexually explicit content remains prevalent in Common Crawl-derived datasets. This includes both pornographic material and other forms of adult content.
Violent Content: Descriptions of violence, violent imagery, and other disturbing content are present throughout the dataset.
Racist and Discriminatory Content: Content that promotes racism, xenophobia, and other forms of discrimination is present in the dataset, reflecting the presence of such content on the web.
Misinformation and Conspiracy Theories: The dataset contains various forms of misinformation, conspiracy theories, and factually incorrect information that can be propagated by AI models trained on this data.

Inadequate Filtering by AI Builders

The filtering techniques employed by AI builders when using Common Crawl data are often insufficient:

Simplistic Filtering Approaches: According to the Mozilla Foundation report, "the filtering techniques AI builders use are often too simplistic to seriously address concerns around toxic and biased training data — something Common Crawl does not provide guidance or leadership on."
Focus on Limited Content Types: Most filtering efforts focus primarily on removing pornography or boilerplate text (like navigation menus), while leaving other types of problematic content untouched.
Automated Filtering Limitations: AI builders often rely on rudimentary automated filtering techniques that cannot effectively identify all forms of problematic content, especially more subtle forms of bias or harmful material.
Lack of Transparency: There is often limited transparency about how Common Crawl's massive data was filtered for harmful content before pre-training, making it difficult to assess the effectiveness of these filtering efforts.

Representation and Bias Issues

Common Crawl data suffers from significant representation issues:

English-Language Dominance: English is the primary language for 46% of documents (as of March 2023), with other languages like German, Russian, Japanese, French, Spanish, and Chinese each representing less than 6% of documents.
Western Cultural Bias: The dataset overrepresents Western perspectives and underrepresents content from other cultural contexts, leading to AI models that may not perform well for non-Western users.
Digital Divide Reflection: The crawling process prioritizes pages on domains that are frequently linked to, which makes domains related to digitally marginalized communities less likely to be included.
Incomplete Web Coverage: Despite claims that Common Crawl contains the "entire web," it represents only a small fraction of existing web content. As stated by a main crawl engineer at Common Crawl: "Often it is claimed that Common Crawl contains the entire web, but that's absolutely not true. Based on what I know about how many URLs exist, it's very, very small."
Common Crawl is neither curated nor balanced. Approximately 44% of its content is in English, and no other language surpasses 6%. German, Spanish, French, Russian, Japanese, and Chinese are present, but minimally so. The remaining global languages—including many spoken across Eastern Europe, the Global South, and Indigenous communities—are virtually absent.

<text x="400" y="50" font-family="Arial, sans-serif" font-size="24" fill="#2c3e50" text-anchor="middle" font-weight="bold">Language Distribution in AI Training Data</text>

<text x="400" y="80" font-family="Arial, sans-serif" font-size="16" fill="#7f8c8d" text-anchor="middle">(Based on Common Crawl dataset)</text>

<text x="190" y="180" font-family="Arial, sans-serif" font-size="16" fill="#2c3e50" text-anchor="end" font-weight="bold">English</text>

<text x="190" y="240" font-family="Arial, sans-serif" font-size="16" fill="#2c3e50" text-anchor="end" font-weight="bold">German</text>

<text x="190" y="300" font-family="Arial, sans-serif" font-size="16" fill="#2c3e50" text-anchor="end" font-weight="bold">Other Majors</text>

<text x="210" y="320" font-family="Arial, sans-serif" font-size="12" fill="#9b59b6" text-anchor="start">Russian, Chinese, Japanese, etc, etc</text>

<text x="190" y="360" font-family="Arial, sans-serif" font-size="16" fill="#2c3e50" text-anchor="end" font-weight="bold">Spanish</text>

<text x="190" y="420" font-family="Arial, sans-serif" font-size="16" fill="#2c3e50" text-anchor="end" font-weight="bold">French</text>

<text x="190" y="480" font-family="Arial, sans-serif" font-size="16" fill="#2c3e50" text-anchor="end" font-weight="bold">Others</text>

<text x="400" y="565" font-family="Arial, sans-serif" font-size="14" fill="#7f8c8d" text-anchor="middle">AI models "think" in English - linguistically, culturally, and contextually</text>

</svg>

Implications for AI Models

Training AI models on Common Crawl data without adequate filtering and curation leads to several negative outcomes:

Perpetuation of Harmful Stereotypes: Models trained on unfiltered or inadequately filtered Common Crawl data may generate content that perpetuates harmful stereotypes and biases.
Generation of Toxic Content: These models may produce hate speech, racist content, or other harmful outputs when prompted in certain ways.
Misinformation Spread: Models may confidently present misinformation or conspiracy theories as factual information.
Cultural Insensitivity: Due to representation biases, models may produce content that is culturally insensitive or inappropriate for users from underrepresented regions or language groups.
Unsafe Content Generation: Models may generate sexually explicit, violent, or otherwise unsafe content, particularly when safeguards are insufficient.
Inability to express ideas naturally in non-english languages, stilted and legal sounding, not like a native speaker.

Growing Access Restrictions

The use of Common Crawl for AI training is facing increasing challenges:

Content Creator Pushback: More platforms, online communities, and news media want to block or charge money for access to their data, with an increasing number of relevant domains like Facebook and the New York Times blocking Common Crawl from crawling most (or all) of their pages.
Legal Challenges: The dataset includes copyrighted work distributed from the US under fair use claims, but this is increasingly being challenged through lawsuits (like The New York Times suing OpenAI and Microsoft).
API Restrictions: Platforms are increasingly shutting down or limiting unpaid API access to their data, making it harder to gather comprehensive web content.
Ignorance of global culture Common Crawl's significant disadvantage is its inherent bias, disproportionately reflecting dominant online populations (Western, English-speaking) and lacking diverse global perspectives, languages, and cultures. This underrepresentation hinders the development of globally aware and equitable AI models, risking the propagation of biases and limiting multilingual AI applications. Addressing this cultural and linguistic imbalance is crucial for building inclusive and equitable AI systems for a global user base.

Conclusion

While Common Crawl provides an invaluable resource for AI research and development, its use comes with significant challenges related to problematic content. The deliberate lack of curation by Common Crawl, combined with inadequate filtering by AI builders, results in models that may perpetuate harmful stereotypes, generate toxic content, and spread misinformation.

Addressing these issues requires a multi-faceted approach:

Better filtering techniques that address a wider range of problematic content
Greater transparency about filtering methods and their limitations
Supplementing Common Crawl data with more carefully curated datasets
Development of industry standards and best practices for filtering training data
Creation of dedicated intermediaries tasked with filtering Common Crawl in transparent and accountable ways

Long-term, there should be less reliance on sources like Common Crawl and a bigger emphasis on training generative AI on datasets created and curated by people in equitable and transparent ways.

<hr>

Thank you for reading

/fragments/ddt/ai-proposition