Back to projects
Dataset2025

Dataset Wikipedia EN

Open-source English Wikipedia dataset for NLP model training

Completed project

Detailed description

Massive dataset extracted from English Wikipedia, the world's largest encyclopedic corpus. Cleaned, deduplicated and structured for language model training. Includes articles, summaries, infoboxes and bibliographic references.

Key features

  • Largest available Wikipedia dataset
  • Deduplication and quality validation
  • Structured sections (intro, body, references)
  • Multi-format export (Parquet, JSON, CSV)
  • Documentation and reproduction scripts

Technologies used

PythonScrapyPandasParquetHuggingFace