Back to projects
Dataset2025
Dataset Wikipedia EN
Open-source English Wikipedia dataset for NLP model training
Completed project
Detailed description
Massive dataset extracted from English Wikipedia, the world's largest encyclopedic corpus. Cleaned, deduplicated and structured for language model training. Includes articles, summaries, infoboxes and bibliographic references.
Key features
- Largest available Wikipedia dataset
- Deduplication and quality validation
- Structured sections (intro, body, references)
- Multi-format export (Parquet, JSON, CSV)
- Documentation and reproduction scripts
Technologies used
PythonScrapyPandasParquetHuggingFace