Back to projects
Dataset2025
Dataset Wikipedia FR
Open-source French Wikipedia dataset for NLP model training
Completed project
Detailed description
Complete dataset extracted from French Wikipedia, cleaned and structured for natural language processing (NLP) model training. Contains French encyclopedic articles with preserved metadata, categories and internal links. Ideal for fine-tuning French-language LLMs.
Key features
- Complete French Wikipedia extraction
- Text cleaning and normalization
- ML-optimized Parquet format
- Preserved metadata and categories
- HuggingFace Datasets compatible
Technologies used
PythonScrapyPandasParquetHuggingFace