Back to projects
Dataset2025

Dataset Wikipedia FR

Open-source French Wikipedia dataset for NLP model training

Completed project

Detailed description

Complete dataset extracted from French Wikipedia, cleaned and structured for natural language processing (NLP) model training. Contains French encyclopedic articles with preserved metadata, categories and internal links. Ideal for fine-tuning French-language LLMs.

Key features

  • Complete French Wikipedia extraction
  • Text cleaning and normalization
  • ML-optimized Parquet format
  • Preserved metadata and categories
  • HuggingFace Datasets compatible

Technologies used

PythonScrapyPandasParquetHuggingFace