Back to research

Dataset Creation for LLMs

PythonScrapyBeautifulSoupPandasParquetHuggingFace Datasets

Objective

Create high-quality datasets for training and fine-tuning language models. Focus on French (underrepresented) and multi-language code data.

Methodology

1. Ethical scraping with robots.txt compliance and rate limiting 2. Data cleaning and deduplication 3. Structuring in optimized format (Parquet) 4. Quality validation with sampling 5. Publication on HuggingFace Hub

Results

Wikipedia FR: 2M+ encyclopedia articles (addressing French language underrepresentation) Wikipedia EN: 6M+ high-quality encyclopedia articles StackOverflow: 32.5M+ technical Q&A entries (code generation, debugging) All datasets are structured in optimized Parquet format with preserved metadata, cross-referenced links, and category hierarchies. Ready for immediate fine-tuning with HuggingFace Transformers.

Bibliography