Building My Own LLM
Objective
Deeply understand how Large Language Models work by building one from scratch. The goal isn't to compete with GPT or Claude, but to gain intimate understanding of each component: tokenization, embeddings, attention mechanisms, and training loops.
Architecture & Dataset
Model v0.1 — Specifications • Parameters: 58M (compact architecture for fast iteration) • Training tokens: 1B tokens • Dataset: Custom mix from my open-source datasets: - Wikipedia FR (2M+ articles) - Wikipedia EN (6M+ articles) - StackOverflow (32.5M+ Q&A) - GitHub (popular repositories with high star count) Infrastructure • GPU: NVIDIA A100 40GB • Cloud: Modal.com (serverless GPU) • Framework: PyTorch + Transformers
Results & Challenges
Training Metrics • Final loss: 6.5 (stable convergence) • Compute: A100 40GB on Modal.com Challenges Encountered • Hyperparameter tuning: Finding the right balance between learning rate, batch size and warmup steps • Learning Rate Scheduler: Implementing a progressive scheduler to increase LR over time • Learning optimization: Iterative adjustments to avoid overfitting and improve generalization Next Steps • Scale to 150M+ parameters • Add structured code data • Experiment with different architectures (GPT-style vs LLaMA-style)