Building My Own LLM

PythonPyTorchTransformersCUDAHugging Face

Objective

Deeply understand how Large Language Models work by building one from scratch. The goal isn't to compete with GPT or Claude, but to gain intimate understanding of each component: tokenization, embeddings, attention mechanisms, and training loops.

Architecture & Dataset

Model v0.1 — Specifications • Parameters: 58M (compact architecture for fast iteration) • Training tokens: 1B tokens • Dataset: Custom mix from my open-source datasets: - Wikipedia FR (2M+ articles) - Wikipedia EN (6M+ articles) - StackOverflow (32.5M+ Q&A) - GitHub (popular repositories with high star count) Infrastructure • GPU: NVIDIA A100 40GB • Cloud: Modal.com (serverless GPU) • Framework: PyTorch + Transformers

Results & Challenges

Training Metrics • Final loss: 6.5 (stable convergence) • Compute: A100 40GB on Modal.com Challenges Encountered • Hyperparameter tuning: Finding the right balance between learning rate, batch size and warmup steps • Learning Rate Scheduler: Implementing a progressive scheduler to increase LR over time • Learning optimization: Iterative adjustments to avoid overfitting and improve generalization Next Steps • Scale to 150M+ parameters • Add structured code data • Experiment with different architectures (GPT-style vs LLaMA-style)

Building My Own LLM

Objective

Architecture & Dataset

Results & Challenges

Bibliography