Back to News

Writing an LLM from scratch, part 22 -- finally training our LLM!

Giles Thomas, Giles Thomas
October 16, 2025 at 01:42 AM
Joy (70%)
positive
Writing an LLM from scratch, part 22 -- finally training our LLM!

Key Takeaways

  • The author completed Chapter 5 of Sebastian Raschka's LLM book, finding the implementation phase exciting despite initial difficulty with cross-entropy loss.
  • A small model trained on 20,000 characters produced surprisingly coherent text, which improved dramatically after loading pre-trained GPT-2 weights.
  • The author recommends typing all code manually but warns that replicating exact random outputs from the book is difficult due to unseeded randomness in auxiliary functions.
  • For validation, the author suggests focusing on the general trend of decreasing training loss rather than exact numerical matches.
  • The chapter moves into practical implementation details, including the use of optimizers beyond basic stochastic gradient descent.

The author wraps up their notes on Chapter 5 of Sebastian Raschka's "Build a Large Language Model (from Scratch)", finding the concepts of cross-entropy loss and perplexity difficult, but the subsequent coding highly rewarding as the model begins to generate text. After training a small model on a tiny dataset from Edith Wharton's "The Verdict," the author observed surprisingly coherent output, and loading the 124M-parameter GPT-2 weights resulted in remarkably coherent text generation, suggesting success. The author strongly advises readers to manually type and run all the code, but cautions that matching the book's exact random outputs is challenging because various helper functions introduce hidden randomness that is hard to perfectly sequence. Nevertheless, the author suggests that as long as the training loss decreases steadily and validation loss plateaus as shown in the book, the implementation is successful, before briefly touching upon the topic of optimizers beyond simple SGD.

Related Articles