Building A GPT From Scratch: Decoder For Text Generation

2025-04-16

Building a GPT from scratch: decoder for text generation

In this post, I discuss the decoding process of a simple GPT model that I trained for 150 epochs. After analyzing the training results, I demonstrate how I generate text based on an input sequence using tokenization and sampling methods such as top-k sampling and temperature scaling. I explain how the model predicts the next token in the sequence and how the process iteratively continues. Despite the limitations of my basic model, this article serves as a foundation for future improvements, especially as I explore adding attention mechanisms to enhance performance.

Model loading

I begin by loading the trained GPT model, the tokenized vocabulary, and other parameters required for text generation. Here is the code snippet for setting up the model:


tokenizer_data = torch.load('./build/tokenized_data.pkl', weights_only=False)
vocab = tokenizer_data['vocab']
id_to_token = {idx: token for token, idx in vocab.items()}

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

checkpoint = torch.load('./build/gpt_model.pth', weights_only=False)
vocab_size = checkpoint['vocab_size']
d_model = checkpoint['d_model']

model = GPT(vocab_size=vocab_size, d_model=d_model).to(device)
model.load_state_dict(checkpoint['state_dict'])
model.eval()

Tokenization and text generation

The generate_text function encodes the starting sequence into token IDs using my custom tokenizer and generates text token by token. Below is the key part of the generation logic:


def generate(self, input_ids, max_new_tokens, temperature=1.0, top_k=10):
    generated_ids = input_ids.squeeze(0).tolist()
    context_length = input_ids.size(1)

    for _ in range(max_new_tokens):
        input_tensor = torch.tensor(generated_ids[-context_length:], dtype=torch.long).unsqueeze(0).to(input_ids.device)
        logits, _ = self.forward(input_tensor)
        next_token_logits = logits[:, -1, :] / temperature
        next_token_logits = top_k_logits(next_token_logits, top_k)
        probs = F.softmax(next_token_logits, dim=-1)
        next_token_id = torch.multinomial(probs, num_samples=1).item()
        generated_ids.append(next_token_id)

    return torch.tensor(generated_ids, dtype=torch.long).unsqueeze(0).to(input_ids.device)

This method ensures a structured approach to generating the next token based on probabilities, with the temperature controlling randomness and top-k filtering limiting unlikely tokens.

Sample output

Here is an example starting sequence and generated output:


start_sequence = "Nel mezzo del cammin di nostra vita"
generated_sequence = generate_text(model, start_sequence, max_new_tokens=50)
generated_only = generated_sequence[len(start_sequence.split()):]

print("Start sequence:", start_sequence)
print("Generated sequence:", " ".join(generated_only))

The model generated a continuation based on the famous line from Dante’s Divine Comedy:


Nel mezzo del cammin di nostra vita
e con le sue picciole onde piegava l'erba e con la sua mi 'ntrassi;
ed è quel ch'elli avea membro e 'l ciel per non ti sia da la mente
a la terra per che tu chi vide mei ciò sia, parole tue fami".
Virgilio li occhi dolenti ne li

Although the generated text lacks coherence, it provides insight into the model’s current limitations, which I plan to address by integrating attention mechanisms in the future.

For more insights into this topic, you can find the details here.

python github artificial-intelligence large-language-models gpt pytorch

Building a GPT from scratch: token embeddings and training Building a GPT from scratch: positional encoding (PE)