Building A GPT From Scratch: Introspection And Loss Visualization

Quantum
Quest
Algorithms, Math, and Physics

Building a GPT from scratch: model introspection and loss visualization

In this final part of my GPT model implementation, I focus on integrating model introspection to better understand the architecture and parameters. To monitor training, I visualized the loss using both simple and exponential moving averages, smoothing the trends and making it easier to evaluate performance.

Model introspection

To better understand the structure and configuration of the model, I implemented a model introspection function. This function outputs relevant parameters such as the number of layers, dimensions of embeddings, and whether multi-head attention is being used. The introspection also calculates and displays the total number of trainable parameters.


class MyGPT(nn.Module):
    ...

   def introspect(self, compile_model=True, device='cpu'):
        # Move the model to the specified device
        self.to(device)
        if compile_model:
            self = torch.compile(self)

        # Print the model architecture
        print(self)

        # Print model configuration
        print(f"Vocab Size: {self.vocab_size}")
        print(f"Embedding Dim (d_model): {self.d_model}")
        print(f"Max Sequence Length: {self.max_len}")
        print(f"Hidden Dimension: {self.hidden_dim}")
        print(f"Dropout Probability: {self.dropout_prob}")
        print(f"Number of Attention Heads: {self.num_heads}")
        print(f"Number of Layers: {self.n_layers}")
        print(f"Using Multi-Head Attention: {self.use_multiple_head}")

        # Calculate total trainable parameters
        total_params = sum(p.numel() for p in self.parameters() if p.requires_grad)
        print(f"Total Parameters: {total_params:,} ({round(total_params / 1_000_000)}M)")
Loss visualization

Once one or more training runs have been completed, I visualized the training and validation loss using a simple or exponential moving average to smooth the data. This provided a clearer view of the trend in the loss values by reducing noise.


def simple_moving_average(data, window_size):
    sma = []
    for i in range(len(data)):
        if i < window_size:
            sma.append(sum(data[:i + 1]) / (i + 1))
        else:
            sma.append(sum(data[i - window_size + 1:i + 1]) / window_size)
    return sma

def exponential_moving_average(data, alpha):
    ema = [data[0]]
    for i in range(1, len(data)):
        ema.append(alpha * data[i] + (1 - alpha) * ema[i - 1])
    return ema

This section concludes my GPT model implementation, showing how I enhanced the model’s ability to handle complex text generation tasks by refining the attention mechanisms and improving training visualization.

You can access the full source on GitHub here.

For more insights into this topic, you can find the details here.