gpt2.v to-do list

Concrete, end-to-end hardware design plan that directly references the steps in inf_hw.py. The goal is to replicate—bit-for-bit—our Python inference pipeline (integer-only, Q4.4 + Q8.8) in SystemVerilog. Refer to the plan below. We will address how to handle time multiplexing (a single MAC unit, etc.) for area constraints, and we show how each function in the Python code maps to a specialized or shared hardware block. Lastly, we end with a directory tree showing exactly how to organize your SystemVerilog source files and Cocotb testbenches.

1. Overall Approach: Replicate inf_hw.py in RTL

We have these fundamental stages in our single-layer GPT forward pass:

Embeddings: WTE (token embedding) and WPE (positional embedding), both Q4.4
Add: Sum Q4.4 embeddings → Q4.4
LN1: LN in Q8.8 domain => output Q8.8
Downshift to Q4.4 for next block
Attention: single-head, dimension = D_MODEL=16, with approximate Softmax
LN2: Q8.8 => downshift to Q4.4
MLP: FC1 → ReLU → FC2 + Residual, Q4.4
LNf: final LN Q8.8 => downshift to Q4.4
LM Head: final linear => Q4.4

Then we do an ArgMax over the final vocab output. In hardware, each of these steps must replicate the integer math, saturations, shifts, and expansions exactly as in the Python code.

2. Key Hardware Modules

We break each piece of inf_hw.py into a SystemVerilog module or sub-module that we can time-multiplex heavily for area savings. The approach uses one MAC (multiply-accumulate) or a small number of MACs, plus a top-level state machine that configures them for each step (Linear, LN multiply, etc.). This single MAC approach is beneficial for minimal area (like TinyTapeout constraints).