Concrete, end-to-end hardware design plan that directly references the steps in inf_hw.py
. The goal is to replicate—bit-for-bit—our Python inference pipeline (integer-only, Q4.4 + Q8.8) in SystemVerilog. Refer to the plan below. We will address how to handle time multiplexing (a single MAC unit, etc.) for area constraints, and we show how each function in the Python code maps to a specialized or shared hardware block. Lastly, we end with a directory tree showing exactly how to organize your SystemVerilog source files and Cocotb testbenches.
We have these fundamental stages in our single-layer GPT forward pass:
Then we do an ArgMax over the final vocab output. In hardware, each of these steps must replicate the integer math, saturations, shifts, and expansions exactly as in the Python code.
We break each piece of inf_hw.py
into a SystemVerilog module or sub-module that we can time-multiplex heavily for area savings. The approach uses one MAC (multiply-accumulate) or a small number of MACs, plus a top-level state machine that configures them for each step (Linear, LN multiply, etc.). This single MAC approach is beneficial for minimal area (like TinyTapeout constraints).