Wednesday, August 12, 2009

on the way to optimized - a state machine step

So I've had to write a very optimized processing block. This seemed pretty daunting so I chose to ignore performance and begin with a very structured hand-written state machine. Got the process working and now onto the optimization.

Taking a state machine that is well written and making it fully pipelined is very easy! I can't give a wizard like step-by-step process to do this, but some simple ideas come to mind. If the process can be optimized then that means that your states must not be exclusive. Ie: some of the processing steps will run in parallel with other steps. In order to create such a situation you want to take code out of your state machine and put it in separate flag enabled blocks. Use these flags to control which steps run when. Sometimes you'll find that many of the steps can run without the need for flags and all that's important is some process start flag or end flag.

I have written a very simple example module to show how to do this. read_enable takes one clock to bring data in. The hardware_manipulation block takes 2 clocks to process and send data out.

When you look at the state machine you will see these steps:
0. request data (read_enable <= 1)
1. wait one clock while data is being retrieved
2. put received data into hardware_manipulation block (data_in <= data)
3. wait one clock while hardware_manipulation block is calculating
4. wait one more clock while hardware_manipulation block is outputting data
5. put hardware manipulated data out to result (result <= data_out) and go back to start

This complete process takes 6 clocks. By using the optimized version you can see that this process takes 6 steps to complete but runs 3x faster b/c it's restarted every 2 clocks.

When you look at the optimized version you will see these steps:
0. request data (read_enable <= 1)
1. wait one clock while data is being retrieved
2. put received data into hardware_manipulation block (data_in <= data) AND in parallel start from 0 again (ie request data again)
3. wait one clock while hardware_manipulation block is calculating
4. wait one more clock while hardware_manipulation block is outputting data
5. put hardware manipulated data out to result (result <= data_out)

Once the stages get going you will have stage 0 running with stages 2 and 4, and stage 1 running with stages 3 and 5.

Here's the example:

module process(
input clk,
input rst,

output reg read_enable,
input [7:0] data,

output reg [11:0] result
);

reg [7:0] data_in;
wire [7:0] data_out;

hardware_manipulation(
.clk(clk),

.data_in(data_in),
.data_out(data_out)
);

/*
reg [31:0] state;
localparam READ = 0;
localparam READ_WAIT = 1;
localparam HW = 2;
localparam HW_WAIT_1 = 3;
localparam HW_WAIT_2 = 4;
localparam OUT = 5;

always @(posedge clk) begin
if(rst) begin
state <= READ;
result <= 0;
read_enable <= 0;
end
else begin
read_enable <= 0;
case (state)
READ: begin
read_enable <= 1;
state <= READ_WAIT;
end
READ_WAIT: begin
state <= HW;
end
HW: begin
data_in <= data;
state <= HW_WAIT_1;
end
HW_WAIT_1: begin
state <= HW_WAIT_2;
end
HW_WAIT_2: begin
state <= OUT;
end
OUT: begin
result <= data_out;
state <= READ;
end
endcase
end
end*/

reg [5:0] stages;
always @(posedge clk) begin
if(rst) begin
stages <= 1;
result <= 0;
read_enable <= 0;
end
else begin
read_enable <= 0;
stages <= {stages[4:0], 0};
if(stages[0])
read_enable <= 1;
if(stages[1])
stages[0] <= 1;
if(stages[2])
data_in <= data;
if(stages[5])
result <= data_out;
end
end

endmodule


I have not simulated this block but it's simple so it should be correct... The idea comes across which is what's most important.

good luck with the optimization.

Thursday, August 6, 2009

How you use a DSP48E slice... or DSP48E tile...

Once again I find myself reading and tirelessly paging through Xilinx documentation in order to understand how to properly implement a DSP48E block. Of course before I did this I just wrote my code and let the tools figure out what to do. Now I desired to instantiate the block myself and perhaps to get some added value by doing this. I can happily report that I've done it, and lowered the FF (Flip Flop) and LUT (Look Up Table) usage by a significant amount! Here are a few tips that might help you get started:

On the Virtex 5 chips you have columns of DSP48E tiles. A tile is 2 DSP48E slices arranged vertically. A slice is a single DSP48E block. The V5 syntax for location constraints (LOC) is DSP48E_XcYr where c is the column and r is the row. Each Virtex 5 chip can have a different number of DSP48E columns. The DSP48E's counting is not related to the typical SLICE columns or rows, they are separately counted. Bottom left DSP48E is DSP48_X0Y0, and top right DSP48E for the SX95T is DSP48_X9Y63. This equates to 640 DSP48E slices (in 320 DSP48E tiles).

A DSP48E has a lot (emphasized) of functionality. Refer to ug193.pdf from Xilinx for detailed descriptions.

The embedded registers in the DSP48E and its ability to change its operation on a clock-by-clock basis block save lots fabric FFs and LUTs. A lot of functionality that would typically be taken out of the DSP48E block can be kept inside by using its registers and different modes of operation.

Another function which is very nice is the PCIN/PCOUT. A lower DSP48E in a tile can transfer it's output, without going out to the fabric, to the higher DSP48E in the same tile for a joint calculation. This calculation is then saved from being done on the fabric.

A few caveats:
PCIN/PCOUT must be connected via a wire bus of the FULL 48 bit width. The tools will give an error if you attempt to connect only a part of the bus. This is of course completely logical, but a more descriptive error and explanation would be nice. I'm sure this applies the same to all other silicon interconnected buses between the DSP48E blocks for the same reasons. Once PCIN and PCOUT are connected, and of course only between 2 DSP48E blocks as these buses are direct between 2 adjacent DSP48E blocks, the tools will attempt to place them properly such that the connection is valid. This means that if the tools cannot find a single tile to place these two DSP48E blocks into and in the correct order then it will fail at Map. You can force the location of DSP48E blocks using the LOC constraint, or the relative location using the RLOC constraint. U_SET is useful if you are trying to use RLOC and want that constraint to be relative to only a specific group of DSP48E blocks.

Thumbs up to Xilinx for some excellent DSP blocks in the Virtex 5!

PS - Be aware of 2 errors in the Virtex 5 HDL Documenation:
The port is not CEMULTCARRY-IN but rather CEMULTCARRYIN.
The string value is not "NO_PAT_DET" but rather "NO_PATDET". - This error currently only comes out at the Map stage so will only be caught after the long Synthesis and Translate steps.
I've had Xilinx create 2 CRs to fix the documentation errors and the error reporting issue relating to this.

Good luck,