So I've had to write a very optimized processing block. This seemed pretty daunting so I chose to ignore performance and begin with a very structured hand-written state machine. Got the process working and now onto the optimization.
Taking a state machine that is well written and making it fully pipelined is very easy! I can't give a wizard like step-by-step process to do this, but some simple ideas come to mind. If the process can be optimized then that means that your states must not be exclusive. Ie: some of the processing steps will run in parallel with other steps. In order to create such a situation you want to take code out of your state machine and put it in separate flag enabled blocks. Use these flags to control which steps run when. Sometimes you'll find that many of the steps can run without the need for flags and all that's important is some process start flag or end flag.
I have written a very simple example module to show how to do this. read_enable takes one clock to bring data in. The hardware_manipulation block takes 2 clocks to process and send data out.
When you look at the state machine you will see these steps:
0. request data (read_enable <= 1)
1. wait one clock while data is being retrieved
2. put received data into hardware_manipulation block (data_in <= data)
3. wait one clock while hardware_manipulation block is calculating
4. wait one more clock while hardware_manipulation block is outputting data
5. put hardware manipulated data out to result (result <= data_out) and go back to start
This complete process takes 6 clocks. By using the optimized version you can see that this process takes 6 steps to complete but runs 3x faster b/c it's restarted every 2 clocks.
When you look at the optimized version you will see these steps:
0. request data (read_enable <= 1)
1. wait one clock while data is being retrieved
2. put received data into hardware_manipulation block (data_in <= data) AND in parallel start from 0 again (ie request data again)
3. wait one clock while hardware_manipulation block is calculating
4. wait one more clock while hardware_manipulation block is outputting data
5. put hardware manipulated data out to result (result <= data_out)
Once the stages get going you will have stage 0 running with stages 2 and 4, and stage 1 running with stages 3 and 5.
Here's the example:
module process(
input clk,
input rst,
output reg read_enable,
input [7:0] data,
output reg [11:0] result
);
reg [7:0] data_in;
wire [7:0] data_out;
hardware_manipulation(
.clk(clk),
.data_in(data_in),
.data_out(data_out)
);
/*
reg [31:0] state;
localparam READ = 0;
localparam READ_WAIT = 1;
localparam HW = 2;
localparam HW_WAIT_1 = 3;
localparam HW_WAIT_2 = 4;
localparam OUT = 5;
always @(posedge clk) begin
if(rst) begin
state <= READ;
result <= 0;
read_enable <= 0;
end
else begin
read_enable <= 0;
case (state)
READ: begin
read_enable <= 1;
state <= READ_WAIT;
end
READ_WAIT: begin
state <= HW;
end
HW: begin
data_in <= data;
state <= HW_WAIT_1;
end
HW_WAIT_1: begin
state <= HW_WAIT_2;
end
HW_WAIT_2: begin
state <= OUT;
end
OUT: begin
result <= data_out;
state <= READ;
end
endcase
end
end*/
reg [5:0] stages;
always @(posedge clk) begin
if(rst) begin
stages <= 1;
result <= 0;
read_enable <= 0;
end
else begin
read_enable <= 0;
stages <= {stages[4:0], 0};
if(stages[0])
read_enable <= 1;
if(stages[1])
stages[0] <= 1;
if(stages[2])
data_in <= data;
if(stages[5])
result <= data_out;
end
end
endmodule
I have not simulated this block but it's simple so it should be correct... The idea comes across which is what's most important.
good luck with the optimization.
No comments:
Post a Comment