Friday, December 10, 2010

dual clock fifo / ram - gray codes

A long time ago I tried to write a dual clock fifo, it was fun, and I somewhat succeeded. I have now done it again, only this time with a lot more knowledge and expertise.

The main issues are how to create a dual clock ram that is not dependent on the existence or speeds of the clocks involved, and how to keep track of full/empty/... signals in the different clock domains.

Creating a dual clock ram / dpr (dual port ram) is an interesting challenge. To do this properly, you can't just transfer requests from one domain to the other, b/c that's ugly... and the other clock might be off, it might be really slow in comparison, and bunches of other reasons. Perhaps you could create one really high speed clock, but that's ugly too, and costs more power, and is wasteful and more difficult to implement in my opinion.

So how do you do it? Use the wea (write enable port a), and web (write enable port b) as the inputs of an or gate which will be your clock to the ram elements. My code samples the write enables in their individual clock domains at the falling edge, and outputs a clean wea and web. Make sure the address decode logic is an input to the d of the wea and web, and not after. If the decode logic is after you will have lots of glitchy writes to addresses you didn't mean to write to. Use the we (wea or'ed with web) to clock the ram elements. Of course you will need a separate we signal for each ram element. So far this seems to work fairly reliably. You will need to check timing too...

Now onto the fifo full/empty signals. Obviously calculate the pointers against each other. But how do you get the pointer from the other side? Use a gray code!!! Convert the pointers to gray code in the originating clock domain (and sample them), then sample them twice in the destination clock domain. Then convert them back. That's the whole story. Pretty easy I'd say.

The reason you need a gray code is so that for a given increment of 1, the change in the gray code is always guaranteed to be only 1 bit of difference. The gray code I use is:
gray = (binary >> 1) ^ binary.

For example
000 = 000 >> 1 ^ 000 = 000 ^ 000 = 000
001 = 001 >> 1 ^ 001 = 000 ^ 001 = 001
010 = 010 >> 1 ^ 010 = 001 ^ 010 = 011
011 = 011 >> 1 ^ 011 = 001 ^ 011 = 010

Note that from binary 1 (001) to binary 2 (010) there are 2 bits that must change. If they are sampled without converting to gray code then you might sample the change from 1 to 2 as 011 (3). This is because the second bit arrived at the destination clock domain right before the rising clock and the first bit arrived right after the rising clock.
0 0 0 0 0 0 0 0 0      bit 2
0 0 0 0 0 0 1 1 1      bit 1
1 1 1 1 1 1 1 1 0      bit 0
......................|...      clock

This is an example of how you could clock in a completely incorrect value. Of course if the clock had risen a few moments later, or even a few moments earlier, you wouldn't have gotten this value, but when dealing with unrelated clock domains, you have way of guaranteeing when the clock will rise. Also note that this issue will come up regardless of how close you make the wires, b/c there will ALWAYS be a point in the middle where one of the bits updated and the other hasn't.

When using gray code you get:
1 = 001
2 = 011

0 0 0 0 0 0 0 0 0      bit 2
0 0 0 0 0 0 1 1 1      bit 1
1 1 1 1 1 1 1 1 1      bit 0
......................|...      clock

No matter where the clock might rise in the change between value 1 and value 2, you will either get the previous (value 1) or the new (value 2) value. You will never sample a completely incorrect value since only one bit changes.

For those like me looking for an easy bit of code to create a gray code:

Gray Code Functions in VHDL: (from memory)

function to_gray ( b : std_logic_vector(3 downto 0)) return std_logic_vector(3 downto 0)
 variable g : std_logic_vector(3 downto 0);
 g := b xor ('0' & b(3 downto 1));
 return g;
end function;

function from_gray ( g : std_logic_vector(3 downto 0)) return std_logic_vector(3 downto 0)
 variable b : std_logic_vector(3 downto 0);
 b(3) := g(3);
 for i in 0 to 2 loop
  b(2 - i) := b(3 - i) xor g(2 - i);
 end loop;
 return b;
end function;

Although a more efficient way to do this would be to build a table of some sort...

Sunday, September 12, 2010

Sadly Modelsim 6.6c has come out without fixing the bug I reported...

I reported a blatant bug in Modelsim's dealing with libraries. See:

And it's too bad that they haven't had the foresight to actually fix this problem.

What is nice is that they fixed a bunch of other bugs. There was one that was causing bad results and that had to do with queue slices when sent into tasks within other modules. That is now working correctly.

Saturday, August 28, 2010

Lattice / Xilinx FIFO resource issues

If you are like me, then when you design logic to interact with a FIFO, you make sure to carefully control your reads and writes. Since I do this for all FIFOs, it is very wasteful to have logic internal to the FIFO that checks whether I'm writing to a full FIFO, or reading from an empty FIFO.

Altera's tools allow you to choose whether or not you want such overflow and underflow logic in the FIFO. Xilinx and Lattice's tools do not. What this means is that every time you create and instantiate a FIFO with Lattice or Xilinx chips, there is logic that will negatively affect timing and use more resources, internal to the FIFO. This logic is so wasteful (and dangerous since it hides bugs), that I feel it important to point this out to the general public.

If you are using Lattice's IPExpress, the logic is shown in the Verilog or VHDL source files. This is great because it allows you to disable the logic with some simple comments.

For Xilinx, there doesn't appear to be a way to get at the source code that creates the logic, so I would strongly advise asking your FAE to request that Xilinx adds an option to Core Generator to allow for lighter weight FIFO creation that excludes this 'safety' feature.

In my system, I found a significant savings by hacking the (Lattice) created Verilog FIFO and disabling the empty and full check. Since my system depends on correct read and write logic, this was a function that was very useless to me. If I was reading when there was no data, or writing when there was no room - the system won't work anyhow - so I MAKE sure that doesn't happen in my logic.

Thursday, August 19, 2010

gvim 7.3 64-bit (x64) Windows 7, Vista, XP - build instructions and installer

This is an update to:

You can download the actual installer from: (64 bit only, includes all patches up to and including 7.3.003).

To create the 7.3 version, you must download:

There are no extra or lang archives. Everything is included in the main archive.

Download the 7.3 patches:

Follow all of the previous instructions (except for what relates to extra and lang archives).

Also modify the gvim.nsi file to change all instances of gvimext64.dll to gvimext.dll.

Other than that, it is all the same.

Monday, August 16, 2010

WRT54GL - GoogleWiFiSecure - OpenWRT

It had to be an eventuality that someone would try to get OpenWRT onto Google's secure wifi network, and that ended up being me. As I couldn't find any other information I had to figure this out for myself.

Here it is:
Use Backfire trunk (currently it's approximately 10.03.1 RC1). Use wpa_supplicant (instead of wpad or wpad-mini), and set it to use libopenssl. Add libopenssl. Add luci-medium too. Use the 47xx Broadcom as opposed to the old 2.4 kernel. Compile. Flash. Look for excellent compilation instructions on OpenWRT's website. I used Debian Lenny in a virtual machine (Oracle VirtualBox) and it went very smoothly.

After compiling and flashing, reboot and then telnet into the system and edit /lib/wifi/ line 84 from:

It's a bug and hopefully they will fix it. The bug is either in LUCI or wpa_supplicant (probably depends who you ask), but this will get it working either way.

Now go into luci and set the settings. "Path-to-certificate" can be left blank. It will not affect anything (other than to verify server, but Google's WIFI doesn't support that). Set all the other settings as Google instructs:

To connect to GoogleWiFiSecure, create a new network connections profile with the following settings:

SSID: GoogleWiFiSecure (case-sensitive)

WPA/WPA2 Settings
Encryption method: AES (preferred) or TKIP
Authentication Protocol: MS-CHAP-V2 or CHAP
Trusted Server Name for Authentication:

Under the 802.1x settings be sure to enter the username and password that you obtained earlier.

Reboot router and it should work.

Currently this is a NAT configuration. I'm sure with more work you can configure it to be a bridge, but as I passed the router back to my friend, and I have my own internet, I don't care anymore...

Here is a link to the compiled .trx file. You'll need the ability to burn the trx file.
Here is a link to the .config file I used: (rename to .config)

PS - For the config, you just do make defconfig, and then choose the Broadcom BCM947xx/953xx (WITHOUT the 2.4!). If you use 2.4 then you won't have the 802.1x authentication which is required for Google's secure Wifi.
Then add wpa-supplicant (remove wpad-mini), and set wpa_supplicant to use libopenssl instead of internal. Then add libopenssl. Also add luci-medium if you want graphical configuration.
Everything else will work fine.

And YES - BCM947xx is the correct target, I know, I know, but the WRT54GL uses the BCM5352. Don't worry, it is the same chipset or something as the BCM947xx series...

This is it.

I'm sure that can be changed when creating the image, but I never bothered with it, cause I don't care enough. If you make that change and would like to share how then by all means post a comment. Also note that it seems that after a change is made you will have to power cycle the router to get it to get back onto the wifi network. I don't know why, but I'm happy enough that it gets there once so I didn't bother trying to debug this. A mostly impossible step given my limited knowledge of how OpenWRT works anyhow.

As I don't have this router anymore, I'm probably not gonna update this blog much more unless someone posts an interesting comment which should be added.

Hopefully this will help someone.

Friday, August 13, 2010

SystemVerilog - Interface with modports - master with multiple slaves

Playing around with Quartus 10.0, and so far this is synthesizable. This code uses a generate block to create multiple modports such that I can easily connect many slaves to a single interface programmatically where the interface has one bit per slave in a single bus.

This creates a 6-to-1 multiplexer sourced from 6 latches. Each latch has its own preset, clear, and select, and they all use a shared data and enable input.

Note the use of $size. It could have been done in other ways I'm sure.
Note the use of 'slave_modport_gen' in the instantiation of the slaves. This is how to reference the separate modports that are created in the interface. The need for separate modports is b/c I need to control different bits of q_bus for each modport.
Note the use of .sel(sel_in[i]) in the modport port direction descriptions. This is an alias (aliased to sel inside the module), also required to have the same module be able to control the desired bit of q_bus.

`timescale 1ns/1ns

interface bus # (
 parameter NUM_SLAVES = 1
) (
 input [NUM_SLAVES - 1:0] sel_in,
 input e_in,
 input d_in,
 input [NUM_SLAVES - 1:0] s_in,
 input [NUM_SLAVES - 1:0] r_in,
 output q_out

 logic q_bus [NUM_SLAVES - 1:0];

 modport master(
 input sel_in,
 output q_out,

 input q_bus

 genvar i;
 for(i = 0; i < NUM_SLAVES; i++) begin : slave_modport_gen
  wire e_comb;
  assign e_comb = sel_in[i] && e_in;
  modport slave(
   input .e(e_comb),
   input .d(d_in),
   input .s(s_in[i]),
   input .r(r_in[i]),
   output .q(q_bus[i])

module master(
 bus.master m
 integer i;
 always_comb begin
  m.q_out = 0;
  for(i = 0; i < $size(m.sel_in); i++) begin
    m.q_out = m.q_bus[i];

module slave (
 interface s
 always_latch begin
   s.q <= s.d;
   s.q <= 1;
   s.q <= 0;

module top # (
 parameter NUM_SLAVES = 6
) (
 input [NUM_SLAVES - 1:0] sel_in,
 input e_in,
 input d_in,
 input [NUM_SLAVES - 1:0] s_in,
 input [NUM_SLAVES - 1:0] r_in,
 output q_out

 bus # (
 ) bus_inst(

 master master_inst(bus_inst);

 genvar i;
 for(i = 0; i < NUM_SLAVES; i++) begin : slave_gen
  slave slave_inst(bus_inst.slave_modport_gen[i].slave);


Wednesday, August 11, 2010

Modelsim - simulating from 2 vendors at once - Verilog

Here's an interesting problem. You want to simulate code from Lattice along with code from Xilinx. (For example when you are simulating a Xilinx PCIe Root Complex against a Lattice PCIe Endpoint!)

Of course you couldn't attempt this without partitioning into libraries (because of name conflicts). So perhaps you have a library called lattice_work and xilinx_work. When compiling the code there are primitives like NAND, NOR, XOR... that are used in the both sets of code. The Lattice code will need to use Lattice's primitives, and the Xilinx code will need to use Xilinx's primitives.

You can use the +v and -y options to have the primitives found and compiled into the lattice_work and xilinx_work libraries, and then you can use the -L work trick (see previous blog

The problem with this method is that now whenever you run vlog with the incr flag, it will not only check if the Lattice and Xilinx code is up to date, it will also go and check the primitives! That's a huge waste of time! Also, your lattice_work and xilinx_work libraries will get filled with lots of primitives that you don't care to see.

What I like to do is to compile all the primitives into separate libraries. For example I have for Lattice, ecp3_work and pmi_work. For Xilinx I have unisims_work and secureip_work. The primitives are in ecp3_work and unisims_work. This is a nice method, and of course saves time since I never have to recompile or check for updates on these files. The drawback is that the Lattice code and Xilinx code will arbitrarily pick out the NAND, NOR, XOR, etc primitives and will of course cause lots of errors.

I prefer to not have to add source code modifiers, ie I'd like the source code to not have to change in order to fix this. `uselib is a great solution that typically requires it to be placed before the source. My solution has been to create uselib_xilinx.v and uselib_lattice.v files and add that file in the vlog command per each source file.

For example, instead of calling
vlog lattice_code.v
I now call
vlog uselib_lattice.v lattice_code.v

uselib_lattice.v looks like this:
`uselib dir=ecp3_work dir=pmi_work

This works because the `uselib attribute persists through the vlog command. So even if you call vlog with many files, the `uselib will persist. This of course would be problematic if you have additional `uselib calls in your code since they will override the uselib_lattice's or uselib_xilinx's `uselib attribute.

I wish Modelsim would fix their library issues. The solution of -L work is very problematic, and leads to the bug I describe here ( If they would only add some nice switches into the vlog and vsim commands it would make life much easier.

SystemVerilog disable fork

The short of it is that 'disable fork' is a pain. The long of it is that you have no choice, and perhaps an example might make your lives easier:

disable fork;

Why wrap it in another fork-join pair? This is due to how disable fork works. disable fork disables ALL forks at its level and below. If the above code is inside of a task which is part of a module, then calling disable fork without it being wrapped will disable all forks running in the module. I KNOW! It's ludicrous! Yes, but this is how it works.
What about named forks? Also problematic. This is because when you use a named fork and then call disable , it will disable all instances of the same named fork. If you have multiple threads running the same task and both are in the named fork, then one disable will kill both of them. I KNOW! It's ludicrous! ...
Any other options? I don't know. This seems to be the only way to do the job in SystemVerilog. Perhaps avoiding semaphores and doing everything with wait statements or events, but for now I prefer to use semaphores as they are an item I can 'new', and therefore they allow for much more flexibilty.

Saturday, July 31, 2010

DDR3 Fly-by-topology and Write-leveling

The key is that the DQ/DQS signals are directly connected to each rank (or set of chips). So where the controls (clock, command, address, etc...) are connected in fly-by topology, the data and strobes are connected directly. This is why write-leveling is important. It indicates the skew between when the clock (and other control signals) arrive(s) and when the data (and strobes) arrive(s).

Also this is a nice explanation of rank and other stuff:

Thursday, July 29, 2010

SystemVerilog - Interfaces vs Classes

So I've been designing Verilog for a while and I recently started playing with SystemVerilog. As a programmer with C++ experience, I'm very familiar with OOP (Object Oriented Programming). But this didn't give me the complete understanding of SystemVerilog's paradigm. Here's the key differences in ideas between Interfaces and Classes.

Interfaces are like pointers to the actual signals. Interfaces live in the same realm as modules. Just like modules must be instantiated globally and not within a procedural block, so to interfaces must be instantiated globally. If you have a UART module (which contains the low level UART RX (receiver) and TX (transmitter)), and you have 3 different sets of UART wires (for 3 different UART ports). You may also have a UARTTest class which takes a UART and runs some tests on it. How do you build this?

Keep in mind that you don't want to always have to pass signals along. So in theory I want to create a UARTTest instance and have that instance store inside of it the signals required to control an individual UART.

You can't pass module instances around... This is where Interfaces come in. Here's an attempt without interfaces:

reg uart_tx [2:0]; //3 separate uart tx lines for 3 separate uarts
reg uart_rx [2:0]; //3 separate uart rx lines for 3 separate uarts
//not bothering to declare the rest of the signals... or anything else

UART uart_inst(.tx(uart_tx[0]), .rx(uart_rx[0]), ...);
UART uart_inst(.tx(uart_tx[1]), .rx(uart_rx[1]), ...);
UART uart_inst(.tx(uart_tx[2]), .rx(uart_rx[2]), ...);

initial begin
UARTTest t0;
UARTTest t1;
UARTTest t2;
t0 = new;
t0.runTest(uart_tx[0], uart_rx[0], ...);
t1 = new;
t1.runTest(uart_tx[1], uart_rx[1], ...);
t2 = new;
t2.runTest(uart_tx[2], uart_rx[2], ...);

The reason I have to send each signal above is because I have no way to store the signals in the class. Of course the main problems here are the verbosity of such calls. (I'm pretty sure what I've written above will work (with ref of course)... but even that's kind of unclear... I'll have to test this later.) Anyhow, the nicer and more proper way to do this is:

Create an interface with an rx and tx signal, and then instantiate them.

//signals are not declared at the top level, they're stuck in the interface.
uart_interface uart_interf_inst0 [2:0];

UART uart_inst(.tx(uart_interf_inst[0].tx), .rx(uart_interf_inst[0].rx), ...);
UART uart_inst(.tx(uart_interf_inst[1].tx), .rx(uart_interf_inst[1].rx), ...);
UART uart_inst(.tx(uart_interf_inst[2].tx), .rx(uart_interf_inst[2].rx), ...);

initial begin
UARTTest t0;
UARTTest t1;
UARTTest t2;
t0 = new(uart_interf_inst[0]);
t1 = new(uart_interf_inst[1]);
t2 = new(uart_interf_inst[2]);

Now the interface is stored in the class, and can be used. Since it's an interface, it actually controls real signals to the UART module. Remember to use as the parameter to new the virtual keyword as in:

function new(
virtual uart_interface interf

virtual for an interface implies something like a pointer to an interface. So it ends up being a pointer to the structure that contains all the pointers to the signals. Exactly what you need.

Monday, July 26, 2010

Help!!! - DDR3 ODT (On Die Termination)???

Gosh, figuring this stuff out is difficult. Anyone know how this actually works? Why is Rtt Nom variable? When is it in use? If Dynamic ODT is on then Rtt Nom is really only in use when nothing is going across the wires. RTT Wr is used during a write, and Rtt is disabled during a read. So what is Rtt Nom really used for? It can't be for when Dynamic ODT is off, because that still wouldn't explain why there are 2 settings. Is it needed to keep the wires clean and not bouncing when nothing is going across it? Perhaps, but then why are there 6 possible values? What am I missing????

I can't continue until I completely understand this stuff. Perhaps I need to take some analog classes, or read some analog for dummies books, and then this would become clear... Although I doubt it.

Saddened by my lack of understanding here.

It makes me so happy to learn something new, and that happened in spades tonight! Well at least I learned a bunch of analog stuff, and got a great explanation for Rtt Nom.

The key is that when there are multiple DDR3 chips hooked up, and you are only writing to one of them, then Rtt Wr could be used for the one you are writing to, but the other one which shares the same DQ/DQS/DM lines needs to be terminated with Rtt Nom which must be calibrated because it will affect the signal integrity of the data being written. This is the explanation. I can't claim complete understanding or comprehension since I've not got near enough analog knowledge or experience, but for my simplistic understanding, this suffices very well!

So much happier now :)

Friday, July 23, 2010

modelsim scripting - tcl

Spent some time scripting with Modelsim, and here are the lessons:

quietly / transcript off | on | quiet ...

Modelsim by default echoes all set commands to the console. This is b/c of Modelsim's transcript setting. You can either call 'quietly set ...' or just set 'transcript off' at the top. echo will stop working, but you shouldn't use that anyhow, use 'puts ...' in place of 'echo ...' for compatibility across operating systems.

onerror {resume} ... doesn't actually do everything you want it to...

When I use a script with nested procs, and loops, and far down in the stack I call 'eval vlog ...', and the vlog fails, Modelsim will NOT continue from where it failed. This is pretty useless to me as I don't need it to continue further on, but right there. So one file failed vlog, continue compiling the rest! The solution is to 'catch' the error. Wrap 'eval vlog' in a catch. Something like 'catch [eval vlog ...]'. You can also use onerror to perform any error level logging and stuff, but this way it will still continue right where you were.

argv doesn't exist

As most people know, and as is written up on the web a lot, Modelsim hijacks argv. They give you argc and all the arguments appear as %1 ... %n where n is argc. argv is set to '-gui', and that's it.

Wednesday, July 21, 2010

Altera - Ease of use leader? Not on your life!

I've used Altera a few times in the last 10 years. Wrote a few FPGAs, used Quartus, SignalTap, and Megawizard. But I've never had the chance to really compare Xilinx and Altera. Now I've actually had the opportunity to do a side by side comparison of Xilinx's and Altera's PCI Express cores.

Here it is: I was comparing Xilinx and Altera's PCIe root complexes to see which one simulates faster. First challenge was to create the cores, and learn how to instantiate them. I used Virtex 6 and Stratix IV as the FPGAs to instantiate for. I used the vendor provided cores, and not the 3rd party ones. I started with Xilinx.

Xilinx's instantiation wizards are a bit ugly, but they create the core well enough. For Xilinx I started by checking if there were any compile time defines I needed, and it looked like I didn't need anything special. So I found the top level file for the core, and started to manually instantiate it in a basic top_tb.v file. I then started to compile just to see what I'd need to get the simulation moving. There were a bunch (if not all) of the files in one directory which I needed, and I also needed the secureip and unisims libraries. Once I got the simulation to load, I went in to my top_tb, and added in the reference clock as needed. I then saw the tx p/n ports start wiggling as expected. For Xilinx it was easy to see that I needed a parameter to put it into simulation mode - a mode which shortens certain processes in the physical level training algorithm. At this point, I connected my endpoint and saw the link come up very quickly.

Now on to Altera.

Altera's instantiation wizards are much nicer. They also create the core as expected. I began in the same way as Xilinx by going through and checking if I needed special compile time defines. Like Xilinx, there weren't any. What I did notice which surprised me was the use of defparams, but that's not a problem b/c it's inside their code and doesn't affect me. I now took the top level module and began instantiating it in a new top_tb.v. Altera's core required me to compile altera_mf, sgate, 200model, stratixiv_hssi_atoms, stratixiv_pcie_hip, and some others which I can't recall at the moment. I also had to compile a number of files that were in the core's tree. Once I got all the modules needed to load the simulation, I was able to run. Altera's core loads up with a large number of mis-matched port connections. Altera's core didn't bother to connect many output ports in many places (presumably b/c they weren't using them). They also connected some ports with incorrect sizes. This fills the Modelsim console with a bunch of warnings. At this point I added the clock to top_tb.v, and set the lowest bit of test_in to 1 as appropriate for simulations. I expected to quickly see tx p/n to start wiggling. I saw the tx p/n pins wiggle after a very long time. I didn't bother to connect the endpoint.

Now for the important stuff:

Altera's simulations are slow as molasses!

Altera's PCIe core runs many times slower than Xilinx's or Lattice's. I have run all three and I was shocked. Truth is at first I thought Altera's core wasn't working until I realized that I had to give it more time. That is how slow it was. At least 2x slower if not closer to 4x slower than either of the other 2 vendors that I've worked with.

Altera's core is confusing and unintuitive!

I was trying to find the signal that says that the transaction layer is ready. Something like Lattice's dl_up, and Xilinx's trn_dl_up. There is no such thing in Altera's core. To see that Altera's core is up you must look at the ltssm status. The core provides multiple clocks that you need to set, many different resets, a test_in and test_out array, and in general lots of signals that are shockingly unclear.

Altera's port list is very old style, and separates the ports based on input and output!

The top level port list is old Verilog style. I don't have any problem with that of course. What is disconcerting is that the list has 2 sections, inputs and outputs. This is completely unhelpful. Xilinx's core has the ports separated by function: Common, TX, RX, CFG ... This is how I would have liked Altera to do it. Perhaps there's a reason for this, but it seems to follow the same paradigm of providing a very unfriendly RTL.

Altera's provided simulation is a mess!

I tried to understand the paradigm of Altera's PCIe core by looking at the simulation, and I can only use the word appalled. The method they use to reset the simulation, is some form of counter at which point they start things running. I was looking for this to find Altera's recommended method of seeing if the transaction layer is up. There was nothing to find.

I finally gave up on Altera, not because I couldn't get it working, but because it was not nearly worth the effort. Their core is a huge mess. I'm sure it works well enough in synthesis, but I wouldn't bother trying to run it in simulation. Even when using their Modelsim script to run simulation, you get the Modelsim warnings of unconnected ports. They should be embarrassed to release something like this. If only their IP team that provides these cores could learn from the GUI team. Very nice UI, horrible RTL.

I'm sure I could go on and on, but suffice it to say that for the advanced user who likes to read the code created, Xilinx is much better. Also for anyone who desires to simulate efficiently, Xilinx is much better. I have never been so shocked by a vendor before, especially one that is purported to be the most user friendly of the vendors. Both Lattice and Xilinx win against Altera.

As a side-note relating to Altera, it seems these types of issues have existed long before their PCI Express core. I have spoken with some very long-time Altera users, and it is known that their FIFOs simulate horrifically slow, and their old PCI core user-side was a mess too. Perhaps its time for Altera to heed this call and take the necessary steps to remedy this situation.


Sunday, July 18, 2010

Modelsim, working with multiple libraries - bug - not finding work

In short,

if you have the work library and another library named test, and you have:


and top instantiated test_mod which instantiates work_mod, you will get an error at vsim time that modelsim can't find work_mod. This is due to how Modelsim determines the work library when instantiating the test_mod. The typical command you would use is:
vsim -L test - this produces an error
vsim -L work -L test - also an error
The only way to get around this issue for now is to use:
vsim -L ./work -L test

What is occurring from what I can tell is that Modelsim treats the work library as special, and in this way creates all sorts of buggy behavior. Another recommendation from Model Tech is to not use the work library and therefore you would have a command like:
vsim -L mywork -L test
which would work fine. This is a very untenable solution for all the obvious reasons, so for now I'm sticking with the first workaround.

Error from Modelsim:

# vsim -L test
# Loading
# Loading test.test_mod
# ** Error: (vsim-3033) test_module.v(2): Instantiation of 'work_mod' failed. The design unit was not found.
# Region: /top/test_mod_inst
# Searched libraries:
# C:\tmp\modelsim_test\test (<-This is actually Modelsim mixing up the work library)
# C:\tmp\modelsim_test\test
# Error loading design

Example files:

module work();
_library library_inst();
module work_sub();

module _library();
work_sub work_sub_inst();
vlib work
vlog work.v
vlib library
vlog -work library library.v
vsim -L library work

#vsim -L ./work -L library work - this will compile and run as noted above.

Friday, July 2, 2010

Simulating 2 Lattice PCIe cores...

I start off every post with some vocal exclamation, so gah!

(All this is from memory, so forgive any unintentional mistakes.)

Lattice IP Express allows you to create PCIe cores. For compilation, you are given 2 NGO files, and For simulation you are given an obfuscated behavioral model of PCIe and the sources for the pcs NGO. (Why do they compile those sources to I guess it's to save the user from complications, and to save them from using the define files discussed next.) The simulation files are all controlled through files with various `defines in them. These `defines control the compilation of both the beh and pcs files. This is a major problem in Modelsim!

First off, you run into the issue that you can't have 2 cores with different simulation `defines running in the same design. For example I have a DUT (Device Under Test) with Wishbone and a BFM (Bus Functional Model) without it. This can't work b/c the same module names are used in both instances, even when you gave the cores different top level names during IP Express creation!

Next issue is that you can try to change the module names, but the behavioral model has a large amount of tiny modules with obfuscated names and with no easy way to go over the file and rename them all. You can do it, but it would take a nice sed script or some other method, and it would be annoying... The other modules of course could be renamed much easier. Next problem is that the defines are of course the same names, so you'd have to rename those, or compile the files from a top level .v file so the defines stay local at least with the -sv switch. All sorts of headaches involved.

Create separate libraries for the simulation core and the DUT core. Use Modelsim's trick to get vsim to work properly... What is that trick you ask? By setting the first search library as work in the vsim command, Modelsim will perform a search for modules beginning with the library that the module is instantiated inside of.
This won't work:
vsim top_tb -L pcie_sim_work -L pcie_endpoint_work

This will work:
vsim top_tb -L work -L pcie_sim_work -L pcie_endpoint_work

(Note: Lattice requires pcsd_mti_work also... so add -L pcsd_mti_work after mapping the library)

Modelsim discusses this special case in their reference manual under "Handling sub-modules with common names":
"When you specify -L work first in the search library arguments you are directing
vsim to search for the instantiated module or UDP in the library that contains the module
that does the instantiation."

Thursday, June 10, 2010

verilog & parallel case for case statements

So I was coding Verilog and I had a little circular buffer bringing in a mux selection on a per clock basis. I wanted this mux selection value to be treated as one-hot in the case statement.

Quickly mashed up example...

reg [31:0] circ_buffer [3:0]; //8 copies of the selector which is some size (no bigger than 32 bits)

always @ (clk) begin
circ_buffer[wr_index] <= A;
else if(wr_b)
circ_buffer[wr_index] <= B;
if(wr_a || wr_b)
wr_index <= wr_index + 1;

always @ (clk) begin
case (circ_buffer[rd_index])
A: ...
B: ...

So I want A and B to be defined as one-hot (ie A is 01 and B is 10). And then I want its interpretation to be parallel. This would happen if circ_buffer was a clearly defined state machine, but as it is, it's not. I just read up on parallel case and how it works, and I've seen lots of information... albeit bad information for those who like the most intense and clear control of what they are doing. An example is the document by Sunburst Design.

Here's what I get from most of what people have written. "parallel case" should never be given as a synthesis option b/c it will create RTL that will not behave like it will simulate. The 'proper' usage given is completely useless. Read the document by Sunburst Design and hopefully you'll understand my meaning.

Anyhow, parallel case is a very strong synthesis option that should be used carefully. I would probably advise against it simply b/c the same thing can be accomplished using carefully written if statements. For me, the solution is to simply swap the second part to:

UPDATE (July 25, 2010):
What I've written below is not correct for ASIC design. I've just recently learned that Synopsis treats the 'case' with 'parallel case' option in a special way. It basically instantiates a one-hot even if it could result in 2 sources driving a single wire. This can't happen in an FPGA as an FPGA will always resolve who drives what, but for ASIC design this isn't necessarily the case. To clarify - for ASIC design you may very well want to use the parallel case statement. Using the ifs as I show below would produce a resolved version using gates.
END UPDATE (July 25, 2010)

always @ (clk) begin
if(circ_buffer[rd_index][0]) ...
if(circ_buffer[rd_index][1]) ...

Yeah, it's a less pretty way, but it's the way you get the same simulation results as synthesis results. It's much smarter this way. But don't be fooled by the papers and forget that parallel case most definitely has a use, just one that's ill advised, or at the very least not capable of being simulated by any current tools that I know of.

I don't know what I'm gonna do yet, but perhaps leave a parallel case state machine with a check for valid states above it and blow up if now... That takes care of the simulation issue, right? Otherwise I'll do the if statements, one after another.

Saturday, May 22, 2010

vim plugins - seriously

I don't like indentation for Verilog (or VHDL), so I've decided to try and fix it. The best way to do that would be to understand how it works. I opened the ftplugin file verilog.vim and started reading. That's the why. Hopefully someone else will get something from the few useful gems I've gleaned.

DISCLAIMER: I do not know that I'm even understanding this correctly. I'm reading and gleaning as best as I can. It's strange code. Comments with corrections are appreciated.

s: (s colon) - script variables
a: (a colon) - argument variables
b: (b colon) - buffer variables
g: (g colon) - global variables
v: (v colon) - ... maybe it's the visual mode variables

=~ (equal tilda) - type match
: (colon) - type of dictionary or list
%( (percent open-parenthesis) - start of match that doesn't store in back-reference variables
%5c, %<5c, %>5c, %5l, %<5l, %>5l (percent number c), (percent number less-than c), (percent number greater-than c), (percent number l), (percent number less-than l), (percent number greater-than l) - search modifier to find at specific column (c) or line (l) with optional less-than or greater-than specified column or line.

Look at this guy's page, finally stumbled onto it:
and this page:
and this one too:

Tuesday, April 27, 2010

Build gvim 7.2 x64 and create installer for Windows 7, Vista, XP

Update with new installer and changes for building 7.3:

I've been getting kinda tired of not having a decent installer for gvim 7.2 on my Vista x64 system. I have figured out all of these steps, and I'm sharing them with you. You can download the actual installer from: (64 bit only, includes all patches up to and including 7.2.411)

This version is built exactly as I've shown below. The printing works in Windows, and so do the context menus. It also installs to the Program Files directory as it should. This is only tested on Vista x64.

Here are the steps: (I'm sure they can be somewhat simplified and I'd be happy to hear about that.)

Create a folder called building, eg c:\building

Download these files:
Unzip them using 7-zip or alternative program.
Each of these archives contain a vim72 folder, extract them all into c:\building\vim72 folder.

Download these files, and extract them to c:\building\patches

Download these files into c:\building\patches

Download and install these programs:
Optionally install make, although we will be bypassing the make with a manual command due to incorrect syntax in the Makefile when running on Windows.

Add "C:\Program Files (x86)\GnuWin32\bin\" to your path variable and reopen any command prompts.

Download and extract diff.exe from:
Copy diff.exe to c:\building\

Download and install Windows SDK for AMD64:
If this link doesn't work, then go to
and download the GRMSDKX_EN_DVD.iso (which is the AMD64 version)

Download this package and copy upx.exe to c:\building\vim72\nsis

Open a command prompt and run
C:\Windows\System32\cmd.exe /E:ON /V:ON /T:0E /K "C:\Program Files\Microsoft SDKs\Windows\v7.0\Bin\SetEnv.cmd" /release
Use this command prompt for building as it is now set to 64 bit release.

Install patches by changing directory to c:\building\vim72 and running these commands in this order:
patch --binary -p0 < ..\patches\7.2.001-100
patch --binary -p0 < ..\patches\7.2.101-200
patch --binary -p0 < ..\patches\7.2.201-300
patch --binary -p0 < ..\patches\7.2.301-400
patch --binary -p0 < ..\patches\7.2.401
patch --binary -p0 < ..\patches\7.2.402
patch --binary -p0 < ..\patches\7.2.403
patch --binary -p0 < ..\patches\7.2.404
patch --binary -p0 < ..\patches\7.2.405
patch --binary -p0 < ..\patches\7.2.406
patch --binary -p0 < ..\patches\7.2.407
patch --binary -p0 < ..\patches\7.2.408
patch --binary -p0 < ..\patches\7.2.409
patch --binary -p0 < ..\patches\7.2.410
patch --binary -p0 < ..\patches\7.2.411

Change directory to c:\building\vim72\src, and build using this command:
nmake -f Make_mvc.mak OLE=yes
nmake -f Make_mvc.mak OLE=yes GUI=yes
(First command produces vim.exe and second one produces gvim.exe)

Change directory to c:\building\vim72\runtime\doc and run
sed -e "s/[ \t]*\*[-a-zA-Z0-9.]*\*//g" -e "s/vim:tw=78://" uganda.txt | uniq >uganda.nsis.txt
This is the line that should be run from the Makefile in this directory, but Windows make.exe has trouble with it.

Change directory to c:\building\vim72\nsis and open gvmi.nsi in a text editor (preferably not notepad or wordpad...)
At line 22 change (comment out)
!define HAVE_NLS
#!define HAVE_NLS
(I don't have a 64-bit libintl.dll)

Replace all instances of PROGRAMFILES with PROGRAMFILES64
(Forces the default installation directory to be the Program Files directory instead of Program Files (x86).)

At line 248 change
   File /oname=vim.exe ${VIMSRC}\vimd32.exe
   File /oname=vim.exe ${VIMSRC}\vimw32.exe
(The check for WINNT isn't working on my Vista x64 system... so I force it)

Copy all files and directory from c:\building\vim72\runtime to c:\building\vim72

Rename these files:
c:\building\vim72\src\gvim.exe -> c:\building\vim72\src\gvim_ole.exe
c:\building\vim72\src\vim.exe -> c:\building\vim72\src\vimw32.exe
c:\building\vim72\src\install.exe -> c:\building\vim72\src\installw32.exe
c:\building\vim72\src\uninstal.exe -> c:\building\vim72\src\uninstalw32.exe

Copy and rename this file:
c:\building\vim72\src\xxd\xxd.exe -> c:\building\vim72\src\xxdw32.exe

Change directory to c:\building\vim72\nsis and run:
"c:\Program Files (x86)\NSIS\makensis.exe" gvim.nsi

Nachum Kanovsky
FPGA & Embedded Software Expert

Friday, April 23, 2010

Modelsim, scripts getting locked?

So I was working a lot with Modelsim yesterday simulating DDR2 cores, but I kept having to change the script. But Modelsim wouldn't let me!!! So here's the key, when Modelsim is running a script and you interrupt it while it's running, it may show:
and then you're just screwed! OK OK, type 'abort' and hit enter. This will finally end the script and allow you to modify your script without closing down Modelsim. Shockingly 'quit -sim' doesn't work, nor are there any menu options that help.

Perhaps this'll save someone else problems.

Lattice DDR2 cores - read_data_valid doesn't come up...?

So as is usual, I'm suspicious of crummy simulation models and that's of course the reason why Lattice's ECP3 DDR2 (6.7 and 7.0 beta) cores don't give read_data_valid during simulation... They output the data, but not the read_data_valid signal.

After 2 hours of debugging, I found that it was all in the module hookup. read_tap_delay! The signal that they don't bother documenting, or explaining was causing the problem. I accidentally hooked up read_tap_delay to 2'b0 instead of 6'b0 (shame on me). This killed it. I guess I learn 2 things here, first that lack of documentation doesn't indicate lack of importance! and second that 2'b0 doesn't translate in all situations to fill the port with zeroes!

On the other hand, I finally took the plunge and hooked up Micron's DDR2 model. I have to say that it was the easiest model ever to hook up! Just define sg3 and x16 and voila, it's done. OK OK, there were two small snafus...

Lattice provides their cores with a simulation model, if you try to swap out their DDR2 model (also Micron's), then there is at least one problem. They have a file where they define x8 for all cores. I don't know why, I'd guess it's got something to do with Micron's old model. Anyhow for those of you swapping models for simulation, be aware of that. It will explain why Micron's model suddenly insists on a data bus of incorrect size even though you defined everything properly!

Lattice doesn't include negative differential ports in their top level definition, so if you have a clk_p and clk_n, all that appears in the top level netlist is clk_p, and the lpf file (constraints file) indicates the differential nature and automatically connects the differential pair. (Yes, pairs are fixed in the ECP3 and many other Lattice FPGAs.)

So now I have a DDR2 model requiring clk_n and dqs_n. clk_n is simple, just:
wire clk_n;
assign clk_n = ~clk_p;
easy shmeezy

On the other hand, dqs_n is an experience in learning. To save you the time that you've probably already spent trying to figure this out, here it is:
wire dqs_n;
assign dqs_n = (dqs_p === 1'bz) ? 1'bz : ~dqs_p;

Yes yes, the 3 equal signs are necessary. Shockingly a part of the language I'd never needed to use before, now comes in handy.

Live to learn another day,

Monday, March 29, 2010

DNS issues with Netgear WGR614v5

Sooo... here's the story.

I keep getting stuck on "Resolving host" when browsing the web. I developed a simple test. Open 9 tabs at the same time and see which get through and which don't. My Godaddy sites never come through, and most of the time neither do Twitter and Blogger.

What's really going on? Since it was stuck on "Resolving host" I reverted to some manual nslookup debugging. Even nslookup showed "DNS request timed out", and "timeout was 2 seconds".

I then updated my local LAN card to use Comcast's server instead of relying on the router's ability to do lookups. No change.

Convert UDP DNS requests to TCP!! It seems that this router gets overwhelmed by routing the DNS requests and drops them.

When this happens, the WGR614v5 loses the ability to resolve DNS for about 5 - 10 seconds. During this time, I tested a custom DNS lookup using both UDP and TCP. The UDP lookup failed, and the TCP lookup worked. This prompted me to write a simple server which listens for DNS requests, wraps the data with the customary 2 byte size header and sends it on as a TCP request. Once the response comes back, I remove the 2 byte size header from the response, and I send it back to the requester.

The router may have a general UDP problem, as opposed to a UDP DNS problem, but all I care about at the moment is my ability to get to my webpages. As HTTP is TCP and it just relies on the DNS requests to get the address, this solves my problem well enough for now.

A new router is being purchased in the next few days...

Friday, March 5, 2010

HTML stuff...

Never put a block-level element inside a paragraph :


! This is illegal in HTML 4.01 (

"The P element represents a paragraph. It cannot contain block-level elements (including P itself)."


IE8 - JSON works, but only if you have:

otherwise IE8's JSON object is undefined...


IE8 - eval function - Accepts string only, and should be enclosed in parenthesis! So if you are eval'ing an object then the object name is put in quotes. For example:

parse(jsonString) {
return eval('(' + jsonString + ')');
//return eval(jsonString); -- Illegal!


PHP - isset should be used to see if variable exists, also variable indexes and the like.
ie: isset(myObj['testExist'])); will return true if testExist exists and is non-null


Chrome and Safari (guessing it's a webkit issue) won't keep the text selected if you just use an onFocus callback with a call. You must cancel the mouseup event otherwise the text most of the time gets deselected. Argh

Thursday, February 18, 2010

Downloaded NVIDIA driver? Screensaver may still turn on when playing a movie in full screen

I'm running Vista x64 SP2 and Windows Media Player 11. I have an NVIDIA GeForce 8600 graphics card. I also make sure that the "Allow screen saver during playback" is unchecked in WMP -> Tools -> Options. After 20 minutes my screen saver kicks in.

Anyhow I've solved the problem with the screen saver. I like to keep my system as updated as possible, so I install NVIDIA's drivers directly from their website. Yesterday I removed NVIDIA's drivers, and I let Windows install it's own drivers for the graphics card. Of course Microsoft is just packaging a form of NVIDIA's drivers, but now my screen saver stays disabled when I play a movie.

Saturday, February 6, 2010

Interesting HW interview question... Serial binary input divisible by 5

I'm in California at the third round of interviews and I'm asked a question that baffled me. Given a serial input of a binary number, write a state machine that tells if that value is divisible by 5.

I first solved the problem thinking the input was MSB first. This is pretty trivial. 5 states and some very simple movements. Then it was clarified that they wanted it solved LSB first. I thought about it but I could not come up with a solution.

It is now the next day and I approached the problem again. I have a good and clear solution, but I'd love to know if there is a better one.

WARNING - Solution below:

A single bit, in any location, when converted to decimal will always end in either 2, 4, 8, or 6 except of course the least significant bit which is 1.
1, 2, 4, 8, 16, 32, 64, 128, 256...
1, 2, 4, 8, 6, 2, 4, 8, 6...
A divide by 5 problem at any given time can be in one of five situations. It's either 0, 1, 2, 3, or 4 from being divisible by 5. Ie: it is 0 and is divisible, or 1 and needs another 4, or 2 and needs another 3...

By combining both of these ideas you can build a state machine with 20 states that easily solves this problem. It's like 5 small 4-stated machines. The 4-stated machine keeps moving between 2, 4, 8, and 6 when there's a 0 input, and moves to one of the other machines when there is a 1. Of course you need an extra start state to deal with the first bit. And anytime you are in the 0 4-stated machine then the value is divisible.

Best way to explain is by example:
current value, add value (c, a) where c goes from 0 to 4 and a is 2, 4, 8, or 6.

Starting in first bit state:
Value that will be given is 25 (11001).
1: First bit state moves to (1, 2) since current value is 1 and next bit will add 2.
0: (1, 2) moves to (1, 4) since current value is still 1 and next bit will add 4.
0: (1, 4) moves to (1, 8) since current value is still 1 and next bit will add 8.
1: (1, 8) moves to (4, 6) since current value is now 4 (9 divided by 5 leaves a remainder of 4) and next bit will add 6.
1: (4, 6) moves to (0, 2) since current value is now 0 (10 divided by 5 leaves a remainder of 0) and next bit will add 2.

That's it.

Tuesday, January 12, 2010

Spice circuit simulator - part 2

I did it! I made a circuit and I got it to run. I have returned back to Berkeley's 3f5 spice. Using cspice I can input the file circuit.txt:

My first circuit
vcc high 0 dc 10
r1 high bw 3
r2 bw 0 4

This is a very simple circuit with 10v going to a 3 ohm resistor and then a 4 ohm resistor and then back to ground.  10v -- 4ohm -- 3ohm -- gnd

I run this simulation with this command on Windows: cspice.exe < circuit.txt

This results in actual output! Much of this I learned from Berkeley's documentation at under Interactive User Guide. I also referenced Kevin Cosgrove's article 'Analyzing Circuits with SPICE on Linux' at

I am not anywhere near the point of running spice to do transient analyses. I only want very simplistic static results of basic digital circuits. This is what I now feel confident that I can do.

Monday, January 11, 2010

Spice circuit simulator

I'm learning electrical engineering... Pretty tough to understand this stuff, but really helps. Another tool I'm trying to use is a spice simulator. This way I can write tiny circuits and see if I understand what they're supposed to do, and spice can verify that.

Anyhow, I started searching for spice, and that was a bit of a challenge just because of it's name... Google doesn't bring you to what you're looking for directly, so here's a good starting point: - Homepage - List of available spice simulators

OK. So which one to download. I like to work as manually as possible to increase my understanding as much as possible, so this meant I wanted a tool that was as much like the UNIX version (I'm on Windows) as possible. I also want a text based tool for now. This led me to download Berkeley's spice3f5.tar.gz, and to figure out how to compile this with Visual C++ 2008. A good starting point is I did some things differently from what he listed, but I believe his directions will work too. A tip would be to use GnuWin32's find, xargs, and sed tools to perform the necessary changes to the files. Also when following the /OUT: directions you have to add that line to only the files he listed, otherwise the lib files won't get added correctly. I chose a more generic method of modifying the batch files with something like this in the relevant msc51.bat files:
set LIBNAME=..\..\dev1.lib
if not exist %LIBNAME% set LIBPARAM=/OUT:
lib %LIBPARAM%%LIBNAME% @response.lib >> ..\..\..\msc.out

... After getting it to compile I realize that it doesn't create spice3.exe, and only really creates bspice.exe and cspice.exe. I don't currently know much about spice so I don't really know what those files do for you, but from what I can see they aren't what I was looking for.

At this point I started looking for another alternative. I found ngspice: The download page is I know it says that they only release sources, but that's not true, download the zip file and you'll see it has Windows binaries. I'm guessing that they never bothered to update that information.

Last thing... Seems that spice uses something referred to as decks (input decks). Trouble is no one bothers to tell you what that is. I am once again positing that this term comes from very old systems that accepted commands on actual cards. Each command was put on a card and the cards were fed into a machine. The combined cards were referred to as decks. I found a reference to cards and decks here: So an input deck seems to just be a set of input commands.

Will update as I learn more. At this point I'm still just a fish flapping around on dry land with regards to spice.