Muxing clocks is nothing new to me, I've done this many times. But until now, I've never had to keep both sides synchronous with each other.
process(clkA, clkB, clkSel)
begin
clkAB <= clkA;
if(clkSel = '1') then
clkAB <= clkB;
end if;
end process;
process(clkA)
begin
if(clkA'event and clkA = '1') then
ffA <= sig_tmpA;
end if;
end process;
process(clkB)
begin
if(clkB'event and clkB = '1') then
ffB <= sig_tmpB;
end if;
end process;
process(clkAB)
begin
if(clkAB'event and clkAB = '1') then
if(grabA)
ffAB <= ffA;
end if;
elsif(grabB)
ffAB <= ffB;
end if;
end if;
end process;
Here's the problem. I want ffAB to be set to the value of ffA when grabA is true and I want ffAB to be set to the value of ffB when grabB is true. This requires having clkAB in phase with clkA when clkSel is 0 and having clkAB in phase with clkB when clkSel is 1. The problem is clock skew. Basically the clkAB clock becomes a delayed version of the source clock. This means that the flip-flops before the clock will change before the flip-flop after the clock mux is able to sample the value. This is a standard hold-time problem. For an ASIC/ASSP, the solution is implemented by the synthesis tools. It is pretty automatic from what I've been led to believe. Perhaps some constraints or switches are required, but in general, if the two clocks are related then the tools will balance the clock trees to make sure of setup and hold times.
On the other hand, an FPGA doesn't work this way. Clocks are routed on a clock network. Creating a clock mux of this sort adds a serious skew to the timing. As I am currently using Actel tools, I am very aware of the lack of any CTS done by the synthesis or place & route tools.
The key to solving these issues (other than manual regioning of individual flip-flops) is to carry the data over the delayed clock boundary by using buffers. BUFD is Actel's buffer primitive, and by using buffers on the data line you can eat from the setup-time and give to your hold-time. Basically take the output of ffA/ffB and add buffers between their Qs and the D of ffAB.
Voila, you have solved the hold time issue.
Another option is to create a copy of clkA and clkB, and delay those copies using a buffer. Then use these copies in your original clkA and clkB processes. This would create a smaller skew. The problem with this mechanism is that you may also want to sample data back from clkAB to clkA/clkB. If you play with the clock only, then it will be difficult to solve the hold problems in both directions. As soon as you solve it in one direction, you most likely will have created it in the other direction.
A good solution would use a combination of shifting the clock a bit, and using buffers to cover any remaining skew. Whatever solution you choose to implement should be done in direct conjunction with the timing analyzer tools. Those tools are your best way of knowing where and when you have setup or hold problems on your system.