How to design a better FPGA (or ASIC)

Introduction

This page is not directed at people designing with FPGAs but instead those who are designing FPGAs and ASICs. Those developing with FPGAs may, however, find some of the information useful in terms of what to watch out for but may get the most long term benefit by pointing their FPGA or ASIC vendor to this page. Much of the information here also applies to microcontrollers.

FPGAs, like many other integrated circuits, are often designed with insufficient thought given to total system impact.

Don't be silicon-wise and system foolish. -- Vishwany Agrawal, AT&T

This page may offend some FPGA designers and managers who put personal ego ahead of the interests of their company, their customers, and the public interest. So be it.

I think it is a safe bet that the problems described here are cutting FPGA sales by at least a factor of two. A factor of ten would not surprise me, in the least. There is a huge difference between the promise of FPGAs and the reality and most of that is due to failure to pay attention to system level details.

I am going to call Xilinx, in particular, to task because I am currently working primarily with their chips. The reader should not assume that this means that Xilinx chips are worse than other brands. I see evidence of many of these problems on other manufacturers as well. If there was another brand that had chips competitive to the ones I am using and didn't have a lot of these problems, I would have switched in a heartbeat. There are even a few problems listed here that Xilinx got right or almost got right.

208-QFP Package

The 208-QFP package is crucial. This is the highest pin count package that is commonly used that does not require BGA technology. Many companies cannot use BGA technology and even for those that can there are substantial benefits. It is vitally important to provide devices in this package. Xilinx crippled the Spartan-3AN and 3A families by not providing PQ208 versions.

And shocking revelation, there are still companies that use through hole parts. So, there is actually a market for 208-PGA (preferably 2 rows of pins). Carrier board is ok, just keep the cost reasonable.

A good pinout is essential

Pinouts should be designed to facilitate PCB layout. It is much easier to reroute a signal to a pin inside an IC once than have every PCB designer do it repeatedly (and possibly need more layers or run into signal integrity problems).

A 208-PQFP package has 52 pins on each side. Lets allocate them. 32 user I/O, every pin full I/O and half of a differential pair. 8 grounds, interspersed with I/O. 5 config (including vcc), near corner. 2 Vref, one in the middle of each set of 16 I/O lines. 2 Vccint, near corner. 2 VccIO, one for each bank of 16 I/O lines, prefably near corner. 1 extra. Except for the config pins, all 4 sides are identical. All the rarely used dual function config pins (parallel config) are put on the same bank and arranged so that differential pairs are either both used for config or both for I/O. Of the config pins in each corner, 1 corner has all the JTAG pins: TDI, TCK, TMS, and TDO. One corner has the serial flash pins: CCLK, D0 (serial data), Config CE, and config reset (INIT_B). One corner has the mode and reset pins: reset, M0, M1, and M2. The remaining pins go in the last corner. Each set of configuration pins has its own Vcc and should work from at least 1.2 to 3.3V. When you assign pins for pin-hungry configuration modes, start at one end of a single bank and work your way toward the other end, leaving as many adjacent fully functional I/O pins as possible for each mode.

Avoid dysfunctional I/O pins

The Xilinx XC3S500E-PG208, for example, has a staggering number of dysfunctional I/O pins. Using, presumably, the same die as for the much higher pin count BGA packages, you would think it would be real easy to wire bond to fully functional I/O pins. But, no, they had to waste valuable pins on the QFP package by bringing out the worst pins when better pins were available. The situation was made worse, before this happened, by thoughtlessly assigning special functions to I/O pins inappropriately. A pin is dysfunctinal if:

Finding just 24 usable pins on each side turned out to be quite difficult. It should have taken minutes, not days.

Design for prototyping

Remember, a company's engineers need to be able to produce a proof of concept prototype quickly and cheaply before they can work on a production version. Such a prototype will also raise a number of issues that would otherwise require multiple iterations of a "production" designed board. On a production board, you can probably work around some crappy I/O pins but not at the development board; you can, for example, use input only pins for input signals. But on a development board, you don't know how the pins will ultimately be used so they need to be 100% functional. And yet, the same features that help a development board will also simplify layout of a production board. A decent prototyping system is modular and has many interchangable connectors for I/O modules with the same exact pinout. Anything less is not suitable for rapid prototyping.

Symmetry

You should have as much 4 sided symmetry as possible. I/O pins with the same capabilities in the same positions on each side. Pins should be so similar that you can document the capabilities of most of them in a single table with the corresponding pins from each side listed together. The more symmetry and fully capable pins you have, the easier it is to document and, much more importantly, the easier it is to use the documentation.

This includes global clocks on all 4 sides. Xilinx has global clocks top and bottom and left half and right half clocks on the two sides. Worse, the only bank that has enough I/O pins to support a DRAM, does not have a global clock. A symetrical design would have had two global clocks on each of 4 sides and 2 left/right/top/bottom half clocks.

Internal JTAG chain access

Let us create many internal JTAG chains. Xilinx provides USER1 and USER2 JTAG instructions (and in a few cases USER3 and USER4) instructions to provide for internal JTAG chains. That is a start but you can do better. The instruction register should be 8 bits (XC3S500E is currently 6). Give us just input on the TDO mux, as opposed to the current two. All undecoded opcodes, which currently select the bypass register, now select that input. Put a two input mux on it. Depending on the state of a configuration bit, that mux either selects the bypass register or our internal logic. Now, bring all bits from the instruction register into routable points on the chip, plus TDI, TMS, and TCK. Also bring out the output of the bypass register. We can do without the instruction register bits by making our own parallel register, if necessary, and this is even desirable in some cases. The internal TAP state machine states would also be nice but not essential. This allows us to add as many JTAG chains as we want, up to about 191, and also gives us the ability to come close to 100% ASIC emulation (as far as JTAG is concerned), just with different instruction numbers provided.

One other useful state would be to let us insert a chain between the output that would normally drive TDO and the actual TDO driver. One JTAG instruction would nullify this connection, allowing recovery in case the user JTAG chain is fubar. This is easily combined with the above. This gives us 100% ASIC emulation. Easily combined with the above.

With these two changes, we have the ability to construct our own JTAG subsystems that are either in parallel with the Xilinx system or in series with it.

Configurable I/O driver/receivers

A lot of applications require reconfiguring the I/O characteristics. A logic analyzer is one example. GPIO is another. There are many more. Give us macros that provide control over all the I/O drive characteristics (current drive, slew rate, pull ups, single ended/differential, input threshold, single ended/differential, ddr, etc.). You presumably have a register which is latched from the configuration bits. Put a mux on the data input lines to that register and the strobe line, and give us one input to the mux. Or, if it is treated as RAM, let us write to it. Either option lets us treat it as a register accessable to an internal micro or other logic. And definitely give us the offsets in the configuration stream of the configuration fuses in case we want to manipulate that at download time instead of run time. The following, macro, for example, would provide access to the full features of a pair of I/O drivers including differential I/O and DDR:


  entity dual_configurable_io is
  port(
    -- vector[0] drives/reads positive pin, vector[1] drives/reads negative pin
    -- in diff mode, vector[0] drives both
    -- in ddr mode, the ddrin and ddrout provide the other half.
    in:            out std_logic_vector(1 downto 0),  -- data in
    ddrin:         out std_logic_vector(1 downto 0),  -- DDR data in
    out:           in std_logic_vector(1 downto 0),  -- data out
    ddrout:        in std_logic_vector(1 downto 0),  -- ddr data out
    enable:        in std_logic_vector(1 downto 0),  -- output enable
    configuration: in std_logic_vector(??? downto 0), -- configuration register
    conf_strobe    in std_logic;                      -- clock data into conf register
 );
 end dual_configurable_io;

5V tolerance

Is 5V tolerant I/O, or even 5V capable I/O, that hard? Other companies do it. So you need a thicker oxide layer on the edge of the chip. That is one extra mask and one extra step at the fab. Maybe the result is the I/O stages are a little slower. I would think, should be able to design the chip so you can produce 5V and 3.3V versions from the same mask set (including the extra mask), just with a variation in the process. And while most chips used in a new design are likely to be 3.3V or less, the design itself in many cases has to interact with the outside world and a lot of the outside world is 5V. There are also applications in noisy environments (like motor controls) where you want the extra noise resistance and may still use 5V HC (not HCT) parts. Level translation outside the chip is a train wreck. The selection of level translation parts available leaves a lot to be desired and you have lost a lot of flexibility and reliability once you go from a three pin driver interface (in, out, enable) to a single pin. And external level translation means more parts.

Whatever happened to the idea of FPGAs reducing parts count?

Four power supplies: 5V, 3.3V, 2.5V, and 1.2V. 24 decoupling capacitors. External level translators.

RAM

It is way past time to offer a stackchip with customer's choice of DRAM (for high capacity) or SRAM (when standby mode is needed) stacked on top. From a system level perspective, this should be cheaper than putting the chip on the board. And it saves 32 valuable I/O pins in a QFP package compared to chip on board. You can use the extra I/O buffers used for FPGA packages to talk to the RAM; just use a separate VccIO so you don't cripple all 4/8 banks of I/O by limiting them to RAM VccIO. Set the VccIO by changing one metalization layer or using bond wires. Stacked Parallel flash would be nice but much less important with a big RAM since we can download a serial flash into RAM at boot time. This will finally put you in a position to offer microcontrollers some serious competition.

SERDES

SERDES isn't just for expensive high end FPGAs for exotic niche markets. You have done that, now bring it to mass market chips like the Spartan-3E series. FPGAs are seriously falling behind off the shelf components here. PCI Express, 1394, SATA, SAS, Fiber Channel, SSA, Gigabit Ethernet, and RapidIO use variations of 8B/10B serdes. SERDES is becoming common, and cheap. An engineer shouldn't have to use a multi-hundred dollar FPGA to make a PCI Express card, Expresscard, or use a SATA hard drive.

SD/MMC card support

The ability to use an external SPI flash to load configuration is far less interesting than the ability to load the configuration from a partition on an MMC/SD card in SPI mode.

User settable fuses

It should be very easy to make very simple changes to the design at load time. The user should be able to identify configuration bits of interest in the VHDL/Verilog/schematic/netlist source by name and have the development tool chain spit out a list of the names and the fuse offsets in the configuration stream. This includes I/O port configuration.

Non-volatiler configuration "fuses" setting the state of each internal pin

The state of each output pin needs to be defined right from power up.

Power issues

Three or four power supplies is a burden in many, if not most, applications. Make everything run on either the core supply or either 2.5 or 3.3V (user's choice). The user should be able to eliminate at least one of those two supplies if they can use all 3.3V or 2.5V parts. No power sequencing. And no requirement that power supplies increase monotonically. Linear regulators, if used, like to fold back and USB has inrush current limits.

Windriver

Jungo Windriver (used for pod support) is not acceptable. It not only creates problems at install time and every time you upgrade the kernel, it creates gigantic security holes. As far as xilinx is concerned, there has been a little bit of progress here.

Open JTAG pods

You are in the chip business, not the pod business. Full documentation including schematics, firmware source, CPLD source, and protocol documentation is an absolute, non-negotiable, requirement. The current situation of needed a different vendor pod for each brand of chip and not being able to program any of them directly, is grossly unacceptable. If we have an Atmel Micro, a Xilinx FPGA, a national ethernet MAC, etc. on a board we should be able to use one pod for all of them. We should be able to use any pod with any software. It does not matter whether you make a profit on your pod designs or a loss. You are in the chip business. It it not a problem if people clone your pods - it is a good thing. The more pods in peoples hands that support your software and your brand of chip, the more chips you can sell. If a clone pod can be had for half your price, then that just lowers the cost of entry for potential customers. You will still sell enough pods to cover the development cost, since many people will just buy your pods rather than shop around. If your pod is open, you might actually sell more of them. As it is, people are avoiding the Xilinx platform cable USB like the plague.

And while you might think it is a good think to lock in your customers, that is not only psychopathic corporate behavior it is also counter productive. You are also locking out your competitors customers plus many prospective FPGA users.

There is nothing special about the Platform Cable USB hardware design to protect. It is bloody obvious. A high percentage of engineers trying to come up with a solution are going to end up with essentially the same design. 20Mbit/second JTAG requires USB 2.0. That seriously limits your selection of micros. The selection is so bad, that the engineer is likely to select the EZ-USB FX2 because it is the best known and there is very little competition (and even less that can download code). The design of the FX2 is so bad, you need an external CPLD to use the FIFOs in bidirectional mode and to handle the shifting (no SPI port). The CPLD logic is pretty obvious. So anyone who tries to make a fast pod is going to consider the basic design of that pod. No breakthroughs there. Nothing worth hiding, even if hiding was a sensible thing to do.

When you withhold tecnical documentation, you are hurting your customers, and your own company, more than your competitors. Competitors have the economy of scale to reverse engineer; your customers don't.

Linux

Linux support is required. Not just one distribution, all of them. And the software needs to be free. Setup a multiboot machine and compile away. BTW, when I get around to it, I plan to automate this. If I didn't have to deal with all the problems described on this page, I probably would have already. Yet another way that by wasting my time, you are shooting yourself in the foot.

Support the little guy

Don't ignore small developers. Big markets start small.

Remember that up front costs often come out of the engineer's pocket

Company managment is often scared to try new technology. Often it is hard to get new tech through managment without the engineering buying a development board, a JTAG pod or other ICE, and any development software out of his or her own pocket to prove it can be done. Rinse, lather, repeat until you find a chip that isn't crap. Prototype PCBs may also be involved. I have seen this time and time again at many different companies and even major R&D labs. Anything that raises costs of prototyping or evaluation can kill a project.

Hire Systems level people

Creating a FPGA chip is hard work. Your VLSI people are too busy looking at trees to see the forest. You need people who have the skills, eperience, aptitude, inclination, and the priority to take a systems level view. Not just application engineers who spend most of their time doing tech support but people who have experience designing complete complex systems, including PCB layout, micro firmware, interface, and programmable logic. Also, listen to your application engineers who should in turn be listening to the customers. Well, if comp.arch.fpga is an indication, they should be doing a better job of listening.

Some background

I am designing a FPGA development board (not a toy "evaluation board") for rapid prototyping. It has eight 12 bit I/O ports with identical pinouts, plus a few extra 4 bit ports where I can scrounge some extra I/O pins. The same pinout is used on micro and FPGA pins. I call this duodecaport. I/O modules can plug into any of the duodecaports, many of them can also stack. Many of the modules will work with either a micro or FPGA. Why 12 pins? Well, the least of the reasons is tradition. The 8255 chip had 12 pins; unfortunately, it was a lousy design in which the individual pins were not programable as input or output (no data direction register). Many others have followed suit. 12 pins provides 8 data lines plus 4 handshake, 12 I/O pins, or 6 differential pairs. Two ports can combine to give 16 data lines plus 8 handshake. 12 lines is enough to provide a multiplexed address/data bus, a double pumped PXPIPE (Parallel PCI express) style full duplex stream connection, an LPC bus, an SPI bus with 4 board select bits and 4 chip select bits, or a number of other bus configurations. An ethernet RMII PHY interface fits in 12 pins as does a USB ULPI or On-the-go PHY. A PCI Express PXPIPE fits in 12 pins once you double pump it, with the help of a CPLD. Fewer pins and you frequently fall short, more pins and you frequently waste valuable pins. 12 pins will work with most micros and FPGAs. To help with level shifting, allow the use of different bus formats (or straight through connections to prototype FPGA applications that drive the chip directly), and to translate pins where different micros put different special functions on different pins, most I/O boards have a CPLD. Bus and star configurations are supported. You can stack a bunch of low speed boards on a bus to save pins and use star configuration for the high speed I/O devices. Also, more pins is incompatable with the trend towards narrower higher speed buses and a star configuration.

A rapid prototyping system such as this has potential to make a big difference in FPGA sales. This is not a toy evaluation board where someone can blink some LEDs or yet another crappy "development" board with inadequate grounds and poorly thought out pinouts or connectors that are unsuitable for lab use (such as Spartan-3E starter kit). This is a serious modular rapid prototyping development system that can be used in a very wide range of applications and where I/O modules can be reused across a number of different micros and FPGAs.

A case history

I tried to use FPGAs at a former employer for production systems. The stumbling blocks for management: no one has the FPGA experience (my first FPGA design (after I left the company) worked as expected the very first time I downloaded it without simulation), surface mount, lack of 5V I/O or 5V tolerance, need for lots of power supplies, etc. The company did wave soldering in house but the technician who did most of the PCB layout didn't know surface mount, the assembly workers didn't know surface mount, the repair technicians didn't know surface mount, the factory foreman didn't know surface mount, and they didn't have a reflow oven.

Related Pages (this site)