Overview
High-level design tools offer field-programmable gate array (FPGA) technology to engineers and scientists who have little or no digital hardware design expertise. Whether you use graphical programming, C, or VHDL, the synthesis process is quite complex and can leave you wondering how FPGAs really work. What actually happens inside the chip to make programs execute within configurable blocks of silicon? This white paper is intended for the nondigital designer who wants to understand the fundamental parts of an FPGA and how it all works “under the hood.” This information is still helpful when using high-level design tools, and can hopefully shed some light on the inner workings of an extraordinary technology.
Table of Contents
Field Programmable Gate Arrays
Every FPGA chip is made up of a finite number of predefined resources with programmable interconnects to implement a reconfigurable digital circuit.
Figure 1. The Different Parts of an FPGA
FPGA chip specifications include the amount of configurable logic blocks, the number of fixed function logic blocks, such as multipliers, and size of memory resources like embedded block RAM. There are many other parts to an FPGA chip, but these are typically the most important when selecting and comparing FPGAs for a particular application.
At the lowest level, configurable blocks of logic, such as slices or logic cells, are made up of two basic things: flip-flops and look-up tables (LUTs). This is important to note because the various FPGA families differ in the way flip-flops and LUTs are packaged together. Virtex-II FPGAs for example, have slices with two LUTs and two flip-flops, whereas Virtex-5 FPGAs have slices with four LUTs and four flip-flops. The LUT architecture itself may also differ (4-input versus 6-input) but more details on how LUTs work will be provided in a later section.
Table 1 lists the specifications of the FPGAs used in LabVIEW FPGA hardware targets. The number of gates has traditionally been a way to compare FPGA chips to ASIC technology, but it does not truly describe the number of individual components inside an FPGA. This is one of the reasons why Xilinx did not specify the number of gates for the new Virtex-5 family.
|
|
Virtex-II 1000 |
Virtex-II 3000 |
Spartan-3 1000 |
Spartan-3 2000 |
Virtex-5 LX30 |
Virtex-5 LX50 |
Virtex-5 LX85 |
|
Gates |
1 million |
3 million |
1 million |
2 million |
----- |
----- |
----- |
|
Flip-Flops |
10,240 |
28,672 |
15,360 |
40,960 |
19,200 |
28,800 |
51,840 |
|
LUTs |
10,240 |
28,672 |
15,360 |
40,960 |
19,200 |
28,800 |
51,840 |
|
Multiplier |
40 |
96 |
24 |
40 |
32 |
48 |
48 |
|
Block RAM (kb) |
720 |
1,728 |
432 |
720 |
1,152 |
1,728 |
3,456 |
Table 1. FPGA Resource Specifications for Various Families
To understand these specifications better, consider the way code is synthesized into digital circuitry. Synthesis is the process of translating high-level programming languages into true hardware implementations. For any given piece of synthesizable code, either graphical or textual, there is a corresponding circuit schematic that describes how logic blocks should be wired together. The LabVIEW FPGA Module adds additional logic around every block diagram function before sending final schematic to the compiler. Let’s examine a small section of block diagram code to see what the corresponding schematic looks like. Figure 2 shows an example of five Boolean signals being past into a grouping of Boolean functions to graphically calculate a single binary value.

Figure 2. Small Section of a LabVIEW Block Diagram with Simple Boolean Logic
Under normal conditions (outside the LabVIEW single-cycle timed loop), the corresponding circuit schematic that results from the Figure 2 block diagram section looks like Figure 3.

Figure 3. Circuit Schematic Corresponding to Boolean Logic in Figure 2
It may be difficult to see, but there are actually two parallel branches of circuitry that are created. The five topmost black wires feed into the first branch, which adds a flip-flop between each Boolean operation. The five bottommost black wires go to a second chain of logic to with the same number of flip-flops, which is created to keep track of the number of clock cycles needed to propagate data through the digital circuit. In total, 12 flip-flops and 12 LUTs are used implement this schematic. The upper branch and each component are analyzed in the following sections.
Flip-Flops

Figure 4. Flip-Flop Symbol
Flip-flops are binary shift registers used to synchronize logic and save logical states between clock cycles. On every clock edge, a flip-flop latches the 1 or 0 (TRUE or FALSE) value on its input and holds that value constant until the next clock edge. Under normal conditions, LabVIEW FPGA places a flip-flop between every single operation to maximize the propagation time available for each operation to execute. The exception to this rule is when code is placed into a single-cycle timed loop structure. In this special loop structure, flip-flops are added only at the beginning and end of the loop iteration, and it is up to the programmer to understand timing considerations. Further details on how code within a single-cycle timed loop is synthesized are discussed in a later section. Figure 5 shows the upper branch of Figure 3, with flip-flops highlighted in red.
Figure 5. Schematic Drawing with Flip-Flops Highlighted in Red
Look-Up Tables (LUTs)

Figure 6. Four-Input LUT
The remaining logic in the schematic shown in Figure 6 is implemented using very small amounts of RAM in the form of LUTs. It is easy to assume that the number of system gates in an FPGA refers to the number of NAND gates and NOR gates in a particular chip, but, in reality, all combinatorial logic (ANDs, ORs, NANDs, XORs, and so on) is implemented as truth tables within LUT memory. A truth table is a predefined list of outputs for every combination of inputs. (Faded visions of Karnaugh maps might be flashing through your head right now)
Here is the quick refresher from digital logic class:
The Boolean AND operation, for example, is shown in Figure 7:

Figure 7. Boolean AND Operation
The corresponding truth table for the two inputs of an AND operation is shown in Table 2.
![]()
|
Input 1 |
Input 2 |
Output |
|
0 |
0 |
0 |
|
0 |
1 |
0 |
|
1 |
0 |
0 |
|
1 |
1 |
1 |
Table 2. Truth Table for Boolean AND Operation
You also can think of the inputs as the numerical index for all possible outputs, as shown in Table 3.
|
LUT Index |
Output |
|
0 (00) |
0 |
|
1 (01) |
0 |
|
2 (10) |
0 |
|
3 (11) |
1 |
Table 3. LUT Implementation of Truth Table for Boolean AND Operation
Virtex-II and Spartan-3 FPGAs have four-input LUTs to implement truth tables with up to 16 combinations of four input signals. Figure 8 is an example of a four-input circuit implementation.

Figure 8. Circuit of Four Input Signals to Boolean Logic
Table 4 shows the corresponding truth table you would implement within a four-input LUT.
|
LUT Index |
Output |
|
0 (0000) |
1 |
|
1 (0001) |
1 |
|
2 (0010) |
1 |
|
3 (0011) |
0 |
|
4 (0100) |
0 |
|
5 (0101) |
0 |
|
6 (0110) |
0 |
|
7 (0111) |
1 |
|
8 (1000) |
0 |
|
9 (1001) |
0 |
|
10 (1010) |
0 |
|
11 (1011) |
1 |
|
12 (1100) |
0 |
|
13 (1101) |
0 |
|
14 (1110) |
0 |
|
15 (1111) |
1 |
Table 4. Corresponding Truth Table for Circuit Shown in Figure 8
FPGAs in the Virtex-5 family use six-input LUTs, which implement truth tables with up to 64 combinations of six different input signals. This becomes increasingly important when using single-cycle timed loops in LabVIEW FPGA, as the combinatorial logic between flip-flops can become very complex. The next section describes how single-cycle timed loops optimize FPGA resource usage in LabVIEW.
Single-Cycle Timed Loops
The example code used in previous sections assumed that code was placed outside a single-cycle timed loop, and additional circuitry was synthesized to ensure synchronous dataflow execution. The single-cycle timed loop is a special structure in LabVIEW FPGA that generates a much more optimized circuit schematic, with the expectation that all branches of logic can execute within a single clock cycle. If a single-cycle timed loop is configured to run at 40 MHz, for example, all branches of logic must execute within a clock tick of 25ns.
If the same Boolean logic from a previous example were placed inside a single-cycle timed loop, as shown in Figure 9, the corresponding circuit schematic that is generated is shown in Figure 10.
Figure 9. Simple Boolean Logic within a Single-Cycle Timed Loop

Figure 10. Circuit Schematic Corresponding to Boolean Logic in Figure 9
When compared to the previous schematic shown in Figure 3, it is clear that this implementation is much simpler. The logic between the flip-flops would require at least two 4-input LUTs on a Virtex-II or Spartan-3 FPGA (shown in Figure 11).

Figure 11. Four-Input LUT Implementation of Circuit Schematic in Figure 10
Since Virtex-5 FPGAs have 6-input LUTs, the exact same logic could be implemented within a single LUT (shown in Figure 12).

Figure 12. Six-Input LUT Implementation of Circuit Schematic in Figure 10
The single-cycle timed loop used in this example (Figure 9) is configured to run at 40 MHz, which means that the logic between any given flip-flop must execute within one clock tick of 25 ns. The maximum speed at which code can execute is dependent on the propagation of electrons through the circuit. The branch of logic with the longest propagation delay is known as the critical path, and it determines the theoretical maximum clock speed for that part of the circuit. The six-input LUTs on Virtex-5 FPGAs not only reduce the total number of LUTs needed to implement a given piece of logic but also reduce the propagation delay of electrons through that piece of logic. This means that you can configure the same single-cycle timed loop for faster clock rates simply by choosing a Virtex-5-based hardware target.
For more information on the benefits of Virtex-5 FPGAs, please see the white paper listed in the resources below.
Multipliers and DSP slices

Figure 13. Multiply Function
The seemingly simple task of multiplying two numbers together can get extremely resource-intensive and complex to implement in digital circuitry. To provide some frame of reference, Figure 14 is the schematic drawing of one way to implement a 4-bit by 4-bit multiplier using combinatorial logic.

Figure 14. Schematic Drawing of a 4-Bit by 4-Bit Multiplier
Now imagine multiplying two 32-bit numbers together, and you end up with more than 2000 operations for a single multiply. Because of this, FPGAs have prebuilt multiplier circuitry to save on LUT and flip-flop usage in math and signal processing applications. Virtex-II and Spartan-3 FPGAs have 18-bit by 18-bit multipliers, so multiplying two 32-bit numbers together actually requires three multipliers for a single operation. Many signal processing algorithms involve keeping the running total of numbers being multiplied, and, as a result, higher-performance FPGAs like Virtex-5 have prebuilt multiplier-accumulate circuitry. These prebuilt processing blocks, also known as DSP48 slices, integrate a 25-bit by 18-bit multiplier with adder circuitry. LabVIEW FPGA, however, uses the multiplier functionality independently. Table 5 shows multiplier resources for various FPGA families.
|
|
Virtex-II 1000 |
Virtex-II 3000 |
Spartan-3 1000 |
Spartan-3 2000 |
Virtex-5 LX30 |
Virtex-5 LX50 |
Virtex-5 LX85 |
|
Number of Multipliers |
40 |
96 |
24 |
40 |
32 |
48 |
48 |
|
Type |
18x18 |
18x18 |
18x18 |
18x18 |
DSP48 Slices |
DSP48 Slices |
DSP48 Slices |
Table 5. Multiplier Resources for Various FPGAs
Block RAM
Memory resources are another key specification to consider when selecting FPGAs. User-defined RAM, embedded throughout the FPGA chip, is useful for storing datasets or passing values between parallel loops. Depending on the FPGA family, you can configure the onboard RAM in blocks of 16 or 36 kb. You still have the option to implement datasets as arrays using flip-flops, however, large arrays quickly become expensive for FPGA logic resources. A 100-element array of 32-bit numbers could consume more than 30 percent of the flip-flops in a Virtex-II 1000 FPGA or take up less than 1 percent of the embedded block RAM. Digital signal processing algorithms often need to keep track of an entire block of data, or the coefficients of a complex equation, and without onboard memory, many processing functions do not fit within the configurable logic of an FPGA chip. Figure 15 shows the graphical functions for reading and writing to memory using block RAM.
Figure 15. Block RAM Functions for Writing and Reading to Memory
You can also use memory blocks to hold periodic waveform data for onboard signal generation by storing one complete period as a table of values and indexing through the table sequentially. The ultimate frequency of the output signal is determined by the rate at which values are indexed, and you can use this method for dynamically changing the output frequency without introducing sharp transitions in the waveform.
Figure 16. Block RAM Functions for FIFO Buffers
The inherent parallel execution of FPGAs allows for independent pieces of hardware logic to be driven by different clocks. Passing data between logic running at different rates can be tricky, and onboard memory is often used to smooth out the transfer using first-in-first-out (FIFO) buffers. You can configure FIFO buffers, shown in Figure 16, for different sizes and help to ensure that data is not lost between asynchronous parts of the FPGA chip. Table 6 shows the user-configurable block RAM embedded in various FPGA families.
|
|
Virtex-II 3000 |
Virtex-II 1000 |
Spartan-3 1000 |
Spartan-3 2000 |
Virtex-5 LX30 |
Virtex-5 LX50 |
Virtex-5 LX85 |
|
Total RAM (kbits) |
1728 |
720 |
432 |
720 |
1152 |
1728 |
3456 |
|
Blocks Size (kbits) |
16 |
16 |
16 |
16 |
36 |
36 |
36 |
Table 6. Memory Resources for Various FPGAs
Conclusion
The adoption of FPGA technology continues to increase as higher-level tools evolve and further abstract the concepts described in this white paper. It is still important, however, to look inside the FPGA and appreciate how much is actually happening when block diagrams are compiled down to execute in silicon. Comparing and selecting hardware targets based on flip-flops, LUTs, multipliers, and block RAM is the best way to choose the right FPGA chip for your application. Understanding resource usage is extremely helpful during development, especially when optimizing for size and speed. These fundamental building blocks are not meant to be a comprehensive list all resources and there are many other parts to an FPGA that were not discussed. You can continue to learn more about FPGAs and digital hardware design through recommended resources below:
Additional Resources
The Design Warrior's Guide to FPGAs - by Clive "Max" Maxfield
Browse Customer Solutions Using FPGA Technology
Video: Introduction to LabVIEW FPGA
An Introduction to FPGA Technology: Read the Top Five Benefits
Learn about R Series Intelligent DAQ
National Instruments LabVIEW FPGA Module
Reader Comments | Submit a comment »
Legal
This tutorial (this "tutorial") was developed by National Instruments ("NI"). Although technical support of this tutorial may be made available by National Instruments, the content in this tutorial may not be completely tested and verified, and NI does not guarantee its quality in any way or that NI will continue to support this content with each new revision of related products and drivers. THIS TUTORIAL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND AND SUBJECT TO CERTAIN RESTRICTIONS AS MORE SPECIFICALLY SET FORTH IN NI.COM'S TERMS OF USE (http://ni.com/legal/termsofuse/unitedstates/us/).





