Academic Company Events NI Developer Zone Support Solutions Products & Services Contact NI MyNI

Document Type: Tutorial
NI Supported: Yes
Publish Date: Aug 2, 2009

High Data Rate Filtering on the NI PXIe-5641R RIO IF Transceiver

0 ratings | 0.00 out of 5
Print | PDF

Overview

This application note discusses the main critical issues in creating a LabVIEW FPGA filter structure for Finite Response Filters. This document specifically looks at generic Finite Impulse Response (FIR) filter structures and methods of optimizing the structure. Attention will be given to maximizing the clock rate of the filter clock domain, maximizing the data rate and minimizing FPGA resource utilization. Methods include reuse of multipliers, optimizing the number of bits or precision of the data, and use of symmetric filters.

Introduction

The introduction of the NI PCI-5640R IF Tranceiver greatly increased the data rate in processing data using LabVIEW FPGA.  The NI PCI-5640R module has two ADCs with a maximum IQ rate of 25 MSps and two DACs supporting a maximum IQ rate of 50 MSps.  Recently, the PXIe version, the NI PXIe-5641R has been released.  Algorithms designed using LabVIEW FPGA must be able to support these data rates to perform the implementation of communication schemes such as FM, QAM and PSK modulation.  A critical part of implementing a communication scheme is filtering.  The National Instruments Digital Filter Design Toolkit is a powerful tool for creating the filters and generating filter code for LabVIEW FPGA.  Unfortunately, at this time, the supported filter structure implementations involve using a single multiplier in a loop, limiting the through put of filters and overall processing. 

This application note discusses the main critical issues in creating a LabVIEW FPGA filter structure for Finite Response Filters.  This document specifically looks at generic Finite Impulse Response (FIR) filter structures and methods of optimizing the structure.  Attention will be given to maximizing the clock rate of the filter clock domain, maximizing the data rate and minimizing FPGA resource utilization.  Methods include reuse of multipliers, optimizing the number of bits or precision of the data, and use of symmetric filters.  Some topics will be discussed briefly with a reference to more information.

Fixed-Point Filter Design Process using the Digital Filter Design Toolkit

Fixed-point signal processing platforms, such as field-programmable gate arrays (FPGAs), are typically more power-efficient and less expensive than floating-point alternatives. However, fixed-point systems are generally more difficult to design. For example, you must consider the effects of coarse quantization in fixed-point systems.

To design a fixed-point filter using the LabVIEW Digital Filter Design Toolkit, you first must design a floating-point filter, also known as a reference filter, that meets the target specifications. In some cases you need to design a reference filter that exceeds the target specifications. The excess margin ensures a smooth conversion from a floating-point representation to a fixed-point representation. You then must modify the floating-point filter to accommodate the finite-precision constraints of the target platform while still trying to meet the target specifications. The following figure illustrates the fixed-point filter design process. The grey boxes illustrate the floating-point filter design process, the dotted lines represent optional steps, and the arrows on the left indicate which return steps occur if the filter design fails to meet the requirements in the current step.

Designing a fixed-point filter from a reference floating-point filter involves the following steps:

1.    Selecting a filter structure - In floating-point filter design, after you select a design method, the LabVIEW Digital Filter Design Toolkit uses a default filter structure according to the specified design method. However, in fixed-point implementations, different filter structures can have different memory and multiplier requirements and might cause different finite word length effects. To obtain the best filtering results, you must convert the default filter structure to an appropriate structure. This paper discusses the use of the standard FIR Direct Form filter structure as well as the FIR Symmetric Form filter structure.

2.    Scaling the filter coefficients - Every filter structure contains many accumulators, each of which might use a different data range. You can scale the filter coefficients by using the DFD Scale Filter VI to ensure that all of the accumulators use the same data range. Scaling the filter coefficients can help you obtain a better filtering result.  This step is optional.

3.    Quantizing the floating-point filter. Quantization is the process of approximating a fixed-point value for each reference floating-point value. You then can use the fixed-point values in fixed-point mathematical computation or a hardware implementation. By quantizing the coefficients of the reference floating-point filter, you convert a floating-point filter to a fixed-point filter.

4.    Analyzing the fixed-point filter - To determine how the characteristics of the realized fixed-point filter deviate from the characteristics of the reference floating-point filter, you must analyze the fixed-point filter.

5.    Creating a fixed-point filter model - To create the fixed-point filter model, you must configure the quantizers for the input and output signals and specify the settings for internal computation.

6.    Simulating the fixed-point filter - Before applying the fixed-point filter model in real-world applications, it is a good idea to simulate the behavior of the filter to verify if the fixed-point filter model works as you require in a simulation. If the fixed-point filter does not provide the required performance in the simulation, you can change the implementation structure, modify quantization settings, or redefine the filter specifications for the reference floating-point filter.

7.    Generating code from the fixed-point filter - The LabVIEW Digital Filter Design Toolkit can export filter coefficients and automatically generate integer LabVIEW FPGA code, but at this time, does not support parallel filter structures wih a data rate of 1 clock cycle per sample.  Included with this paper are a few LabVIEW FPGA examples that can be used to implement filters in your LabVIEW FPGA project.

For more information on using the NI LabVIEW Digital Filter Design Toolkit for creating the filter taps for LV FPGA filters, see the Digital Filter Design Toolkit Help file.

Finite Impulse Response Filters

Finite Response Filters (FIR) filters are digital filters with finite impulse responses. FIR filters are also known as nonrecursive filters, convolution filters, or moving-average (MA) filters, because you can express the output of an FIR filter as a finite convolution:

where xi represents the input sequence to be filtered, yi represents the output filtered sequence, and hk represents the FIR filter coefficients.

FIR filters have the following characteristics:

• They can be designed to have linear phase by ensuring coefficient symmetry.

• They are always stable.

• You can perform the filtering function using the convolution. A delay generally is associated with the output sequence:

where n is the number of FIR filter coefficients.

LabVIEW FPGA Concepts, Structures and VIs

LabVIEW FPGA Concept: Combinatorial Path

A combinatorial path is the path through logic between the output of a register and the input of another register on an FPGA. A register stores data on an FPGA and updates the data on the rising edge of a clock. Long combinatorial paths take more time to execute and limit the maximum clock rate of the clock domain.

LabVIEW FPGA Concept: Throughput

Throughput specifies the minimum number of cycles between two successive samples of valid input data.  High sample rate processing may require a throughput of one cycle per valid data input.  This is normally written as cycle/sample and will be used this way on LabVIEW FPGA configuration pages.

LabVIEW FPGA Single-Cycle Timed Loop

The Single-Cycle Timed Loop (SCTL) is the most important element in enabling high sample rate processing in LabVIEW FPGA.  The SCTL is a special use of the LabVIEW Timed Loop structure when used in an FPGA VI. This loop executes all functions inside within one tick of the FPGA clock you have selected.  You can use the SCTL with derived clocks to clock the loop at a rate other than the default clock. 

Using a traditional While Loop in your FPGA VI takes an absolute minimum of 3 ticks to execute each iteration. This is because of the enable chain used in the compiled FPGA VI. An explanation of the enable chain is beyond the scope of this document, but is used to ensure dataflow when the FPGA VI is compiled into a bitfile.  Additionally, each function inside the While Loop will require at least one tick to execute, although functions will execute in parallel if there is no data dependency. With the SCTL, all functions inside the loop must execute within a single clock tick.

The performance benefits of using a SCTL in an FPGA VI will vary depending on what is in the loop. The code within a SCTL must complete in one clock tick to successfully compile compared to a normal loop, but there will be a marked performance improvement.  Logic is implemented combinatorially in hardware, the FPGA configuration generated by the code uses less resources. Instead of doing an add, saving the result, and then a multiply and saving the result as in a normal While loop, the SCTL does both in one tick and does not have to save the result in between. This conserves FPGA resources because no flip flop is needed between operations to save the result of each previous operation.  Shift registers or feedback nodes are used to allow logic to execute in parallel and pass data between subsequent iterations of the SCTL; thus, the entire logic chain is implemented over multiple SCTL iterations. As with any parallel implementation in an FPGA VI, this uses additional FPGA resources.

The benefit of the SCTL in high sample rate applications is that the data throughput can be higher than in a normal While loop.  The goal is to run the FIR filters in an SCTL with a loop rate of 100 MHz.

Long combinatorial paths are typically a problem in SCTLs because the logic between the input register and the output register must execute within one period of the clock rate you specify. In the SCTL, registers within and between components are removed, increasing the length of the combinatorial path between registers. If the code in a combinatorial path does not execute within a clock cycle, LabVIEW returns a timing violation in the Compilation Failure dialog box.

LabVIEW FPGA Multiply VIs

When using the SCTL, the maximum rate at which the loop will be able to run will be determined by the slowest running combinatorial logic implemented in the SCTL.  For example, there are five parallel logic operations inside a SCTL Four of the operations take 7 ns or less to complete and the 5th operation, the normal LabVIEW FPGA Multiply, takes 11.1 ns.  Since the operations are all parallel, the shortest time that it will take the SCTL loop to run once is 11.1 ns.  The inverse of this loop time is the fastest possible loop rate, or 90 MHz.

Most of the time, it is extremely beneficial to have the SCTL as fast as possible, allowing for data to be processed at a rate that allows for other resource saving implementations, such as reuse of multipliers.  The basic LabVIEW Multiply takes about 11.1 ns to complete on a VIRTEX 5 SX95 FPGA.  Presently there are no options that allow the multiply operation to complete faster on the FPGA.  There is an alternative multiply VI in LabVIEW, the FXP Multiply.  The FXP Multiply is part of the FPGA Fixed-Point Math Library included with LabVIEW.  The timing performance can be improved on the on FPGA by pipelining the multiplier. The functionality of a pipelined multiplier is equivalent to a normal multiplier cascaded by a certain number of registers as shown below. The number of the registers is equal to the number of pipelining stages.  While the output of the multiply operation is delayed one or two clock cycles, the time that the multiply logic can run is less than 1.0 us, allowing a SCTL rate of at least 100 MHz.

FPGA Multiplier Resources

Multipliers are a requirement for DSP applications on FPGAs. Without these components, the seemingly simple task of multiplying two numbers together can become extremely resource-intensive. The Virtex-II has an 18 x 18 bit multiplier, but with the release of the Virtex-4, Xilinx added a specialized logic block called the DSP48 slice. These blocks, specifically designed for DSP data and signal analysis operations, include built-in multiply and adder circuitry. The multiplier on the original DSP48 slice was 18 x 18 bit. Xilinx improved this by extending the multiplier capacity to 25 x 18 bit on the Virtex-5. These slices are called DSP48E slices. This slice supports more than 40 dynamically controlled operating modes, including multiplier, multiplier-accumulator, multiplier-adder, subtracter, three-input adder, barrel shifter, wide bus multiplexers, wide counters, and comparators. These slices make all of these functions available without consuming normal logic resources.

The improved size of 25 x 18 bits on the newer multipliers improves the processing of data over a sequence of operations.  Typically, data acquired by an ADC for signal processing application is at least 14 or 16 bits.  Older FPGAs using the 18 x 18 multipliers, while good at the time, could introduce small rounding errors that would build up over a series of processing stages.  For example, in implementing a FPGA filter, the data in is 16 bits.  The filter taps are implemented as 18 bit numbers, maximizing the number of bits for the FPGA multiplier.  If there is a second operation after the filtering operation, such as shifting the frequency of the signal with more multipliers, the data is limited to 18 bits at the input.  With the new 25 x 18 bit multipliers, the filter taps will still be 18 bits, but the data can now be processed over a number of steps maintaining a 25 bit precision until the end of the series of operations.  This will increase overall accuracy of the operations.

Each FPGA has a set number of multipliers available for use.  The Virtex-2 P30 FPGA on the NI PCI-5640R has 136 multipliers and the NI PXIe-5641R has 640 multipliers.  The size of the data or inputs to the multipliers must be used judiciously.  In LabVIEW FPGA, or VHDL, inputs to a multiplier can be set to have a higher number of bits than a single multiplier supports.  For example, if the FPGA has 18 x 18 bit multipliers, and the data that is to be multiplied is 20 bits and 24 bits, the compiler will make sure that the operation is carried out if possible with respect to resources.  But in doing so, the compiler will use two FPGA multipliers to perform the operation.  If the operation is a filter with 49 taps, and 49 multipliers, the compiler will end up using at least 98 FPGA multipliers.  Each multiply operation in the filter will use to FPGA multiply resources.

Another area of concern is the use of multipliers when implementing a filter.  A filter processing data at a sample rate of 25 MSps with a large number of filter taps takes up a significant amount of resources. It is very possible that the complete FPGA application that is implemented may use up a greater number of multipliers than are available on the FPGA.  In this case, filter design in terms of the number of filter taps must be optimized as well to use a minimum number of multipliers.  A tradeoff may need to be made in reference to the quality of the desired filter response with respect to the number of taps/multipliers that are available.  It may take a few iterations in the filter design process to achieve the best quality filter given the number of multipliers that are available in the entire application.

Creating Filter Structures in LabVIEW FPGA

FIR Direct Form Structure

For FIR filters, the FIR Direct Form structure is the most straightforward structure from a filter transfer function perspective. The following figure represents the FIR Direct Form structure. The number of delays equals the filter order, M.  In terms of FPGA resources, the FIR Direct Form structure will contain M+1 multipliers. 


[+] Enlarge Image

The expanded form of the equation for the FIR filter in the image is as follows.

Currently, the NI Digital Filter Design Toolkit implements LabVIEW FPGA filters in a loop with a single fixed point multiplier.  Each iteration of the loop processes one of the multiplication steps of the filtering application.  For example, in the previous figure, the filter uses 5 multipliers, and the filter will take at least 5 clock cycles to process one valid input data sample.  The throughput is at least 5 cycles per sample.  This may not seem like a large number, but if the application requires a much larger number of filter taps, the throughput can be much slower.  A filter that requires 41 taps will have a through put of at least 41 cycles per sample.  If the max clock rate is 20 MHz, the max symbol rate will be less than ~ 487 kHz.  This may be a good rate for many applications, but if the IQ rate is 1 MSps or higher, the effective sample rate is not high enough.

The FIR Direct Form structure can be implemented in LabVIEW FPGA in a way that supports a Throughput of 1 sample per cycle to support high sample rate applications.  In this case, the structure supports a new data sample on every clock cycle, but there is a pipeline delay of 5 clock cycles from an initial sample to a fully primed filter output.  The layout of a LabVIEW FPGA implementation of a FIR filter is very similar to the FIR Direct Structure above. 


[+] Enlarge Image

The LabVIEW FPGA example uses Fixed Point Discrete Delays to maintain and shift the current data from the Data In terminal from one cycle to the next.  The Valid Data In terminal is controls this process.  Fixed Point Discrete Delays are also used to maintain the values of the filter taps.  A third use for the Fixed Point Discrete Delays is to set up a delay chain that corresponds to the pipeline delay, or number of clock cycles, that it takes to process the data through the filter and provide a valid data out signal for the output.  Prior to operation, the filter taps are loaded in to the delays from the FIR Filter Coefficient terminal for the duration of the filtering operation.  Fixed Point Multipliers are used with an internal 2 stage pipeline configured in the FXP Multiply configuration page.  This is critical for the filter to meet timing constraints of a SCTL running at 100 MHz.  The multipliers are followed up with stages of adders that eventually sum the total output for the filter output.  This FIR filter structure can run in a SCTL that is clocked at 100 MHz.  At this rate, the filter structure can support the processing of data at 100 MSps.

FIR Symmetric Form Structure

If the FIR filter being implemented is a linear phase FIR designed for the FIR Direct Form structure, the number of multipliers being used can be almost halved.  By using a symmetric linear phase FIR filter, the FIR Direct Form structure can be modified by using the symmetry of the filter coefficients to reduce the number of multipliers from M+1 to M/2+1.  In our 5 tap filter case, two multipliers can be eliminated.  There also two additional adders used in this implementation.


[+] Enlarge Image

The expanded form of the equation for the FIR filter in the image now simplifies as follows.

The following is a LabVIEW FPGA implementation of the FIR Symmetric structure.  There are now only two FPGA multipliers required for this filter implementation.


[+] Enlarge Image

The only difference between this filter structure implementation and the previous implementation is the addition of two adder blocks that sum x(n) with x(n-4) and x(n-1) with x(n-3).  This allows the savings of two FPGA multiply resources for other logic applications.

This FIR filter structure can run in a SCTL that is clocked at 100 MHz.  At this rate, the filter structure can support the processing of data at 100 MSps.

FPGA Multiplier Reuse

The FIR Symmetric filter structure enabled the savings of almost half the number of FPGA multipliers as used in the FIR Direct Form structure.  Both structures support the processing of data at 100 MSps.  The NI PXIe-5641R ADCs have a max data rate of 25 MSps which can be taken advantage of to reduce the number of multipliers even more.  The following image is a modified form of the FIR Direct Form structure, but in this case, instead of taking advantage of FIR filter Symmetry to reduce the number of multipliers, it is taking advantage of the fact that incoming data rate is no more than half the clock rate of the filter clock domain.


[+] Enlarge Image

This structure adds a number of multiplexors, or switches, to control the routing of data and filter taps to the multipliers.  Each multiplexor or switch only has two positions, so the clock domain rate has to be at least twice the data rate.  Valid data is only allowed to be shifted into the delay network every other clock cycle, i, or i + 1.  On this first clock cycle, x(n) and h(0) are routed through one pair of switches to the first multiplier, x(n-2) and h(2) are routed through the next pair of switches to the second multiplier, and x(n-4) and h(4) are routed through the last pair of switches to the third multiplier.  On the next clock cycle, no new data is allowed to be shifted in.  On this next clock cycle, x(n-1) and h(1) are routed through one pair of switches to the first multiplier, x(n-3) and h(3) are routed through the next pair of switches to the second multiplier, and zeros are routed through the last pair of switches to the third multiplier.

The outputs of each multiplier are all summed together by the adder on each clock cycle.  After the adder there is a discrete delay that holds the sum of the first set of operations one clock cycle.  On the next clock cycle, the current summation from the adder is added to the previous sum for the filter output.  In order to maintain correct timing, there is an internal enable chain implemented that will only be true at the output of the filter operation when valid data is available at the output of the second adder.

The following is a LabVIEW FPGA implementation of the FIR Direct Form structure with multiplexors to reduce the number of FPGA multiply resources required for the filtering operation.  There are now only three FPGA multipliers required for this filter implementation.


[+] Enlarge Image

The LabVIEW FPGA example uses the Valid Data In Boolean for two purposes in this case.  The first purpose is to clock the new valid data sample into the shift registers, and the second purpose is for controlling the multiplexors (grey colored subVIs) in routing the correct data to the multipliers on the correct clock cycle.  Valid data is clocked in on the appropriate clock cycle while previous data shifted in the shift register chain.  On the next clock cycle, which is actually the first clock cycle that data is processed, the true is also at the output of the shift registers (0----->1) that shifts the valid signal and sets the multiplexors to route the correct data to the multipliers.  On this first clock cycle, x(n) and h(0) are routed through one pair of switches to the first multiplier, x(n-2) and h(2) are routed through the next pair of switches to the second multiplier, and x(n-4) and h(4) are routed through the last pair of switches to the third multiplier.  On the next clock cycle, no new data is allowed to be shifted in and the output of the first data valid shift register is now a false, which changes the state of the multiplexors.  On this next clock cycle, x(n-1) and h(1) are routed through one pair of switches to the first multiplier, x(n-3) and h(3) are routed through the next pair of switches to the second multiplier, and zeros are routed through the last pair of switches to the third multiplier.

The concept of using multiplexors to control the flow of data to the FPGA multipliers can also be applied to the FIR Symmetric case to further save FPGA multiply resources.  The following block diagram shows the previous FIR Symmetric block diagram modified with the multiplexors to control the routing of data to the multiply logic.  In this case, the number of multipliers has been reduce from the original 5 that were required to 2 using a combination of Symmetric filters and controlled routing of data to the multiplies with multiplexors.


[+] Enlarge Image

The following is a LabVIEW FPGA implementation of the FIR Symmetric Form structure with multiplexors to reduce the number of FPGA multiply resources required for the filtering operation.  There are now only two FPGA multipliers required for this filter implementation.


[+] Enlarge Image

The concept of FPGA multiplier reuse can be used to save even more multipliers.  The maximum number that can be reduced will depend on the ratio of clock domain clock rate to sample rate of the data.  The two previous cases resulted in a savings of about 2x, and when running in a clock domain of 100 MHz, can handle a sample rate of 50 MSps.  If the sample rate of the data is only 1 MSps, it is possible to scale the concept all the way down to reusing a single multiplier for each tap of a FIR filter, assuming the filter has less than 100 taps. 

The following table shows some benchmarks in compiling a FIR Symmetric filter using 127 taps.  There is a baseline compile with essentially an empty VI, and two other compiles of the FIR Symmetric filter where the multiplies are reused 2X and 4X.  As expected, the number of multipliers has dropped by almost one half from 68 to 36 multipliers.


[+] Enlarge Image

Conclusion

Creating filters in LabVIEW FPGA for data rates as high as 50 MSps, or more is possible with careful FPGA resource planning.  Succeeding requires creating optimized LabVIEW FPGA filter structures that are capable of working in FPGA clock domains running as high as 100 MHz.  The throughput of individual logic blocks in terms of cycles must be enabled to run as close to one clock cycle per sample as possible.  Each logic process in a SCTL must be analyzed and implemented in order for it’s combinatorial logic run as fast as possible on the FPGA, with a goal to hit loop times of 1 us or less.  The design of the filter itself in terms of the desired filter response should be just as important as creating a filter in LabVIEW FPGA that is capable of running at 100 MHz.  It is very possible to create such a filter when keeping in mind the parts that are most likely to be the weakest link in terms of time and/or resources.

References

Pipelining to Optimize FPGA VIs (FPGA Module)

Optimizing your LabVIEW FPGA VIs: Parallel Execution and Pipelining

Single-Cycle Timed Loop FAQ for the LabVIEW FPGA Module

Using Clusters and Arrays in LabVIEW FPGA

Advantages of the Xilinx Virtex-5 FPGA

 

 

 

0 ratings | 0.00 out of 5
Print | PDF

Reader Comments | Submit a comment »

 

Legal
This tutorial (this "tutorial") was developed by National Instruments ("NI"). Although technical support of this tutorial may be made available by National Instruments, the content in this tutorial may not be completely tested and verified, and NI does not guarantee its quality in any way or that NI will continue to support this content with each new revision of related products and drivers. THIS TUTORIAL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND AND SUBJECT TO CERTAIN RESTRICTIONS AS MORE SPECIFICALLY SET FORTH IN NI.COM'S TERMS OF USE (http://ni.com/legal/termsofuse/unitedstates/us/).