Deterministic Data Streaming in Distributed Data Acquisition Systems
Overview
For Aerospace design engineers who build high-performance data acquisition systems using the LabVIEW Real-Time Module, third party distributed shared memory is a mechanism for high-speed data transfer between nodes of a distributed computing system. Unlike TCP/IP, distributed shared memory guarantees deterministic data transfer performance.
Table of Contents
Background and Scope
High-performance data acquisition systems are increasingly common in the aerospace industry in wind tunnel instrumentation, large-scale vibration test, and similar applications. One growing trend is the development of measurement systems that distribute the computational load of these demanding applications among multiple computers.
The objective of this document is to describe an approach to implementing deterministic data streaming between computers in this type of systems. Certain bus architectures (such as VXI and VME) support multiple CPU configurations. It is also possible to purchase PCs with multiple processors installed. This document, however, focuses on highly scalable system architectures with hard real-time performance requirements that harness the data bandwidth and computing power of multiple independent computers.
See Also:
Real-Time tutorial
Introduction to Distributed Data Acquisition Systems
At the highest level, distributed and single-CPU data acquisition systems have similar architectures. Both have distinct application components (consisting of a combination of hardware and software) that perform specific actions. Each of these components shares data with others according to its given task.
Each component is also an independent module that can be scaled or modified without affecting the other parts of the application. For example, in LabVIEW, a modular application can be created with function-specific VIs that are used as building blocks by higher level modules. Figure 1 illustrates some components typically found in a traditional data acquisition application.
Each component corresponds to an independent thread or process running in a single CPU computer, and data sharing between components is achieved using buffers or other data structures in shared system memory.

Figure 1. Data Acquisition Application Components on a Single Computer System
In a distributed architecture, components (or component subsets) are hosted by independent computers. The following figure shows an example of distributing the components shown in Figure 1.

Figure 2. Data Acquisition Application Components Distributed in Multiple Computer Nodes
As mentioned previously, the goal of distributed applications is to share the processing load among multiple computers. The following paragraphs discuss some of the factors that can drive the need for a distributed architecture.
Performance
By definition, high performance data acquisition applications require a high level of computing power. In the single-computer configuration illustrated in Figure 1, the application components (data acquisition, user interface and data logging) share the computing bandwidth of the CPU. In a distributed architecture, each CPU is dedicated to a specific function.
Scalability
The complexity of single-CPU data acquisition systems increases exponentially with performance requirements. A distributed configuration, provides an alternative architecture that can be scaled to meet future needs. In a distributed system, individual components can be upgraded or modified without affecting the other parts of the system. Computing capacity and functionality can be added where needed by upgrading hardware components or by adding computer nodes. For example, the channel count of a distributed data acquisition system could be expanded either by upgrading existing data acquisition nodes or by adding new data acquisition nodes, without affecting the user interface or data logging components.
Determinism
When multiple execution threads are scheduled to run in parallel on a single CPU, they tend to interact randomly in a way that increases jitter in their individual loop execution times. Conversely, reducing the number of tasks a CPU needs to perform (by distributing the tasks among multiple computers) improves the determinism of time-critical tasks.
Compatibility with a legacy system
In situations where a distributed system is already in use, expanding functionality by adding computing nodes can be more attractive than altering the existing software. Distributed systems can be heterogeneous, so newer technology such as PCI and PXI can be mixed with older instrumentation busses such as VXI.
Communication Requirements
Just as in a single-computer system, the components of a distributed system need to share data with each other in an easy and transparent way. The challenge is to implement a communication scheme that is best suited to link distributed components together. Depending on the functionality of the component, its communication requirements will fall into one of the following categories:
Non-Deterministic Communication
Some components do not require data to arrive at deterministic rates in order to work properly. A computer that simply monitors the data and displays it on a graphical user interface is an example of this type of component. Some data can even be skipped without affecting the monitor and display function. In this scenario, TCP and UDP are the two most popular communication mechanisms.
Deterministic Communication
Continuous data logging is a time-critical task typically found in distributed data acquisition systems. In addition to high speed and low latency, deterministic data transfer requires the ability to signal events on the computer receiving the data. For these applications, distributed shared memory is the most common solution.
A distributed system may have a mix of components with deterministic and non-deterministic interfaces. The following figure shows a typical example for high-performance data acquisition.

[+] Enlarge Image
Figure 3. A Distributed System Using Both Deterministic and Non-Deterministic Communication
The remainder of this paper focuses on techniques for implementing deterministic data streaming from the data acquisition component to the data logging component using distributed shared memory in the LabVIEW Real-Time environment. If you are not familiar with distributed shared memory or how it works, we suggest that you read the article found at the following link.
See Also:
Communication mechanisms for distributed real-time applications
Deterministic Data Sharing
In our example, the data acquisition component must transfer large amounts of data to the data logging component. As you will see, the communication mechanism must be optimized for the size and transfer rate. In order to avoid data loss, the data logging component must be able to keep up with the acquisition component by retrieving the data as soon as it is ready.
The circular buffer is a time-honored technique that has proven effective in this type of situation. The idea is simple - the application allocates a buffer of memory that is divided into sections. The producer (data acquisition component) fills a section of the buffer while the consumer (the data logging component) reads the data as soon as possible. A circular buffer allows both consumer and producer to access data in the buffer simultaneously, because at any time they read and write data in different buffer sections.
The technique is called circular buffer because when the producer reaches the end of the buffer, it continues writing at the beginning. It is the responsibility of the consumer to keep up with the producer so that data is never overwritten. A synchronization mechanism (described in the following sections) is needed so that the producer informs the consumer when new data is available.

Figure 4. Circular Buffer Example.
The overall buffer size and the number of sections must be determined from the application requirements. In general, faster data generation rates require larger buffers to avoid overwriting the data before it is consumed. The case where the buffer contains 2 sections is called double buffering. When using double buffers, the synchronization between consumer and producer must be tight, since there is no margin for the consumer to fall behind. The general design trade-off is that the data update rate is the transfer rate divided by the section size. If the buffer sections are small and the data generation rate is high, the consumer must retrieve the data very quickly. On the other hand if the buffer sections are large they can take a long time to fill, which leaves the consumer without new data for extended periods of time.
A good technique for balancing the buffer size and the data generation rate is to segment the buffer in more than 2 sections, as shown in Figure 4. By using this approach you create smaller sections, so the data will reach the consumer more quickly. Since there are more segments, the consumer also has some margin to process the data while keeping up with the producer.
Deterministic Data Streaming Using Distributed Shared Memory
In a single CPU machine, the circular buffer is located in local memory so that both components can access it. Synchronization is generally implemented using interrupts or software-generated events.
In a distributed system, a circular buffer can be implemented using a section of distributed shared memory. Each time the producer writes to the buffer on the local distributed shared memory card, the data is automatically replicated to buffers on all the nodes in the network. Consumers and producers only need to focus on reading from and writing to their local distributed shared memory.
There are two ways in which synchronization can be implemented in distributed systems: polling and interrupts. Polling is easier to implement since the consumer simply monitors a semaphore variable located in distributed shared memory to find out when it can access new data. The drawbacks of this approach are that it is only as fast as the polling speed of the consumer, and it wastes valuable CPU time. Interrupt-based synchronization is somewhat more complicated, but it reduces latency and doesn't consume CPU time.
Circular Buffer Algorithm
The following section describes a circular buffer implementation example with an arbitrary number of sections, one consumer and one producer. This algorithm addresses the basic data transfer scheme, but not the application-specific error handling or the logic for stopping the transfer process when it is completed.
Control Variables
The following control variables are used for communication between the producer and the consumer. They reside in the distributed shared memory, in addition to the data buffer.
The distributed shared memory drivers available in LabVIEW Real-Time specify the memory access in terms of offsets starting at 0x0 (hexadecimal zero). For this example, the variables are allocated in shared memory at arbitrary offsets.
Tip: In LabVIEW, it is a good practice to create strict typedefs to store parameter offset values. Using a Ring control typedef, you can specify each variable name along with its corresponding offset (including non-sequential offsets). When programming the application, use the typedef so that you don't need to remember the offsets. Typedefs also help with maintenance and scalability, since the constant values are stored in a centralized location instead of being hardcoded throughout the application. If you need to change an offset in the future, updating the typedef definition will automatically change it wherever the typedef is used.
In this implementation the buffer size is the section size multiplied by the number of sections.
Note: To ensure that the buffer fits in the available distributed shared memory, the buffer start offset plus the buffer size must be less than the total amount of memory on the distributed memory card.
The value "-1" in the 'writingOnSection' and 'readingOnSection' variables indicates that the producer or the consumer is idle.
Note that with this choice of variables you can configure your buffer with an arbitrary number of sections. This example uses a double buffer. This set of variables also simplifies navigation through the buffer. Instead of tracking individual data value offsets, the producer and consumer can move in multiples of 'sectionSize', starting at 'bufferStart' offset. For example the calculation for each section is given by:
Producer Data Transfer Algorithm
The goal of the following example algorithm is to allow the producer to post data as quickly as possible without affecting its determinism. When a new data section is available, it signals the consumer, indicating what section of the buffer has been updated. The consumer sleeps until an interrupt from the producer arrives.
The basic execution flow of the producer is shown in Figure 5. The numbers in the diagram correspond to the producer detail comments below.

Figure 5. Producer Flow Diagram
Producer Details:
(1) If 'writingOnSection' is ever equal to 'readingOnSection' , it indicates that the consumer did not keep up with the producer. Here are some possible ways the producer might handle this error:
a) Generate an error signal (via an interrupt or polled variable) to inform the consumer that it is not keeping up. Both the consumer and producer could be programmed to stop on this type of event.
b) Generate an error signal to inform the consumer it is not keeping up. The consumer could increase its update rate to catch up with the producer.
c) If the consumer component tolerates data loss (for example a data monitoring GUI), the producer can overwrite the buffer.
d) Perform some other custom action based on application specific requirements.
(2) The most efficient method of transferring a block of data to a distributed shared memory board is using DMA. The control variables in this example are compatible with DMA transfer VIs available for distributed shared memory drivers. In this algorithm, the producer writes one buffer section per DMA write operation.
(3) In this algorithm, the producer tells the consumer which section is ready using the 'sectionReady' variable. Before signaling the consumer, the producer must update this variable. The sectionReady variable is shown here for conceptual purposes. Depending on your selection of distributed shared memory (SCRAMNet or VMIC) this parameter may not be necessary.
(4) SCRAMNet and VMIC provide proprietary methods for generating interrupts. Please refer to the proper documentation to learn more about each vendor's application programming interface (API).
Tip for SCRAMNet users: SCRAMNet hardware allows you to configure specific memory locations to generate an interrupt when data is written to or read from that location. You can configure the last byte of each section to generate an interrupt when new data is written to it. This approach saves execution time because the producer does not need to signal the consumer as a separate operation. When the DMA transfer reaches the last byte, it will automatically generate an interrupt informing the consumer that a new section is ready. The consumer can determine which section to read by identifying which memory location caused the interrupt. While this approach is faster, it may be more difficult for somebody who is unfamiliar with the code to understand. The approach that uses the 'sectionReady' variable requires more API calls but may be easier to maintain and understand.
Tip for VMIC users: VMIC boards implement interrupts as a messaging scheme. The API provides mechanisms for 'sending' interrupts to different nodes in the distributed network. The API allows sending data along with an interrupt. On this implementation, the producer can send the section number as the interrupt data to the consumer. This approach also eliminates the need for the 'sectionReady' variable.
(5) After the producer signals the consumer, it increments the 'writingOnSection' number. If it has reached the end of the buffer, it should reset the section number so that it points to the first section.
Consumer Data Transfer Algorithm
The execution flow of the consumer is shown in Figure 6.

Figure 6. Consumer Flow Diagram
Consumer Details:
(1) After receiving the signal from the producer (via the 'sectionReady' variable or any of the vendor-specific methods explained previously), the consumer must set the 'readingOnSection' variable so that it can reference the correct segment to read. If the consumer is slower than the producer, it may receive multiple interrupts during a single loop through this algorithm. VISA (the underlying implementation of the distributed shared memory drivers for LabVIEW RT) queues the interrupts to ensure they are not lost. As mentioned previously, the current section to read may be overwritten before the consumer reads all of the data if the producer is faster than the consumer. If this is the scenario, you may want to implement some other logic (such as "read latest" instead of "read next") in order to keep ahead.
Note: Be sure to utilize the timeout feature when waiting for an interrupt. Waiting forever may cause your program to hang, since there is no guarantee that the producer will ever generate an interrupt. When the 'wait for interrupt' call wakes up, analyze the cause. If it timed out, the consumer must handle this situation differently (since no data arrived), and go back to sleep until the next interrupt.
(2) As with the producer, the most efficient way to read a block of data from the distributed shared memory is via DMA. After reading a block of data, you can decide to process it in-line or send it to a parallel process for further analysis, logging or display. In general, sending the data to a parallel process provides optimum performance because it does not slow down the consumer loop
(3) Setting 'readingOnSection' to -1 indicates that the consumer is idle. The consumer being faster than producer doesn't affect the producer's performance or determinism. Since the consumer is waiting for an interrupt, it will sleep until new data is available, releasing CPU power for other critical tasks.
Conclusion
Distributing a data acquisition system in multiple computers can maximize the throughput and capabilities of the system. Implementing circular buffers over distributed shared memory provides a convenient and scalable mechanism for data streaming between one or more nodes in the system.
The algorithm presented here serves as a basis for developing a distributed data streaming application. Additional communication parameters and system states will surface in your application. Take some time to identify as many scenarios and requirements of your application as possible before starting your implementation.
For more information on distributed shared memory applications, please refer to the link below.
Reader Comments | Submit a comment »
Legal
This tutorial (this "tutorial") was developed by National Instruments ("NI"). Although technical support of this tutorial may be made available by National Instruments, the content in this tutorial may not be completely tested and verified, and NI does not guarantee its quality in any way or that NI will continue to support this content with each new revision of related products and drivers. THIS TUTORIAL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND AND SUBJECT TO CERTAIN RESTRICTIONS AS MORE SPECIFICALLY SET FORTH IN NI.COM'S TERMS OF USE (http://ni.com/legal/termsofuse/unitedstates/us/).

