Academic Company Events NI Developer Zone Support Solutions Products & Services Contact NI MyNI

Document Type: Example Program
NI Supported: Yes
Publish Date: Aug 8, 2008


Feedback


Yes No

Related Categories

Related Links - Developer Zone

Related Links - Products and Services

Examples of Multiprocessing and Hyperthreading in LabVIEW

0 ratings | 0.00 out of 5
Print

Overview

**NOTE: The example code in this document was run in LabVIEW 7.1. The results might not reflect the performance of newer versions of LabVIEW.

The primes code example shows how to write a multithreaded program in a text-based programming language by rewriting the primes parallelism example in C++. The C++ code example demonstrates the kind of effort required to write thread-handling code and illustrates the special coding necessary to protect data that threads share.

Primes Parallelism Programming Example in C++

The following sample code was written and tested in Microsoft Visual C++ 6.0. The single-threaded primes parallelism example, following the same algorithm as in the LabVIEW primes parallelism example, would look something like the following:

// Single-threaded version.

void __cdecl CalculatePrimes(int numTerms) {

       bool *resultArray = new bool[numTerms/2];

       for (int i=0; i<numTerms/2; ++i) {

              int primeToTest = i*2+3;// Start with 3, then add 2 each iteration.

              bool isPrime = true;

              for (int j=2; j<=sqrt(primeToTest); ++j) {

                     if (primeToTest % j == 0) {

                          isPrime = false;

                          break;

                          }

                     }

              resultArray[i] = isPrime;

              }

       ReportResultsCallback(numTerms, resultArray);

       delete [] resultArray;

       }

No parallelism exists in this single-threaded version of the code, which would consume 100 percent of one virtual processor without using the other processor. To use any of the bandwidth on the other processor, the application must initiate additional threads and distribute the work.

The following code is an example of a preliminary multithreaded version of the primes parallelism example:

struct ThreadInfo {

       int numTerms;

       bool done;

       bool *resultArray;

       };

static void __cdecl CalculatePrimesThread1(void*);

static void __cdecl CalculatePrimesThread2(void*);

void __cdecl CalculatePrimesMultiThreaded(int numTerms) {

       // Initialize the information to pass to the threads.

       ThreadInfo threadInfo1, threadInfo2;

       threadInfo1.done = threadInfo2.done = false;

       threadInfo1.numTerms = threadInfo2.numTerms = numTerms;

       threadInfo1.resultArray = threadInfo2.resultArray = NULL;

       // Start two threads

       _beginthread(CalculatePrimesThread1, NULL, &threadInfo1);

       _beginthread(CalculatePrimesThread2, NULL, &threadInfo2);

       // Wait for the threads to finish executing.

       while (!threadInfo1.done || !threadInfo2.done)

              Sleep(5);

       // Collate the results.

       bool *resultArray = new bool[numTerms/2];

       for (int i=0; i<numTerms/4; ++i) {

              resultArray[2*i] = threadInfo1.resultArray[i];

              resultArray[2*i+1] = threadInfo2.resultArray[i];

              }

       ReportResultsCallback(numTerms, resultArray);

       delete [] resultArray;

       }

static void __cdecl CalculatePrimesThread1(void *ptr) {

       ThreadInfo* tiPtr = (ThreadInfo*)ptr;

       tiPtr->resultArray = new bool[tiPtr->numTerms/4];

       for (int i=0; i<tiPtr->numTerms/4; ++i) {

              int primeToTest = (i+1)*4+1;

              bool isPrime = true;

              for (int j=2; j<=sqrt(primeToTest); ++j) {

                     if (primeToTest % j == 0) {

                          isPrime = false;

                          break;

                          }

                     }

              tiPtr->resultArray[i] = isPrime;

              }

       tiPtr->done=true;

       }

static void __cdecl CalculatePrimesThread2(void *ptr) {

       ThreadInfo* tiPtr = (ThreadInfo*)ptr;

       tiPtr->resultArray = new bool[tiPtr->numTerms/4];

       for (int i=0; i<tiPtr->numTerms/4; ++i) {

              int primeToTest = (i+1)*4+3;

              bool isPrime = true;

              for (int j=2; j<=sqrt(primeToTest); ++j) {

                     if (primeToTest % j == 0) {

                          isPrime = false;

                          break;

                          }

                     }

              tiPtr->resultArray[i] = isPrime;

              }

       tiPtr->done=true;

       }

In this example, the CalculatePrimesMultiThreaded() function creates two threads using the _beginthread() function. The first thread calls the CalculatePrimesThread1() function, which tests half of the odd numbers. The second thread calls the CalculatePrimesThread2 function and tests the other half of the odd numbers.

The original thread, which is still running the CalculatePrimesMultiThreaded() function, has to wait for the two worker threads to finish by creating the data structure ThreadInfo for each thread and passing it into the _beginthread() function. When a thread finishes executing, it writes true into ThreadInfo::done. The original thread must continually poll ThreadInfo::done until the original thread reads a true value for each computation thread, at which time it is safe for the original thread to access the results of the calculation. The program then collates the values into a single array, so they are identical with the output of the single-threaded version of this example.

**NOTE: The previous step was not shown in the LabVIEW example, but is a trivial task to accomplish.

When you write multithreading code in a sequential programming language like C++, you must protect any data locations multiple threads can access. If you do not protect the data locations, the consequences are extremely unpredictable and often difficult to reproduce and locate. In any circumstance where the assignment of tiPtr->done to true is not atomic, you must use a mutex to protect both the assignment and the access.

You can improve the previous example, however. In particular, there is no reason to initiate two additional threads and leave one idle. Because this example contains code for a computer with two virtual processors, you can initiate one additional thread instead and use the original to perform half the computation. Also, you could pass both threads through the same function instead of writing what is basically the same function twice. You need to pass in an additional parameter to the function to indicate in which thread it can run.

**NOTE: You can make the same optimization to the LabVIEW example by creating a reentrant subVI that contains the For Loop. The calling VI would have two instances of the subVI instead of two For Loops.

The following code is a more efficient multithreaded application:

struct ThreadInfo2 {

       int threadNum;

       int numTerms;

       bool done;

       bool *resultArray;

       };

static void __cdecl CalculatePrimesThread(void*);

void __cdecl CalculatePrimesMultiThreadedSingleFunc(int numTerms) {

       // Initialize the information to pass to the threads.

       ThreadInfo2 threadInfo1, threadInfo2;

       threadInfo1.done = threadInfo2.done = false;

       threadInfo1.numTerms = threadInfo2.numTerms = numTerms;

       threadInfo1.resultArray = threadInfo2.resultArray = NULL;

       threadInfo1.threadNum = 1;

       threadInfo2.threadNum = 2;

       // Start a thread

       _beginthread(CalculatePrimesThread, NULL, &threadInfo1);

       // Use this thread for the other branch instead of spawning another thread.

       CalculatePrimesThread(&threadInfo2);

       // Maybe this thread finished first. If so, wait for the other.

       while (!threadInfo1.done)

              Sleep(5);

       // Collate the results.

       bool *resultArray = new bool[numTerms/2];

       for (int i=0; i<numTerms/4; ++i) {

              resultArray[2*i] = threadInfo1.resultArray[i];

              resultArray[2*i+1] = threadInfo2.resultArray[i];

              }

       ReportResultsCallback(numTerms, resultArray);

       delete [] resultArray;

       }

static void __cdecl CalculatePrimesThread(void *ptr) {

       ThreadInfo2* tiPtr = (ThreadInfo2*)ptr;

       int offset = (tiPtr->threadNum==1) ? 1 : 3;

       tiPtr->resultArray = new bool[tiPtr->numTerms/4];

       for (int i=0; i<tiPtr->numTerms/4; ++i) {

              int primeToTest = (i+1)*4+offset;

              bool isPrime = true;

              for (int j=2; j<=sqrt(primeToTest); ++j) {

                     if (primeToTest % j == 0) {

                          isPrime = false;

                          break;

                          }

                     }

              tiPtr->resultArray[i] = isPrime;

              }

       tiPtr->done=true;

       }

This version of the example code is more efficient because you initiate fewer threads and spend less time waiting for the worker threads to complete. Additionally, because the code performs the computation inside a single function, there is less duplicated code, which makes the application easier to maintain. However, a large portion of code handles thread management, which is inherently unsafe and difficult to test and debug.

 

Multiprocessing Programming Example in LabVIEW

The primes parallelism example is a simple example that demonstrates many of the concepts involved in multiprocessing. Most real-world applications are not so simple. Even if you needed to write a prime number generator, the algorithm the examples in this document use is not an efficient method.

**NOTE: Refer to Multiprocessing and Hyperthreading in LabVIEW in the LabVIEW Help to view the primes parallelism example in LabVIEW.

The following section describes another computationally intensive algorithm that can benefit from the LabVIEW multithreaded execution system.

The block diagram in the following illustration calculates pi to any number of digits you specify.


[+] Enlarge Image

The pink wires are clusters that serve as arbitrary-precision numeric values. If you want to compute pi to 1,000 digits, you need more than an extended-precision, floating-point value. The operations that look like LabVIEW functions but operate on the pink clusters are VIs that perform computations on the arbitrary-precision numbers.

This VI also is computationally intensive. It computes pi based on the following formula:

Given the numerical complexity of this equation, it would be difficult in most text-based programming languages to write a program that uses both logical processors on a hyperthreaded computer. LabVIEW, however, can analyze the previous equation and recognize it as an opportunity for multiprocessing.

When LabVIEW analyzes the following block diagram, it identifies some inherent parallelism. Notice that no dataflow dependency exists between the sections highlighted in red and those in green.


[+] Enlarge Image

At first, it appears that little parallelism exists on this block diagram. It seems that only two operations can execute in parallel with the rest of the VI, and those two operations run only once. Actually, the square root is time-consuming, and this VI spends about half the total execution time running the square root operation in parallel with the For Loop. If you split the VI into two unrelated loops, LabVIEW can recognize the parallelism and handle the threads in parallel using both logical processors, which results in a large performance gain. This solution works on a hyperthreaded computer similarly to the way it works on a multiprocessor computer.

It can be difficult to analyze an application to identify parallelism. If you write an application in a text-based programming language such as C++, you must identify opportunities for parallelism and explicitly create and run threads to take advantage of hyperthreaded computers. The advantage to using LabVIEW is that in many instances, like this one, LabVIEW automatically identifies parallelism so the execution system can use both logical processors. In certain applications, it is beneficial to use an algorithm that creates opportunities for parallelism so you can take advantage of the multithreading capabilities of LabVIEW.

In addition to the operations highlighted in red and green on this block diagram, more parallelism exists in the For Loop. LabVIEW executes those operations in parallel, which saves time on a hyperthreaded computer. However, because there are dataflow dependencies later in the data stream, you gain more benefit when LabVIEW separates the highlighted sections.

The following table lists the execution time the VI needs to calculate pi to 1,500 digits without any performance optimization:

Hyperthreading

27.9 s

No Hyperthreading

25.3 s

Notice that the hyperthreaded computer executes slower. Because both logical processors on a hyperthreaded computer share a cache, it is easier to overwhelm cache lines and cause thrashing. LabVIEW stores information necessary for debugging in an area in memory that all execution threads need to access, both to write to and to read from. In some situations, two threads execute simultaneously on different logical processors, and one thread needs to read from the debug information at the same time that another thread is writing to the debug information. If two threads perform operations to bytes in the same cache line on a hyperthreaded computer, the loss in performance can be substantial.

You can improve the performance of the VI by disabling debugging. You can disable debugging for a performance-critical section of the VI and then enable it again if you need to debug that section of the VI.

The following table lists the execution time the VI needs to calculate pi to 1,500 digits after disabling debugging for the entire VI:

Hyperthreading

21.6 s

No Hyperthreading

21.5 s

Notice that the hyperthreaded computer has almost the same performance as the computer that is not hyperthreaded. The overhead of splitting the work does not seem to improve performance even though LabVIEW executes in parallel.

The following table displays the results of running the primes parallelism example in LabVIEW 7.1. The VI found all the prime numbers in the first 400,000 natural numbers on a 3.06 GHz Pentium 4 Windows XP computer.

Computer Configuration Time in Seconds

Hyperthreading without Debugging

1.97 s

Hyperthreading with Debugging

10.47 s

The difference with debugging enabled is significant, but this difference occurs only when two parallel sections execute in the same VI. If you wrote the application with more than just one VI or if the VI were complex enough to handle most of the debugging information for separate pieces in different cache lines, the difference with debugging enabled would not be as significant.

Another optimization that can improve performance on a hyperthreaded computer is to make some subVIs reentrant. Multiple threads can make simultaneous calls to the same subVI if the subVI is reentrant. It is important to understand reentrant execution in the LabVIEW execution system when you optimize a VI for a multiprocessor or hyperthreaded computer. Incorrectly marking a VI as reentrant could cause unnecessary delays, especially in a multiprocessor environment.

To determine which subVIs can be reentrant, find VIs that both branches of parallel execution call. Select View»VI Hierarchy to display the VI Hierarchy window for the main VI to find the VIs and functions both execution branches use.

In the previous example, the Square Root function and the For Loop operate in parallel.


[+] Enlarge Image

The top-level VI calls the Square Root, Add, and Multiply VIs and the Multiply Scalar and Divide Scalar VIs in a For Loop. These are two ideal sections to make reentrant because LabVIEW calls these sections from different threads. The following table lists the execution time the VI needs to calculate pi to 1,500 digits after marking the Add, Multiply, Multiply Scalar and Divide Scalar VIs as reentrant:

Hyperthreading 20.5 s
No Hyperthreading 21.9 s

Marking the VIs reentrant made the VI execute in a slightly faster time on the hyperthreading computer. The execution time for the computer without hyperthreading was slightly slower after making the four VIs reentrant because LabVIEW creates additional dataspace when it calls reentrant VIs.

You can continue to optimize an application by selecting Tools»Profile»Performance and Memory and then specifying some VIs to execute at subroutine priority. You also can mark more VIs as reentrant. The following table lists the execution time the VI needs to calculate pi to 1,500 digits after several rounds of optimization:

Hyperthreading 18.0 s
No Hyperthreading 20.4 s

These results display a 10 percent performance improvement on the hyperthreaded computer. Notice that you did not have to change any of the code of the VI to take advantage of the improvement. You only had to disable debugging, make some VIs reentrant, and execute several VIs at subroutine priority. In total, these changes make about a 35 percent performance improvement on the hyperthreaded computer.

0 ratings | 0.00 out of 5
Print

Reader Comments | Submit a comment »

 

Legal
This example program (this "program") was developed by a National Instruments ("NI") Applications Engineer. Although technical support of this program may be made available by National Instruments, this program may not be completely tested and verified, and NI does not guarantee its quality in any way or that NI will continue to support this program with each new revision of related products and drivers. THIS EXAMPLE PROGRAM IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND AND SUBJECT TO CERTAIN RESTRICTIONS AS MORE SPECIFICALLY SET FORTH IN NI.COM'S TERMS OF USE (http://ni.com/legal/termsofuse/unitedstates/us/).