Academic Company Events NI Developer Zone Support Solutions Products & Services Contact NI MyNI
LabVIEW Zone LabVIEW Tools Network Resources LabVIEW Champions Student Corner User Groups Discussion Forum Code Sharing Learning Center

Coding Challenge - Dictionary Challenge - Results

 Challenge Results:

 Dictionary Challenge

 Here are the results of the Dictionary Coding Challenge:

(Score is indicated in letters per millisecond.)
Konstantin Shifershteyn 93.771
Bruce Ammons 57.024
Bruce Ammons 52.478
Ben Rayner 51.120
Ben Rayner 49.930
Daniel Harding* 47.601
Skip Collins 39.794
Greg Falcon* 28.813
Adam Russell 27.250
Adam Russell 27.064

xxx 1.203

A * next to the name means the submission is from NI and is there just for the competition.

The name of the challenge was the Word Finder, and like the previous challenge (BitTwiddler) the goal was to see different approaches in LabVIEW for searching and pattern matching. Entries were scored based on the number of unique words returned during a fixed period of time. One point was awarded for each letter in each word. Each entry was run with the same fixed set of dictionaries and random word subsets. Each dictionary had a unique letter frequency. Scores ranged from over 93 letters per millisecond to just over 1 letter per millisecond, with the majority of entries returning between 5 and 30 letters per millisecond.

With the Bit Twiddler challenge, a priority was placed on fine tuning bit operations. Often times the problem at hand is not so easy to optimize. In this challenge, the organization of a processed dictionary proved to be the most important factor in overall speed. Let's look at a few approaches and compare them.

 
enlarge
Virtually every solution used a structure called a LabVIEW 2 style global. This is a while loop with the continuation terminal set in such a way that the loop only iterates once. A shift register is used to store data from one execution to the next. This allows the VI to save information about the dictionary during the first iteration, and then reuse that data from the shift register when the dictionary is no longer passed in.

This solution executed with a score of 1.203. During initialization, the average length of words in the dictionary was calculated. Words over the average length were removed from the dictionary. This allowed for faster searches due to a smaller search set, but also reduced the score of each word returned. During the search phase, the random letters were checked against each word in the dictionary using the Regular Expression and Search And Replace primitives. This would remove one letter out of the random set for each matching letter in the dictionary word. If all the letters matched up, the word would be returned.

The use of regular expressions and search/replace functionality was used in almost half of the submissions. The remaining entries used array based histograms, such as in the next solution:

 
enlarge
This entry executed with a score of 5.875. This entry used histograms of words for comparison, with each element in the array representing the number of times a particular character occurred in a word. If one array histogram is less than or equal to another histogram across all elements, it is a subset. Not only did this VI use histograms, it also trimmed the dictionary by removing any words that had any letters that occurred more than once.

Several entries used histograms, and then would do a linear search through the histograms looking for a match. The performance of these entries varied depending on how the histograms were ordered. The top entries took unique approaches to organizing the dictionary histograms.

 
enlarge
This entry by Bruce Ammons executed with a score of 52.478. During initialization, the histograms of each word in the dictionary were generated. A fail table was also generated. The unique aspect of this design was that in the case of a random set's histogram not matching a selected dictionary word's histogram, the fail table would give a hint to where to look next. This increased performance quite a bit.

Another interesting design aspect was the use of a histogram that varied with letter distribution. The contest entries were scored with a variety of letter distributions, so entries that were hard-coded for English letter distribution fared poorly on other languages. A few entries would analyze the dictionary during the initialization phase to determine the letter distribution. That distribution was then later used to generate histograms with the most commonly occurring letter representing the first element in the histogram array.

 
enlarge
This is the diagram of the winning entry by Konstantin Shifershteyn. It won the contest with a score of 93.771 letters per millisecond. This VI used a two-tiered comparison approach. During initialization, words were categorized by how many unique letters they contained. A U32 variable was assigned with 26 bits representing the presence of certain letters in the word. Comparisons of a single U32 can be performed very quickly - so if the comparison showed that the word might match then it would be worth doing the exhaustive 26-element comparison.

This entry kept several bins, and placed the dictionary words into various bins based on a portion of their histogram. When a random letter set was to be checked, the "Word Stats" VI would generate a histogram, the index of which bin to look in, and how many unique characters the set contained. If the random set only contained four unique letters, then it was immediately known that dictionary words with five or more unique letters would never match. Within each bin, dictionary words were sorted by largest number of unique characters. Then a hint array contained information for each bin to tell where in the bin to begin looking.

Each entry pointed to the next entry by providing the index of the next histogram to check. This is very similar to Bruce Ammons fail table approach. When a dictionary word was successfully returned the pointer to that entry was increased. This allowed the VI to completely skip dictionary words that had already been returned.

Congratulations to the winners of this contest. Each entry had a very unique approach to the problem. Some entries were easier to decipher than others. The key to this challenge was taking advantage of characteristics of the data. When processing needs to be done on data, often times unique characteristics of the data can be used to dramatically speed up algorithms. Hopefully this challenge showed everyone that even seemingly obscure data sets can be viewed in such a manner so that brute force approaches are not always necessary.

Go here for further discussion.