文章基本信息

标题：Optimization techniques for data sorting algorithms.
作者：Pirjan, Alexandru
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2011
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Key words: parallelism, radix sort, CUDA, algorithms, shared memory
关键词：Algorithms;Data processing;Electronic data processing;Mathematical optimization;Optimization theory

Optimization techniques for data sorting algorithms.

Pirjan, Alexandru

Abstract: This paper describes the designing of high performance parallel radix sort routines for many cores graphics processing units, benefitting from the high computational power offered by the Compute Unified Device Architecture (CUDA). I address issues of great importance when optimizing a classical data-sorting algorithm regarding fine-grain parallelism and memory management techniques. One particular interest was to research how well the optimization techniques, applied to data sorting algorithms written in CUDA, scale to the latest generation of general-purpose graphic processors units (GPGPU), like the Fermi architecture implemented in the GTX580 and the previous architecture implemented in GTX285.

Key words: parallelism, radix sort, CUDA, algorithms, shared memory

1. INTRODUCTION

Sorting is one of the most studied algorithmic problems. Its importance has led to the designing of efficient sorting algorithms for a variety of parallel architectures (Cormen et al., 2001). A wide class of efficient sorting algorithms is based on optimized computing routines. Sorting operations are extensively used in database systems and are essential in computer graphics and geographic information systems, representing the basis for building spatial data structures. Efficient sorting routines are of paramount importance in implementing algorithms such as parallel programming models based on MapReduce (Dean & Ghemawat, 2008). Therefore, it is very important to implement efficient sorting routines on any programming platform taking into account the continuous evolution of computer hardware architectures that provide an increased computational processing power.

The continuous increase in the level of implemented parallelism has been one of the dominant trends in microprocessor architectures in recent years. Many-cores central processing units are becoming more and more widespread and thus there is an evident tendency of increasing the level of parallelism by using multiple processing units. Based on multiple processing cores, the graphics processing units represent viable solutions for increasing the level of parallelism. For example, the current NVIDIA Fermi graphics processing units contain up to 512 processing cores per chip and in contrast to previous generations of GPUs, they can be programmed directly in C using CUDA (Dean & Ghemawat, 2008). In this paper, I describe the design of efficient sorting algorithms for GPUs that implement the Compute Unified Device Architecture.

CUDA offers a flexible programming model, thus the current generation of GPUs allows us to take into account a wider range of algorithms than it was previously possible on GPU architectures. In particular, I focus on optimizing a radix sort algorithm (with binary represented keys) for the GTX580 architecture. The graphic processing unit GPU is a multithread processor and therefore, in order to design efficient algorithms on such architectures one has to involve large amounts of fine-grained parallelism.

The described radix sort algorithm implements multiple parallel scan operations through efficient fine grain parallelism (Nickolls et al., 2008); (Dusseau et al., 1996).

2. THE CUDA PARALLEL PROGRAMMING MODEL

In the following, I briefly present the most important aspects regarding the architecture of current GPUs and NVIDIA's CUDA parallel programming model.

For a long time, graphics processing units (GPUs) have been used solely for graphics rendering purposes on computers. In recent years, the GPU has evolved from an initial specialized architecture to a complex one, able to solve a wider range of complex problems and not only the video rendering. The introduction of the NVIDIA Compute Unified Device Architecture made possible to accelerate and improve a large suite of applications. A briefly description of this architecture and its main features are depicted in Figure 1.

[FIGURE 1 OMITTED]

NVIDIA graphics processor is enabled by the CUDA hardware and software architecture to execute programs written in different languages including C, C++ or FORTRAN. A CUDA program is designed to invoke a set of parallel program kernels, that are processed in parallel by multiple threads, grouped into thread blocks and grids of thread blocks.

3. THE RADIX SORT ALGORITHM

In the category of sorting algorithms, the radix sort is among the oldest, probably the best known and the most efficient in sorting small keys on sequential machines. One of the most important advantages of the radix sort algorithm is the fact that it is easy to parallelize, based on the reduction of the counting sort used at each pass to a parallel prefix or sum operation (Satish et al., 2009). Another major advantage is that scans can be efficiently implemented on a GPU or on other many-cores processors, being a fundamental operation in managing parallel data (Nickolls et al., 2008). Therefore, radix sort may be considered one of the easiest parallel sort algorithms from the implementation point of view and one of the most efficient ones, comparable with some more sophisticated algorithms (Satish et al., 2009). In order to optimize the radix sort, I have divided the sequence into tiles that have been assigned to a number of t thread blocks. I have used memory optimization techniques and global synchronization between each separate parallel kernel invocation.

For every block, its tile is loaded and sorted in the on-chip memory, and then each block writes its histogram and the stored data tile to global memory. Then, a prefix sum is performed over the histogram table; it is stored in column-major order, for computing global digit offsets. The prefix sum results are used by each block for copying the elements to their correct output position (Satish et al., 2009); (Dusseau et al., 1996).

4. EXPERIMENTAL RESULTS

In the following, I describe several experimental results regarding the implemented sorting algorithm. The performance tests sorted sequences of 32-bit words key-value pairs, this length being suitable for building irregular data structures and sorting points. The keys were produced using a uniform random number generator.

I recorded the execution time that also contains the necessary amount of time for transferring input data from the host memory to the GPU's on-board memory across the peripheral component interconnect express bus. The algorithm described in this article is designed so that it scales across a significant range of parallel hardware architectures that implement the Compute Unified Device Architecture. This scaling is illustrated through the performance measured on the GeForce GTX285 and the GeForee GTX580 NVIDIA GPUs, whose technical specifications are described in Table 1.

One important aspect is to choose the right data partitioning in order to ensure enough workload for the complete utilization of the available hardware parallel resources. I have computed the number of key-values pairs sorted per second by dividing the input size by total running time. I have also adopted the technique developed in (Satish et al., 2009) and evaluated how the sorting performance was influenced by using different key distributions. The chosen sizes for the number of random bits generated for obtaining the keys were 8, 16, 24 or 32 bits.

In Figure 2 is represented the radix sorting time obtained for different sequence sizes on GTX285 and GTX580.

[FIGURE 2 OMITTED]

5. CONCLUSIONS

One particular interest in this article was to research how well the optimization techniques available in the literature for data sorting algorithms written in CUDA, scale to the latest generation of general-purpose graphics processors units (GPGPU), like the Fermi architecture implemented in the GTX580 and the previous architecture implemented in GTX285. Lately, there has been a lot of interest in the literature for optimizing data sorting algorithms using CUDA. None of those works so far (to my best knowledge) tried to validate if the optimization techniques can apply to the GTX580 (the high-end GPU of the Fermi architecture) and how well does the Fermi architecture scale to these data sorting algorithms performance improving techniques.

The obtained numerical results have confirmed and revealed some important aspects regarding the studied problem. The best way to obtain algorithmic efficiency on the Fermi architecture is to use the fast on-chip memory of the GPU and to employ fine-grained parallelism in order to benefit from the computing power of thousands of parallel threads offered by this architecture. Fast memory spaces have a significant influence on the overall performance and must be properly managed. These techniques, along with the increased level of parallelism provided by the 512 CUDA processing cores of the GTX580, have determined significant improved execution times. Future work involves a more thorough optimization using a larger selection of CUDA applications and an exhaustive benchmarking process. I intend to analyze how the new CUDA architecture can optimize and parallelize other types of sorting algorithms.

6. REFERENCES

Cormen T. H.; Leiserson C. E.; Rivest R. L. & Stein C. (2001). Introduction to Algorithms, MIT Press, ISBN-10: 0-262-03293-7, New York

Dean J.; Ghemawat S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, Vol. 51, No. 1, (ianuary 2008) pp. 107-113, ISSN 00010782

Dusseau A. C.; Culler D. E.; Schauser K. E.; Martin R. P. (1996). Fast parallel sorting under LogP: Experience with the CM-5. IEEE Trans. Parallel Distrib. Syst., Vol. 7, No. 8, (august 1996)pp. 791-805, ISSN 1045-9219

Nickolls J.; Buck I.; Garland M.; Skadron K. (2008). Scalable parallel programming with CUDA, ACM Queue, Vol. 6, No. 2, (march/april 2008) pp. 40-53, ISSN 1542-7730

Satish N.; Harris M.; Garland M. (2009). Designing efficient sorting algorithms for manycore GPUs, Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, 23-29 may 2009, Rome, ISSN: 1530-2075, ISBN:978-1-4244-3751-1, Satish, N. (Ed.), pp. 1-10, IEEE Publisher, Washington

Tab. 1. Technical specifications of GTX285 and GTX580

NVIDIA GPU GTX285 GTX580
Release Year 2009 2010
CUDA Cores 240 512
Graphics Clock (MHz) 648 MHz 772 MHz
Processor Clock MHz 1476 MHz 1544 MHz
Texture Fill Rate
 billion/sec 51.8 49.4
Memo Clock (MHz) 1242 2004
Standard Memory Config 1 GB 1536 MB
 GDDR3 GDDR5
Memory Interface Width 512-bit 384-bit
Memory Bandwidth
 GB/sec 159.0 192.4
Fabrication Process 55 nm 40 nm
Number of Transistors 1.4 billion 3.0 billion
Streaming Multiprocessors
 (SM) 30 16
Streaming Processors
 (ver SM) 8 32