文章基本信息

标题：Dynamic Parallelization and Vectorization of Binary Executables on Hierarchical Platforms
本地全文：下载
作者：Efe Yardimci ; Michael Franz
期刊名称：The Journal of Instruction-Level Parallelism
电子版ISSN：1942-9525
出版年度：2008
卷号：10
页码：1-24
出版社：International Symposium on Microarchitecture
摘要：As performance improvements are being increasingly sought via coarse-grained par-allelism, established expectations of continued sequential performance increases are notbeing met. Current trends in computing point toward platforms seeking performance im-provements through various degrees of parallelism, with coarse-grained parallelism featuresbecoming commonplace in even entry-level systems.Yet the broad variety of multiprocessor configurations that will be available that di.erin the numb er of pro cessing elements will make it di.cult to statically create a singleparallel version of a program that performs well on the whole range of such hardware. Asa result, there will so on be a vast number of multipro cessor systems that are significantlyunder-utilized for lack of software that harnesses their power e.ectively. This problem isexacerbated by the growing inventory of legacy programs in binary executable form withpossibly unreachable source code.We present a system that improves the p erformance of optimized sequential binariesthrough dynamic recompilation. Leveraging observations made at runtime, a thin soft-ware layer recompiles executing code compiled for a unipro cessor and generates paral-lelized and/or vectorized code segments that exploit available parallel resources. Amongthe techniques employed are control speculation, lo op distribution across several threads,and automatic parallelization of recursive routines.Our solution is entirely software-based and can be ported to existing hardware platformsthat have parallel processing capabilities. Our p erformance results are obtained on realhardware without using simulation.In preliminary benchmarks on only modestly parallel (2-way) hardware, our system al-ready provides speedups of up to 40% on SpecCPU benchmarks, and near-optimal sp eedupson more obviously parallelizable benchmarks