出版社:SISSA, Scuola Internazionale Superiore di Studi Avanzati
摘要:An optimized code has to be tuned to the CPU architecture: a current trend in modern CPUs is the increasing number of cores per socket, with different levels of cache. It turns out to be natural to have different parallelization “granularities” (multithreading and multiprocessing) characterized by completely different bandwidth and latencies. We present different strategies for the implementation of the Wilson Dirac operator which aim at maximizing the performance on the Aurora architecture.