Optimization towards IBM POWER

Our current P7 server is P7curielin2, which has 8 cores (POWER7 (architected), 3.1GHz, revision 2.1 pvr 003f 0201). The total number of cores is 8 * 4 =32 SMT. The total memory is 64GB. The OS is Red Hat Enterprise Linux Server release 6.4 (Santiago). P755 server has two 32 cores process running at 3.3 GHz. P7 has 32 kb L1$D and 32 MB L3 cache.

Our current Intel processor is Intel(R) Xeon(R) CPU E7- 4830 (8-core)(Nehalem) running at 2.13GHz in IBM Blade Server and Xeon(R) E5-2687W (Sandy Bridge) running at 3.78GHz. There are 4 processors with 32 cores in total. The OS sees 64 cores due to Intel hyper threading. The memory is 256GB and the OS is CentOS release 6.3. The cache size is 8 MB. Each core has 2 Hyperthreads that share the cache data.

We illustrate the code segment at high level regarding the computation for Bayesian inference in graphical models. The red text gives the complexity analysis, where the parallelism is processed using System G Middleware program framework. Further optimization was utilized, but it is worth noting that the optimization applies to both POWER and Xeon architectures. The result on POWER is shown below, which scales pretty well, and achieves comparable performance to Intel Xeon Nehalem/Sandy bridge.

Tuning on POWER

* Initial sequential code: ~3x single thread performance gap
* Perform optimizations to address well known/documented POWER 7 weaknesses
  - Compiler flags: -mcpu=power7 -funroll-loops --param max-unroll-times=2 ….
  - Manual Algorithmic Changes (~10x performance improvement)
    - Improve CPI / IPC:
      - Reduce/avoid the use of integer divisions and modulo on P7
    - Convert a computational intensive problem into a memory IO bounded problem
      - Precompute division/modulo and performs table look-up at run time
      - Spend more memory space for storing some results that may need to be computed again in future
    - Improve branch handling (branch misses, taken branches):
      - Reduce the number of short loops (unrolling)
      - Potential use SIMD to reduce the number of iterations

Sample Performance Hotspot on POWER and Solutions

* Integer divide - 80\% execution time spent in one compute intensive loop
  - CPI 2.02 (Intel) versus 2.78 (POWER)
  - 51,851,849,281 completion stall cycles out of 93,422,290,992 caused by FXU instructions
* Solutions
  - Code change to alternatively perform integer divisions
  - Address it in POWER9
    - Faster integer divide, module

POWER Competitiveness

* Have well documented code tuning manuals available for developers
* Compiler research to work around certain limitations (e.g., short loops)
* For critical codes pair domain specific developer with performance tuning expert