The last project on the Measurement Tools and Techniques subject was about the simulation. For this purpose two tools were used: Dimemas and Pin-tool.
Analysis of high-performance parallel application is very impor- tant to design and tune both application and hardware. Simulation tools help to perform analysis on how application works on a different environment or how hardware would performs with different config- urations. In this project two simulating tools are described and used for performance analysis. First one simulates a different environments for NAS benchmark’s application in order to find out the optimal environment configuration on which application would run comparable as on environment similar to MareNostrum’s supercomputer. Second one simulates three levels of cache in order to find the optimal configuration of caches for matrix multiplication application.
Dimemas tool for performance analysis of message-passing programs was used. The main reason to use this tool is to develop and tune parallel applications on a workstation, when providing an accurate prediction of their performance on the parallel target machine. Dimemas generate trace files that are suitable for further analysis and visualization with Paraver tool, so even more detailed examination of the application could be done. Paraver traces analysis are presented in this work as the result of Dimemas simulation.
Analysis was performed using Dimemas tool for different number of processors (2, 4, 8,16, …) with various parameters of latency, network bandwidth, number of buses and relative CPU performance. The main goal was to find:
- max acceptable latency when performance of a system reduces not sig- nificantly in comparison to ”ideal”
- min acceptable bandwidth when performance of a system reduces not significantly
- min acceptable number of buses (connectivity) in order the application still performs comparable to the ”ideal” application
- min relative CPU performance in order application still would be com- parable to the ”ideal” version of the application
Performing the analysis and the simulations require the right choice for the tool to make it. To perform the simulation initial configuration of the system was taken from the MareNostrum SuperComputer with following parameters:
- Number of buses: 0 (that is, infinite)
- Network bandwidth: 250 Mbyte/sec
- Startup on remote communication (latency): 0.000008 ms • Relative CPU speed – 1.0
- 128 cores, 1 core per processor
To discover the best tuning for the system environment parameters were changed. Ranges of the parameters were the following:
- Number of buses: (0 .. 128], exponential step size 2
- Network Bandwidth: (250 .. 100] MBytes/sec, step size 10
- Startup on remote communication (latency): (0.000008..0.524288] s, exponential step size 2
- Relative CPU performance: [3.0 .. 0.2], step size 0.2
Pin is a dynamic binary instrumentation framework that enable the creating of dynamic program analysis tools. The tools created from Pin are called Pintools and are used to perform program analysis on user space application.
This part fo the project contains results of the multilevel cache simulation with different per-processor L1 data cache, cluster-shared L2 data cache and globally-shared L3 data cache. To analyze the application, next parameters were varied:
- number of application threads (or CPUs)
- sizes of L1, L2, L3 cache size
- number of processors per cluster that are share the same cluster-shared cache (L2)
For the simulation a Pintool was created that simulates multilevel cache with different per-processor L1 data cache, cluster-shared L2 data cache and globally-shared L3 data cache.
The first simulation simulated NAS benchmark’s Integer Sort application on a similar environment to MareNostrum supercomputer’s environment. The goal of simulation was to find a minimum environment configuration values where its execution would still be comparable to the original execution simulation on MareNostrum supercomputers environment. Simulation was done with Dimemas. Simulation results that such environment configuration is: 64 buses, latency up to 1 ms, bandwidth 250 MB/s and choosing faster CPU won’t make much difference.
The second simulation was multilevel cache simulation. Matrix multiplication application was executed on different configuration of L1, L2 and L3 caches in order to find out the optimal configuration of caches for this execution. For simulation a Pintool was created that could gather miss rates of caches. Gathered results show that optimal configuration is: 64 kB for L1, 2 MB for L2, L3 cache is not necessary (but if to choose to use it, it’s better to use 16 MB L3 cache) and 8 processors share one L2 cache.