Download UPC Operations Microbenchmarking Suite 1.0 User`s manual
Transcript
CESGA Alliance UPC Operations Microbenchmarking Suite 1.0 User’s manual Authors: PhD. Guillermo López Taboada1 Damián Álvarez Mallón2 2 1 [email protected] [email protected] Contents 1 Contact 2 2 Files in this benchmarking suite 2 3 Operations tested 3 4 Customizable parameters 4.1 Compile time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Run time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 5 5 Compilation 7 6 Timers used 7 7 Output explanation 8 UOMS User’s Manual April / 2010 1 1 Contact You can contact us at: Galicia Supercomputing Center (CESGA) http://www.cesga.es Santiago de Compostela, Spain [email protected] PhD. Guillermo Lopez Taboada Computer Architecture Group (CAG) http://gac.des.udc.es/index_en.html University of A Coruña, Spain [email protected] 2 Files in this benchmarking suite • doc/manual.pdf: This file. User’s manual. • COPYING and COPYING.LESSER: Files containing the use and redistribution terms (license). • changelog.txt: File with changes in each release. • src/affinity.upc: UPC code with affinity-related tests. • src/config/make.def.template.*: Makefile templates for HP UPC and Berkeley UPC. • src/config/parameters.h: Header with some customizable parameters. • src/defines.h: Header with needed definitions. • src/headers.h: Header with HUCB functions headers. • src/mem manager.upc: Memory-related functions for allocation and freeing. • src/UOMS.upc: Main file. It contains the actual benchmarking code. • src/init.upc: Code to initialize some structures and variables. • src/Makefile: Makefile to build the benchmarking suite. • src/timers/timers.c: Timing functions. • src/timers/timers.h: Timing functions headers. • src/utils/data print.upc: Functions to output the results. • src/utils/utilities.c: Auxiliary functions. UOMS User’s Manual April / 2010 2 3 Operations tested • upc barrier • upc all broadcast • upc all scatter • upc all gather • upc all gather all • upc all permute • upc all exchange • upc all reduceC • upc all prefix reduceC • upc all reduceUC • upc all prefix reduceUC • upc all reduceS • upc all prefix reduceS • upc all reduceUS • upc all prefix reduceUS • upc all reduceI • upc all prefix reduceI • upc all reduceUI • upc all prefix reduceUI • upc all reduceL • upc all prefix reduceL • upc all reduceUL • upc all prefix reduceUL • upc all reduceF • upc all prefix reduceF • upc all reduceD • upc all prefix reduceD • upc all reduceLD UOMS User’s Manual April / 2010 3 • upc all prefix reduceLD • upc memcpy (remote) • upc memget (remote) • upc memput (remote) • upc memcpy (local) • upc memget (local) • upc memput (local) • memcpy (local) • memmove (local) • upc memcpy asynci (remote) • upc memget asynci (remote) • upc memput asynci (remote) • upc memcpy asynci (local) • upc memget asynci (local) • upc memput asynci (local) • upc all alloc • upc free In bulk memory transfer operations there are two modes: remote and local. Remote mode will copy data from one thread to another, whereas local mode, will copy data from one thread to another memory region with affinity to the same thread. 4 Customizable parameters 4.1 Compile time In the src/config/parameters.h file you can customize some parameters at compile time. They are: • NUMCORES: If defined it will override the detection of the number of cores. If not defined the number of cores is set through the sysconf( SC NPROCESSORS ONLN) system call. • ASYNC MEM TEST: If defined asynchronous memory transfer tests will be built. Default is defined. • MINSIZE: The minimum message size to be used in the benchmarking. Default is 4 bytes. • MAXSIZE: The maximum message size to be used in the benchmarking. Default is 16 megabytes. UOMS User’s Manual April / 2010 4 4.2 Run time The following flags can be used at run time in the command line: • -help: Print usage information and exits. • -version: Print UOMS version and exits. • -off cache: Enable cache invalidation. Be aware that the cache invalidation greatly increases the memory consumption. Also, note that for block sizes smaller than the cache line size it will not work. • -warmup: Enable a warmup iteration. • -reduce op OP: Choose the reduce operation to be performed by upc all reduceD and upc all prefix reduceD. Valid operations are: – UPC ADD (default) – UPC MULT – UPC LOGAND – UPC LOGOR – UPC AND – UPC OR – UPC XOR – UPC MIN – UPC MAX • -sync mode MODE: Choose the synchronization mode for the collective operations. Valid modes are: – UPC IN ALLSYNC|UPC OUT ALLSYNC (default) – UPC IN ALLSYNC|UPC OUT MYSYNC – UPC IN ALLSYNC|UPC OUT NOSYNC – UPC IN MYSYNC|UPC OUT ALLSYNC – UPC IN MYSYNC|UPC OUT MYSYNC – UPC IN MYSYNC|UPC OUT NOSYNC – UPC IN NOSYNC|UPC OUT ALLSYNC – UPC IN NOSYNC|UPC OUT MYSYNC – UPC IN NOSYNC|UPC OUT NOSYNC • -msglen FILE: Read user defined problem sizes from FILE (in bytes). If specified it will override -minsize and -maxsize • -minsize SIZE: Specifies the minimum block size (in bytes). Sizes will increase by a factor of 2 • -maxsize SIZE: Specifies the maximum block size (in bytes) UOMS User’s Manual April / 2010 5 • -input FILE: Read user defined list of benchmarks to run from FILE. Valid benchmark names are: – upc barrier – upc all broadcast – upc all scatter – upc all gather – upc all gather all – upc all exchange – upc all permute – upc memget – upc memput – upc memcpy – local upc memget – local upc memput – local upc memcpy – memcpy – memmove – upc all alloc – upc free – upc all reduceC – upc all prefix reduceC – upc all reduceUC – upc all prefix reduceUC – upc all reduceS – upc all prefix reduceS – upc all reduceUS – upc all prefix reduceUS – upc all reduceI – upc all prefix reduceI – upc all reduceUI – upc all prefix reduceUI – upc all reduceL – upc all prefix reduceL – upc all reduceUL – upc all prefix reduceUL – upc all reduceF – upc all prefix reduceF UOMS User’s Manual April / 2010 6 – – – – – – – – – – 5 upc all reduceD upc all prefix reduceD upc all reduceLD upc all prefix reduceLD upc memget asynci upc memput asynci upc memcpy asynci local upc memget asynci local upc memput asynci local upc memcpy asynci Compilation To compile the suite you have to setup a correct src/config/make.def file. Templates are provided to this purpose. The needed parameters are: • CC: Defines the C compiler used to compile the C code. Please note this does not involve the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler. • CFLAGS: Defines the C flags used to compile the C code. Please note this does not involve the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler • UPCC: Defines the UPC compiler used to compile the suite • UPCFLAGS: Defines the UPC compiler flags used to compile the suite. Please note you should not specify any number of threads flag at this point • UPCLINK: Defines the UPC linker used to link the suite • UPCLINKFLAGS: Defines the UPC linker flags used to link the suite • THREADS SWITCH: Defines the correct switch to set the desired number of threads. It is compiler dependent, and also includes any blank space after the switch Once you have set up your make.def file you can compile the suite as following: make NTHREADS=NUMBER OF UPC THREADS E.g., for 128 threads: make NTHREADS=128 6 Timers used This suite uses high-resolution timers in IA64 architecture. In particular it uses the Interval Timer Counter (AR.ITC). For other architectures it uses the hpupc ticks now if you are using HP UPC, or bupc ticks now if you are using Berkeley UPC, whose precision depends on the specific architecture. If none of this requirements are met the suite uses the default gettimeofday function. However, the granularity of this function only allows to measure microseconds, rather than nanoseconds. UOMS User’s Manual April / 2010 7 7 Output explanation This is an output example of the broadcast: #--------------------------------------------------# Benchmarking upc_all_broadcast # #processes = 2 #--------------------------------------------------#bytes #repetitions t_min[nsec] t_max[nsec] 4 20 19942 48820275 8 20 19942 22922 16 20 19942 22397 32 20 19942 22235 64 20 20277 33610 128 20 20285 22812 256 20 20767 22845 512 20 20767 23020 1024 20 22777 29255 2048 20 23705 25425 4096 20 24562 27097 8192 20 29885 33205 16384 20 42492 44735 32768 10 68317 70052 65536 10 121610 123837 131072 10 227550 231515 262144 10 437645 444740 524288 10 861287 871700 1048576 5 1702722 1704420 2097152 5 3417170 3435637 4194304 5 6830267 6839535 8388608 2 13434382 13469047 16777216 2 27310152 27343357 33554432 1 54294385 54294385 t_avg[nsec] BW_aggregated[MB/sec] 2463315.85 0.00 21457.25 0.70 21420.10 1.43 21626.35 2.88 22886.00 3.81 21676.60 11.22 22230.50 22.41 22314.85 44.48 24169.85 70.01 24603.85 161.10 26437.60 302.32 32174.35 493.42 43919.35 732.49 69490.00 935.53 122635.00 1058.42 229323.50 1132.30 441354.00 1178.86 867619.70 1202.91 1703642.40 1230.42 3429128.40 1220.82 6834224.40 1226.49 13451715.00 1245.61 27326755.00 1227.15 54294385.00 1236.02 The header indicates the benchmarked function and the number of processes involved. The first column shows the size used for each particular row. It is the size of the data at the root thread, or in any thread in a non-rooted operation. The second column is the number of repetitions performed for that particular message size. The following three columns are, respectively, the minimum, maximum and average latencies. The last column shows the aggregated bandwidth calculated using the maximum latencies. Therefore, the bandwidth reported is the minimum bandwidth achieved in all the repetitions. Moreover, when 2 threads are used, affinity tests are performed. This way you can measure the effects of data locality in NUMA systems, if the 2 threads run in the same machine. This feature may be useful even when the 2 threads run in different machines. E.g.: Machines with non-uniform access to the network interface, like quad-socket Opteron/Nehalem-based machines, or cell-based machines like HP Integrity servers. The output of this tests is preceded with something like: #--------------------------------------------------------# using #cores = 0 and 1 (Number of cores per node: 16) # CPU Mask: 1000000000000000 (core 0), 0100000000000000 (core 1) #--------------------------------------------------------UOMS User’s Manual April / 2010 8 All tests after these lines are performed using core 0 (thread 0) and core 1 (thread 1) until another affinity header is showed. UOMS User’s Manual April / 2010 9