Download Curie advanced userguide - VI-HPS

Transcript
Contents
1 Curie 's advance d us age manual
2 Optimiz ation
2.1 Compilation options
2.1.1 Inte l
2.1.1.1 Inte l Sandy Bridge proce s s ors
2.1.2 GNU
3 Submis s ion
3.1 Choos ing or e xcluding node s
4 MPI
4.1 Embarras s ingly paralle l jobs and MPMD jobs
4.2 BullxMPI
4.2.1 MPMD jobs
4.2.2 Tuning BullxMPI
4.2.3 Optimiz ing with BullxMPI
4.2.4 De bugging with BullxMPI
5 Proce s s dis tribution, affinity and binding
5.1 Introduction
5.1.1 Hardware topology
5.1.2 De finitions
5.1.3 Proce s s dis tribution
5.1.4 Why is affinity important for improving pe rformance ?
5.1.5 CPU affinity mas k
5.2 SLURM
5.2.1 Proce s s dis tribution
5.2.1.1 Curie hybrid node
5.2.2 Proce s s binding
5.3 BullxMPI
5.3.1 Proce s s dis tribution
5.3.2 Proce s s binding
5.3.3 Manual proce s s manage me nt
6 Us ing GPU
6.1 Two s e que ntial GPU runs on a s ingle hybrid node
7 Profiling
7.1 PAPI
7.2 VampirTrace /Vampir
7.2.1 Bas ics
7.2.2 Tips
7.2.3 Vampirs e rve r
7.2.4 CUDA profiling
7.3 Scalas ca
7.3.1 Standard utiliz ation
7.3.2 Scalas ca + Vampir
7.3.3 Scalas ca + PAPI
7.4 Parave r
7.4.1 Trace ge ne ration
7.4.2 Conve rting trace s to Parave r format
7.4.3 Launching Parave r
Curie's advanced usage manual
If you have s ugge s tions or re marks , ple as e contact us : hotline .tgcc@ce a.fr
Optimization
Compilation options
Compile rs provide s many options to optimiz e a code . The s e options are de s cribe d in the following s e ction.
Int el
-opt_re port : ge ne rate s a re port which de s cribe s the optimis ation in s tde rr (-O3 re quire d)
-ip, -ipo : inte r-proce dural optimiz ations (mono and multi file s ). The command xiar mus t be us e d ins te ad of
ar to ge ne rate a s tatic library file with obje cts compile d with -ipo option.
-fas t : de fault high optimis ation le ve l (-O3 -ipo -s tatic). + Care full : This option is not allowe d us ing MPI, the MPI
conte xt ne e ds to call s ome librarie s which only e xis ts in dynamic mode . This is incompatible with the -s tatic
option. You ne e d to re place -fas t by -O3 -ipo
-ftz : cons ide rs all the de normaliz e d numbe rs (like INF or NAN) as z e ros at runtime .
-fp-re laxe d : mathe matical optimis ation functions . Le ads to a s mall los s of accuracy.
-pad : make s the modification of the me mory pos itions ope rational (ifort only)
The re are s ome options which allow to us e s pe cific ins tructions of Inte l proce s s ors in orde r to optimiz e the code .
The s e options are compatible with mos t of Inte l proce s s ors . The compile r will try to ge ne rate the s e ins tructions if
the proce s s or allow it.
-xSSE4.2 : May ge ne rate Inte l® SSE4 Efficie nt Acce le rate d String and Te xt Proce s s ing ins tructions . May
ge ne rate Inte l® SSE4 Ve ctoriz ing Compile r and Me dia Acce le rator, Inte l® SSSE3, SSE3, SSE2, and SSE
ins tructions .
-xSSE4.1 : May ge ne rate Inte l® SSE4 Ve ctoriz ing Compile r and Me dia Acce le rator ins tructions for Inte l
proce s s ors . May ge ne rate Inte l® SSSE3, SSE3, SSE2, and SSE ins tructions .
-xSSSE3 : May ge ne rate Inte l® SSSE3, SSE3, SSE2, and SSE ins tructions for Inte l proce s s ors .
-xSSE3 : May ge ne rate Inte l® SSE3, SSE2, and SSE ins tructions for Inte l proce s s ors .
-xSSE2 : May ge ne rate Inte l® SSE2 and SSE ins tructions for Inte l proce s s ors .
-xHos t : this option will apply one of the pre vious options de pe nding on the proce s s or whe re the compilation
is pe rforme d. This option is re comme nde d for optimiz ing your code .
None of the s e options are us e d by de fault. The SSE ins tructions us e the ve ctoriz ation capability of Inte l proce s s ors .
Int el Sandy Bridge processors
Curie thin node s us e the las t Inte l proce s s ors bas e d on Sandy Bridge archite cture . This archite cture provide s ne w
ve ctoriz ation ins tructions calle d AVX for Advance d Ve ctor e Xte ns ions . The option -xAVX allows to ge ne rate a
s pe cific code for Curie thin node s .
Be care ful, a code ge ne rate d with -xAVX option runs only on Inte l Sandy Bridge proce s s ors . Othe rwis e , you will ge t
this e rror me s s age :
Fa ta l Error: This progra m wa s not built to run in your s ys te m.
Ple a s e ve rify tha t both the ope ra ting s ys te m a nd the proce s s or s upport Inte l(R) AVX.
Curie login node s are Curie large node s with Ne hale m-EX proce s s ors . AVX code s can be ge ne rate d on the s e
node s through cros s -compilation by adding -xAVX option. On Curie large node , the -xHos t option will not ge ne rate a
AVX code . If you ne e d to compile with -xHos t or if the ins tallation re quire s s ome te s ts (like autotools /configure ), you
can s ubmit a job which will compile on the Curie thin node s .
GNU
The re are s ome options which allow us age of s pe cific s e t of ins tructions for Inte l proce s s ors , in orde r to optimiz e
code be havior. The s e options are compatible with mos t of Inte l proce s s ors . The compile r will try to us e the s e
ins tructions if the proce s s or allow it.
-mmmx / -mno-mmx : Switch on or off the us age of s aid ins truction s e t.
-ms s e / -mno-s s e : ide m.
-ms s e 2 / -mno-s s e 2 : ide m.
-ms s e 3 / -mno-s s e 3 : ide m.
-ms s s e 3 / -mno-s s s e 3 : ide m.
-ms s e 4.1 / -mno-s s e 4.1 : ide m.
-ms s e 4.2 / -mno-s s e 4.2 : ide m.
-ms s e 4 / -mno-s s e 4 : ide m.
-mavx / -mno-avx : ide m, f o r Curie T hin no des part it io n o nly.
Submission
Choosing or excluding nodes
SLURM provide s the pos s ibility to choos e or e xclude any node s in the re s e rvation for your job.
To choos e node s :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#
#MS UB -T 1800
#
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
#MS UB -A pa xxxx
#MS UB -E '-w curie [1000-1003]'
# Re que s t na me
Numbe r of ta s ks to us e
Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
# Proje ct ID
# Include 4 node s (curie 1000 to curie 1003)
s e t -x
cd ${BRIDGE_MS UB_PWD}
ccc_mprun ./a .out
To e xclude node s :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#
#MS UB -T 1800
#
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
#MS UB -A pa xxxx
#MS UB -E '-x curie [1000-1003]'
# Re que s t na me
Numbe r of ta s ks to us e
Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
# Proje ct ID
# Exclude 4 node s (curie 1000 to curie 1003)
s e t -x
cd ${BRIDGE_MS UB_PWD}
ccc_mprun ./a .out
MPI
Embarrassingly parallel jobs and MPMD jobs
An e mbarras s ingly paralle l job is a job which launch inde pe nde nt proce s s e s . The s e proce s s e s ne e d fe w or
no communications
A MPMD job is a paralle l job which launch diffe re nt e xe cutable s ove r the proce s s e s . A MPMD job can be
paralle l with MPI and can do many communications .
The s e two conce pts are s e parate but we pre s e nt the m toge the r be caus e the way to launch the m on Curie is
s imilar. An s imple e xample in the Curie info page was alre ady give n.
In the following e xample , we us e ccc_mprun to launch the job. srun can be us e d too. We want to launch bin0 on the
MPI rank 0, bin1 on the MPI rank 1 and bin2 on the MPI rank 2. We have firs t to write a s he ll s cript which de s cribe s
the topology of our job:
launch_e xe .s h:
#! /bin/ba s h
if [ $S LURM_PROCID -e q 0 ]
the n
./bin0
fi
if [ $S LURM_PROCID -e q 1 ]
the n
./bin1
fi
if [ $S LURM_PROCID -e q 2 ]
the n
./bin2
fi
We can the n launch our job with 3 proce s s e s :
ccc_mprun -n 3 ./la unch_e xe .s h
The s cript launch_exe.sh mus t have e xe cute pe rmis s ion. Whe n ccc_mprun launche s the job, it will initializ e s ome
e nvironme nt variable s . Among the m, SLURM_PROCID de fine s the curre nt MPI rank.
BullxMPI
MPMD jobs
BullxMPI (or Ope nMPI) jobs can be launche d with mpirun launche r. In this cas e , we have othe r ways to launch MPMD
jobs (s e e e mbarras s ingly paralle l jobs s e ction).
We take the s ame e xample in the e mbarras s ingly paralle l jobs s e ction. The re are the n two ways for launching
MPMD s cripts
We don't ne e d the launch_exe.sh anymore . We can launch dire ctly the job with mpirun command:
mpirun -np 1 ./bin0 : -np 1 ./bin1 : -np 1 ./bin2
In the launch_exe.sh, we can re place SLURM_PROCID by OMPI_COMM_WORLD_RANK:
launch_e xe .s h:
#! /bin/ba s h
if [ ${OMPI_COMM_WORLD_RANK} -e q 0 ]
the n
./bin0
fi
if [ ${OMPI_COMM_WORLD_RANK} -e q 1 ]
the n
./bin1
fi
if [ ${OMPI_COMM_WORLD_RANK} -e q 2 ]
the n
./bin2
fi
We can the n launch our job with 3 proce s s e s :
mpirun -np 3 ./la unch_e xe .s h
Tuning BullxMPI
BullxMPI is bas e d on Ope nMPI. It can be tune d with parame te rs . The command ompi_info -a give s you a lis t of all
parame te rs and the ir de s criptions .
curie 50$ ompi_info -a
(...)
MCA mpi: pa ra me te r "mpi_s how_mca _pa ra ms " (curre nt va lue : <none >, da ta s ource : de fa ult va lue )
Whe the r to s how a ll MCA pa ra me te r va lue s during MPI_INIT or not (good for re produca bility of MPI jobs for de bug purpos e s ). Acce pte d va lue s a re a ll, de fa ult, file , a pi, a nd e nvironme nt
- or a comma de limite d combina tion of the m
(...)
The s e s parame te rs can be modifie d with e nvironme nt variable s s e t be fore the ccc_mprun command. The form of
the corre s ponding e nvironme nt variable is OMPI_MCA_xxxxx whe re xxxxx is the parame te r.
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#MS UB -T 1800
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
#MS UB -A pa xxxx
# Re que s t na me
# Numbe r of ta s ks to us e
# Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
# Proje ct ID
s e t -x
cd ${BRIDGE_MS UB_PWD}
e xport OMPI_MCA_mpi_s how_mca _pa ra ms =a ll
ccc_mprun ./a .out
Opt imizing wit h BullxMPI
You can try the s e s parame te rs in orde r to optimiz e BullxMPI:
e xport OMPI_MCA_mpi_le a ve _pinne d=1
This s e tting improve s the bandwidth for communication if the code us e s the s ame buffe rs for communication during
the e xe cution.
e xport OMPI_MCA_btl_ope nib_us e _e a ge r_rdma =1
This parame te r optimiz e s the late nce for s hort me s s age s on Infiniband ne twork. But the code will us e more
me mory.
Be care ful, the s e s parame te rs are not s e t by de fault. The y can have influe nce s on the be haviour of your code s .
Debugging wit h BullxMPI
Some time s , BullxMPI code s can hang in any colle ctive communication for large jobs . If you find yours e lf in this cas e ,
you can try this parame te r:
e xport OMPI_MCA_coll="^ghc,tune d"
This s e tting dis able s optimiz e d colle ctive communications : it can s low down your code if it us e s many colle ctive
ope rations .
Process distribution, affinity and binding
Introduction
Hardware t opology
Hardware topology of a Curie fat node
The hardware topology is the organiz ation of core s , proce s s ors , s ocke ts and me mory in a node . The pre vious
image was cre ate d with hwloc. You can have acce s s to hwloc on Curie with the command module load hwloc.
Definit ions
We de fine he re s ome vocabulary:
Binding : a Linux proce s s can be bound (or s tuck) to one or many core s . It me ans a proce s s and its thre ads
can run only on a give n s e le ction of core s . For e xample , a proce s s which is bound to a s ocke t on a Curie fat
node can run on any of the 8 core s of a proce s s or.
Af f init y : it re pre s e nts the policy of re s ource s manage me nt (core s and me mory) for proce s s e s .
Dist ribut io n : the dis tribution of MPI proce s s e s de s cribe s how the s e s proce s s e s are s pre ad accros s the
core , s ocke ts or node s .
On Curie , the de fault be haviour for dis tribution, affinity and binding are manage d by SLURM, pre cis e ly the ccc_mprun
command.
Process dist ribut ion
We pre s e nt he re s ome e xample of MPI proce s s e s dis tributions .
blo ck or ro und : this is the s tandard dis tribution. From SLURM manpage : The block dis tribution me thod will
dis tribute tas ks to a node s uch that cons e cutive tas ks s hare a node . For e xample , cons ide r an allocation of
two node s e ach with 8 core s . A block dis tribution re que s t will dis tribute thos e tas ks to the node s with tas ks 0
to 7 on the firs t node , tas k 8 to 15 on the s e cond node .
Block distribution by core
cyclic by s ocke t: from SLURM manpage , the cyclic dis tribution me thod will dis tribute tas ks to a s ocke t s uch
that cons e cutive tas ks are dis tribute d ove r cons e cutive s ocke t (in a round-robin fas hion). For e xample ,
cons ide r an allocation of two node s e ach with 2 s ocke ts e ach with 4 core s . A cyclic dis tribution by s ocke t
re que s t will dis tribute thos e tas ks to the s ocke t with tas ks 0,2,4,6 on the firs t s ocke t, tas k 1,3,5,7 on the
s e cond s ocke t. In the following image , the dis tribution is cyclic by s ocke t and block by node .
Cyclic distribution by socket
cyclic by node : from SLURM manpage , the cyclic dis tribution me thod will dis tribute tas ks to a node s uch that
cons e cutive tas ks are dis tribute d ove r cons e cutive node s (in a round-robin fas hion). For e xample , cons ide r
an allocation of two node s e ach with 2 s ocke ts e ach with 4 core s . A cyclic dis tribution by node re que s t will
dis tribute thos e tas ks to the node s with tas ks 0,2,4,6,8,10,12,14 on the firs t node , tas k 1,3,5,7,9,11,13,15 on
the s e cond node . In the following image , the dis tribution is cyclic by node and block by s ocke t.
Block distribution by node
Why is affinit y import ant for improving performance ?
Curie node s are NUMA (Non-Uniform Me mory Acce s s ) node s . It me ans that it will take longe r to acce s s s ome
re gions of me mory than othe rs . This is due to the fact that all me mory re gions are not phys ically on the s ame bus .
NUMA node : Curie hybrid
node
In this picture , we can s e e that if a data is in the me mory module 0, a proce s s running on the s e cond s ocke t like
the 4th proce s s will take more time to acce s s the data. We can introduce the notion of local data vs remote data. In
our e xample , if we cons ide r a proce s s running on the s ocke t 0, a data is local if it is on the me mory module 0. The
data is remote if it is on the me mory module 1.
We can the n de duce the re as ons why tuning the proce s s affinity is important:
Data locality improve pe rformance . If your code us e s hare d me mory (like pthre ads or Ope nMP), the be s t
choice is to re group your thre ads on the s ame s ocke t. The s hare d datas s hould be local to the s ocke t and
more ove r, the datas will pote ntially s tay on the proce s s or's cache .
Sys te m proce s s e s can inte rrupt your proce s s running on a core . If your proce s s is not bound to a core or to
a s ocke t, it can be move d to anothe r core or to anothe r s ocke t. In this cas e , all datas for this proce s s have
to be move d with the proce s s too and it can take s ome time .
MPI communications are fas te r be twe e n proce s s e s which are on the s ame s ocke t. If you know that two
proce s s e s have many communications , you can bind the m to the s ame s ocke t.
On Curie hybrid node s , the GPUs are conne cte d to bus e s which are local to s ocke t. Proce s s e s can take
longe r time to acce s s a GPU which is not conne cte d to its s ocke t.
NUMA node : Curie hybrid node with GPU
For all the s e s re as ons , it is be tte r to know the NUMA configuration of Curie node s (fat, hybrid and thin). In the
following s e ction, we will pre s e nt s ome ways to tune your proce s s e s affinity for your jobs .
CPU affinit y mask
The affinity of a proce s s is de fine d by a mas k. A mas k is a binary value which le ngth is de fine d by the numbe r of
core s available on a node . By e xample , Curie hybrid node s have 8 core s : the binary mas k value will have 8 figure s .
Each figure s will have 0 or 1. The proce s s will run only on the core which have 1 as value . A binary mas k mus t be
re ad from right to le ft.
For e xample , a proce s s which runs on the core s 0,4,6 and 7 will have as affinity binary mas k: 11010001
SLURM and BullxMPI us e the s e s mas ks but conve rte d in he xade cimal numbe r.
To conve rt a binary value to he xade cimal:
$ e cho "iba s e =2;oba s e =16;11010001"| bc
21202
To conve rt a he xade cimal value to binary:
$ e cho "iba s e =16;oba s e =2;21202"| bc
11010001
The numbe ring of the core s is the PU numbe r from the output of hwloc.
SLURM
SLURM is the de fault launche r for jobs on Curie . SLURM manage s the proce s s e s e ve n for s e que ntial jobs . We
re comme nd you to us e ccc_mprun. By de fault, SLURM binds proce s s e s to a core . The dis tribution is block by node
and by core .
The option -E '--cpu_bind=verbose' for ccc_mprun give s you a re port about the binding of proce s s e s be fore the run:
$ ccc_mprun -E '--cpu_bind=ve rbos e ' -q hybrid -n 8 ./a .out
cpu_bind=MAS K - curie 7054, ta s k 3 3 [3534]: ma s k 0x8 s e t
cpu_bind=MAS K - curie 7054, ta s k 0 0 [3531]: ma s k 0x1 s e t
cpu_bind=MAS K - curie 7054, ta s k 1 1 [3532]: ma s k 0x2 s e t
cpu_bind=MAS K - curie 7054, ta s k 2 2 [3533]: ma s k 0x4 s e t
cpu_bind=MAS K - curie 7054, ta s k 4 4 [3535]: ma s k 0x10 s e t
cpu_bind=MAS K - curie 7054, ta s k 5 5 [3536]: ma s k 0x20 s e t
cpu_bind=MAS K - curie 7054, ta s k 7 7 [3538]: ma s k 0x80 s e t
cpu_bind=MAS K - curie 7054, ta s k 6 6 [3537]: ma s k 0x40 s e t
In this e xample , we can s e e the proce s s 5 has 20 as he xade cimal mas k or 00100000 as binary mas k: the 5th
proce s s will run only on the core 5.
Process dist ribut ion
To change the de fault dis tribution of proce s s e s , you can us e the option -E '-m' for ccc_mprun. With SLURM, you have
two le ve ls for proce s s dis tribution: node and s ocke t.
Node block dis tribution:
ccc_mprun -E '-m block' ./a .out
Node cyclic dis tribution:
ccc_mprun -E '-m cyclic' ./a .out
By de fault, the dis tribution ove r the s ocke t is block. In the following e xample s for s ocke t dis tribution, the node
dis tribution will be block.
Socke t block dis tribution:
ccc_mprun -E '-m block:block' ./a .out
Socke t cyclic dis tribution:
ccc_mprun -E '-m block:cyclic' ./a .out
Curie hybrid node
On Curie hybrid node , e ach GPU is conne cte d to a s ocke t (s e e pre vious picture ). It will take longe r for a proce s s to
acce s s a GPU if this proce s s is not on the s ame s ocke t of the GPU. By de fault, the dis tribution is block by core .
The n the MPI rank 0 is locate d on the firs t s ocke t and the MPI rank 1 is on the firs t s ocke t too. The majority of GPU
code s will as s ign GPU 0 to MPI rank 0 and GPU 1 to MPI rank 1. In this cas e , the bandwidth be twe e n MPI rank 1 and
GPU 1 is not optimal.
If your code doe s this , in orde r to obtain the be s t pe rformance , you s hould :
us e the block:cyclic dis tribution
if you inte nd to us e only 2 MPI proce s s e s pe r node , you can re s e rve 4 core s pe r proce s s with the dire ctive
#MSUB -c 4. The two proce s s e s will be place d on two diffe re nt s ocke ts .
Process binding
By de fault, proce s s e s are bound to the core . For multi-thre ade d jobs , proce s s e s cre ate s thre ads : the s e thre ads
will be bound to the as s igne d core . To allow the s e thre ads to us e othe r core s , SLURM provide s the option -c to
as s ign many core s to a proce s s .
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 8
#MS UB -c 4
#MS UB -T 1800
#MS UB -o e xa mple _%I.o
#MS UB -A pa xxxx
# Re que s t na me
# Numbe r of ta s ks to us e
# As s ign 4 core s pe r proce s s
# Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Proje ct ID
e xport OMP_NUM_THREADS =4
ccc_mprun ./a .out
In this e xample , our hybrid Ope nMP/MPI code runs on 8 MPI proce s s e s and e ach proce s s will us e 4 Ope nMP
thre ads . We give he re an e xample for the output with the ve rbos e option for binding:
$ ccc_mprun ./a .out
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
cpu_bind=MAS K - curie 1139,
ta s k
ta s k
ta s k
ta s k
ta s k
ta s k
ta s k
ta s k
5
0
1
6
4
3
2
7
5 [18761]: ma s k 0x40404040 s e t
0 [18756]: ma s k 0x1010101 s e t
1 [18757]: ma s k 0x10101010 s e t
6 [18762]: ma s k 0x8080808 s e t
4 [18760]: ma s k 0x4040404 s e t
3 [18759]: ma s k 0x20202020 s e t
2 [18758]: ma s k 0x2020202 s e t
7 [18763]: ma s k 0x80808080 s e t
We can s e e he re the MPI rank 0 proce s s is launche d ove r the core s 0,8,16 and 24 of the node . The s e core s are all
locate d on the node 's firs t s ocke t.
Re mark: With the -c option, SLURM will try to gathe r at be s t the core s to have be s t pe rformance s . In the pre vious
e xample , all the core s of a MPI proce s s will be locate d on the s ame s ocke t.
Anothe r e xample :
$ ccc_mprun -n 1 -c 32 -E '--cpu_bind=ve rbos e ' ./a .out
cpu_bind=MAS K - curie 1017, ta s k 0 0 [34710]: ma s k 0xffffffff s e t
We can s e e the proce s s is not bound to a core and can run ove r all core s of a node .
BullxMPI
BullxMPI has its own proce s s manage me nt policy. To us e it, you have firs t to dis able SLURM's proce s s manage me nt
policy by adding the dire ctive #MSUB -E '--cpu_bind=none' . You can the n us e BullxMPI launche r mpirun:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun -np 32 ./a .out
Note : In this e xample , BullxMPI proce s s manage me nt policy can be e ffe ctive only on the 32 core s allocate d by
SLURM.
The de fault BullxMPI proce s s manage me nt policy is :
the proce s s e s are not bound
the proce s s e s can run on all core s
the de fault dis tribution is block by core and by node
The option --report-bindings give s you a re port about the binding of proce s s e s be fore the run:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --re port-bindings --bind-to-s ocke t --cpus -pe r-proc 4 -np 8 ./a .out
And the re is the output:
+ mpirun --bind-to-s ocke t --cpus -pe r-proc 4 -np 8 ./a .out
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],3] to s ocke t 1 cpus 22222222
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],4] to s ocke t 2 cpus 44444444
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],5] to s ocke t 2 cpus 44444444
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],6] to s ocke t 3 cpus 88888888
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],7] to s ocke t 3 cpus 88888888
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],0] to s ocke t 0 cpus 11111111
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],1] to s ocke t 0 cpus 11111111
[curie 1342:19946] [[40080,0],0] odls :de fa ult:fork binding child [[40080,1],2] to s ocke t 1 cpus 22222222
In the following paragraphs , we pre s e nt the diffe re nt pos s ibilitie s of proce s s dis tribution and binding. The s e options
can be mixe d (if pos s ible ).
Re mark: the following e xample s us e a whole Curie fat node . We re s e rve 32 core s with #MSUB -n 32 and #MSUB -x
to have all the core s and to do what we want with the m. This is only e xample s for s imple cas e s . In othe rs cas e ,
the re may be conflicts with SLURM.
Process dist ribut ion
Block dis tribution by core :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bycore -np 32 ./a .out
Cyclic dis tribution by s ocke t:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bys ocke t -np 32 ./a .out
Cyclic dis tribution by node :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -N 16
#MS UB -x
# Re quire e xclus ive node s
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bynode -np 32 ./a .out
Process binding
No binding:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bind-to-none -np 32 ./a .out
Core binding:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bind-to-core -np 32 ./a .out
Socke t binding (the proce s s and his thre ads can run on all core s of a s ocke t):
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bind-to-s ocke t -np 32 ./a .out
You can s pe cify the numbe r of core s to as s ign to a MPI proce s s :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
mpirun --bind-to-s ocke t --cpus -pe r-proc 4 -np 8 ./a .out
He re we as s ign 4 core s pe r MPI proce s s .
Manual process management
BullxMPI give s the pos s ibility to manually as s ign your proce s s e s through a hos tfile and a rankfile . An e xample :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -x
# Re quire a e xclus ive node
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o e xa mple _%I.o
# S ta nda rd output. %I is the job id
#MS UB -A pa xxxx
# Proje ct ID
#MS UB -E '--cpu_bind=none '
# Dis a ble de fa ult S LURM binding
hos tna me > hos tfile .txt
e cho "ra nk 0=${HOS TNAME} s lot=0,1,2,3 " > ra nkfile .txt
e cho "ra nk 1=${HOS TNAME} s lot=8,10,12,14 " >> ra nkfile .txt
e cho "ra nk 2=${HOS TNAME} s lot=16,17,22,23" >> ra nkfile .txt
e cho "ra nk 3=${HOS TNAME} s lot=19,20,21,31" >> ra nkfile .txt
mpirun --hos tfile hos tfile .txt --ra nkfile ra nkfile .txt -np 4 ./a .out
In this e xample , the re are many s te ps :
You have to cre ate a hostfile he re hos tfile .txt whe re you put the hos tname of all node s your run will us e
You have to cre ate a rankfile he re rankfile .txt whe re you as s ign to e ach MPI rank the core whe re it can run.
In our e xample , the proce s s of rank 0 will have as affinity the core 0,1,2 and 3, e tc... Be care ful, the
numbe ring of the core is diffe re nt than the hwloc output: on Curie fat node , the e ight firs t core are on the
firs t s ocke t 0, e tc...
you can launch mpirun by s pe cifying the hos tfile and the rankfile .
Using GPU
T wo sequential GPU runs on a single hybrid node
To launch two s e parate s e que ntial GPU runs on a s ingle hybrid node , you have to s e t the e nvironme nt variable
CUDA_VISIBLE_DEVICES which e nable s GPUs wante d. Firs t, cre ate a s cript to launch binarie s :
$ ca t la unch_e xe .s h
#! /bin/ba s h
s e t -x
e xport CUDA_VIS IBLE_DEVICES =${S LURM_PROCID} # the firs t proce s s will s e e only the firs t GPU a nd the s e cond proce s s will s e e only the s e cond GPU.
if [ $S LURM_PROCID -e q 0 ]
the n
./bin_1 > job_${S LURM_PROCID}.out
fi
if [ $S LURM_PROCID -e q 1 ]
the n
./bin_2 > job_${S LURM_PROCID}.out
fi
/!\ To work corre ctly, the two binarie s have to be e n s e que ntial (not us ing MPI).
The n run your s cript, making s ure to s ubmit two MPI proce s s e s with 4 core s pe r proce s s :
$ ca t multi_jobs _gpu.s h
#! /bin/ba s h
#MS UB -r jobs _gpu
#MS UB -n 2
# 2 ta s ks
#MS UB -N 1
# 1 node
#MS UB -c 4
# e a ch ta s k ta ke s 4 core s
#MS UB -q hybrid
#MS UB -T 1800
#MS UB -o multi_jobs _gpu_%I.out
#MS UB -e multi_jobs _gpu_%I.out
s e t -x
cd $BRIDGE_MS UB_PWD
e xport OMP_NUM_THREADS =4
ccc_mprun -E '--wa it=0' -n 2 -c 4 ./la unch_e xe .s h
# -E '--wa it=0' s pe cify to s lurm to not kill the job if one of the two proce s s e s is te rmina te d a nd not the s e cond
So your firs t proce s s will be locate d on the firs t CPU s ocke t and the s e cond proce s s will be on the s e cond CPU
s ocke t (e ach s ocke t is linke d with a GPU).
$ ccc_ms ub multi_jobs _gpu.s h
Profiling
PAPI
PAPI is an API which allows you to re trie ve hardware counte rs from the CPU. He re an e xample in Fortran to ge t the
numbe r of floating point ope rations of a matrix DAXPY:
progra m ma in
implicit none
include 'f90pa pi.h'
!
inte ge r, pa ra me te r :: s iz e = 1000
inte ge r, pa ra me te r :: ntime s = 10
double pre cis ion, dime ns ion(s iz e ,s iz e ) :: A,B,C
inte ge r :: i,j,n
! Va ria ble PAPI
inte ge r, pa ra me te r :: ma x_e ve nt = 1
inte ge r, dime ns ion(ma x_e ve nt) :: e ve nt
inte ge r :: num_e ve nts , re tva l
inte ge r(kind=8), dime ns ion(ma x_e ve nt) :: va lue s
! Init PAPI
ca ll PAPIf_num_counte rs ( num_e ve nts )
print *, 'Numbe r of ha rdwa re counte rs s upporte d: ', num_e ve nts
ca ll PAPIf_que ry_e ve nt(PAPI_FP_INS , re tva l)
if (re tva l .NE. PAPI_OK) the n
e ve nt(1) = PAPI_TOT_INS
e ls e
! Tota l floa ting point ope ra tions
e ve nt(1) = PAPI_FP_INS
e nd if
! Init Ma trix
do i=1,s iz e
do j=1,s iz e
C(i,j) = re a l(i+j,8)
B(i,j) = -i+0.1*j
e nd do
e nd do
! S e t up counte rs
num_e ve nts = 1
ca ll PAPIf_s ta rt_counte rs ( e ve nt, num_e ve nts , re tva l)
! Cle a r the counte r va lue s
ca ll PAPIf_re a d_counte rs (va lue s , num_e ve nts ,re tva l)
! DAXPY
do n=1,ntime s
do i=1,s iz e
do j=1,s iz e
A(i,j) = 2.0*B(i,j) + C(i,j)
e nd do
e nd do
e nd do
! S top the counte rs a nd put the re s ults in the a rra y va lue s
ca ll PAPIf_s top_counte rs (va lue s ,num_e ve nts ,re tva l)
! Print re s ults
if (e ve nt(1) .EQ. PAPI_TOT_INS ) the n
print *, 'TOT Ins tructions : ',va lue s (1)
e ls e
print *, 'FP Ins tructions : ',va lue s (1)
e nd if
e nd progra m ma in
To compile , you have to load the PAPI module :
ba s h-4.00 $ module loa d pa pi/4.1.3
ba s h-4.00 $ ifort -I${PAPI_INC_DIR} pa pi.f90 ${PAPI_LIBS }
ba s h-4.00 $ ./a .out
Numbe r of ha rdwa re counte rs s upporte d:
7
FP Ins tructions :
10046163
To ge t the available hardware counte rs , you can type "papi_avail" commande .
This library can re trie ve the MFLOPS of a ce rtain re gion of your code :
progra m ma in
implicit none
include 'f90pa pi.h'
!
inte ge r, pa ra me te r :: s iz e = 1000
inte ge r, pa ra me te r :: ntime s = 100
double pre cis ion, dime ns ion(s iz e ,s iz e ) :: A,B,C
inte ge r :: i,j,n
! Va ria ble PAPI
inte ge r :: re tva l
re a l(kind=4) :: proc_time , mflops , re a l_time
inte ge r(kind=8) :: flpins
! Init PAPI
re tva l = PAPI_VER_CURRENT
ca ll PAPIf_libra ry_init(re tva l)
if ( re tva l.NE.PAPI_VER_CURRENT) the n
print*, 'PAPI_libra ry_init', re tva l
e nd if
ca ll PAPIf_que ry_e ve nt(PAPI_FP_INS , re tva l)
! Init Ma trix
do i=1,s iz e
do j=1,s iz e
C(i,j) = re a l(i+j,8)
B(i,j) = -i+0.1*j
e nd do
e nd do
! S e tup Counte r
ca ll PAPIf_flips ( re a l_time , proc_time , flpins , mflops , re tva l )
! DAXPY
do n=1,ntime s
do i=1,s iz e
do j=1,s iz e
A(i,j) = 2.0*B(i,j) + C(i,j)
e nd do
e nd do
e nd do
! Colle ct the da ta into the Va ria ble s pa s s e d in
ca ll PAPIf_flips ( re a l_time , proc_time , flpins , mflops , re tva l)
! Print re s ults
print *, 'Re a l_time : ', re a l_time
print *, ' Proc_time : ', proc_time
print *, ' Tota l flpins : ', flpins
print *, ' MFLOPS : ', mflops
!
e nd progra m ma in
and the output:
ba s h-4.00 $ module loa d pa pi/4.1.3
ba s h-4.00 $ ifort -I${PAPI_INC_DIR} pa pi_flops .f90 ${PAPI_LIBS }
ba s h-4.00 $ ./a .out
Re a l_time : 6.1250001E-02
Proc_time : 5.1447589E-02
Tota l flpins :
100056592
MFLOPS : 1944.826
If you want more pre cis ions , you can contact us or vis it PAPI we bs ite .
VampirT race/Vampir
VampirTrace is a library which le t you profile your paralle l code by taking trace s during the e xe cution of the
program. We pre s e nt he re an introduction of Vampir/Vampirtrace .
Basics
Firs t, you mus t compile your code with VampirTrace compile rs . In orde r to us e VampirTrace , you ne e d to load the
vampirtrace module :
ba s h-4.00 $ module loa d va mpirtra ce
ba s h-4.00 $ vtcc -c prog.c
ba s h-4.00 $ vtcc -o prog.e xe prog.o
Available compile rs are :
vtcc : C compile r
vtc++, vtCC e t vtcxx : C++ compile rs
vtf77 e t vtf90 : Fortran compile rs
To compile a MPI code , you s hould type :
ba s h-4.00 $ vtcc -vt:cc mpicc -g -c prog.c
ba s h-4.00 $ vtcc -vt:cc mpicc -g -o prog.e xe prog.o
For othe rs language s you have :
vtcc -vt:cc mpicc : MPI C compile r
vtc++ -vt:cxx mpic++, vtCC -vt:cxx mpiCC e t vtcxx -vt:cxx mpicxx : MPI C++ compile rs
vtf77 -vt:f77 mpif77 e t vtf90 -vt:f90 mpif90 : MPI Fortran compile rs
By de fault, VampirTrace wrappe rs us e Inte l compile rs . To change for anothe r compile r, you can us e the s ame
me thod for MPI:
ba s h-4.00 $ vtcc -vt:cc gcc -O2 -c prog.c
ba s h-4.00 $ vtcc -vt:cc gcc -O2 -o prog.e xe prog.o
To profile an Ope nMP or a hybrid Ope nMP/MPI application, you s hould add the corre s ponding Ope nMP option for the
compile r:
ba s h-4.00 $ vtcc -ope nmp -O2 -c prog.c
ba s h-4.00 $ vtcc -ope nmp -O2 -o prog.e xe prog.o
The n you can s ubmit your job. He re is an e xample of s ubmis s ion s cript:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#MS UB -T 1800
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
# Re que s t na me
# Numbe r of ta s ks to us e
# Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
s e t -x
cd ${BRIDGE_MS UB_PWD}
ccc_mprun ./prog.e xe
At the e nd of e xe cution, the program ge ne rate s many profiling file s :
ba s h-4.00 $ ls
a .out a .out.0.de f.z a .out.1.e ve nts .z ... a .out.otf
To vis ualiz e thos e file s , you mus t load the vampir module :
ba s h-4.00 $ module loa d va mpir
ba s h-4.00 $ va mpir a .out.otf
Vampir window
If you ne e d more information, you can contact us .
Tips
Vampirtrace allocate a buffe r to s tore its profiling information. If the buffe r is full, Vampirtrace will flus h the buffe r
on dis k. By de fault, the s iz e of this buffe r is 32MB pe r proce s s and the maximum numbe r of flus he s is only one
time . You can incre as e (or re duce ) the s iz e of the buffe r: your code will als o us e more me mory. To change the
s iz e , you have to initializ e an e nvironme nt variable :
e xport VT_BUFFER_S IZ E=64M
ccc_mprun ./prog.e xe
In this e xample , the buffe r is s e t to 64 MB. We can incre as e the maximum numbe r of flus he s :
e xport VT_MAX_FLUS HES =10
ccc_mprun ./prog.e xe
If the value for VT_MAX_FLUSHES is 0, the numbe r of flus he s is unlimite d.
By de fault, Vampirtrace will firs t s tore profiling information in a local dire ctory (/tmp) of proce s s . The s e file s can be
ve ry large and fill the dire ctory. You have to change this local dire ctory with anothe r location:
e xport VT_PFORM_LDIR=$S CRATCHDIR
The re are more Vampirtrace variable s which can be us e d. Se e Us e r Manual for more pre cis ions .
Vampirserver
Trace s ge ne rate d by Vampirtrace can be ve ry large : Vampir can be ve ry s low if you want to vis ualiz e the s e trace s .
Vampir provide s Vampirs e rve r: it is a paralle l program which us e s CPU computing to acce le rate Vampir
vis ualiz ation. Firs tly, you have to s ubmit a job which will launch Vampirs e rve r on Curie node s :
$ ca t va mpirs e rve r.s h
#! /bin/ba s h
#MS UB -r va mpirs e rve r
# Re que s t na me
#MS UB -n 32
# Numbe r of ta s ks to us e
#MS UB -T 1800
# Ela ps e d time limit in s e conds
#MS UB -o va mpirs e rve r_%I.o
# S ta nda rd output. %I is the job id
#MS UB -e va mpirs e rve r_%I.e
# Error output. %I is the job id
ccc_mprun vngd
$ module loa d va mpir
$ ccc_ms ub va mpirs e rve r.s h
Whe n the job is running, you will obtain this ouput:
$ ccc_mpp
US ER
ACCOUNT BATCHID NCPU QUEUE
PRIORITY S TATE RLIM RUN/S TART
toto
ge nXXX
234481 32
la rge
210332
RUN 30.0m 1.3m
$ ccc_mpe e k 234481
Found lice ns e file : /us r/loca l/va mpir-7.3/bin/lic.da t
Running 31 a na lys is proce s s e s ... (a bort with Ctrl-C or vngd-s hutdown)
S e rve r lis te ns on: curie 1352:30000
S US P OLD NAME
1.3m va mpirs e rve r
NODES
curie 1352
In our e xample , the Vampirs e rve r mas te r node is on curie 1352. The port to conne ct is 30000. The n you can launch
Vampir on front node . Ins te ad of clicking on Open, you will click on Remote Open:
Connecting to Vampirserver
Fill the s e rve r and the port. You will be conne cte d to vampirs e rve r. The n you can ope n an OTF file s and vis ualiz e it.
Note s :
You can as k any numbe r of proce s s ors you want: it will be fas te r if your profiling file s are big. But be care ful,
it cons ume s your computing time s .
Don't forge t to de le te the Vampirs e rve r job afte r your analyz e .
CUDA profiling
Vampirtrace can colle ct profiling data from CUDA programs . As pre vious ly, you have to re place compile rs by
Vampirtrace wrappe rs . NVCC compile r s hould be re place d by vtnvcc. The n, whe n you run your program, you have to
s e t an e nvironme nt variable :
e xport e xport VT_CUDARTTRACE=ye s
ccc_mprun ./prog.e xe
Scalasca
Scalas ca is a s e t of s oftware which le t you profile your paralle l code by taking trace s during the e xe cution of the
program. This s oftware is a kind of paralle l gprof with more information. We pre s e nt he re an introduction of
Scalas ca.
St andard ut ilizat ion
Firs t, you mus t compile your code by adding Scalas ca tool be fore your call of the compile r. In orde r to us e Scalas ca,
you ne e d to load the s calas ca module :
ba s h-4.00 $ module loa d s ca la s ca
ba s h-4.00 $ s ca la s ca -ins trume nt mpicc -c prog.c
ba s h-4.00 $ s ca la s ca -ins trume nt mpicc -o prog.e xe prog.o
or for Fortran :
ba s h-4.00 $ module loa d s ca la s ca
ba s h-4.00 $ s ca la s ca -ins trume nt mpif90 -c prog.f90
ba s h-4.00 $ s ca la s ca -ins trume nt mpif90 -o prog.e xe prog.o
You can compile for Ope nMP programs :
ba s h-4.00 $ s ca la s ca -ins trume nt ifort -ope nmp -c prog.f90
ba s h-4.00 $ s ca la s ca -ins trume nt ifort -ope nmp -o prog.e xe prog.o
You can profile hybrid programs :
ba s h-4.00 $ s ca la s ca -ins trume nt mpif90 -ope nmp -O3 -c prog.f90
ba s h-4.00 $ s ca la s ca -ins trume nt mpif90 -ope nmp -O3 -o prog.e xe prog.o
The n you can s ubmit your job. He re is an e xample of s ubmis s ion s cript:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#MS UB -T 1800
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
# Re que s t na me
# Numbe r of ta s ks to us e
# Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
s e t -x
cd ${BRIDGE_MS UB_PWD}
e xport S CAN_MPI_LAUNCHER=ccc_mprun
s ca la s ca -a na lyz e ccc_mprun ./prog.e xe
At the e nd of e xe cution, the program ge ne rate s a dire ctory which contains the profiling file s :
ba s h-4.00 $ ls e pik_*
...
To vis ualiz e thos e file s , you can type :
ba s h-4.00 $ s ca la s ca -e xa mine e pik_*
Scalasca
If you ne e d more information, you can contact us .
Scalasca + Vampir
Scalas ca can ge ne rate OTF trace file in orde r vis ualiz e it with Vampir. To activate trace s , you can add -t option to
scalasca whe n you launch the run. He re is the pre vious modifie d s cript:
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#MS UB -T 1800
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
s e t -x
cd ${BRIDGE_MS UB_PWD}
# Re que s t na me
# Numbe r of ta s ks to us e
# Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
s ca la s ca -a na lyz e -t mpirun ./prog.e xe
At the e nd of e xe cution, the program ge ne rate s a dire ctory which contains the profiling file s :
ba s h-4.00 $ ls e pik_*
...
To vis ualiz e thos e file s , you can vis ualiz e the m as pre vious ly. To ge ne rate the OTF trace file s , you can type :
ba s h-4.00 $ ls e pik_*
ba s h-4.00 $ e lg2otf e pik_*
It will ge ne rate an OTF file unde r the e pik_* dire ctory. To vis ualiz e it, you can load Vampir:
ba s h-4.00 $ module loa d va mpir
ba s h-4.00 $ va mpir e pik_*/a .otf
Scalasca + PAPI
Scalas ca can re trie ve the hardware counte r with PAPI. For e xample , if you want re trie ve the numbe r of floating
point ope rations :
#! /bin/ba s h
#MS UB -r MyJob_Pa ra
#MS UB -n 32
#MS UB -T 1800
#MS UB -o e xa mple _%I.o
#MS UB -e e xa mple _%I.e
# Re que s t na me
# Numbe r of ta s ks to us e
# Ela ps e d time limit in s e conds
# S ta nda rd output. %I is the job id
# Error output. %I is the job id
s e t -x
cd ${BRIDGE_MS UB_PWD}
e xport EPK_METRICS =PAPI_FP_OPS
s ca la s ca -a na lyz e mpirun ./prog.e xe
The n the numbe r of floating point ope rations will appe ar on the profile whe n you vis ualiz e it. You can re trie ve only 3
hardware counte rs at the s ame time on Curie . The the s yntax is :
e xport EPK_METRICS ="PAPI_FP_OPS :PAPI_TOT_CYC"
Paraver
Parave r is a fle xible pe rformance vis ualiz ation and analys is tool that can be us e d to analyz e MPI, Ope nMP,
MPI+Ope nMP, hardware counte rs profile , Ope rating s ys te m activity and many othe r things you may think of!
In orde r to us e Parave r tools , you ne e d to load the parave r module :
ba s h-4.00 $ module loa d pa ra ve r
ba s h-4.00 $ module s how pa ra ve r
------------------------------------------------------------------/us r/loca l/ccc_us e rs _e nv/module s /de ve lopme nt/pa ra ve r/4.1.1:
module -wha tis Pa ra ve r
conflict pa ra ve r
pre pe nd-pa th PATH /us r/loca l/pa ra ve r-4.1.1/bin
pre pe nd-pa th PATH /us r/loca l/e xtra e -2.1.1/bin
pre pe nd-pa th LD_LIBRARY_PATH /us r/loca l/pa ra ve r-4.1.1/lib
pre pe nd-pa th LD_LIBRARY_PATH /us r/loca l/e xtra e -2.1.1/lib
module loa d pa pi
s e te nv PARAVER_HOME /us r/loca l/pa ra ve r-4.1.1
s e te nv EXTRAE_HOME /us r/loca l/e xtra e -2.1.1
s e te nv EXTRAE_LIB_DIR /us r/loca l/e xtra e -2.1.1/lib
s e te nv MPI_TRACE_LIBS /us r/loca l/e xtra e -2.1.1/lib/libmpitra ce .s o
-------------------------------------------------------------------
Trace generat ion
The s implie s t way to activate mpi ins trume ntation of your code is to dynamically load the library be fore e xe cution.
This can be done by adding the following line to your s ubmis s ion s cript:
e xport LD_PRELOAD=$LD_PRELOAD:$MPI_TRACE_LIBS
The ins trume ntation proce s s is manage d by Extrae and als o ne e d a configuration file in xml format. You will have to
add ne xt line to your s ubmis s ion s cript.
e xport EXTRAE_CONFIG_FILE=./e xtra e _config_file .xml
All de taille d about how to write a config file are available in Extrae 's manual which you can re ach at
$EXTRAE_HOME/doc/us e r-guide .pdf. You will als o find many e xample s of s cripts in $EXTRAE_HOME/e xample s /LINUX
file tre e .
You can als o add s ome manual ins trume ntation in your code to add s ome s pe cific us e r e ve nt. This is mandatory if
you want to s e e your own functions in Parave r time line s .
If trace ge ne ration s ucce e d during computation, you'll find a dire ctory set-0 containing s ome .mpit file s in your
working dire ctory. You will als o find a TRACE.mpits file which lis ts all the s e file s .
Convert ing t races t o Paraver format
Extrae provide s a tool name d mpi2prv to conve rt mpit file s into a .prv which will be re ad by Parave r. Since it can be
a long ope ration, we re comme nd you to us e the paralle l ve rs ion of this tool, mpimpi2prv. You will ne e d le s s
proce s s e s than pre vious ly us e d to compute . An e xample s cript is provide d be low:
ba s h-4.00$ ca t re build.s h
#MS UB -r me rge
#MS UB -n 8
#MS UB -T 1800
s e t -x
cd $BRIDGE_MS UB_PWD
ccc_mprun mpimpi2prv -s yn -e pa th_to_your_bina ry -f TRACE.mpits -o file _to_be _a na lys e d.prv
Launching Paraver
You jus t now have to launch "parave r file _to_be _analys e d.prv". As Parave r may as k for high me mory & CPU us age ,
it may be be tte r to launch it through a s ubmis s ion s cript (do not forge t the n to activate the -X option in ccc_ms ub).
For analyz ing your data you will ne e d s ome configurations file s available in Parave r's brows e r unde r
$PARAVER_HOME/cfgs dire ctory.
Paraver window