Download Scone user manual

Transcript
Scone user manual
March 2012
Contents
1 What is Scone
1
1.1
Software available . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Things to note
3
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Logging in
2.1
2.2
3
Logging in to Scone from the University Network . . . . . . . . .
3
2.1.1
Linux Machines . . . . . . . . . . . . . . . . . . . . . . . .
3
2.1.2
Windows machines . . . . . . . . . . . . . . . . . . . . . .
4
Logging in to the Scone nodes . . . . . . . . . . . . . . . . . . . .
4
3 Getting your les on scone
4
3.1
Via scp/sftp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
3.2
Via samba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
4 NAG Libraries
4.1
ifort
4.2
gfortran
5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Using the condor queue
5.1
Submitting a job
5
6
6
. . . . . . . . . . . . . . . . . . . . . . . . . . .
6
5.1.1
Writing the condor le . . . . . . . . . . . . . . . . . . . .
7
5.1.2
Submitting and managing the condor jobs . . . . . . . . .
7
1 What is Scone
Scone is a cluster of Linux servers designed to fulll the High Performance
Computing needs of the department of Mathematics. This consists of 30 64-bit
machines, all of which run Linux 2.6.28.
There are eight machines for general use on the scone system.
There are
twelve machines for use by the statistics group of which seven run under a
condor job submission scheme.
There are also nine machines for the applied
group which do not use job submission software and of which seven are for the
1
use of the uids group only. Finally there is one for the use of the pure group
which again does not use job submission software. A summary of the situation
is shown below.
1.1
Machine
Group
Condor
CPU
RAM
node1
General
No
8 x 2.6 GHz Opteron
32GB
node2
General
No
8 x 2.6 GHz Opteron
32GB
node3
General
No
8 x 2.6 GHz Opteron
32GB
node4
General
No
8 x 2.6 GHz Opteron
32GB
node5
General
No
8 x 2.6 GHz Opteron
64GB
node6
General
No
12 x 2.6 GHz Xeon
48GB
node7
General
No
12 x 2.6 GHz Xeon
48GB
node8
General
No
12 x 2.6 GHz Xeon
48GB
zeppo
Statistics
Yes
4 x 2.2 GHz Opteron
8GB
chico
Statistics
Yes
4 x 2.2 GHz Opteron
8GB
harpo
Statistics
Yes
4 x 2.2 GHz Opteron
8GB
groucho
Statistics
Yes
4 x 2.2 GHz Opteron
8GB
barker
Statistics
Yes
4 x 2.2 GHz Opteron
8GB
morecambe
Statistics
Yes
2 x 2.6 GHz Opteron
8GB
wise
Statistics
Yes
2 x 2.6 GHz Opteron
8GB
jake
Statistics
Yes
8 x 2.3 GHz Opteron
16GB
elwood
Statistics
Yes
8 x 2.3 GHz Opteron
16GB
suilven
Statistics
Yes
12 x 3 GHz Xeon
48GB
quinag
Statistics
Yes
12 x 3 GHz Xeon
48GB
canisp
Statistics
Yes
12 x 3 GHz Xeon
48GB
kelvin
Fluids
No
4 x 2.6 GHz Opteron
8GB
reynolds
Fluids
No
4 x 2.6 GHz Opteron
8GB
riemann
Applied
No
4 x 2.6 GHz Opteron
16GB
darcy
Fluids
No
4 x 3 GHz Opteron
12GB
rayleigh
Fluids
No
4 x 3 GHz Opteron
12GB
hardy
Applied
No
4 x 2.6 GHz Opteron
16GB
bernoulli
Fluids
No
4 x 3 GHz Opteron
16GB
taylor
Fluids
No
8 x 2.6 GHz Opteron
32GB
stokes
Fluids
No
12 x 2.6 GHz Xeon
48GB
heilbronn
Pure
No
4 x 2.6 GHz Opteron
10GB
Software available
The software below is available in Scone.
Requests can be made via service-
[email protected].
•
Gnu C, C++ and gfortran compilers. These may be invoked by the commands gcc, g++ and gfortran. For more details type man gcc, man g++
or man gfortran.
•
Matlab, currently R2010a.
2
•
Maple 12
•
The R statistical Package
•
Python, numpy and scipy
•
Gnuplot
•
gsview
•
Java - sun-jdk-1.6.0.15
•
Mathematica,7.0 (only on node1)
•
Nag libraries
•
ifort compiler
•
g95
•
ghc
1.2
•
Things to note
Do not run any jobs on the head machine (Scone). You may compile and
run very small test programs. Anything else will be killed without notice.
•
Any user may submit a job through condor.
However members of the
statistics group have a vastly increased priority.
•
Only members of the appropriate group may log on to their machines
directly.
2 Logging in
Access to Scone is done via ssh.
Authentication is based on the UOB user-
password provided by the University. You must rst log in to the head machine
scone.maths.bris.ac.uk before accessing other machines.
Remember to enable
X-forwarding on your ssh client if you want to use a the graphical user interface
(GUI) provided by such programs as matlab
2.1
2.1.1
Logging in to Scone from the University Network
Linux Machines
From Linux machines, assuming you logged in to your machine using your UOB
user, you can simply do the following:
$ ssh scone
If you are logged in using a local account try:
3
$ ssh username@scone
The rst time you log-in to scone you will get the next message:
The authenticity of host 'scone (137.222.80.37)' can't be
established.
RSA key fingerprint is
de:8b:77:8d:d7:af:07:d2:8f:6e:64:4c:6f:ec:2b:cd.
Are you sure you want to continue connecting (yes/no)?
Please verify the ngerprint matches the one above and proceed by answering
'yes'.
2.1.2
Windows machines
From Windows open your sshclient...
2.2
Logging in to the Scone nodes
Once you're logged in to Scone you will see the Linux command prompt. You
can now login to any of the nodes available to your group via ssh.
To avoid
entering your password every time, you can create a ssh keypair. To do so you
can execute the next commands:
user@scone ~ $ ssh-keygen
When prompted for the le location to save the key, just hit enter. You will
also be prompted for a passphrase, you can leave this empty for convenience,
but keep in mind that this means anyone with access to your private key can
login as yourself.
Then copy your public key into the authorized_keys le, if you don't have
such a le in your .ssh director, the easiest way is to just copy it with:
mahfrv@scone ~ $ cp .ssh/id_rsa.pub .ssh/authorized_keys
After this you should be able to login to the node machines using the
passphrase provided.
3 Getting your les on scone
Every user is assigned a home directory in scone, the default location is in
/home/local/UOB/user-name. This space is intended to hold user les relevant
for HPC purposes.
3.1
Via scp/sftp
From Linux you can copy les to scone using the commands scp or sftp. Please
see the manual pages for these commands for details on their usage.
Windows clients can make use of the le transfer utility included with the
ssh client.
4
3.2
Via samba
Home directories in Scone are shared on the network via samba. Linux clients in
the maths department are congured to automatically mount Scone on the user's
home directory. Scone les should be available in the scone directory within the
user home. Alternatively the share can be mounted using a command like:
$ mount -t cifs -o user=user-name
//scone.maths.bris.ac.uk/homes
On windows machines the location: \\scone.maths.bris.ac.uk\homes should
direct the user to his home directory.
4 NAG Libraries
4.1
ifort
NAG libraries for the ifort compiler can be found in the directory: `/usr/local/nag/l6i22dcl/'.
In order to run programs compiled with the NAG libraries a valid license le
is required. This le is specied using the variable: `NAG_KUSARI_FILE'. By
default this is set to: `NAG_KUSARI_FILE=/usr/local/nag/l6i22dcl/license/licenseifort.dat'.
Please note that if you want to use a dierent version of the nag libraries, a
license will be required and the `NAG_KUSARI_FILE' re-declared.
The next example shows how to compile and run the test program `c06bafe.f '
available at `http://www.nag.co.uk/numeric/FL/nagdoc_22/examples/source/c06bafe.f ':
$ ifort c06bafe.f /usr/local/nag/l6i22dcl/lib/libnag_nag.a
$ ./a.out
C06BAF Example Program Results
Estimated Actual
I SEQN RESULT abs error error
1 1.0000 1.0000 - 0.18E+00
2 0.7500 0.7500 - -0.72E-01
3 0.8611 0.8269 - 0.45E-02
4 0.7986 0.8211 0.26E+00 -0.14E-02
5 0.8386 0.8226 0.78E-01 0.12E-03
6 0.8108 0.8224 0.60E-02 -0.33E-04
7 0.8312 0.8225 0.15E-02 0.35E-05
8 0.8156 0.8225 0.16E-03 -0.85E-06
9 0.8280 0.8225 0.37E-04 0.10E-06
10 0.8180 0.8225 0.45E-05 -0.23E-07
NOTE: A symbolic link has been created in `/usr/local/lib/libnag-ifort.a'
pointing to `/usr/local/nag/l6i22dcl/lib/libnag_nag.a' for easier usage.
compilation line above is equivalent to: `ifort c06bafe.f -lnag-ifort'.
5
The
Multicore programs with ifort
It is possible to compile a program to make use of multiple cores with the NAG
libraries. To do so it is necessary to link against the MKL version of the libraries
located in: `/usr/local/nag/l6i22dcl/mkl_em64t/'. The variables below are an
example of how the libraries can be used in a Makele:
COMPILER F77= ifort
COMPILE FLAGS = -O3 -i_dynamic -mcmodel=medium
LIBRARY FLAGS= /usr/local/nag/l6i22dcl/lib/libnag_mkl.a -tconsistency \
-L/usr/local/nag/l6i22dcl/mkl_em64t -lmkl_intel_lp64 \
-lmkl_intel_thread -lmkl_core -lmkl_lapack -liomp5 -lpthread
To use multithreading it is also necessary to declare the environment variable
OMP_NUM_THREADS as the number threads to be used, for instance:
`export OMP_NUM_THREADS=8'
4.2
gfortran
NAG libraries for use with the gfortran are installed in the directory `/usr/local/nag/l6a22d/'.
The symbolic link `/usr/local/lib/libnag-gfortran.a' pointing to `/usr/local/nag/l6a22d/lib/libnag_nag.a
has been created for simplication.
Note that to use these libraries, the variable `NAG_KUSARI_FILE' has to
be exported to point to the appropriate license le located in: `/usr/local/nag/l6a22d/license/licensegfortran.dat', to do so run the command below on the machine before executing
your program:
$ export NAG_KUSARI_FILE=/usr/local/nag/l6a22d/license/licensegfortran.dat
5 Using the condor queue
Remember if you are not in the statistics group you currently have very low
priority on the condor queue.
5.1
Submitting a job
When submitting jobs via condor it is not necessary to actually login to any of
the compute servers. There are two stages to submitting a job:
1. Write a le that describes the job to be submitted.
2. Submit the job via condor_submit
6
5.1.1
Writing the condor le
Please note there is a restriction on condor jobs.
Chico, harpo, groucho and barker are "short job" machines (with a two
week computation limit), whereas morecambe, wise and zeppo are "long job"
machines, without this hard limit.
If you think your job will last for more than two weeks, you must specify to
run it on a long job machine in your condor job submission le using:
Requirements = \
Machine =="morecambe.private2.maths.bris.ac.uk" \
|| Machine == "wise.private2.maths.bris.ac.uk" \
|| Machine == "zeppo.private2.maths.bris.ac.uk"
IF YOU DO NOT SPECIFY THIS, AND YOUR JOB LASTS FOR MORE
THAN TWO WEEKS, IT WILL BE KILLED.
Here is an example of condor le. This runs a command from the current
directory:
####################
##
## Test Condor command file
##
####################
executable = ustone6
Universe = vanilla
error = ustone6.err
output = ustone6.out
log = ustone6.log
Queue
This le tells condor that the executable ustone6 is to be run. Standard
output from this executable is to go into the le ustone6.outand standard error
is to go into the le ustone6.err. The le ustone6.log contains any messages
from the condor system (job status any error and so on).
5.1.2
Submitting and managing the condor jobs
Once you condor le is ready then run the condor_submit command.
user@scone:~> condor_submit ustone6.cmd
Submitting job(s). Logging submit event(s).
1 job(s) submitted to cluster 28.
user@scone:~>
The condor_status command lists the status of the condor clusteras below.
7
marfc@scone:~> condor_status
Name OpSys Arch State Activity LoadAv Mem
[email protected] LINUX x86_64 Unclaimed Idle
[email protected] LINUX x86_64 Unclaimed Idle
[email protected] LINUX x86_64 Unclaimed Idle
[email protected] LINUX x86_64 Unclaimed Idle
[email protected] LINUX x86_64 Unclaimed Idle
[email protected] LINUX x86_64 Unclaimed Idle
[email protected] LINUX x86_64 Unclaimed Idle
[email protected] LINUX x86_64 Unclaimed Idle
ActvtyTime
0.000 2031
0.000 2031
0.000 2031
0.000 2031
0.000 2031
0.000 2031
0.000 2031
0.000 2031
0+00:45:17
0+00:45:14
0+00:45:11
0+00:45:08
0+00:45:17
0+00:45:14
0+00:28:04
0+00:28:01
The elds have the following meanings.
•
[Name] - Lists the name of the processor/machine combination. So [email protected]
the rst processor on the machine groucho.
•
[OpSys] - The operating system.
•
[Arch] - The CPU architecture. Currently all nodes run AMD opteronsx86_64.
In the future this may change as more machines are added.In a multi architecture array jobs may be submitted requesting a specicarchitecture.
•
[State] - Lists the current state of the machine as far as from theviewpoint
of condor scheduling. This may be one of the following:
Owner - The machine is being used by the owner of the machine (for
example a member of the appropriate research group), and/or is not
available to run Condor jobs. When the machine rst starts up, it
begins in this state.
Matched - The machine is available to run jobs, and it has been
matched to a specic job. Condor has has not yet claimed this machine. In this state, the machine is unavailable for further matches.
Claimed - The machine has been claimed by condor. No further jobs
will be allocated by condor to this machine until the current job has
ended.
Preempting -The machine was claimed , but is now preempting that
claim.
This is most likely because someone has logged on to the
machine and is running jobs directly.
•
[Activity] - Lists what the machine is actually doing. The details depend
upon the condor State, but in general they can be summarised as below.
Idle - The machine is not doing anything that was initiated by condor.
Busy - The machine is running a job that was initiated by condor.
Suspended - The current job has been suspended. This is most likelybecause of a user logging on to the machine and running jobs directly.
8
Killing - The job is being killed.
•
[LoadAv] - Lists the load average on the machine.
•
[Mem]Lists the memory per CPU on the machine.
•
[ActvtyTime]Node activity time.
So in the above example we can see that the whole cluster is quiet except for
two processors on the machine groucho which are running jobs. The state of
the condor queue can also be examined by the command condor_q (local to
current machine) or condor_q -global (across all machines) as below.
user@scone:~/benchmarks> condor_q
- Submitter: scone.private2.maths.bris.ac.uk :
<172.16.80.65:33772>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
31.0 marfc 1/13 10:04 0+00:00:20 R 0 0.0 ustone6
32.0 marfc 1/13 10:04 0+00:00:01 R 0 0.0 ustone6
2 jobs; 0 idle, 2 running, 0 held
This shows that there are two jobs on scone, with job number 31 and 32,
owned by user marfc, running at the moment.
To delete a job, use the condor_rm command on the machine from which
you submitted the job. The full sequence for submitting, listing and removing
a job is shown below.
user@scone:~/benchmarks> condor_submit ustone6.cmd
Submitting job(s). Logging submit event(s).
1 job(s) submitted to cluster 33.
user@scone:~/benchmarks> condor_q
- Submitter: scone.private2.maths.bris.ac.uk :
<172.16.80.65:33772>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
33.0 marfc 1/13 10:07 0+00:00:03 R 0 0.0 ustone6
1 jobs; 0 idle, 1 running, 0 held
user@scone:~/benchmarks> condor_rm 33.0
Job 33.0 marked for removal
9