Download Scone user manual
Transcript
Scone user manual March 2012 Contents 1 What is Scone 1 1.1 Software available . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Things to note 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Logging in 2.1 2.2 3 Logging in to Scone from the University Network . . . . . . . . . 3 2.1.1 Linux Machines . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.2 Windows machines . . . . . . . . . . . . . . . . . . . . . . 4 Logging in to the Scone nodes . . . . . . . . . . . . . . . . . . . . 4 3 Getting your les on scone 4 3.1 Via scp/sftp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Via samba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 NAG Libraries 4.1 ifort 4.2 gfortran 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Using the condor queue 5.1 Submitting a job 5 6 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5.1.1 Writing the condor le . . . . . . . . . . . . . . . . . . . . 7 5.1.2 Submitting and managing the condor jobs . . . . . . . . . 7 1 What is Scone Scone is a cluster of Linux servers designed to fulll the High Performance Computing needs of the department of Mathematics. This consists of 30 64-bit machines, all of which run Linux 2.6.28. There are eight machines for general use on the scone system. There are twelve machines for use by the statistics group of which seven run under a condor job submission scheme. There are also nine machines for the applied group which do not use job submission software and of which seven are for the 1 use of the uids group only. Finally there is one for the use of the pure group which again does not use job submission software. A summary of the situation is shown below. 1.1 Machine Group Condor CPU RAM node1 General No 8 x 2.6 GHz Opteron 32GB node2 General No 8 x 2.6 GHz Opteron 32GB node3 General No 8 x 2.6 GHz Opteron 32GB node4 General No 8 x 2.6 GHz Opteron 32GB node5 General No 8 x 2.6 GHz Opteron 64GB node6 General No 12 x 2.6 GHz Xeon 48GB node7 General No 12 x 2.6 GHz Xeon 48GB node8 General No 12 x 2.6 GHz Xeon 48GB zeppo Statistics Yes 4 x 2.2 GHz Opteron 8GB chico Statistics Yes 4 x 2.2 GHz Opteron 8GB harpo Statistics Yes 4 x 2.2 GHz Opteron 8GB groucho Statistics Yes 4 x 2.2 GHz Opteron 8GB barker Statistics Yes 4 x 2.2 GHz Opteron 8GB morecambe Statistics Yes 2 x 2.6 GHz Opteron 8GB wise Statistics Yes 2 x 2.6 GHz Opteron 8GB jake Statistics Yes 8 x 2.3 GHz Opteron 16GB elwood Statistics Yes 8 x 2.3 GHz Opteron 16GB suilven Statistics Yes 12 x 3 GHz Xeon 48GB quinag Statistics Yes 12 x 3 GHz Xeon 48GB canisp Statistics Yes 12 x 3 GHz Xeon 48GB kelvin Fluids No 4 x 2.6 GHz Opteron 8GB reynolds Fluids No 4 x 2.6 GHz Opteron 8GB riemann Applied No 4 x 2.6 GHz Opteron 16GB darcy Fluids No 4 x 3 GHz Opteron 12GB rayleigh Fluids No 4 x 3 GHz Opteron 12GB hardy Applied No 4 x 2.6 GHz Opteron 16GB bernoulli Fluids No 4 x 3 GHz Opteron 16GB taylor Fluids No 8 x 2.6 GHz Opteron 32GB stokes Fluids No 12 x 2.6 GHz Xeon 48GB heilbronn Pure No 4 x 2.6 GHz Opteron 10GB Software available The software below is available in Scone. Requests can be made via service- [email protected]. • Gnu C, C++ and gfortran compilers. These may be invoked by the commands gcc, g++ and gfortran. For more details type man gcc, man g++ or man gfortran. • Matlab, currently R2010a. 2 • Maple 12 • The R statistical Package • Python, numpy and scipy • Gnuplot • gsview • Java - sun-jdk-1.6.0.15 • Mathematica,7.0 (only on node1) • Nag libraries • ifort compiler • g95 • ghc 1.2 • Things to note Do not run any jobs on the head machine (Scone). You may compile and run very small test programs. Anything else will be killed without notice. • Any user may submit a job through condor. However members of the statistics group have a vastly increased priority. • Only members of the appropriate group may log on to their machines directly. 2 Logging in Access to Scone is done via ssh. Authentication is based on the UOB user- password provided by the University. You must rst log in to the head machine scone.maths.bris.ac.uk before accessing other machines. Remember to enable X-forwarding on your ssh client if you want to use a the graphical user interface (GUI) provided by such programs as matlab 2.1 2.1.1 Logging in to Scone from the University Network Linux Machines From Linux machines, assuming you logged in to your machine using your UOB user, you can simply do the following: $ ssh scone If you are logged in using a local account try: 3 $ ssh username@scone The rst time you log-in to scone you will get the next message: The authenticity of host 'scone (137.222.80.37)' can't be established. RSA key fingerprint is de:8b:77:8d:d7:af:07:d2:8f:6e:64:4c:6f:ec:2b:cd. Are you sure you want to continue connecting (yes/no)? Please verify the ngerprint matches the one above and proceed by answering 'yes'. 2.1.2 Windows machines From Windows open your sshclient... 2.2 Logging in to the Scone nodes Once you're logged in to Scone you will see the Linux command prompt. You can now login to any of the nodes available to your group via ssh. To avoid entering your password every time, you can create a ssh keypair. To do so you can execute the next commands: user@scone ~ $ ssh-keygen When prompted for the le location to save the key, just hit enter. You will also be prompted for a passphrase, you can leave this empty for convenience, but keep in mind that this means anyone with access to your private key can login as yourself. Then copy your public key into the authorized_keys le, if you don't have such a le in your .ssh director, the easiest way is to just copy it with: mahfrv@scone ~ $ cp .ssh/id_rsa.pub .ssh/authorized_keys After this you should be able to login to the node machines using the passphrase provided. 3 Getting your les on scone Every user is assigned a home directory in scone, the default location is in /home/local/UOB/user-name. This space is intended to hold user les relevant for HPC purposes. 3.1 Via scp/sftp From Linux you can copy les to scone using the commands scp or sftp. Please see the manual pages for these commands for details on their usage. Windows clients can make use of the le transfer utility included with the ssh client. 4 3.2 Via samba Home directories in Scone are shared on the network via samba. Linux clients in the maths department are congured to automatically mount Scone on the user's home directory. Scone les should be available in the scone directory within the user home. Alternatively the share can be mounted using a command like: $ mount -t cifs -o user=user-name //scone.maths.bris.ac.uk/homes On windows machines the location: \\scone.maths.bris.ac.uk\homes should direct the user to his home directory. 4 NAG Libraries 4.1 ifort NAG libraries for the ifort compiler can be found in the directory: `/usr/local/nag/l6i22dcl/'. In order to run programs compiled with the NAG libraries a valid license le is required. This le is specied using the variable: `NAG_KUSARI_FILE'. By default this is set to: `NAG_KUSARI_FILE=/usr/local/nag/l6i22dcl/license/licenseifort.dat'. Please note that if you want to use a dierent version of the nag libraries, a license will be required and the `NAG_KUSARI_FILE' re-declared. The next example shows how to compile and run the test program `c06bafe.f ' available at `http://www.nag.co.uk/numeric/FL/nagdoc_22/examples/source/c06bafe.f ': $ ifort c06bafe.f /usr/local/nag/l6i22dcl/lib/libnag_nag.a $ ./a.out C06BAF Example Program Results Estimated Actual I SEQN RESULT abs error error 1 1.0000 1.0000 - 0.18E+00 2 0.7500 0.7500 - -0.72E-01 3 0.8611 0.8269 - 0.45E-02 4 0.7986 0.8211 0.26E+00 -0.14E-02 5 0.8386 0.8226 0.78E-01 0.12E-03 6 0.8108 0.8224 0.60E-02 -0.33E-04 7 0.8312 0.8225 0.15E-02 0.35E-05 8 0.8156 0.8225 0.16E-03 -0.85E-06 9 0.8280 0.8225 0.37E-04 0.10E-06 10 0.8180 0.8225 0.45E-05 -0.23E-07 NOTE: A symbolic link has been created in `/usr/local/lib/libnag-ifort.a' pointing to `/usr/local/nag/l6i22dcl/lib/libnag_nag.a' for easier usage. compilation line above is equivalent to: `ifort c06bafe.f -lnag-ifort'. 5 The Multicore programs with ifort It is possible to compile a program to make use of multiple cores with the NAG libraries. To do so it is necessary to link against the MKL version of the libraries located in: `/usr/local/nag/l6i22dcl/mkl_em64t/'. The variables below are an example of how the libraries can be used in a Makele: COMPILER F77= ifort COMPILE FLAGS = -O3 -i_dynamic -mcmodel=medium LIBRARY FLAGS= /usr/local/nag/l6i22dcl/lib/libnag_mkl.a -tconsistency \ -L/usr/local/nag/l6i22dcl/mkl_em64t -lmkl_intel_lp64 \ -lmkl_intel_thread -lmkl_core -lmkl_lapack -liomp5 -lpthread To use multithreading it is also necessary to declare the environment variable OMP_NUM_THREADS as the number threads to be used, for instance: `export OMP_NUM_THREADS=8' 4.2 gfortran NAG libraries for use with the gfortran are installed in the directory `/usr/local/nag/l6a22d/'. The symbolic link `/usr/local/lib/libnag-gfortran.a' pointing to `/usr/local/nag/l6a22d/lib/libnag_nag.a has been created for simplication. Note that to use these libraries, the variable `NAG_KUSARI_FILE' has to be exported to point to the appropriate license le located in: `/usr/local/nag/l6a22d/license/licensegfortran.dat', to do so run the command below on the machine before executing your program: $ export NAG_KUSARI_FILE=/usr/local/nag/l6a22d/license/licensegfortran.dat 5 Using the condor queue Remember if you are not in the statistics group you currently have very low priority on the condor queue. 5.1 Submitting a job When submitting jobs via condor it is not necessary to actually login to any of the compute servers. There are two stages to submitting a job: 1. Write a le that describes the job to be submitted. 2. Submit the job via condor_submit 6 5.1.1 Writing the condor le Please note there is a restriction on condor jobs. Chico, harpo, groucho and barker are "short job" machines (with a two week computation limit), whereas morecambe, wise and zeppo are "long job" machines, without this hard limit. If you think your job will last for more than two weeks, you must specify to run it on a long job machine in your condor job submission le using: Requirements = \ Machine =="morecambe.private2.maths.bris.ac.uk" \ || Machine == "wise.private2.maths.bris.ac.uk" \ || Machine == "zeppo.private2.maths.bris.ac.uk" IF YOU DO NOT SPECIFY THIS, AND YOUR JOB LASTS FOR MORE THAN TWO WEEKS, IT WILL BE KILLED. Here is an example of condor le. This runs a command from the current directory: #################### ## ## Test Condor command file ## #################### executable = ustone6 Universe = vanilla error = ustone6.err output = ustone6.out log = ustone6.log Queue This le tells condor that the executable ustone6 is to be run. Standard output from this executable is to go into the le ustone6.outand standard error is to go into the le ustone6.err. The le ustone6.log contains any messages from the condor system (job status any error and so on). 5.1.2 Submitting and managing the condor jobs Once you condor le is ready then run the condor_submit command. user@scone:~> condor_submit ustone6.cmd Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 28. user@scone:~> The condor_status command lists the status of the condor clusteras below. 7 marfc@scone:~> condor_status Name OpSys Arch State Activity LoadAv Mem [email protected] LINUX x86_64 Unclaimed Idle [email protected] LINUX x86_64 Unclaimed Idle [email protected] LINUX x86_64 Unclaimed Idle [email protected] LINUX x86_64 Unclaimed Idle [email protected] LINUX x86_64 Unclaimed Idle [email protected] LINUX x86_64 Unclaimed Idle [email protected] LINUX x86_64 Unclaimed Idle [email protected] LINUX x86_64 Unclaimed Idle ActvtyTime 0.000 2031 0.000 2031 0.000 2031 0.000 2031 0.000 2031 0.000 2031 0.000 2031 0.000 2031 0+00:45:17 0+00:45:14 0+00:45:11 0+00:45:08 0+00:45:17 0+00:45:14 0+00:28:04 0+00:28:01 The elds have the following meanings. • [Name] - Lists the name of the processor/machine combination. So [email protected] the rst processor on the machine groucho. • [OpSys] - The operating system. • [Arch] - The CPU architecture. Currently all nodes run AMD opteronsx86_64. In the future this may change as more machines are added.In a multi architecture array jobs may be submitted requesting a specicarchitecture. • [State] - Lists the current state of the machine as far as from theviewpoint of condor scheduling. This may be one of the following: Owner - The machine is being used by the owner of the machine (for example a member of the appropriate research group), and/or is not available to run Condor jobs. When the machine rst starts up, it begins in this state. Matched - The machine is available to run jobs, and it has been matched to a specic job. Condor has has not yet claimed this machine. In this state, the machine is unavailable for further matches. Claimed - The machine has been claimed by condor. No further jobs will be allocated by condor to this machine until the current job has ended. Preempting -The machine was claimed , but is now preempting that claim. This is most likely because someone has logged on to the machine and is running jobs directly. • [Activity] - Lists what the machine is actually doing. The details depend upon the condor State, but in general they can be summarised as below. Idle - The machine is not doing anything that was initiated by condor. Busy - The machine is running a job that was initiated by condor. Suspended - The current job has been suspended. This is most likelybecause of a user logging on to the machine and running jobs directly. 8 Killing - The job is being killed. • [LoadAv] - Lists the load average on the machine. • [Mem]Lists the memory per CPU on the machine. • [ActvtyTime]Node activity time. So in the above example we can see that the whole cluster is quiet except for two processors on the machine groucho which are running jobs. The state of the condor queue can also be examined by the command condor_q (local to current machine) or condor_q -global (across all machines) as below. user@scone:~/benchmarks> condor_q - Submitter: scone.private2.maths.bris.ac.uk : <172.16.80.65:33772> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 31.0 marfc 1/13 10:04 0+00:00:20 R 0 0.0 ustone6 32.0 marfc 1/13 10:04 0+00:00:01 R 0 0.0 ustone6 2 jobs; 0 idle, 2 running, 0 held This shows that there are two jobs on scone, with job number 31 and 32, owned by user marfc, running at the moment. To delete a job, use the condor_rm command on the machine from which you submitted the job. The full sequence for submitting, listing and removing a job is shown below. user@scone:~/benchmarks> condor_submit ustone6.cmd Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 33. user@scone:~/benchmarks> condor_q - Submitter: scone.private2.maths.bris.ac.uk : <172.16.80.65:33772> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 33.0 marfc 1/13 10:07 0+00:00:03 R 0 0.0 ustone6 1 jobs; 0 idle, 1 running, 0 held user@scone:~/benchmarks> condor_rm 33.0 Job 33.0 marked for removal 9