Ray:

Below are my thoughts on running tgcm23 on the IBM.
If you get a chance, please read this over before we meet
on Monday. I am making a 5-day test run on blackforest
from your account now. Tomorrow we can look at the output,
and go over some details about blackforest. Hopefully, you
can make some runs while I'm at AGU -- I will have my
laptop in SF, so will be available via email.

--Ben


The transfer of output listings from blackforest to hao
following batch jobs is unreliable, so I will describe
a procedure you can use for submitting tgcm23 jobs to
blackforest, monitoring the job, and getting the output
back. 

You can submit ibm jobs from ouray using the "submit"
command, just like cray jobs, except you use a different
run script. You can monitor the job with some aliases I have
put in your .cshrc file that do rsh commands to blackforest,
but the best way is to login to blackforest during and after
a run and monitor the job from there.

I have made a new directory on your account on ouray:
 ~roble/timegcm/tgcm23. Submit tgcm23 jobs to the ibm from here,
e.g. "submit tgcm23_aix.job". (As I mentioned earlier,
I put an alias in your .cshrc file so your submit command
will execute /home/tgcm/bld/submit, which has the blackforest
option.)

Once you have submitted a job to blackforest, job scripts
and output files are stored in the ~/submit directory on blackforest.
(your home is /home/blackforest/roble):

/home/blackforest/roble/submit 

  This directory will contain run scripts and output files
  from a run. When you submit from ouray, this directory
  is cleared out, and scripts for the new job are copied into
  and submitted from here. As the job starts, there will be
  output files for compilation, execution (and finally the
  attempted rcp back to hao, which isnt working). For example,
  during a run, you can telnet to blackforest from ouray
  (password is same as on the Crays), then cd to submit and
  do an lf, e.g.:

bf0915en% pwd
/home/blackforest/roble/submit
bf0915en% lf
total 3992
-rwxr--r--   1 roble    ncar     1645262 Dec 10 13:36 build_step.csh*
-rwxr--r--   1 roble    ncar        2543 Dec 10 13:36 exec_step.csh*
-rw-r--r--   1 roble    ncar        1522 Dec 10 13:36 loadlev.job
-rwxr--r--   1 roble    ncar        2210 Dec 10 13:36 rcp_step.csh*
-rw-r--r--   1 roble    ncar       81255 Dec 10 13:41 tgcm23_build.out
-rw-r--r--   1 roble    ncar      300319 Dec 10 14:05 tgcm23_exec.out

tgcm23_build.out: output listing from the compilation
tgcm23_exec.out : output listing from the execution

The tgcm23_exec.out file will grow during the execution, and you
can look at the end of the file to see how far the job is, e.g.:

bf0915en% tail tgcm23_exec.out
   6:Step   140 of  1800 mtime  80, 9,20, 0
   5:Step   140 of  1800 mtime  80, 9,20, 0
   2:Step   140 of  1800 mtime  80, 9,20, 0
  11:Step   140 of  1800 mtime  80, 9,20, 0
   4:Step   140 of  1800 mtime  80, 9,20, 0
   8:Step   140 of  1800 mtime  80, 9,20, 0
   9:Step   140 of  1800 mtime  80, 9,20, 0
   1:Step   140 of  1800 mtime  80, 9,20, 0
  10:Step   140 of  1800 mtime  80, 9,20, 0
   7:Step   140 of  1800 mtime  80, 9,20, 0
bf0915en% 

The numbers at the beginning of each line are output from the
12 processors (or tasks). When the job completes successfully,
the tgcm23_exec file is sorted and split out into separate
files for each process, e.g., tgcm23_task0.out, tgcm23_task1.out,
etc. The one you probably want to save is the task0 (master)
output, e.g. the following command (executed from blackforest
in the ~/submit directory) would copy the master output to
your ntwk directory on ouray:

blackforest> rcp tgcm23_task0.out ouray.hao:ntwk/tgcm23.out

You can do all these things with rsh and rcp commands to blackforest
from ouray, but its really easier to telnet to blackforest and do
it from there.

During a run, you can use aliases from ouray to monitor a job.
Here are the aliases I have put in your .cshrc file on ouray:

#
# Blackforest:
alias bfstat   'date ; rsh blackforest.ucar.edu ps -f -u $user'
alias bstat    'date ; rsh blackforest.ucar.edu llq'
alias bfqstat  'date ; rsh blackforest.ucar.edu llq -u $user'
alias bfqstatl 'date ; rsh blackforest.ucar.edu llq -l -x -u $user'

For example, executing "bfqstat" from ouray:

(ouray) tgcm23 : bfqstat
Sun Dec 10 14:17:39 MST 2000
Id                       Owner      Submitted   ST PRI Class        Running On 
------------------------ ---------- ----------- -- --- ------------ -----------
bf0915en.152009.1        roble      12/10 13:36 R  50  com_pr       bf0811en   
bf0915en.152009.2        roble      12/10 13:36 NQ 50  interactive             
bf0915en.152009.0        roble      12/10 13:36 C  50  share                   

2 job steps in queue, 0 waiting, 0 pending, 1 running, 1 held
(ouray) tgcm23 : 

This shows the 3 job steps (compilation is in the "share" queue, 
execution is "com_pr", the rcp that doesnt work is "interactive).
The status flags "R" means running, "NQ" means "not queued", and
"C" means "completed". 

The blackforest batch system is called "loadleveler". To get a 
list of current jobs on the machine, you can use the alias "bstat"
command from ouray, or the "llq" command on blackforest. The
loadleveler will send you email when a job step is completed,
or if there was an error.

There is a 6-hour wallclock limit on blackforest. I have found 
that a 5-day job at step=240, saving daily primary histories and
hourly secondary histories in the last day takes about 5 wallclock
hours. There are two community queues: com_pr and com_reg. You can
set to either in the tgcm23_aix.job run script, e.g.:

# @ class        = com_reg

Other important directories on blackforest are:
(/ptmp is the parallel file system on the ibm, and is subject to
 the scrubber).

/home/blackforest/roble/tgcm23
  Contains source and object files from the most recent compilation.

/ptmp/roble
  Directory from which the model is executed.

/ptmp/roble/tgcm23
  Directory for storing history and data files. 

At this point, run only one job at a time on blackforest, especially
the same model version.  You could probably make tgcm14 and tgcm23
runs at the same time, but we can arrange for simultaeneous jobs later.