Peter:

As per your request, here are some 1st-cut timings of a 1-day run 
of tiegcm (480 3-minute steps) on ouray, ute, and blackforest. 
Obviously, ute wins the race, using OpenMP on DSM architecture.

MPI overhead is large for only 1-d decomposition: its a master/slave
thing right now, distributed over geographic latitude. Each task 
calculates nlat/ntask latitudes, and all tasks must exchange 4 
boundary latitudes at every time step for 4th order differencing. 
In the shared memory machines, its all either in global memory,
or is made available through direct memory transfers.

Also, the dynamo is still serial, so master must collect geographic 
latitudes from slaves before running dynamo (which is short in
cpu time), then distribute output afterwards at every step.

FYI, the source is in /home/tgcm/tgcm14, with main driver routine
being advnce.F.

All tasks read redundantly from same history file and other data
files at start-up, but only the master outputs histories -- however, 
only 1 history was written at the end in these tests, so i/o was 
not a big deal.

I'm not sure about dedicated vs non-dedicated processors on the
IBM, and I'm not sure I can distinguish between code efficiency 
and machine load effects. There were about 4-6 people using ute 
at the time of these runs, vs 20+ jobs in ouray.  Also, I'm not 
totally sure that timex and timef are reporting strictly the same
thing, but from a strictly practical standpoint, the clock on the 
wall does not lie.

It compiles on ES40, but am working out some overflows, so don't
have numbers for it yet. Maybe hybrid MPI+OpenMP will work on 
the Compaq machine.  In short, I'm still in love w/ SGI Origin.
Dataproc is very fast for serial post-proc as well. I will fill in 
6-proc runs for j90 and o2k, and make some 1-proc runs also. Am
going to Hammond's meeting in 40 mins.

--Ben


tgcm14: 1 day simulations at 3-min time-step (480 time iterations).
        (1-d decomposition over latitude with nlat=36)
  

Host	Model	OpSys	Parall	nproc/	Elapsed	User cpu
			method	ntask	(hrs)	(hrs)
================================================================
ouray	j90se	Unicos	!MIC$	6
				9	0.63	1.55
				12	0.56	1.6
----------------------------------------------------------------
ute	o2k	IRIX	OpenMP	6
				9	0.30	2.56
				12	0.24	2.7
----------------------------------------------------------------
black	SP	AIX	MPI	6	2.5
forest				9	1.7
				12	1.46
(ntasks using 2pe's per node, i.e., 6 tasks on 3 nodes,
 9 tasks on 5 nodes (1 idle proc), and 12 tasks on 6 nodes)
----------------------------------------------------------------

Elapsed times are from timef on SP, timex on j90se and o2k.

Queues were reg on j90, ded_16 on o2k, and com_pr on the SP.