Ray:							Wed 7/21

I have written a timesgcm script that runs TGCM15 with options to
run various timing and analysis tools. The script is ~roble/times/
timesgcm.ref. I set it up with lexical reads from timesgcm.ccm.sol
and made copies of the two mod decks (timesgcm.ref.mods is same as
timesgcm.ccm.sol.mods and timesgcm.ref.mods1 is same as 
timesgcm.ccm.sol.mods1). I have run it for a single 6 min time step
(DISPOSE=0), and got the results for the 5 analysis tools. The
results are in /ouray/d/times/HS*.

I have put comments in the timesgcm.ref script showing how to use
the various analysis tools. There are 2 that do fortran analysis 
on the source code (ftref and flint), and 3 that do timing and
performance analysis during execution (perftrace, flowtrace, and
a profiler). I saved hard copies of the script, reports from the
single time step tgcm15 runs, and man pages from the tools.

We can go over these when you get back next week, but the most
obvious thing is that the subroutine QRJ used 25.3% of the total
cpu time!  At least part of the reason is the use of EXP, the
floating point exponential system routine. The profiler reported
that EXP used 19.5% of execution, when sampling memory "buckets".
The consultants suggested using a 1/2 precision EXP routine, but
I have'nt tried it yet. 

The performance monitor said that the program as a whole 
"appears to be an efficient vectorized code", and that overall
it achieved 161.4 MFLOPS during execution, which is pretty good
I guess. Another suggestion made by the performance monitor
(and flowtrace) was to "in-line" the QVxxxx routines, i.e., 
put the code from the QVxxx routines into the calling modules.
This would be more efficient because these routines are called
many times but have very short cpu times (QVSUB0 was called
56664 times and its average execution was 1.92e-6 seconds.)
I think Ciceley has already done this with DTUV, but that 
version is not in TGCM15 yet (maybe that's the one with the
strange bug).

See you on Monday...

--Ben