John: I enjoyed your cache opt talk at NCAR last week. We also talked briefly about HAO's tgcm model codes when you were here last fall. I am attaching a few files from our current source code, which runs OpenMP on SGI Origin and Cray J90, and has a "first-cut" MPI implementation for blackforest. I have also put a tar file of the entire source in /pub/foster at ftp://ftp.hao.ucar.edu:122 (file name is for_levesque.tar). The MPI version on blackforest runs ~5-6x slower wallclock than the OpenMP version on the Origin (12pe's on the Origin, 12pe's over 6 nodes on the SP). I am soliciting your comments about how to optimize the MPI, or the code in general, for the IBM SP. This is a 4th order finite-difference 1st principles GCM for the upper atmosphere (97 km lower boundary, to top of thermosphere). The current parallization is a simple 1-d decomposition over latitude (currently nlat=36). Currently it runs either OpenMP *or* MPI, not both (it *should* be fairly easy to try a hybrid setup when the time comes). OpenMP and MPI are turned on/off with compiler switches, depending on the host. We may need to go to a 2-d decomposition to scale efficiently in MPI, and that will require major rewrite. Note that the dynamo (dyn call after main latitude loop) is still serial. This is because it loops over geomagnetic latitude, whereas the main loop is geographic. This means the master task must collect updated geographic data from the slaves, then run the dynamo (which uses relatively little cpu time), then distribute dynamo output back to the slaves for the next time iteration. I suspect this "all to one" and "one to all" communication is dominating the MPI execution. I have recently rewritten several "top layers" of the model, e.g., initialization, user input, history i/o (netCDF), an mpi module, etc. This represents about 6000 of a total of ~32000 lines of code. The new code includes f90 modules. This single source compiles on Crays, SGI Origin, IBM SP, and Compaq EV40. The "core" model, inside the main parallel latitude loop (see call dynamics), is still legacy f77 code. At least 90% of the computing of the model is from inside this main latitude loop. See advnce.F for main driver logic, time-step loop, latitude loop, etc. I'm leaving on spring break, returning 4/3, so take your time responding to this -- I won't expect to hear from you until sometime in April. I understand if you can't look at it at all, or not until later, but maybe it could be a diversion for you at some point. If you are interested in testing the entire code, I can provide tar files with source, makefile, input data, etc. I really appreciate any comments you might have. Thanks, --Ben ----------------------------------------------------------------------- Ben Foster High Altitude Observatory (HAO) foster@ncar.ucar.edu phone: 303-497-1595 fax: 303-497-1589 Nat. Center for Atmos. Res. P.O. Box 3000 Boulder CO 80307 USA -----------------------------------------------------------------------