Rich: Ok, I reorganized the loops in sub qrj to look something like the following (recall qrj is called from parallel latitude loop). ! Save current (k,i) fields from original (i,k) f-array: do i=1,nlon ! (nlon==76) do k=1,npress ! (npress=29) field1(k,i) = f(i,k1) field2(k,i) = f(i,k2) ... enddo enddo ... ! Init fields in (k,i) arrays: do i=1,nlon ! (1,76) do k=1,npress field3(k,i) = 0. field4(k,i) = 0. enddo enddo ... ! Sum over wavelength: do i=1,nlon if (idn(i)==1) then ! daytime do k=1,npress do n=l1,nlam ! (16,59) field3(k,i) = field3(k,i)+blah(n)*blahh(k,i) ... enddo enddo endif enddo ... ! Do other work w/o lamda loop: do i=1,nlon if (idn(i)==1) then ! daytime do k=1,npress array(k,i) = ... enddo endif enddo ... ! Return updated fields to f-array: do k=1,kmaxp1 do i=1,len1 f(i,k1) = field1(k,i) ... enddo enddo Original and new code is attached. After a series of runs, this code does not appear to speed up wall clock time much if at all on the IBM or Origin. Sometimes it shows 4-5% speedup, sometimes none at all. On the Cray j90 tho, it appears faster by about 8%. I have also run it at equinox vs solstice, since the number of daytime longitude indices changes, but not much difference in speed. It also occurs to me that I may be introducing a load imbalance in latitude because the number of daytime indices varies (between 0 and 76) with latitude. Did you say something about optimizing/unrolling the transfers to/from (i,k) <-> (k,i) arrays at beginning and end? Did you get a chance to look at the "benchmarks" I sent of tgcm runs on the 4 platforms? --Ben