Rich:

Ok, I reorganized the loops in sub qrj to look something 
like the following (recall qrj is called from parallel 
latitude loop). 

! Save current (k,i) fields from original (i,k) f-array:
do i=1,nlon               ! (nlon==76)
  do k=1,npress           ! (npress=29)
    field1(k,i) = f(i,k1) 
    field2(k,i) = f(i,k2) 
    ...
  enddo
enddo
...
! Init fields in (k,i) arrays:
do i=1,nlon ! (1,76)
  do k=1,npress
    field3(k,i) = 0.
    field4(k,i) = 0.
  enddo
enddo
...
! Sum over wavelength:
do i=1,nlon
  if (idn(i)==1) then ! daytime
    do k=1,npress 
      do n=l1,nlam ! (16,59)
        field3(k,i) = field3(k,i)+blah(n)*blahh(k,i)
        ...
      enddo
    enddo
  endif
enddo
...
! Do other work w/o lamda loop:
do i=1,nlon
  if (idn(i)==1) then ! daytime
    do k=1,npress 
      array(k,i) = ...
    enddo
  endif
enddo
...
! Return updated fields to f-array:
do k=1,kmaxp1
  do i=1,len1
    f(i,k1) = field1(k,i)
    ...
  enddo
enddo

Original and new code is attached.

After a series of runs, this code does not appear to speed
up wall clock time much if at all on the IBM or Origin. 
Sometimes it shows 4-5% speedup, sometimes none at all. On 
the Cray j90 tho, it appears faster by about 8%. I have also 
run it at equinox vs solstice, since the number of daytime 
longitude indices changes, but not much difference in speed.

It also occurs to me that I may be introducing a load imbalance 
in latitude because the number of daytime indices varies
(between 0 and 76) with latitude. 

Did you say something about optimizing/unrolling the transfers
to/from (i,k) <-> (k,i) arrays at beginning and end?

Did you get a chance to look at the "benchmarks" I sent of 
tgcm runs on the 4 platforms?

--Ben