Rich:

I have something like the following:

!$OMP PARALLEL DO ...
do j=1,nlat ! nlat=36
  ...
  call qrj
  ...
enddo

sub qrj
lmax = 59
do lam=1,lmax ! wavelength
!
! Several of these 2d loops in the lamdas loop:
  do k=1,npress ! npress=29
    do i=1,nlon ! nlon=76
!
! Blocks of (often reduction) statements like:
! (a and b are thread-private)
      b(i,k) = blah,blah
      a(i,k) = a(i,k)+s(lam)*b(i,k) ! sum over wavelength
    enddo ! longitude
  enddo ! pressure
  ...
enddo ! wavelength

The qrj routine is one of the highest cpu consumers
in the code. In an attempt to speed it up I realized
that most of this code does not need to be executed
at night, which is dependent on j and i. So I made an
index array idn(nlon) (a private one for each j), in
which idn(i)==0 for nighttime, or 1 for daytime, then 
did the following in qrj:

do lam=16,lmax ! wavelength
  do k=1,npress
    do i=1,nlon
      if (idn(i)==1) then ! daytime
        ...
        a(i,k) = s(lam)*b(i,k) 
        ...
      endif
    enddo
  enddo
enddo

This did not speed it up much, if at all, even tho there 
were many (altho < 1/2) night-time iterations. So I figured 
to save overhead on at least the k-loop by reversing the
i and k loops:

do lam=16,lmax ! wavelength 
  do i=1,nlon ! nlon=76   
    if (idn(i)==1) then
      do k=1,npress ! npress=29   
        a(i,k) = s(lam)*b(i,k)
        ...
      enddo ! pressure
    endif
  enddo ! longitude   
enddo ! wavelength 

This appeared to be even slower, presumably because of
the non-stride-1 on a and b. When I put the i-loop outside
of the lamdas loop, it was *definitely* much slower. 
There appears to be a trade-off between saving work on 
the diurnal conditional vs stride-1 on the arrays. 

I've considered introducing local arrays c(k,i) and
d(k,i) but some (not all) of the a and b arrays are in 
common, so for those I would have to dump c=trans(a) and 
d=trans(b) at the beginning of the routine, and a=trans(c) 
and b=trans(d) at the end.

Do you have any suggestions?

--Ben

PS: did you get my tgcm benchmarks msg last week?