Rich: I have something like the following: !$OMP PARALLEL DO ... do j=1,nlat ! nlat=36 ... call qrj ... enddo sub qrj lmax = 59 do lam=1,lmax ! wavelength ! ! Several of these 2d loops in the lamdas loop: do k=1,npress ! npress=29 do i=1,nlon ! nlon=76 ! ! Blocks of (often reduction) statements like: ! (a and b are thread-private) b(i,k) = blah,blah a(i,k) = a(i,k)+s(lam)*b(i,k) ! sum over wavelength enddo ! longitude enddo ! pressure ... enddo ! wavelength The qrj routine is one of the highest cpu consumers in the code. In an attempt to speed it up I realized that most of this code does not need to be executed at night, which is dependent on j and i. So I made an index array idn(nlon) (a private one for each j), in which idn(i)==0 for nighttime, or 1 for daytime, then did the following in qrj: do lam=16,lmax ! wavelength do k=1,npress do i=1,nlon if (idn(i)==1) then ! daytime ... a(i,k) = s(lam)*b(i,k) ... endif enddo enddo enddo This did not speed it up much, if at all, even tho there were many (altho < 1/2) night-time iterations. So I figured to save overhead on at least the k-loop by reversing the i and k loops: do lam=16,lmax ! wavelength do i=1,nlon ! nlon=76 if (idn(i)==1) then do k=1,npress ! npress=29 a(i,k) = s(lam)*b(i,k) ... enddo ! pressure endif enddo ! longitude enddo ! wavelength This appeared to be even slower, presumably because of the non-stride-1 on a and b. When I put the i-loop outside of the lamdas loop, it was *definitely* much slower. There appears to be a trade-off between saving work on the diurnal conditional vs stride-1 on the arrays. I've considered introducing local arrays c(k,i) and d(k,i) but some (not all) of the a and b arrays are in common, so for those I would have to dump c=trans(a) and d=trans(b) at the beginning of the routine, and a=trans(c) and b=trans(d) at the end. Do you have any suggestions? --Ben PS: did you get my tgcm benchmarks msg last week?