> It is very difficult to see what would be best without seeing the whole routine, > i.e. I have no idea what M and N and lmax are. lmax=59 so lmax-15 = 44. M and N vary with the statement. I will attach 2 files: a qrj.orig, and qrj.idn. The 1st is the "original" code, the second has some experiments in it, but basically has the iloop inside the lamda loop. The qrj.idn code is about 5% faster than the original -- surprising since the stride is so bad. The F(I,xxxx) array is in thread-private common. Pressure and M fields are wrapped together in the 2nd dimension. LEN3 and LEN2 are ~npress*nlon. The Sx(I,K) arrays are also in thread-private common, but are treated as local (except when contiguous mem is assumed) (kmax==npress, imax=nlon). So I will declare Sxki(npress,nlon) local arrays, put the lamda loop on the inside, lon loop on the outside, and transfer into the Sx(nlon,npress) arrays in between the triple loops when necessary. I'll let you know what happens.. --Ben