> It is very difficult to see what would be best without seeing the whole routine,
>   i.e. I have no idea what M and N and lmax are.

lmax=59 so lmax-15 = 44. M and N vary with the statement.
I will attach 2 files: a qrj.orig, and qrj.idn. The
1st is the "original" code, the second has some experiments
in it, but basically has the iloop inside the lamda loop.
The qrj.idn code is about 5% faster than the original --
surprising since the stride is so bad.

The F(I,xxxx) array is in thread-private common. Pressure
and M fields are wrapped together in the 2nd dimension. 
LEN3 and LEN2 are ~npress*nlon.  The Sx(I,K) arrays are 
also in thread-private common, but are treated as local
(except when contiguous mem is assumed) (kmax==npress, 
imax=nlon).

So I will declare Sxki(npress,nlon) local arrays, put 
the lamda loop on the inside, lon loop on the outside,
and transfer into the Sx(nlon,npress) arrays in between
the triple loops when necessary. I'll let you know what
happens..

--Ben