12/12/98 Performance Library 2.0 README: Sun Performance Library 2.0 __________________________________________________________________________ Sun Performance Library 2.0 is for SPARC platforms on Solaris(TM) 2.5.1, 2.6 and Solaris 7 operating environments and Intel x86 platforms on Solaris(TM) 2.5.1, 2.6 and Solaris 7 operating environments. __________________________________________________________________________ Contents: A. Features B. Compatibility C. Documentation A. Features o The Sun Performance Library is a library of subroutines and functions to perform useful operations in computational linear algebra and Fourier transforms. It is based on the standard libraries BLAS1, BLAS2, BLAS3, LINPACK, LAPACK 2.0, FFTPACK, and VFFTPACK, as well as other capabilities that are not included in those libraries. Each subprogram in the Sun Performance Library performs the same operation and has the same interface as the standard version, but is generally much faster and sometimes more accurate. o This release of Sun Performance Library adds subroutines for the following: * convolution and correlation in one and two dimensions * complex vector FFTs * Fourier transforms in two and three dimensions * complex vector multiplication o This release supports the 64-bit Solaris 7 environment. To use the 64-bit version of the Sun Performance Library, specify -xarch=v9a or -xarch=v9 as described below. Note that integer arguments are still 32-bit, so arrays are still limited to 2 GB. Features specific to Sun Performance Library on SPARC platforms o Parallel Model Choice You can choose a model of parallelism that is optimized for a dedicated machine or for a shared machine by selecting the appropriate library. Shared multiprocessor library: libsunperf.a or libsunperf.so Dedicated multiprocessor library: libsunperf_mt.a or libsunperf_mt.so The shared multiprocessor model of parallelism has the following features: * The parallelism model assumes a machine shared among many tasks. * Parallelization is implemented with threads library synchronization primitives. The dedicated multiprocessor model of parallelism has the following features: * The parallelism model assumes a machine dedicated to one task. * Parallelization is implemented with spin locks. On a dedicated system, the dedicated model can be somewhat faster than the shared model due to lower synchronization overhead. On a system running many different tasks, the shared model provides better cooperation in the use of available resources. Risk in Combining Shared Model with Compiler Parallelization Options +-------------------------------------------------------------------+ + Do not mix the shared model with -parallel, -explicitpar, etc. + + If you do mix them, then the behavior is unpredictable. + +-------------------------------------------------------------------+ o Number of Processors For both models of parallelism, you specify the number of processors with the PARALLEL environment variable. o Optimal Choice of the Sun Performance Library Specify the most appropriate -xarch= option at link time to select the version of the Sun Performance Library optimized for your specific SPARC chip family. The same -xarch= option should be used at compile time with optimization for best performance. A very rough guideline follows. * Specify -xarch=v9 or -xarch=v9a on machines based on UltraSPARC or UltraSPARC II and running the 64-bit Solaris 7 operating system. Note that the resulting executable can be run only on UltraSPARC systems that use UltraSPARC chips from Sun Microsystems and that run 64-bit Solaris 7. * Specify -xarch=v8plus or -xarch=v8plusa on machines based on UltraSPARC or UltraSPARC II; on these machines the command `uname -m` returns "sun4u". Note that the resulting executable can be run only on UltraSPARC systems that use UltraSPARC chips from Sun Microsystems. * Specify -xarch=v8 on machines based on SuperSPARC, SuperSPARC II, or hyperSPARC; on these machines the command `uname -m` returns either "sun4m" or "sun4d". * Specify -xarch=v8a on machines based on microSPARC or microSPARC II; on these machines the command `uname -m` returns "sun4m". * Specify -xarch=v7 on machines based on earlier SPARC chips; on these machines `uname -m` returns either "sun4c" or in one case, "sun4m" (i.e. CY7C602-based Sun4-630/670/690). However, the use of the Sun Performance Library is strongly discouraged on these legacy machines due to their lack of integer multiply/ divide and the floating-point FsMULd instruction. The Sun Performance Library uses these intructions extensively; they are emulated by the kernel and hence an application may run very slowly. o Usage Note: either -dalign or establishing a trap 6 handler is required --see "Compatibility," below. Single-processor * Call one or more of the routines * Do not set PARALLEL to a number greater than 1 * Link with -xlic_lib=sunperf specified at the end of the command line * Do not compile or link with -parallel, -explicitpar, or -autopar Example: Compile and link with libsunperf.so (default) cc -dalign -xarch=... any.c -xlic_lib=sunperf or f77 -dalign -xarch=... any.f -xlic_lib=sunperf or f90 -dalign -xarch=... any.f90 -xlic_lib=sunperf Example: Compile and link with libsunperf.a statically cc -dalign -xarch=... any.c \ -Bstatic -xlic_lib=sunperf -Bdynamic or f77 -dalign -xarch=... any.f \ -Bstatic -xlic_lib=sunperf -Bdynamic or f90 -dalign -xarch=... any.f90 \ -Bstatic -xlic_lib=sunperf -Bdynamic Multiple-processor in shared mode * Call one or more of the routines * Set PARALLEL to a number greater than 1 * Compile and link with -mt * Link with -xlic_lib=sunperf specified at the end of the command line * Do not compile or link with -parallel, -explicitpar, or -autopar Example: Compile and link with libsunperf.so (default) cc -dalign -xarch=... any.c -xlic_lib=sunperf -mt or f77 -dalign -xarch=... any.f -xlic_lib=sunperf -mt or f90 -dalign -xarch=... any.f90 -xlic_lib=sunperf -mt Example: Compile for link with libsunperf.a statically cc -dalign -xarch=... any.c \ -Bstatic -xlic_lib=sunperf -Bdynamic -mt or f77 -dalign -xarch=... any.f \ -Bstatic -xlic_lib=sunperf -Bdynamic -mt or f90 -dalign -xarch=... any.f90 \ -Bstatic -xlic_lib=sunperf -Bdynamic -mt Multiple-processor in dedicated mode (with parallelization options) * Call one or more of the routines * Set PARALLEL to the number of available processors * Link with -xlic_lib=sunperf specified at the end of the command line * Compile and link with -parallel, -explicitpar, or -autopar Example: Compile and link with libsunperf_mt.so (default) cc -dalign -xarch=... -xparallel any.c -xlic_lib=sunperf or f77 -dalign -xarch=... -parallel any.f -xlic_lib=sunperf or f90 -dalign -xarch=... -parallel any.f90 -xlic_lib=sunperf Example: Compile and link with libsunperf_mt.a statically cc -dalign -xarch=... -xparallel any.c \ -Bstatic -xlic_lib=sunperf -Bdynamic or f77 -dalign -xarch=... -parallel any.f \ -Bstatic -xlic_lib=sunperf -Bdynamic or f90 -dalign -xarch=... -parallel any.f90 \ -Bstatic -xlic_lib=sunperf -Bdynamic Using Sun Performance Library on Intel x86 o Single-processor * Call one or more of the routines * Do not set PARALLEL to a number greater than 1 * Link with -xlic_lib=sunperf specified at the end of the command line Example: Compile and link with libsunperf.so (default) cc any.c -xlic_lib=sunperf or f77 any.f -xlic_lib=sunperf Example: Compile for link with libsunperf.a statically cc any.c -Bstatic -xlic_lib=sunperf -Bdynamic or f77 any.f -Bstatic -xlic_lib=sunperf -Bdynamic o Multiple-processor * Call one or more of the routines * Set PARALLEL to a number greater than 1 * Compile and link with -mt * Link with -xlic_lib=sunperf specified at the end of the command line Example: Compile and link with libsunperf.so (default) cc any.c -xlic_lib=sunperf -mt or f77 any.f -xlic_lib=sunperf -mt Example: Compile and link with libsunperf.a statically cc any.c -Bstatic -xlic_lib=sunperf -Bdynamic -mt or f77 any.f -Bstatic -xlic_lib=sunperf -Bdynamic -mt B. Compatibility o The Fortran functions and subroutines are used by calling them from within a program, usually, but not necessarily, a FORTRAN 77 or Fortran 90 program. For instance, the calling program can be C or C++. However, the calling program must use the FORTRAN 77 calling sequence. * Do not prototype the subroutines with Fortran 90's INTERFACE statement. The use of INTERFACE implies that the subroutines will use the Fortran 90 calling sequence, and this is not the case. Call the subroutines in the same way that you would call any FORTRAN 77 subroutine. * Arrays are stored columnwise. * All arguments are passed by reference. * The number of arguments to a routine is fixed. * Types of arguments must match even after C or C++ does type conversion. For example, care must be exercised when passing a single precision real value because a C or C++ compiler may automatically promote the argument to double precision. * Indices are based at one in keeping with standard Fortran practice. * Do not use -ext_names=plain to compile subprograms that call subprograms from Sun Performance Library. o The C interfaces are used by calling them from within a program, usually, but not necessarily, a C or C++ program. For instance, the calling program can be a Pascal or Ada program. However, the calling sequence must follow these rules: * Arrays are stored columnwise. * Indices are based at zero in keeping with standard C and C++ practice. For example, the FORTRAN interface to IDAMAX, which C programs access as "idamax_", would return a 1 to indicate the first element in a vector. The C interface to idamax, which C programs access as "idamax", would return a 0 to indicate the first element of a vector. This convention is observed in function return values, permutation vectors, and anywhere else that vector or array indices are used. o Compile with -dalign or enable trap 6 on SPARC The routines in the Sun Performance Library on SPARC are compiled with -dalign. To be compatible with this library, you must do one of the following: * Compile all of your own routines with -dalign, or * Enable trap 6, which allows misaligned data. How to enable trap 6 on SPARC: (1) Place this assembly code in a file called trap6_handler.s: .global trap6_handler_ .text .align 4 trap6_handler_: retl ta 6 (2) Assemble trap6_handler.s: fbe trap6_handler.s The first parallelizable subroutine invoked from Sun Performance Library will call a routine by the name of trap6_handler_. If you do not supply a trap6_handler_ then Sun Performance Library will call a default handler that does nothing. Because the default handler does nothing, if you do not supply a handler then any misaligned data will cause a trap that will be fatal if it is not handled. (4) Include trap6_handler.o on the command line: f77 any.f trap6_handler.o -xlic_lib=sunperf o Some of the routines in Sun Performance Library use malloc internally, so user codes that make calls to Sun Performance Library and to sbrk may not work correctly. o Perflib uses global integer registers %g2, %g3, and %g4 in 32-bit mode and %g2 and %g3 in 64-bit mode as scratch registers. o When Perflib starts threads in shared mode, it uses a stack size that it determines as follows: (1) Check the value of the STACKSIZE environment variable and interpret the units as kbytes (1024 bytes). (2) Compute the maximum stack size required by Perflib. (3) Use the largest of the values in (1) and (2) for the size of the stack in the created thread. o When Perflib starts threads in dedicated mode, the user must use the STACKSIZE environment variable to specify a stack size of at least 4 MB: setenv STACKSIZE 4000 C. Documentation o Man Pages There is a man page for each function and subroutine in the library. o Reference Manual An AnswerBook manual is available with this release. Documentation can also be found on http://docs.sun.com. o Other books Your local computer bookstore may also have these relevant books: LAPACK User's Guide, 2nd ed., by Anderson, et al, SIAM, 1995 LAPACK User's Guide, by Anderson, et al, SIAM, 1992 LINPACK User's Guide, by Dongarra, et al, SIAM, 1979