mp_in_doacross_loop(3C)






























































                                                                        Page 1



MP(3C)                                                                  MP(3C)



NAME
     mp: mp_block, mp_blocktime, mp_create, mp_destroy, mp_my_threadnum,
     mp_numthreads, mp_set_numthreads, mp_setup, mp_unblock, mp_setlock,
     mp_suggested_numthreads, mp_unsetlock, mp_barrier, mp_in_doacross_loop,
     mp_set_slave_stacksize - C multiprocessing utility functions

SYNOPSIS
     void mpblock()

     void mpunblock()

     void mpblocktime(iters)
     int iters

     void mpsetup()

     void mpcreate(num)
     int num

     void mpdestroy()

     int mpnumthreads()

     void mpsetnumthreads(num)
     int num

     int mpmythreadnum()

     int mpismaster()

     void mpsetlock()

     void mpunsetlock()

     void mpbarrier()

     int mpindoacrossloop()

     void mpsetslavestacksize(size)
     int size

     unsigned int mpsuggestednumthreads(num)
     unsigned int num


DESCRIPTION
     These routines give some measure of control over the parallelism used in
     C programs.  They should not be needed by most users, but will help to
     tune specific applications.






                                                                        Page 2



MP(3C)                                                                  MP(3C)



     mp_block puts all slave threads to sleep via blockproc(2).  This frees
     the processors for use by other jobs.  This is useful if it is known that
     the slaves will not be needed for some time, and the machine is being
     shared by several users.  Calls to mp_block may not be nested; a warning
     is issued if an attempt to do so is made.

     mp_unblock wakes up the slave threads that were previously blocked via
     mp_block.  It is an error to unblock threads that are not currently
     blocked; a warning is issued if an attempt is made to do so.

     It is not necessary to explicitly call mp_unblock.  When a parallel
     region is entered, a check is made, and if the slaves are currently
     blocked, a call is made to mp_unblock automatically.

     mp_blocktime controls the amount of time a slave thread waits for work
     before giving up.  When enough time has elapsed, the slave thread blocks
     itself.  This automatic blocking is independent of the user level
     blocking provided by the mp_block/mp_unblock calls.  Slave threads that
     have blocked themselves will be automatically unblocked upon entering a
     parallel region.  The argument to mp_blocktime is the number of times to
     spin in the wait loop.  By default, it is set to 10,000,000.  This takes
     about .25 seconds on a 200MHz processor.  As a special case, an argument
     of 0 disables the automatic blocking, and the slaves will spin wait
     without limit.  The environment variable MP_BLOCKTIME may be set to an
     integer value.  It acts like an implicit call to mp_blocktime during
     program startup.

     mp_destroy deletes the slave threads.  They are stopped by forcing them
     to call exit(2).  In general, doing this is discouraged.  mp_block can be
     used in most cases.

     mp_create creates and initializes threads.  It creates enough threads so
     that the total number is equal to the argument.  Since the calling thread
     already counts as one, mp_create will create one less than its argument
     in new slave threads.

     mp_setup also creates and initializes threads.  It takes no arguments.
     It simply calls mp_create using the current default number of threads.
     Normally the default number is equal to the number of cpu's currently on
     the machine.  If the user has not called either of the thread creation
     routines already, then mp_setup is invoked automatically when the first
     parallel region is entered.  If the environment variable MP_SETUP is set,
     then mp_setup is called during initialization, before any user code is
     executed.

     mp_numthreads returns the number of threads that would participate in an
     immediately following parallel region.  If the threads have already been
     created, then it returns the current number of threads.  If the threads
     have not been created, then it returns the current default number of
     threads.  The count includes the master thread. Knowing this count can be
     useful in optimizing certain kinds of parallel loops by hand, but this
     function has the side-effect of freezing the number of threads to the



                                                                        Page 3



MP(3C)                                                                  MP(3C)



     returned value.  As a result, this routine should be used sparingly. To
     determine the number of threads without this side-effect, see the
     description of mp_suggested_numthreads below.

     mp_set_numthreads sets the current default number of threads to the
     specified value.  Note that this call does not directly create the
     threads, it only specifies the number that a subsequent mp_setup call
     should use.  If the environment variable MP_SET_NUMTHREADS is set, it
     acts like an implicit call to mp_set_numthreads during program startup.
     For convenience when operating among several machines with different
     numbers of cpus, MP_SET_NUMTHREADS may be set to an expression involving
     integer literals, the binary operators + and -, the binary functions min
     and max, and the special symbolic value ALL which stands for "the total
     number of available cpus on the current machine."  Thus, something simple
     like
                 setenv MP_SET_NUMTHREADS 7
     would set the number of threads to seven.  This may be a fine choice on
     an 8 cpu machine, but would be very bad on a 4 cpu machine.  Instead, use
     something like
                 setenv MP_SET_NUMTHREADS "max(1,all-1)"
     which sets the number of threads to be one less than the number of cpus
     on the current machine (but always at least one).  If your configuration
     includes some machines with large numbers of cpus, setting an upper bound
     is a good idea.  Something like:
                 setenv MP_SET_NUMTHREADS "min(all,4)"
     will request (no more than) 4 cpus.

     For compatibility with earlier releases, NUM_THREADS is supported as a
     synonym for MP_SET_NUMTHREADS.

     mp_my_threadnum returns an integer between 0 and n-1 where n is the value
     returned by mp_numthreads.  The master process is always thread 0.  This
     is occasionally useful for optimizing certain kinds of loops by hand.

     mp_is_master returns 1 if called by the master process, 0 otherwise.

     mp_setlock provides convenient (though limited) access to the locking
     routines.  The convenience is that no set up need be done; it may be
     called directly without any preliminaries.  The limitation is that there
     is only one lock.  It is analogous to the ussetlock(3P) routine, but it
     takes no arguments and does not return a value.  This is useful for
     serializing access to shared variables (e.g.  counters) in a parallel
     region.  Note that it will frequently be necessary to declare those
     variables as volatile to ensure that the optimizer does not assign them
     to a register.

     mp_unsetlock is the companion routine for mp_setlock.  It also takes no
     arguments and does not return a value.

     mp_barrier provides a simple interface to a single barrier(3P).  It may
     be used inside a parallel loop to force a barrier synchronization to
     occur among the parallel threads.  The routine takes no arguments,



                                                                        Page 4



MP(3C)                                                                  MP(3C)



     returns no value, and does not require any initialization.

     mp_in_doacross_loop answers the question "am I currently executing inside
     a parallel loop."  This is needful in certain rare situations where you
     have an external routine that can be called both from inside a parallel
     loop and also from outside a parallel loop, and the routine must do
     different things depending on whether it is being called in parallel or
     not.

     mp_set_slave_stacksize sets the stacksize (in bytes) to be used by the
     slave processes when they are created (via sprocsp(2)).  The default size
     is 16MB.  Note that slave processes only allocate their local data onto
     their stack, shared data (even if allocated on the master's stack) is not
     counted.

     mp_suggested_numthreads uses the supplied value as a hint about how many
     threads to use in subsequent parallel regions, and returns the previous
     value of the number of threads to be employed in parallel regions. It
     does not affect currently executing parallel regions, if any. The
     implementation may ignore this hint depending on factors such as overall
     system load.  This routine may also be called with the value 0, in which
     case it simply returns the number of threads to be employed in parallel
     regions without the side-effect present in mp_numthreads.

     Pragmas or directives

     The MIPSpro C (and C++) compiler allows you to apply the capabilities of
     a Silicon Graphics multiprocessor computer to the execution of a single
     job. By coding a few simple directives, the compiler splits the job into
     concurrently executing pieces, thereby decreasing the wall-clock run time
     of the job.

     Directives enable, disable, or modify a feature of the compiler.
     Essentially, directives are command line options specified within the
     input file instead of on the command line. Unlike command line options,
     directives have no default setting. To invoke a directive, you must
     either toggle it on or set a desired value for its level.  The following
     directives can be used in C (and C++) programs when compiled with the -mp
     option.


     #pragma parallel

         This pragma denotes the start of a parallel region. The syntax for
         this pragma has a number of modifiers, but to run a single loop in
         parallel, the only modifiers you usually use are shared, and local.
         These options tell the multiprocessing compiler which variables to
         share between all threads of execution and which variables should be
         treated as local.

         In C, the code that comprises the parallel region is delimited by
         curly braces ({ }) and immediately follows the parallel pragma and



                                                                        Page 5



MP(3C)                                                                  MP(3C)



         its modifiers.

         The syntax for this pragma is:

         #pragma parallel shared (variables)
         #pragma local (variables) optional modifiers
         {code}

         The parallel pragma has four modifiers: shared, local, if, and
         numthreads.

         Their definitions ares:

             shared ( variablenames )

             Tells the multiprocessing C compiler the names of all the
             variables that the threads must share.

             local ( variablenames )

             Tells the multiprocessing C compiler the names of all the
             variables that must be private to each thread. (When PCA sets up
             a parallel region, it does this for you.)

             if ( integervaluedexpr )

             Lets you set up a condition that is evaluated at run time to
             determine whether or not to run the statement(s) serially or in
             parallel. At compile time, it is not always possible to judge how
             much work a parallel region does (for example, loop indices are
             often calculated from data supplied at run time). Avoid running
             trivial amounts of code in parallel because you cannot make up
             the overhead associated with running code in parallel. PCA will
             also generate this condition as appropriate.  If the if condition
             is false (equal to zero), then the statement(s) runs serially.
             Otherwise, the statement(s) run in parallel.

             numthreads(expr)

             Tells the multiprocessing C compiler the number of available
             threads to use when running this region in parallel. (The default
             is all the available threads.)

             In general, you should never have more threads of execution than
             you have processors, and you should specify  numthreads with the
             MPSETNUMTHREADS environmental variable at run time If you want
             to run a loop in parallel while you run some other code, you can
             use this option to tell the multiprocessing C compiler to use
             only some of the available threads.

             The expression expr should evaluate to a positive integer.




                                                                        Page 6



MP(3C)                                                                  MP(3C)



             For example, to start a parallel region in which to run the
             following code in parallel:

             for (idx=n; idx; idx--) {

                a[idx] = b[idx] + c[idx];

             }

             you must write:

             #pragma parallel shared( a, b, c ) shared(n) local( idx )

             or:

             #pragma parallel

             #pragma shared( a, b, c )

             #pragma shared(n)

             #pragma local(idx)

             before the statement or compound statement (code in curly braces,
             { }) that comprises the parallel region.

             Any code within a parallel region but not within any of the
             explicit parallel constructs ( pfor, independent, one processor,
             and critical ) is termed local code. Local code typically
             modifies only local data and is run by all threads.


     #pragma pfor

         The pfor is contained within a parallel region.  Use #pragma pfor to
         run a for loop in parallel only if the loop meets all of these
         conditions:

             All the values of the index variable can be computed
             independently of the iterations.

             All iterations are independent of each other - that is, data used
             in one iteration does not depend on data created by another
             iteration. A quick test for independence: if the loop can be run
             backwards, then chances are good the iterations are independent.

             The loop control variable cannot be a field within a
             class/struct/union or an array element.

             The number of times the loop must be executed is determined once,
             upon entry to the loop, and is based on the loop initialization,
             loop test, and loop increment statements.



                                                                        Page 7



MP(3C)                                                                  MP(3C)



             If the number of times the loop is actually executed is different
             from what is computed above, the results are unpredictable. This
             can happen if the loop test and increment change during the
             execution of the loop, or if there is an early exit from within
             the for loop. An early exit or a change to the loop test and
             increment during execution may have serious performance
             implications.

             The test or the increment should not contain expressions with
             side effects.

             The chunksize, if specified, is computed before the loop is
             executed, and the behavior is unpredictable if its value changes
             within the loop.

             If you are writing a pfor loop for the multiprocessing C++
             compiler, the index variable i can be declared within the for
             statement via

             int i = 0;

             The draft for the C++ standard states that the scope of the index
             variable declared in a for statement extends to the end of the
             for statement, as in this example:

             #pragma pfor for (int i = 0, ...)

             The C++ compiler doesn't enforce this; in fact, with this
             compiler the scope extends to the end of the enclosing block. Use
             care when writing code so that the subsequent change in scope
             rules for i (in later compiler releases) do not affect the user
             code.

         If the code after a pfor is not dependent on the calculations made in
         the pfor loop, there is no reason to synchronize the threads of
         execution before they continue. So, if one thread from the pfor
         finishes early, it can go on to execute the serial code without
         waiting for the other threads to finish their part of the loop.

         The #pragma pfor directive takes several modifiers; the only one that
         is required is iterate. #pragma pfor tells the compiler that each
         iteration of the loop is unique.  It also partitions the iterations
         among the threads for execution.

         The syntax for #pragma pfor is:

         #pragma pfor iterate ( ) optionalmodifiers
         for ...
            { code ... }

         The pfor pragma has several modifiers. Their syntax is:




                                                                        Page 8



MP(3C)                                                                  MP(3C)



         iterate (index variable=expr1; expr2; expr3 )
         local(variable list)
         lastlocal (variable list)
         reduction (variable list)
         affinity (variable) = thread (expression)
         schedtype (type)
         chunksize (expr)

         Where:

             iterate (index variable=expr1; expr2; expr3 )

             Gives the multiprocessing C compiler the information it needs to
             identify the unique iterations of the loop and partition them to
             particular threads of execution.

                 index variable is the index variable of the for loop you want
                 to run in parallel.

                 expr1 is the starting value for the loop index.

                 expr2 is the number of iterations for the loop you want to
                 run in parallel.

                 expr3 is the increment of the for loop you want to run in
                 parallel.

             local (variable list)

             Specifies variables that are local to each process. If a variable
             is declared as local, each iteration of the loop is given its own
             uninitialized copy of the variable. You can declare a variable as
             local if its value does not depend on any other iteration of the
             loop and if its value is used only within a single iteration. In
             effect the local variable is just temporary; a new copy can be
             created in each loop iteration without changing the final answer.

             lastlocal (variable list)

             Specifies variables that are local to each process. Unlike with
             the local clause, the compiler saves only the value of the
             logically last iteration of the loop when it exits.

             reduction (variable list)

             Specifies variables involved in a reduction operation. In a
             reduction operation, the compiler keeps local copies of the
             variables and combines them when it exits the loop. An element of
             the reduction list must be an individual variable (also called a
             scalar variable) and cannot be an array or struct. However, it
             can be an individual element of an array. When the reduction
             modifier is used, it appears in the list with the correct



                                                                        Page 9



MP(3C)                                                                  MP(3C)



             subscripts.

             One element of an array can be used in a reduction operation,
             while other elements of the array are used in other ways. To
             allow for this, if an element of an array appears in the
             reduction list, the entire array can also appear in the share
             list.

             The two types of reductions supported are sum(+) and product(*).

             The compiler confirms that the reduction expression is legal by
             making some simple checks. The compiler does not, however, check
             all statements in the do loop for illegal reductions. You must
             ensure that the reduction variable is used correctly in a
             reduction operation.

             affinity (variable) = thread (expression)

             The effect of thread-affinity is to execute iteration "i" on the
             thread number given by the user-supplied expression (modulo the
             number of threads). Since the threads may need to evaluate this
             expression in each iteration of the loop, the variables used in
             the expression (other than the loop induction variable) must be
             declared shared and must not be modified during the execution of
             the loop. Violating these rules may lead to incorrect results.

             If the expression does not depend on the loop induction variable,
             then all iterations will execute on the same thread, and will not
             benefit from parallel execution.

             schedtype (type)

             Tells the multiprocessing C compiler how to share the loop
             iterations among the processors. The schedtype chosen depends on
             the type of system you are using and the number of programs
             executing.  You can use the following valid types to modify
             schedtype:

                 simple (the default)

                 tells the run time scheduler to partition the iterations
                 evenly among all the available threads.

                 runtime

                 Tells the compiler that the real schedule type will be
                 specified at run time.

                 dynamic

                 Tells the run time scheduler to give each thread chunksize
                 iterations of the loop. chunksize should be smaller than



                                                                       Page 10



MP(3C)                                                                  MP(3C)



                 (number of total iterations)/(number of threads). The
                 advantage of dynamic over simple is that dynamic helps
                 distribute the work more evenly than simple.

                 Depending on the data, some iterations of a loop can take
                 longer to compute than others, so some threads may finish
                 long before the others.  In this situation, if the iterations
                 are distributed by simple, then the thread waits for the
                 others. But if the iterations are distributed by dynamic, the
                 thread doesn't wait, but goes back to get another chunksize
                 iteration until the threads of execution have run all the
                 iterations of the loop.

                 interleave

                 Tells the run time scheduler to give each thread chunksize
                 iterations (described below) of the loop, which are then
                 assigned to the threads in an interleaved way.

                 gss (guided self-scheduling)

                 Tells the run time scheduler to give each processor a varied
                 number of iterations of the loop. This is like dynamic, but
                 instead of a fixed chunksize, the chunk size iterations begin
                 with big pieces and end with small pieces.

                 If I iterations remain and P threads are working on them, the
                 piece size is roughly:  I/(2P) + 1

                 Programs with triangular matrices should use gss.

                 chunksize (expr)

                 Tells the multiprocessing C/C++ compiler how many iterations
                 to define as a chunk when you use the dynamic or interleave
                 modifier (described above).

                 expr should be positive integer, and should evaluate to the
                 following formula:

                      number of iterations / X

                 where X is between twice and ten times the number of threads.
                 Select twice the number of threads when iterations vary
                 slightly. Reduce the chunk size to reflect the increasing
                 variance in the iterations.  Performance gains may diminish
                 after increasing X to ten times the number of threads.








                                                                       Page 11



MP(3C)                                                                  MP(3C)



     #pragma one processor

         A #pragma one processor directive causes the statement that follows
         it to be executed by exactly one thread.

         The syntax of this pragma is:

         #pragma one processor

         { code }


     #pragma critical

         Sometimes the bulk of the work done by a loop can be done in
         parallel, but the entire loop cannot run in parallel because of a
         single data-dependent statement. Often, you can move such a statement
         out of the parallel region.  When that is not possible, you can
         sometimes use a lock on the statement to preserve the integrity of
         the data.

         In the multiprocessing C/C++ compiler, use the critical pragma to put
         a lock on a critical statement (or compound statement using { }).
         When you put a lock on a statement, only one thread at a time can
         execute that statement.  If one thread is already working on a
         critical protected statement, any other thread that wants to execute
         that statement must wait until that thread has finished executing it.

         The syntax of the critical pragma is:

         #pragma critical (lock_variable)

         { code }

         The statement(s) after the critical pragma will be executed by all
         threads, one at a time. The lock variable lock_variable is an
         optional integer variable that must be initialized to zero. The
         parentheses are required. If you don't specify a lock variable, the
         compiler automatically supplies one.  Multiple critical constructs
         inside the same parallel region are considered to be independent of
         each other unless they use the same explicit lock variable.


     #pragma independent

         Running a loop in parallel is a class of parallelism sometimes called
         fine-grained parallelism or homogeneous parallelism. It is called
         homogeneous because all the threads execute the same code on
         different data.  Another class of parallelism is called coarse-
         grained parallelism or heterogeneous parallelism. As the name
         suggests, the code in each thread of execution is different.




                                                                       Page 12



MP(3C)                                                                  MP(3C)



         Ensuring data independence for heterogeneous code executed in
         parallel is not always as easy as it is for homogeneous code executed
         in parallel.  (Ensuring data independence for homogeneous code is not
         a trivial task.)

         The independent pragma has no modifiers. Use this pragma to tell the
         multiprocessing C/C++ compiler to run code in parallel with the rest
         of the code in the parallel region.

         The syntax for #pragma independent is:

         #pragma independent

         { code }


     Synchronization Directives

     To account for data dependencies, it is sometimes necessary for threads
     to wait for all other threads to complete executing an earlier section of
     code.  Two sets of directives implement this coordination: #pragma
     synchronize and #pragma enter/exit gate.


     #pragma synchronize

          A #pragma synchronize tells the multiprocessing C/C++ compiler that
          within a parallel region, no thread can execute the statements that
          follows this pragma until all threads have reached it. This
          directive is a classic barrier construct.

          The syntax for this pragma is:

          #pragma synchronize



     #pragma enter gate

          #pragma exit gate

          You can use two additional pragmas to coordinate the processing of
          code within a parallel region. These additional pragmas work as a
          matched set.  They are #pragma enter gate and #pragma exit gate.

          A gate is a special barrier. No thread can exit the gate until all
          threads have entered it. This construct gives you more flexibility
          when managing dependencies between the work-sharing constructs
          within a parallel region.

          The syntax of the enter gate pragma is:




                                                                       Page 13



MP(3C)                                                                  MP(3C)



          #pragma enter gate

          For example, construct D may be dependent on construct A, and
          construct F may be dependent on construct B. However, you do not
          want to stop at construct D because all the threads have not cleared
          B. By using enter/exit gate pairs, you can make subtle distinctions
          about which construct is dependent on which other construct.

          Put this pragma after the work-sharing construct that all threads
          must clear before the #pragma exit gate of the same name.

          The syntax of the exit gate pragma is:

          #pragma exit gate

          Put this pragma before the work-sharing construct that is dependent
          on the preceding #pragma enter gate. No thread enters this work-
          sharing construct until all threads have cleared the work-sharing
          construct controlled by the corresponding #pragma enter gate.


     #pragma pageplace

          The syntax of this pragma is:

          #pragma page_place (addr, size, threadnum)

          where addr is the starting address, size is the size in bytes, and
          threadnum is the thread.

          On a system with physically distributed shared memory, for example,
          Origin2000), you can explicitly place all data pages spanned by the
          virtual address range [addr, addr + size-1] in the physical memory
          of the processor corresponding to the specified thread.


SEE ALSO
     cc(1), f77(1), mp(3f), sync(3c), sync(3f), MIPSpro Power C Programmer's
     Guide, MIPSpro C Language Reference Manual, MIPSpro FORTRAN 77
     Programmer's Guide















                                                                       Page 14

Museum

Related Articles