If you enter the no hw-module switch switchnumber slot slotnumber oversubscription command to configure non-oversubscription mode (performance mode), then only ports 1, 5, 9, and 13 are configurable; the other ports on the module are disabled.
The ThinkSystem SR650 is a mainstream 2U 2-socket server with industry-leading reliability, management, and security features, and is designed to handle a wide range of workloads.
New to the SR650 is support for up to 24 NVMe solid-state drives. With this support, the SR650 is an excellent choice for workloads that need large amounts of low-latency high-bandwidth storage, including virtualized clustered SAN solutions, software-defined storage, and applications leveraging NVMe over Fabrics (NVMeOF).
This article describes the three new configurations available for the SR650:
You can also learn about the offerings by watching the walk-through video below.
Changes in the April 16 update:
The Lenovo ThinkSystem SR650 is a mainstream 2U 2-socket server with industry-leading reliability, management, and security features, and is designed to handle a wide range of workloads.
New to the SR650 is support for up to 24 NVMe solid-state drives. With this support, the SR650 is an excellent choice for workloads that need large amounts of low-latency high-bandwidth storage, including virtualized clustered SAN solutions, software-defined storage, and applications leveraging NVMe over Fabrics (NVMeOF).
Figure 1. ThinkSystem SR650 with 24 NVMe drives
Three new configurations are now available:
NVMe (Non-Volatile Memory Express) is a technology that overcomes SAS/SATA SSD performance limitations by optimizing hardware and software to take full advantage of flash technology. Intel Xeon processors efficiently transfer data in fewer clock cycles with the NVMe optimized software stack compared to the legacy AHCI stack, thereby reducing latency and overhead. NVMe SSDs connect directly to the processor via the PCIe bus, further reducing latency. NVMe drives are characterized by very high bandwidth and very low latency.
These configurations are available configure-to-order (CTO) in the Lenovo Data Center Solution Configurator (DCSC), https://dcsc.lenovo.com. The following table lists the feature codes related to the NVMe drive subsystem. The configurator will derive any additional components that are needed.
Field upgrades: The 20x NVMe and 24x NVMe drive configurations are also available as field upgrades as described in the Field upgrades section.
Feature code | Description |
---|---|
PCIe Switch Adapters | |
B22D | ThinkSystem 810-4P NVMe Switch Adapter (PCIe x8 adapter with four x4 drive connectors) |
AUV2 | ThinkSystem 1610-4P NVMe Switch Adapter (PCIe x16 adapter with four x4 drive connectors) |
B4PA | ThinkSystem 1610-8P NVMe Switch Adapter (PCIe x16 adapter with four connectors to connect to eight drives) |
NVMe Backplane | |
B4PC | ThinkSystem SR650 2.5' NVMe 8-Bay Backplane |
Riser Cards | |
AUR3 | ThinkSystem SR550/SR590/SR650 x16/x8 PCIe FH Riser 1 Kit (x16+x8 PCIe Riser for Riser 1, for 16 and 20-drive configurations) |
B4PB | ThinkSystem SR650 x16/x8/x16 PCIe Riser1 (x16+x8+x16 PCIe Riser for Riser 1, for 24-drive configurations) |
AURC | ThinkSystem SR550/SR590/SR650 (x16/x8)/(x16/x16) PCIe FH Riser 2 Kit (x16+x16 PCIe Riser for Riser 2, for all three configurations) |
Note the following requirements for any of the three NVMe-rich configurations:
Although not required, it is expected that these configurations will be fully populated with NVMe drives. Maximum performance is achieved when all NVMe drive bays are filled with drives.
To verify support and ensure that the right power supply is chosen for optimal performance, validate your server configuration using the latest version of the Lenovo Capacity Planner:
http://datacentersupport.lenovo.com/us/en/solutions/lnvo-lcp
See the ThinkSystem SR650 product guide for the complete list of NVMe drives that are supported in the server: https://lenovopress.com/lp0644#drives-for-internal-storage
The NVMe drives listed in the following table are not supported in the three NVMe-rich configurations.
Part number | Feature code | Description |
---|---|---|
Unsupported NVMe drives | ||
7SD7A05770 | B11L | ThinkSystem U.2 Intel P4600 6.4TB Mainstream NVMe PCIe3.0 x4 Hot Swap SSD |
7N47A00984 | AUV0 | ThinkSystem U.2 PM963 1.92TB Entry NVMe PCIe 3.0 x4 Hot Swap SSD |
7N47A00985 | AUUU | ThinkSystem U.2 PM963 3.84TB Entry NVMe PCIe 3.0 x4 Hot Swap SSD |
7N47A00095 | AUUY | ThinkSystem U.2 PX04PMB 960GB Mainstream NVMe PCIe 3.0 x4 Hot Swap SSD |
7N47A00096 | AUMF | ThinkSystem U.2 PX04PMB 1.92TB Mainstream NVMe PCIe 3.0 x4 Hot Swap SSD |
7XB7A05923 | AWG6 | ThinkSystem U.2 PX04PMB 800GB Performance NVMe PCIe 3.0 x4 Hot Swap SSD |
7XB7A05922 | AWG7 | ThinkSystem U.2 PX04PMB 1.6TB Performance NVMe PCIe 3.0 x4 Hot Swap SSD |
The 16x NVMe drive configuration has the following features:
The 16x NVMe drive configuration has the following performance characteristics:
In the 16x NVMe drive configuration, the drive bays are configured as follows:
The PCIe slots in the server are configured as follows:
The front and rear views of the SR650 with 16x NVMe drives and 8x SAS/SATA drives is shown in the following figure.
Figure 2. SR650 front and rear views of the 16-NVMe drive configuration
The following figure shows a block diagram of how the PCIe lanes are routed from the processors to the NVMe drives.
Figure 3. SR650 block diagram of the 16-NVMe drive configuration
The details of the connections are listed in the following table.
Drive bay | Drive type | Drive lanes | Adapter | Slot | Host lanes | CPU |
---|---|---|---|---|---|---|
0 | NVMe | PCIe x4 | Onboard NVMe port | None | PCIe x8 | 2 |
1 | NVMe | PCIe x4 | 2 | |||
2 | NVMe | PCIe x4 | Onboard NVMe port | None | PCIe x8 | 2 |
3 | NVMe | PCIe x4 | 2 | |||
4 | NVMe | PCIe x4 | 1610-4P | Slot 6 (Riser 2) | PCIe x16 | 2 |
5 | NVMe | PCIe x4 | 2 | |||
6 | NVMe | PCIe x4 | 2 | |||
7 | NVMe | PCIe x4 | 2 | |||
8 | NVMe | PCIe x4 | 810-4P | Slot 4 (vertical) | PCIe x8 | 1 |
9 | NVMe | PCIe x4 | 1 | |||
10 | NVMe | PCIe x4 | 810-4P | Slot 7 (internal) | PCIe x8 | 1 |
11 | NVMe | PCIe x4 | 1 | |||
12 | NVMe | PCIe x4 | 1610-4P | Slot 1 (Riser 1) | PCIe x16 | 1 |
13 | NVMe | PCIe x4 | 1 | |||
14 | NVMe | PCIe x4 | 1 | |||
15 | NVMe | PCIe x4 | 1 | |||
16 | SAS or SATA | RAID 8i | Slot 3 (Riser 1) | PCIe x8 | 1 | |
17 | SAS or SATA | 1 | ||||
18 | SAS or SATA | 1 | ||||
19 | SAS or SATA | 1 | ||||
20 | SAS or SATA | 1 | ||||
21 | SAS or SATA | 1 | ||||
22 | SAS or SATA | 1 | ||||
23 | SAS or SATA | 1 |
The 20x NVMe drive configuration has the following features:
The 20x NVMe drive configuration has the following performance characteristics:
The PCIe slots in the server are configured as follows:
The front and rear views of the SR650 with 20x NVMe drives is shown in the following figure.
Figure 4. SR650 front and rear views of the 20-NVMe drive configuration
The following figure shows a block diagram of how the PCIe lanes are routed from the processors to the NVMe drives.
Figure 5. SR650 block diagram of the 20-NVMe drive configuration
The details of the connections are listed in the following table.
Drive bay | Drive type | Drive lanes | Adapter | Slot | Host lanes | CPU | |
---|---|---|---|---|---|---|---|
0 | NVMe | PCIe x4 | Onboard NVMe port | None | PCIe x8 | 2 | |
1 | NVMe | PCIe x4 | 2 | ||||
2 | NVMe | PCIe x4 | Onboard NVMe port | None | PCIe x8 | 2 | |
3 | NVMe | PCIe x4 | 2 | ||||
4 | NVMe | PCIe x4 | 1610-4P | Slot 6 (Riser 2) | PCIe x16 | 2 | |
5 | NVMe | PCIe x4 | 2 | ||||
6 | NVMe | PCIe x4 | 2 | ||||
7 | NVMe | PCIe x4 | 2 | ||||
8 | NVMe | PCIe x4 | 1610-4P | Slot 5 (Riser 2) | PCIe x16 | 2 | |
9 | NVMe | PCIe x4 | 2 | ||||
10 | NVMe | PCIe x4 | 2 | ||||
11 | NVMe | PCIe x4 | 2 | ||||
12 | NVMe | PCIe x4 | 810-4P | Slot 4 (vertical) | PCIe x8 | 1 | |
13 | NVMe | PCIe x4 | 1 | ||||
14 | NVMe | PCIe x4 | 810-4P | Slot 7 (internal) | PCIe x8 | 1 | |
15 | NVMe | PCIe x4 | 1 | ||||
16 | NVMe | PCIe x4 | 1610-4P | Slot 1 (Riser 1) | PCIe x16 | 1 | |
17 | NVMe | PCIe x4 | 1 | ||||
18 | NVMe | PCIe x4 | 1 | ||||
19 | NVMe | PCIe x4 | 1 | ||||
20 | Blank bay - no connection | ||||||
21 | Blank bay - no connection | ||||||
22 | Blank bay - no connection | ||||||
23 | Blank bay - no connection |
The 24x NVMe drive configuration has the following features:
The 24x NVMe drive configuration has the following performance characteristics:
The PCIe slots in the server are configured as follows:
The front and rear views of the SR650 with 24x NVMe drives is shown in the following figure.
Figure 6. SR650 front and rear views of the 24-NVMe drive configuration
The following figure shows a block diagram of how the PCIe lanes are routed from the processors to the NVMe drives.
Figure 7. SR650 block diagram of the 24-NVMe drive configuration
The details of the connections are listed in the following table.
Drive bay | Drive type | Drive lanes | Adapter | Slot | Host lanes | CPU | |
---|---|---|---|---|---|---|---|
0 | NVMe | PCIe x4 | 810-4P | Slot 6 (Riser 2) | PCIe x8 | 2 | |
1 | NVMe | PCIe x4 | |||||
2 | NVMe | PCIe x4 | 2 | ||||
3 | NVMe | PCIe x4 | |||||
4 | NVMe | PCIe x4 | 1610-8P | Slot 1 (Riser 1) | PCIe x16 (from onboard NVMe ports) | 2 | |
5 | NVMe | PCIe x4 | |||||
6 | NVMe | PCIe x4 | 2 | ||||
7 | NVMe | PCIe x4 | |||||
8 | NVMe | PCIe x4 | 2 | ||||
9 | NVMe | PCIe x4 | |||||
10 | NVMe | PCIe x4 | 2 | ||||
11 | NVMe | PCIe x4 | |||||
12 | NVMe | PCIe x4 | 810-4P | Slot 4 (vertical) | PCIe x8 | 1 | |
13 | NVMe | PCIe x4 | |||||
14 | NVMe | PCIe x4 | 1 | ||||
15 | NVMe | PCIe x4 | |||||
16 | NVMe | PCIe x4 | 810-4P | Slot 7 (internal) | PCIe x8 | 1 | |
17 | NVMe | PCIe x4 | |||||
18 | NVMe | PCIe x4 | 1 | ||||
19 | NVMe | PCIe x4 | |||||
20 | NVMe | PCIe x4 | 810-4P | Slot 2 (Riser 1) | PCIe x8 | 1 | |
21 | NVMe | PCIe x4 | |||||
22 | NVMe | PCIe x4 | 1 | ||||
23 | NVMe | PCIe x4 |
The following two field upgrade option kits are available to upgrade existing SAS/SATA or AnyBay drive configurations based on the 24x 2.5' chassis (feature code AUVV) to either the 20-drive or 24-drive NVMe configurations.
Part number | Feature code | Description |
---|---|---|
4XH7A09819 | B64L | ThinkSystem SR650 U.2 20-Bays Upgrade Kit |
4XH7A08810 | B64K | ThinkSystem SR650 U.2 24-Bays Upgrade Kit |
These kits include drive backplanes and required NVMe cables, power cables, drive bay fillers, and NVMe switch adapters.
No 16-drive upgrade kit: There is no upgrade kit for the 16x NVMe drive configuration.
The ThinkSystem SR650 U.2 20-Bays Upgrade Kit includes the following components:
The ThinkSystem SR650 U.2 24-Bays Upgrade Kit includes the following components:
For more information, see these resources:
Product families related to this document are the following:
Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.
The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo®
AnyBay®
ThinkSystem
The following terms are trademarks of other companies:
Intel® and Xeon® are trademarks of Intel Corporation or its subsidiaries.
Other company, product, or service names may be trademarks or service marks of others.
Table of ContentsSingle Process Multiple Data (SPMD) Model:
prun [ options ] <program> [ <args> ]
Multiple Instruction Multiple Data(MIMD) Model:
prun [ global_options ] [ local_options1 ]
<program1> [ <args1> ] : [ local_options2 ]
<program2> [ <args2> ] : ... :
[ local_optionsN ]
<programN> [ <argsN> ]
Note that in both models, invoking prun via an absolutepath name is equivalent to specifying the --prefix option with a <dir> valueequivalent to the directory where prun resides, minus its last subdirectory. For example:
% /usr/local/bin/prun ...
is equivalent to
% prun --prefix /usr/local
% prun[ -np X ] [ --hostfile <filename> ] <program>
This will run X copies of <program> in your current run-time environment(if running under a supported resource manager, PSRVR’s prun will usuallyautomatically use the corresponding resource manager process starter, asopposed to, for example, rsh or ssh, which require the use of a hostfile,or will default to running all X copies on the localhost), scheduling (bydefault) in a round-robin fashion by CPU slot. See the rest of this pagefor more details.
Please note that prun automatically binds processes. Threebinding patterns are used in the absence of any further directives:
If yourapplication uses threads, then you probably want to ensure that you areeither not bound at all (by specifying --bind-to none), or bound to multiplecores using an appropriate binding level or specific number of processingelements per application process.
Use one ofthe following options to specify which hosts (nodes) within the psrvr torun on.
The following options specify the number of processes to launch.Note that none of the options imply a particular binding policy - e.g., requestingN processes for each socket does not imply that the processes will be boundto the socket.
To map processes:
To orderprocesses’ ranks:
Forprocess binding:
For rankfiles:
To manage standard I/O:
To manage files and runtime environment:
The parser for the -x option is not very sophisticated; it does not evenunderstand quoted values. Users are advised to set variables in the environment,and then use -x to export (not define) them.
Setting MCA parameters:
For debugging:
There are also otheroptions:
Thefollowing options are useful for developers; they are not generally usefulto most users:
There may be other optionslisted with prun --help.
If the application is multiple instruction multiple data (MIMD), comprisingof multiple programs, the set of programs and argument can be specifiedin one of two ways: Extended Command Line Arguments, and Application Context.
An application context describes the MIMD program set including all argumentsin a separate file. This file essentially contains multiple prun commandlines, less the command name itself. The ability to specify different optionsfor different instantiations of a program is another reason to use an applicationcontext.
Extended command line arguments allow for the description of theapplication layout on the command line using colons (:) to separate thespecification of programs and arguments. Some options are globally set acrossall specified programs (e.g. --hostfile), while others are specific to a singleprogram (e.g. -np).
For example,
Or, consider the hostfile
% cat myhostfile
aa slots=2
bb slots=2
cc slots=2
Here, we list both the host names (aa, bb, and cc) but also how many'slots' there are for each. Slots indicate how many processes can potentiallyexecute on a node. For best performance, the number of slots may be chosento be the number of cores on the node or the number of processor sockets. If the hostfile does not provide slots information, PSRVR will attemptto discover the number of cores (or hwthreads, if the use-hwthreads-as-cpusoption is set) and set the number of slots to that value. This default behavioralso occurs when specifying the -host option with a single hostname. Thus,the command
When runningunder resource managers (e.g., SLURM, Torque, etc.), PSRVR will obtain boththe hostnames and the number of slots directly from the resource manger.
The numberof processes launched can be specified as a multiple of the number of nodesor processor sockets available. For example,
Another alternativeis to specify the number of processes with the -np option. Consider nowthe hostfile
% cat myhostfile
aa slots=4
bb slots=4
cc slots=4
Now,
Consider the same hostfile as above, again with -np 6:
node aa node bb node cc
prun 0 1 2 3 4 5
prun --map-by node 0 3 1 4 2 5
prun -nolocal 0 1 2 3 4 5
The --map-by node option will load balance the processes across the availablenodes, numbering each process in a round-robin fashion.
The -nolocal optionprevents any processes from being mapped onto the local host (in this casenode aa). While prun typically consumes few system resources, -nolocal canbe helpful for launching very large jobs where prun may actually need touse noticeable amounts of memory and/or processing time.
Just as -np canspecify fewer processes than there are slots, it can also oversubscribethe slots. For example, with the same hostfile:
One can also specify limits to oversubscription. For example, with thesame hostfile:
Limitsto oversubscription can also be specified in the hostfile itself: % catmyhostfile
aa slots=4 max_slots=4
bb max_slots=4
cc slots=4
The max_slots field specifies such a limit. When it does, the slots valuedefaults to the limit. Now:
Using the --nooversubscribeoption can be helpful since PSRVR currently does not get 'max_slots' valuesfrom the resource manager.
Of course, -np can also be used with the -H or-host option. For example,
And here is a MIMD example:
The mapping stepis used to assign a default location to each process based on the mapperbeing employed. Mapping by slot, node, and sequentially results in the assignmentof the processes to the node level. In contrast, mapping by object, allowsthe mapper to assign the process to an actual object on each node.
Note:the location assigned to the process is independent of where it will bebound - the assignment is used solely as input to the binding algorithm.
The mapping of process processes to nodes can be defined not just withgeneral policies but also, if necessary, using arbitrary mappings thatcannot be described by a simple policy. One can use the 'sequential mapper,'which reads the hostfile line by line, assigning processes to nodes inwhatever order the hostfile specifies. Use the -pmca rmaps seq option. Forexample, using the same hostfile as before:
prun -hostfile myhostfile -pmcarmaps seq ./a.out
will launch three processes, one on each of nodes aa,bb, and cc, respectively. The slot counts don’t matter; one process is launchedper line on whatever node is listed on the line.
Another way to specifyarbitrary mappings is with a rankfile, which gives you detailed controlover process binding as well. Rankfiles are discussed below.
The secondphase focuses on the ranking of the process within the job. PSRVR separatesthis from the mapping procedure to allow more flexibility in the relativeplacement of processes. This is best illustrated by considering the followingtwo cases where we used the —map-by ppr:2:socket option:
node aa node bb
rank-by core 0 1 ! 2 3 4 5 ! 6 7
rank-by socket 0 2 ! 1 3 4 6 ! 5 7
rank-by socket:span 0 4 ! 1 5 2 6 ! 3 7
Ranking by core and by slot provide the identical result - a simple progressionof ranks across each node. Ranking by socket does a round-robin ranking withineach node until all processes have been assigned a rank, and then progressesto the next node. Adding the span modifier to the ranking directive causesthe ranking algorithm to treat the entire allocation as a single entity- thus, the MCW ranks are assigned across all sockets before circling backaround to the beginning.
The binding phase actually binds each processto a given set of processors. This can improve performance if the operatingsystem is placing processes suboptimally. For example, it might oversubscribesome multi-core processor sockets, leaving other sockets idle; this canlead processes to contend unnecessarily for common resources. Or, it mightspread processes out too widely; this can be suboptimal if applicationperformance is sensitive to interprocess communication costs. Binding canalso keep the operating system from migrating processes excessively, regardlessof how optimally those processes were placed to begin with.
The processorsto be used for binding can be identified in terms of topological groupings- e.g., binding to an l3cache will bind each process to all processors withinthe scope of a single L3 cache within their assigned location. Thus, ifa process is assigned by the mapper to a certain socket, then a —bind-tol3cache directive will cause the process to be bound to the processorsthat share a single L3 cache within that socket.
To help balance loads,the binding directive uses a round-robin method when binding to levels lowerthan used in the mapper. For example, consider the case where a job is mappedto the socket level, and then bound to core. Each socket will have multiplecores, so if multiple processes are mapped to a given socket, the bindingalgorithm will assign each process located to a socket to a unique corein a round-robin manner.
Alternatively, processes mapped by l2cache andthen bound to socket will simply be bound to all the processors in thesocket where they are located. In this manner, users can exert detailedcontrol over relative MCW rank location and binding.
Finally, --report-bindingscan be used to report bindings.
As an example, consider a node with twoprocessor sockets, each comprising four cores. We run prun with -np 4 --report-bindingsand the following additional options:
% prun ... --map-by core --bind-to core
[...] ... binding child [...,0] to cpus 0001
[...] ... binding child [...,1] to cpus 0002
[...] ... binding child [...,2] to cpus 0004
[...] ... binding child [...,3] to cpus 0008
% prun ... --map-by socket --bind-to socket
[...] ... binding child [...,0] to socket 0 cpus 000f
[...] ... binding child [...,1] to socket 1 cpus 00f0
[...] ... binding child [...,2] to socket 0 cpus 000f
[...] ... binding child [...,3] to socket 1 cpus 00f0
% prun ... --map-by core:PE=2 --bind-to core
[...] ... binding child [...,0] to cpus 0003
[...] ... binding child [...,1] to cpus 000c
[...] ... binding child [...,2] to cpus 0030
[...] ... binding child [...,3] to cpus 00c0
% prun ... --bind-to none
Here, --report-bindings shows the binding of each process as a mask. In thefirst case, the processes bind to successive cores as indicated by themasks 0001, 0002, 0004, and 0008. In the second case, processes bind toall cores on successive sockets as indicated by the masks 000f and 00f0.The processes cycle through the processor sockets in a round-robin fashionas many times as are needed. In the third case, the masks show us that2 cores have been bound per process. In the fourth case, binding is turnedoff and no bindings are reported.
PSRVR’s support for process binding dependson the underlying operating system. Therefore, certain process bindingoptions may not be available on every system.
Process binding can alsobe set with MCA parameters. Their usage is less convenient than that ofprun options. On the other hand, MCA parameters can be set not only on theprun command line, but alternatively in a system or user mca-params.conffile or as environment variables, as described in the MCA section below.Some examples include:
prun option MCA parameter key value
--map-by core rmaps_base_mapping_policy core
--map-by socket rmaps_base_mapping_policy socket
--rank-by core rmaps_base_ranking_policy core
--bind-to core hwloc_base_binding_policy core
--bind-to socket hwloc_base_binding_policy socket
--bind-to none hwloc_base_binding_policy none
rank <N>=<hostname> slot=<slot list>
For example:
$ cat myrankfile
rank 0=aa slot=1:0-2
rank 1=bb slot=0:0,1
rank 2=cc slot=1-2
$ prun -H aa,bb,cc,dd -rf myrankfile ./a.out
Means that
Rank 0 runs on node aa, bound to logical socket 1, cores0-2.
Rank 1 runs on node bb, bound to logical socket 0, cores 0 and 1.
Rank 2 runs on node cc, bound to logical cores 1 and 2.
Rankfiles can alternatively be used to specify physical processor locations.In this case, the syntax is somewhat different. Sockets are no longer recognized,and the slot number given must be the number of the physical PU as mostOS’s do not assign a unique physical identifier to each core in the node.Thus, a proper physical rankfile looks something like the following:
$ cat myphysicalrankfile
rank 0=aa slot=1
rank 1=bb slot=8
rank 2=cc slot=6
This means that
Rank 0 will run on node aa, bound to the core thatcontains physical PU 1
Rank 1 will run on node bb, bound to the core that contains physicalPU 8
Rank 2 will run on node cc, bound to the core that contains physicalPU 6
Rankfiles are treated as logical by default, and the MCA parameter rmaps_rank_file_physicalmust be set to 1 to indicate that the rankfile is to be considered as physical.
The hostnames listed above are 'absolute,' meaning that actual resolveablehostnames are specified. However, hostnames can also be specified as 'relative,'meaning that they are specified in relation to an externally-specified listof hostnames (e.g., by prun’s --host argument, a hostfile, or a job scheduler).
The 'relative' specification is of the form '+n<X>', where X is an integerspecifying the Xth hostname in the set of all available hostnames, indexedfrom 0. For example:
$ cat myrankfile
rank 0=+n0 slot=1:0-2
rank 1=+n1 slot=0:0,1
rank 2=+n2 slot=1-2
$ prun -H aa,bb,cc,dd -rf myrankfile ./a.out
All socket/core slot locations are be specified as logical indexes. Youcan use tools such as HWLOC’s 'lstopo' to find the logical indexes of socketand cores.
Ifa relative directory is specified, it must be relative to the initial workingdirectory determined by the specific starter used. For example when usingthe rsh or ssh starters, the initial directory is $HOME by default. Otherstarters may set the initial directory to the current working directoryfrom the invocation of prun.
If the -wdir option appears both in a context fileand on the command line, the context file directory will override the commandline value.
If the -wdir option is specified, prun will attempt to changeto the specified directory on all of the remote nodes. If this fails, prunwill abort.
If the -wdir option is not specified, prun will send the directoryname where prun was invoked to each of the remote nodes. The remote nodeswill try to change to that directory. If they are unable (e.g., if the directorydoes not exist on that node), then prun will use the default directorydetermined by the starter.
All directory changing occurs before the user’sprogram is invoked.
PSRVR directsUNIX standard output and error from remote nodes to the node that invokedprun and prints it on the standard output/error of prun. Local processesinherit the standard output/error of prun and transfer to it directly.
Thusit is possible to redirect standard I/O for applications by using the typicalshell redirection procedure on prun.
% prun -np 2 my_app < my_input> my_output
Note that in this example only the rank 0 process will receive the streamfrom my_input on stdin. The stdin on all the other nodes will be tied to/dev/null. However, the stdout from all nodes will be collected into themy_output file.
SIGUSR1 and SIGUSR2 signals received by prun arepropagated to all processes in the job.
A SIGTSTOP signal to prun willcause a SIGSTOP signal to be sent to all of the programs started by prunand likewise a SIGCONT signal to prun will cause a SIGCONT sent.
Othersignals are not currently propagated by prun.
See the 'Remote Execution' section for more details.
However, it is not always desirable or possible to edit shell startup filesto set PATH and/or LD_LIBRARY_PATH. The --prefix option is provided for somesimple configurations where this is not possible.
The --prefix option takesa single argument: the base directory on the remote node where PSRVR isinstalled. PSRVR will use this directory to set the remote PATH and LD_LIBRARY_PATHbefore executing any user applications. This allows running jobs withouthaving pre-configured the PATH and LD_LIBRARY_PATH on the remote nodes.
PSRVRadds the basename of the current node’s 'bindir' (the directory where PSRVR’sexecutables are installed) to the prefix and uses that to set the PATHon the remote node. Similarly, PSRVR adds the basename of the current node’s'libdir' (the directory where PSRVR’s libraries are installed) to the prefixand uses that to set the LD_LIBRARY_PATH on the remote node. For example:
If the following command line is used:
% prun --prefix /remote/node/directory
PSRVR will add '/remote/node/directory/bin' to the PATH and '/remote/node/directory/lib64'to the D_LIBRARY_PATH on the remote node before attempting to execute anything.
The --prefix option is not sufficient if the installation paths on the remotenode are different than the local node (e.g., if '/lib' is used on the localnode, but '/lib64' is used on the remote node), or if the installationpaths are something other than a subdirectory under a common prefix.
Notethat executing prun via an absolute pathname is equivalent to specifying--prefix without the last subdirectory in the absolute pathname to prun.For example:
% /usr/local/bin/prun ...
is equivalent to
% prun --prefix /usr/local
The -pmca switch takes two arguments:<key> and <value>. The <key> argument generally specifies which MCA module willreceive the value. For example, the <key> 'btl' is used to select which BTLto be used for transporting messages. The <value> argument is the value thatis passed. For example:
The -pmca switchcan be used multiple times to specify different <key> and/or <value> arguments. If the same <key> is specified more than once, the <value>s are concatenatedwith a comma (',') separating them.
Note that the -pmca switch is simplya shortcut for setting environment variables. The same effect may be accomplishedby setting corresponding environment variables before running prun. Theform of the environment variables that PSRVR sets is:
PMIX_MCA_<key>=<value>
Thus, the -pmca switch overrides any previously set environment variables. The -pmca settings similarly override MCA parameters set in the $OPAL_PREFIX/etc/psrvr-mca-params.confor $HOME/.psrvr/mca-params.conf file.
Unknown <key> arguments are still setas environment variable -- they are not checked (by prun) for correctness.Illegal or incorrect <value> arguments may or may not be reported -- it dependson the specific MCA module.
To find the available component types underthe MCA architecture, or to find the available parameters for a specificcomponent, use the pinfo command. See the pinfo(1) man page for detailedinformation on the command.
A valid line in the file may contain zero ormany '-x', '-pmca', or “--pmca” arguments. The following patterns are supported:-pmca var val -pmca var 'val' -x var=val -x var. If any argument is duplicatedin the file, the last value read will be used.
MCA parameters and environmentspecified on the command line have higher precedence than variables specifiedin the file.
Reflecting this advice, prun will refuse to run as root by default.To override this default, you can add the --allow-run-as-root option to theprun command line.
By default,PSRVR records and notes that processes exited with non-zero terminationstatus. This is generally not considered an 'abnormal termination' - i.e.,PSRVR will not abort a job if one or more processes return a non-zero status.Instead, the default behavior simply reports the number of processes terminatingwith non-zero status upon completion of the job.
However, in some cases itcan be desirable to have the job abort when any process terminates withnon-zero status. For example, a non-PMIx job might detect a bad result froma calculation and want to abort, but doesn’t want to generate a core file.Or a PMIx job might continue past a call to PMIx_Finalize, but indicatethat all processes should abort due to some post-PMIx result.
It is not anticipatedthat this situation will occur frequently. However, in the interest of servingthe broader community, PSRVR now has a means for allowing users to directthat jobs be aborted upon any process exiting with non-zero status. Settingthe MCA parameter 'orte_abort_on_non_zero_status' to 1 will cause PSRVRto abort all processes once any process exits with non-zero status.
Terminations caused in this manner will be reported on the console asan 'abnormal termination', with the first process to so exit identifiedalong with its exit status.
If the --timeout command line optionis used and the timeout expires before the job completes (thereby forcingprun to kill the job) prun will return an exit status equivalent to thevalue of ETIMEDOUT (which is typically 110 on Linux and OS X systems).