Underutilized: E131 - Too Many Unused Nodes Message

This error occurs if only 1 node assigned to a job is being used and all the others are idle. In this example, 31 nodes were found to be inactive. The load value represents the number of user processes running on the node. Ideally, there should be 1 user process per core. This job asked for 32 nodes, and there were 20 cores per node, so the total load should have been about 640. In fact, all the processes were started on one node, smic252 (see line 9), giving a load of 639.92. This caused resource starvation on the node. This is reflected in the fact that the job ran for almost 4 hours, but no CPU time was consumed (see line 14). Such a situation could destabilize the system, so the job was terminated. Information collected by the PBS job manager is reflected in lines 13 to 32. User account and allocation information is shown in lines 33 to 42.

 1)  E131 - Too many unused nodes
 2)  Job 74236 has 31 unused nodes.
 3)  Please correct this problem.
 4)
 5)  Job deleted
 6)
 7)  PBS job: 74236, nodes: 32
 8)  Hostname Days Load CPU U#(User:Process:VirtualMemory:Memory:Hours)
 9)  smic252     34    639.92   0       0
10)  smic253     34    0.16       0       0
11)  smic254     34    0.11       0       0
12)  . . . 29 similar lines removed . . .
13)  PBS_job=74236 user=flast allocation=hpc_alloc03
14)  queue=checkpt total_load=641.57 cpu_hours=0.00 wall_hours=3.90
15)  unused_nodes=31 total_nodes=32 ppn=20 avg_load=20.04
16)  avg_cpu=0% avg_mem=0mb avg_vmem=0mb top_proc=none:0.0hr:0%
17)  node_processes=0
18)
19)  Node statistics::
20)  Number of nodes: 32
21)  Number of cores: 640
22)  Total physical memory per node: 64364mb
23)  Average memory usage per node: 0mb, 0%
24)  Average memory usage per core: 0mb
25)  Average virtual memory usage per node: 0mb
26)  Average virtual memory usage per core: 0mb
27)  Average CPU percent per node: 0%
28)  Average CPU percent per core: 0%
29)  Average load per node: 0.02
30)  Reverified average load per node: 19.89
31)  Effective maximum load on a node: 635.08
32)
33)  Name:  First Last
34)  Mail:  flast@somewhere.lsu.edu
35)  Affil: First Last
36)  Category:
37)  Name:  First Last
38)  Mail:  flast@somewhere.lsu.edu
39)  Affil: First Last
40)  Category: validation:current:02/22/2011
41)  Allocations:
42)  hpc_alloc03,flast,1578202.88,default

Users may direct questions to sys-help@loni.org.