Underutilized Error Message Terms

Average CPU percent per core
- Average CPU usage per core. This is the percent of time a processing core spends in user processing rather than servicing I/O or system needs. 100% means fully committed to processing. 0% would mean something really bad is happening, such as I/O saturation due to virtual memory disk thrashing.
Average CPU percent per node
- Average CPU percent per core times number of cores. This can give an idea of how node resources are being used overall.
Average memory usage per core
- The average amount of memory used by a core for its processing. It should normally be less than the total memory available per node, divided by the number of cores. This leaves some memory free for system needs. Some jobs may choose to use fewer cores to allow higher memory usage by active cores.
Average memory usage per node
- Average memory per core times number of cores.
Average virtual memory usage per core
- The amount of virtual (disk-based) memory used per core.
Average virtual memory usage per node
- Average virtual memory per core times number of cores
avg_cpu
- The same as "Average CPU percent per node" above.
avg_load
- The load summed from all assigned nodes divided by the number of nodes.
avg_mem
- Total memory usage from all assigned nodes divided by the number of nodes.
avg_vmem
- Total virtual memory usage from all assigned nodes divided by the number of nodes.
cpu_hours
- Sum of CPU time consumed by user processes from all available cores.
gb
- Data size in billions of bytes (10^9 bytes).
load
- Number of running user processes.
mb
- Data size in millions of bytes (10^6 bytes).
M
- Data size in Megabytes (2^20 bytes, or 1,048,576).
ppn
- Number of processors (cores) per node.
reverified_avg_load
- A second sampling to double check the load summed from all assigned nodes divided by the number of nodes.
top_proc
- The user process identified as the top CPU time consumer. The information is displayed in a ":" delimited format (but without spaces): "user name : command name : node (host) name : memory used : virtual memory used : wall clock hours : cpu%"
toppm
- The user process identified as the top memory consumer. The information is displayed in a ":" delimited format (but without spaces): "user name: command name : node (host) name : memory used : virtual memory used"
total_load
- Sum of the user processes running on all nodes assigned to the job.
total_nodes
- Number of nodes assigned to the job.
wall_hours
- Elapsed job time as measured by a wall clock.
unused_nodes
- Number of assigned nodes measured as being idle.
user
- User login name.
virtual memory
- When system random access memory space is exhausted, the system will use space on disk to store data. Reading and writing to/from disk is vastly slower than to/from system memory, and can bring processing to a halt. Heavy use of virtual memory leads to a situation called disk thrashing where all the time is spent in disk I/O and nothing is available for CPU processing. On some systems, the OOM (out-of-memory) killer may start killing processes to free memory and wind up inadvertently stopping critical processes, crashing the node.

Users may direct questions to sys-help@loni.org.