Troubleshooting a Slow Linux System

From Leo's Notes
Last edited on 30 December 2021, at 02:08.

An odd issue with a slow Linux server was brought to my attention. The server is a Dell PowerEdge 850 and is pretty beasty with 80 CPUs and 512GB of memory. All the usual causes were ruled out:

  • Not running low on memory or swapping
  • Not low on CPU cycles (it's mostly idle)
  • Disks are not dying
  • Network mounts aren't a problem
  • No crazy interrupts or context switching occurring

Symptoms however were obvious. The shell was slow, running top and updating it as fast as it can by holding down the space bar would have it refresh only twice per second (while on a normal system, it would probably update ten or more times per second), running ps -eaf would take seconds. It also looked like any syscalls were affected which lead me to think it's an issue with the kernel or perhaps some loaded modules. Context switches weren't an issue, neither were interrupts. Nothing appeared wrong with the system.

It turns out, the CPU was heavily underclocked and I didn't catch the fact that the CPU frequency was in the hundreds of megahertz in my first round of troubleshooting.

[root@ebg boot]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-4640 v2 @ 2.20GHz
Stepping:              4
CPU MHz:               203.295
CPU max MHz:           2700.0000
CPU min MHz:           1200.0000
BogoMIPS:              4399.91
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76
NUMA node1 CPU(s):     1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77
NUMA node2 CPU(s):     2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78
NUMA node3 CPU(s):     3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor d
s_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts spec_ctrl intel_stibp flush_l1d

The governor was set to powersave, but even then, the minimum frequency shouldn't dip below 1200MHz. Setting the governor to performance didn't make any difference.

[root@ebg boot]# cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 1.20 GHz - 2.70 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 1.20 GHz and 2.70 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: 193 MHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: yes
    2500 MHz max turbo 4 active cores
    2500 MHz max turbo 3 active cores
    2600 MHz max turbo 2 active cores
    2700 MHz max turbo 1 active cores

A reboot didn't make any difference. In fact, the system was so slow it took well over 20 minutes to do a reboot.

As a temporary fix, I eventually turned the server off and unplugged power to both PSUs and then turned it back on which seemed to have brought the server back up at the proper frequency.

[root@ebg ~]# cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 1.20 GHz - 2.70 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 1.20 GHz and 2.70 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: 1.20 GHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: yes
    2500 MHz max turbo 4 active cores
    2500 MHz max turbo 3 active cores
    2600 MHz max turbo 2 active cores
    2700 MHz max turbo 1 active cores

This might be issue to other Dells as well as some users complained their laptops ran at a fixed slow clockspeed at https://www.dell.com/community/Laptops-General-Read-Only/Dell-XPS-9550-CPU-Multiplier-Stuck-at-8/m-p/4740656#M882076.

Addendum and Possible Fix

The issue came back and the CPU cores were once again clocked at around 200MHz. A reboot did not help.

This R850 was running a very old BIOS version (2.0.20) dated from 2014. Updating this to 2.7.0 appears to have fixed the issue immediately after the BIOS update and a subsequent reboot (with no power cycle performed). This could be due to the CPUs being reset as part of the update which reset its frequency.