Monday, August 24, 2020

Palo Alto firewall - Troubleshooting High MP CPU



  • How to Interpret: show system resources

The command show system resources gives a snapshot of Management Plane (MP) resource utilization including memory and CPU. This is similar to the ‘top’ command in Linux.

show system resources provides information about the memory used and available and if the MP is using swap. If the swap usage remains consistently high, it implies that processes are either failing to release memory or they justifiably require more memory to operate.

To verify MP is swapping for a extended period of time, refer to process kswapd highlighted in the sample output below. If this process is running for hours - this could indicate that you have consistent swapping going on and may need to take steps to reduce your memory footprint.

The values of interest are the Swap, Mem, CPU wait time (%wa), Virtual memory usage (VIRT) and CPU usage (%CPU). These values are helpful when determining high MP CPU and/or slow MP response. If the CPU wait time is high, it indicates the MP is waiting for a process to release the CPU.

To view real-time memory and CPU usage, run the command: show system resources follow

Most of the Palo Alto Platforms have multiple core CPUs. Some platforms have dedicated processors for MP and DP, while some use Single Processor for both MP and DP.

For example,

 Platform MP Processors MP Cores
 PA-200 Single Shared Dual Core Processor 1
 PA-220 Single Shared Quad Core Processor 2
 PA-800 Single Shared 8 Cores Processor 3
 PA-3000 Dedicated Dual Core Processor 2
 PA-3200 Dedicated Quad Core Processor 4
 PA-5000 Dedicated Quad Core Processor 4
 PA-5200 Dedicated Multi Core Processor 16-24

This processor can be multiple core processor (some cores can also be used for DP depending on platform), the output lists usage for all cores combined.

You can press "1" after running "show system resources follow" to toggle view to show separate states:

Sample output of the command is provided below:

show system resources
admin@PA-850 (active)> show system resources

top - 15:15:55 up 15 days,  6:53,  1 user,  load average: 7.50, 6.82, 6.57
Tasks: 177 total,   8 running, 168 sleeping,   0 stopped,   1 zombie
%Cpu(s): 71.1 us,  2.7 sy,  0.3 ni, 25.7 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  4119516 total,   515792 free,  1325328 used,  2278396 buff/cache
KiB Swap:  6104084 total,  4836884 free,  1267200 used.  1112294 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 4532       20   0   99752   4036   3308 R  94.4  0.1  22001:32 pan_task
 4533       20   0   74712   3808   3252 R  94.4  0.1  22001:56 pan_task
 4537       20   0   74732   3872   3272 R  94.4  0.1  22001:01 pan_task
 4538       20   0   74712   3788   3188 R  94.4  0.1  22002:18 pan_task
 4534       20   0   74712   3724   3132 R  88.9  0.1  22002:01 pan_task
22131       20   0    4012   1620   1244 R  66.7  0.0   0:37.80 genindex.sh
30475       20   0   16912   7020   1904 R   5.6  0.2   0:00.04 top
    1       20   0    2320    252    252 S   0.0  0.0   0:10.94 init
    2       20   0       0      0      0 S   0.0  0.0   0:03.84 kthreadd
    3       20   0       0      0      0 S   0.0  0.0   1:13.88 ksoftirqd/0
    5        0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:+
...snipped...
   46        0 -20       0      0      0 S   0.0  0.0   0:00.00 khelper
   47        0 -20       0      0      0 S   0.0  0.0   0:00.00 netns
  235        0 -20       0      0      0 S   0.0  0.0   0:00.00 writeback
  238        0 -20       0      0      0 S   0.0  0.0   0:00.00 bioset
  239        0 -20       0      0      0 S   0.0  0.0   0:00.00 kblockd
  245        0 -20       0      0      0 S   0.0  0.0   0:00.00 ata_sff
  255       20   0       0      0      0 S   0.0  0.0   0:00.00 khubd
  262        0 -20       0      0      0 S   0.0  0.0   0:00.00 edac-poller
  280        0 -20       0      0      0 S   0.0  0.0   0:00.00 rpciod
  327       20   0       0      0      0 S   0.0  0.0  40:24.01 kswapd0
  328       20   0       0      0      0 S   0.0  0.0   0:00.00 fsnotify_m+
  329        1 -19       0      0      0 S   0.0  0.0   0:00.00 nfsiod
  335        0 -20       0      0      0 S   0.0  0.0   0:00.00 kthrotld
  904       20   0       0      0      0 S   0.0  0.0   0:00.00 scsi_eh_0
  907       20   0       0      0      0 S   0.0  0.0   0:00.00 scsi_eh_1
  921       20   0       0      0      0 S   0.0  0.0   0:00.00 spi32766
  968       20   0       0      0      0 S   0.0  0.0   0:43.90 kworker/5:1
  ...snipped...
 5047 nobody    20   0  303308    812    812 S   0.0  0.0   2:55.11 nginx
...snipped...
30466 admin     20   0    2728    704    580 S   0.0  0.0   0:00.01 less
30471       20   0   17248   7100   2204 S   0.0  0.2   0:00.02 sh
30476       20   0   16724   6204   1560 S   0.0  0.2   0:00.02 sed


You can press "1" after running "show system resources follow" to toggle view to show separate states:
Example from a PA-850 firewall:

show system resources
admin@PA-850 (active)> show system resources
top - 14:21:05 up 11 days, 5:58, 1 user, load average: 6.45, 6.44, 6.45
Tasks: 173 total, 8 running, 165 sleeping, 0 stopped, 0 zombie
%Cpu0 : 35.0 us, 7.3 sy, 0.0 ni, 57.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 20.5 us, 3.0 sy, 0.0 ni, 76.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 28.1 us, 7.3 sy, 0.0 ni, 64.4 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 4119516 total, 456576 free, 1381004 used, 2281936 buff/cache
KiB Swap: 6104084 total, 5046640 free, 1057444 used. 1434722 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4532 root 20 0 99752 5796 4784 R 100.0 0.1 16189:00 pan_task
4533 root 20 0 74712 5540 4920 R 100.0 0.1 16189:29 pan_task
4534 root 20 0 74712 5460 4860 R 100.0 0.1 16189:29 pan_task
4537 root 20 0 74732 5360 4756 R 100.0 0.1 16188:40 pan_task
4538 root 20 0 74712 5564 4932 R 100.0 0.1 16189:40 pan_task
21848 root 20 0 4012 1664 1256 R 96.0 0.0 5:27.98 genindex.sh
3330 root 20 0 48976 4520 2560 S 1.0 0.1 69:47.02 ehmon
4488 root 20 0 639500 139056 117412 S 0.3 3.4 48:58.79 useridd
4490 root 20 0 364060 5096 3720 S 0.3 0.1 98:37.45 distributord
4523 root 20 0 58676 2892 2576 S 0.3 0.1 21:11.89 tund
4761 root 20 0 131032 14696 3064 S 0.3 0.4 75:23.93 identityclient
4783 nobody 20 0 288456 8620 3936 S 0.3 0.2 4:48.71 appweb3
4785 root 20 0 1656580 195980 7320 S 0.3 4.8 32:45.77 logrcvr
25172 nobody 20 0 2186392 159940 7900 S 0.3 3.9 2:41.97 httpd
1 root 20 0 2320 508 504 S 0.0 0.0 0:08.10 init


To Toggle IRIX Mode off press Shift+I:

- You can see above this box has 8 CPUs, 5 of which (1-5) are for DP. The DP process is pan_task which is supposed to be running at all times hence it has no idle CPU cycles.
- In above you will notice individually each pan_task (process for DP) are showing 100% CPU. This is design behavior of TOP Command in IRIX Mode where It is possible for the % CPU column to display values that total greater than 100%.
- Solaris mode divides the % CPU for each process by the number of CPUs in the system so that the total will be 100%.


top - 15:37:56 up 15 days, 7:15, 1 user, load average: 6.59, 6.65, 6.66
Tasks: 176 total, 7 running, 167 sleeping, 0 stopped, 2 zombie
%Cpu0 : 28.0 us, 6.4 sy, 0.5 ni, 64.2 id, 0.0 wa, 0.0 hi, 0.9 si, 0.0 st
%Cpu1 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 17.9 us, 6.0 sy, 0.0 ni, 75.7 id, 0.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 43.6 us, 7.3 sy, 0.0 ni, 48.6 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st
KiB Mem : 4119516 total, 292220 free, 1224748 used, 2602548 buff/cache
KiB Swap: 6104084 total, 4752288 free, 1351796 used. 848512 avail Mem

Iris mode On
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4532 root 20 0 99752 3508 2832 R 100.0 0.1 22023:31 pan_task
4534 root 20 0 74712 3460 2868 R 100.0 0.1 22024:01 pan_task
4537 root 20 0 74732 3560 2956 R 100.0 0.1 22023:01 pan_task
4538 root 20 0 74712 3372 2776 R 100.0 0.1 22024:18 pan_task
4533 root 20 0 74712 3320 2776 R 99.5 0.1 22023:55 pan_task
23368 root 20 0 4012 1620 1212 R 95.4 0.0 7:05.25 genindex.sh
3104 root 0 -20 206236 19628 2372 S 5.5 0.5 98:36.40 masterd_apps
4488 root 20 0 639500 131072 114584 S 4.6 3.2 68:00.68 useridd
3124 root 15 -5 133252 8904 1240 S 1.8 0.2 202:33.41 sysd
3307 root 30 10 202592 7432 1928 S 1.4 0.2 85:45.42 python
41 root 20 0 0 0 0 S 0.5 0.0 17:19.33 rcuc/7
3686 admin 20 0 4208 1616 1128 R 0.5 0.0 0:00.23 top
4761 root 20 0 131032 13484 2124 S 0.5 0.3 103:03.36 identityclient
4804 root 20 0 1204484 3332 1440 S 0.5 0.1 18:27.52 authd
7675 root 20 0 0 0 0 S 0.5 0.0 0:04.91 kworker/0:2


Additionally note that average CPU usage for this box will also consider DP Cores so your monitoring tools will show higher MP Usage.

Example from a PA-5060 device:
show system resources follow

 top - 21:23:26 up 4 days,  1:57,  1 user,  load average: 0.00, 0.00, 0.00
Tasks: 131 total,   1 running, 130 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4053832k total,  2953796k used,  1100036k free,   163596k buffers
Swap:  2007992k total,     1080k used,  2006912k free,  1298156k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2402 root 0 -20 52960 16m 4440 S 0.3 0.4 5:31.90 masterd_apps
2421 root 15 -5 21680 4408 2508 S 0.3 0.1 18:59.06 sysd
3106 nobody 20 0 156m 47m 10m S 0.3 1.2 21:01.78 appweb3
1 root 20 0 2084 696 576 S 0.0 0.0 0:01.88 init


In this platform all cores are dedicated for MP, so you do not see any pan_task. Rest of the logic remains same.


  • MP-Log

To check the management plane's resource usage for historic logs, use the following command (this information is logged every 15 minutes).

less mp-log mp-monitor.log
admin@firewall1(active)> less mp-log mp-monitor.log
2022-03-16 06:34:33.901 +0000  --- processes
Total num processes: 39
Name                   PID      CPU%  FDs Open   Virt Mem     Res+Swap     State
pppoe                  6655     0     7          446568       27148        S
ha_agent               6648     0     11         239500       29040        S
satd                   6649     0     12         562168       39420        S
l2ctrl                 6646     0     8          519204       32788        S
dp-console             4234     0     5          21992        4596         S
varrcvr                6645     0     34         3011300      145616       S


  • Genindex.sh causing high MP CPU

show system resources follow
top - 14:17:29 up 11 days, 5:55, 1 user, load average: 6.84, 6.46, 6.45
Tasks: 176 total, 8 running, 168 sleeping, 0 stopped, 0 zombie
%Cpu(s): 73.0 us, 2.6 sy, 0.1 ni, 24.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 4119516 total, 319332 free, 1415160 used, 2385024 buff/cache
KiB Swap: 6104084 total, 5050440 free, 1053644 used. 1295050 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4532 root 20 0 99752 5800 4788 R 100.0 0.1 16185:23 pan_task
4533 root 20 0 74712 5544 4924 R 100.0 0.1 16185:53 pan_task
4534 root 20 0 74712 5460 4860 R 100.0 0.1 16185:52 pan_task
4538 root 20 0 74712 5568 4936 R 100.0 0.1 16186:04 pan_task
4537 root 20 0 74732 5364 4760 R 99.7 0.1 16185:04 pan_task
21848 root 20 0 4012 1664 1256 R 97.0 0.0 2:03.25 genindex.sh
3124 root 15 -5 133232 12636 2568 S 1.0 0.3 140:10.34 sysd
3104 root 0 -20 206236 22044 4280 S 0.7 0.5 70:13.62 masterd_apps
3307 root 30 10 202592 10124 4512 S 0.7 0.2 62:48.48 python
4785 root 20 0 1656580 195988 7104 S 0.3 4.8 32:45.55 logrcvr

Symptom
Genindex.sh process utilizing a large amount of CPU.

Environment
All firewall platforms can be affected by this post PANOS - +8.1

Cause
Genindex.sh runs after a PAN-OS upgrade to re-index the logs with any new fields that have been added.  It runs every 15 minutes and is known to cause spikes in the CPU on the management plane.  After it completes indexing all the logs, it will then stop running.

This should not cause any issues or be detrimental to the firewall and will only last until it has indexed all of your logs.  However, depending on the amount of logs you have based off your platform and what you have log retention set to, it can take up to weeks.

One way to reduce the amount of time genindex.sh has to run is to make sure you are making use of log retention as the last logs there are to index, the less time it will take for the process to finish.

Resolution
There is no resolution, this is working as expected.  Once it has finished indexing all the logs it will stop running and you should see your CPU utilization drop.  Also it's worth keeping in mind that this process only uses one core so your devices with multiple cores will not be affected, however it will show up in the logs and GUI that the management CPU is high.

On many occasions, this could be resolved by reducing the amount of logging.


  • Varrcvr process is causing high MP CPU

▶ Symptom
The MP CPU is consistently hitting 100% on this firewall. A major change was migration to Panorama. There were no reports, and we do not generate lots of logs.
When you executed the command "show system resources follow" multiple times, and found that the varrcvr process was always showing as 66 plus in the output.


show system resources follow


admin@pa-220> show system resources follow
top - 17:15:18 up 63 days, 18:28,  1 user,  load average: 3.77, 3.93, 4.10
Tasks: 139 total,   6 running, 133 sleeping,   0 stopped,   0 zombie
%Cpu0  : 90.5 us,  8.5 sy,  0.0 ni,  1.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 38.4 us, 14.1 sy, 47.2 ni,  0.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  4119652 total,   254280 free,  2471820 used,  1393552 buff/cache
KiB Swap:     1972 total,     1972 free,        0 used.  1165668 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3219 root      20   0   71944  32292   4972 R  99.7  0.8  91773:26 pan_task
 3220 root      20   0   46976   7160   4724 R  99.7  0.2  91776:16 pan_task
 4478 root      20   0 1549004  25420   8980 S  66.6  0.6  28695:37 varrcvr
27267 root      20   0    3876   1536   1260 R   8.2  0.0   0:05.09 genindex.sh
27252 root      20   0   17312   7524   2280 S   4.9  0.2   0:00.76 logpurger.sh
 2915 root       0 -20  199232  32176   4548 S   2.0  0.8 432:48.61 masterd_apps
27212 admin     20   0    4184   1576   1128 R   0.7  0.0   0:00.28 top
28316 root      20   0   15444   3092   1152 R   0.7  0.1   0:00.02 cat
   10 root      20   0       0      0      0 R   0.3  0.0  57:09.10 rcuc/0
   16 root      20   0       0      0      0 S   0.3  0.0  24:36.92 rcuc/2
 1504 nobody    20   0 2316468 161600  11072 S   0.3  3.9   6:58.72 httpd
...snipped...


▶ Resolution
Restart the varrcvr (vardata-receiver) process with the following CLI command.

debug software restart process vardata-receiver
admin@pa-220> debug software restart process vardata-receiver
Process varrcvr was restarted by user admin

vardata-receiver is Vardata Receiver server process
Varrcvr: Recording URL filtering log and packet capture sent by dataplane. Involved with WildFire logs. 


  • PA-500 High Management CPU and poor performance with high logging

▶ Symptom
PA-500 runs very slow and it was possible to notice that Management CPU is between 98-100% almost all the time.

MP CPU Spikes up very frequently and "logrcvr", "mgmtsrvr" and "genindex.sh" appears to be the offending processes:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
3265 20 0 1095m 282m 4780 S 35.4 14.2 93995:05 logrcvr 
14961 20 0 5260 1560 1232 R 16.9 0.1 0:51.34 genindex.sh 

3265 20 0 1095m 205m 3264 S 76.7 10.4 94000:37 logrcvr 
3217 20 0 1198m 315m 7280 S 14.7 15.9 38133:24 mgmtsrvr


High IOWait process was seen as well, however disk space utilization was fine:

PU USAGE DISTRIBUTION Avg Max Max Date
---------------------------------------------
User 54% 54.3% 01-08-06:15
System 13% 13.9% 01-05-12:11
Nice 2% 2.6% 01-09-02:36
Idle 25% 26.6% 01-04-11:29
IOWait 1% 1.1% 01-04-11:29


Due to poor management plane performance commits and other tasks may be affected and get stuck, and using commit force should help in order to go through.

▶ Environment
Palo Alto PA-500 Firewall.
PAN-OS 8.0 and 8.1,
High rate of logging configured.

▶ Cause
When average IOWait is greater than 1%, it is likely causing processes to run slower than normal and it is important to reduce the amount of logging.

▶ Resolution
1. Set the Max days for Logs, In this case the max days was configured to value '90' using this article, which automatically purged all the logs older than 90 days.
2. In this case device was also upgraded to latest stable release. These both actions improved the performance remarkably.



* Reference URLs:

No comments: