Description
If you use ARCCONF for monitoring your Adaptec RAID controller, you may face a bug where Adaptec CIM Provider does not fully clean it’s temporary files and fills up the root RAM disk of the ESXi server.
Affected System
- Controller: Adaptec 51245
- Controller firmware: Build 18948 (latest firmware for Adaptec 51245 as of 02-Oct-2013)
- VMware ESXi: 5.x
- Driver version: aacraid-esxi5.0-1.1.7.29100 (latest driver for Adaptec 51245 as of 02-Oct-2013)
- CIM provider version: v7.31.18856 (latest CIM provider for Adaptec 51245 as of 02-Oct-2013)
- ARCCONF Client version: Version 7.31 (B18856)
This is a confirmed system configuration. One may experience this bug with other versions of the software or hardware. The primary suspect for the bug is the CIM provider.
Symptoms
You are using ARCCONF to monitor Adaptec Controller on your ESXi server and you start to receive one or more of following errors in ESXi:
- The VMRC console has disconnected…attempting to reconnect
- unable to connect to the MKS: a general error occurred: internal error
- ESXi logs have RAM disk is full errors.
- vdf -h command in SSH show’s ram disk root as 99%-100% used:
Ramdisk Size Used Available Use% Mounted on root 32M 32M 0M 100% -- etc 28M 280K 27M 0% -- tmp 192M 112K 191M 0% -- hostdstats 249M 4M 244M 2% --
Cause
When querying ARCCONF GETCONFIG a log file /var/log/arcconf.log is created on the ESXi server. This log file is always appended and never cleaned by the driver.
RAM disk default size is 32Mb. The speed at which the RAM disk becomes full depends on the monitoring intervals and the actual config of the controller. In our previous configuration, it took 60 days to fill up the disk. As our monitoring became more complex and with shorter intervals, it took 7 days. Keep in mind that the log is deleted if the server restarts. So, depending on circumstances, you may never notice the bug.
Official Fix
There is no known official fix as of 02-Oct-2013.
Workaround
The workaround is to clean the arcconf.log manually or using cron job. We use a cron job that cleans arcconf.log every two minutes.
*/2 * * * * /bin/echo > /var/log/arcconf.log
For the cron to be persistent across reboots, add following lines to the /etc/rc.local.d/local.sh
/bin/kill $(cat /var/run/crond.pid) /bin/echo "*/2 * * * * /bin/echo > /var/log/arcconf.log" >> /var/spool/cron/crontabs/root /usr/lib/vmware/busybox/bin/busybox crond
First line kills crond, second adds our ECHO command and third restarts crond.
—————————————————————–
UPD 18-10-2013: fixed typo in the crond schedule.
UPD 7-07-2014: fixed another typo in the crond schedule description.
/tmp # echo 1 >> /var/log/arcconf.log
/tmp # cat /var/log/arcconf.log
1
/tmp # ls -l
-rw——- 1 root root 0 Feb 13 10:06 31NuQ4
-rw——- 1 root root 0 Feb 11 19:58 3eTUJa
-rw——- 1 root root 0 Feb 10 13:28 64UiRB
-rw——- 1 root root 0 Feb 11 20:05 GDJqXp
-rw——- 1 root root 0 Feb 10 12:44 Rff5X9
-rw——- 1 root root 0 Feb 11 19:49 TZowEp
-rw——- 1 root root 0 Feb 12 12:48 UolH23
-rw-r–r– 1 root root 201322481 Feb 14 12:03 arcconf.log
Size of arcconf.log stay big.
Hi,
You should use
only one ” >”, not “>>”. The first – overwrites the file. Second one, appends.
Cheers!
UPD: Plus, I just noticed, that you echo to arcconf.log located in /var/log, but you run ls in the /tmp
If it’s not a symlink, then you should
Thanks but I was using this instruction – http://sysadmin.te.ua/linux/aacraid-monitoring.html . There is much more info about Adaptec AAC-Raid monitoring. Сheck it.
aprogrammer,
Yes, this article is not about monitoring but about a bug that may cause ESXi to fail while performing a monitoring task.
Thank you for the link, it’s quite interesting! However we have found that using ARCCONF GETCONFIG alone is not sufficient for an adequate monitoring. It does not covers all failure modes of Adaptec’s controller. Nor it provides a way to predict a failure.
We are planning to write an article about our monitoring procedure. Make sure to check back later!
Regards!
*/2 * * * * it’s not every half hour, it’s every two minutes …
Every half an hour – */30 * * * *
You are absolutely correct.
Thank you!