vCSA Root Disk Space Issue Caused by dnsmasq

Introduction

I was having problems with my vCSA (6.0 Update 3d) such as not being able to login to the client. I went to the Appliance Management Interface and the Health was showing as critical. The cause was a couple of very large dnsmasq.log files. It took a couple of blog posts and a KB articles to fix, so I thought I would write up a consolidated blog post of the fix.

Investigation

First of all I connected to the shell of the vCSA and then ran the command df -h to show the space used on each partition in human readable format:

vcsa-01:~ # df -h
Filesystem                            Size  Used Avail Use% Mounted on
/dev/sda3                              11G   11G     0 100% /
udev                                  7.9G  164K  7.9G   1% /dev
tmpfs                                 7.9G   40K  7.9G   1% /dev/shm
/dev/sda1                             128M   38M   84M  31% /boot
/dev/mapper/core_vg-core               50G   18G   30G  38% /storage/core
/dev/mapper/log_vg-log                 20G   18G  1.5G  92% /storage/log
/dev/mapper/db_vg-db                  9.9G  812M  8.6G   9% /storage/db
/dev/mapper/dblog_vg-dblog            5.0G  603M  4.1G  13% /storage/dblog
/dev/mapper/seat_vg-seat               32G  2.4G   28G   8% /storage/seat
/dev/mapper/netdump_vg-netdump        3.0G   18M  2.8G   1% /storage/netdump
/dev/mapper/autodeploy_vg-autodeploy   12G  154M   12G   2% /storage/autodeploy
/dev/mapper/invsvc_vg-invsvc          9.9G  272M  9.1G   3% /storage/invsvc

You can see the partition /dev/sda3 is out of space. I need to find out what was consuming all the space. Browsing through the disks I found in /var/log:

vcsa-01:/var/log # ls -lh
total 4.6G
-rw-r----- 1 dnsmasq dnsmasq  17M Feb  4 08:32 dnsmasq.log
-rw-r----- 1 dnsmasq root    844M Jan 28 07:45 dnsmasq.log-20180128
-rw-r----- 1 dnsmasq dnsmasq 3.7G Feb  2 16:19 dnsmasq.log-20180204

Over 4.5GB of old log files!

Immediate fix

I had two choices to immediately fix the issue. I could delete the log files or expand the drive. The least risky was to delete the log files but as the vCSA is going away soon due to a v6.5 deployment, I decided to do both and learn something new. I would always advise to do a snapshot of the appliance before you make any changes, but as we are expanding drives you can’t do it with a snapshot in place. Expand the drive first then snapshot.

There are eleven disks on a v6.0 vCSA. William Lam has a blog post giving details. The disk we need to expand is VMDK1 which is by default 12GB. Use a client to expand the disk to say 15GB. If a disk is expanded you can normally use lvm autogrow to expand the disk but in this case VMDK1 is split into three partitions and only /dev/sda3 is full. You can see this by using the lsblk command:

nor-vc-01:~ # lsblk
NAME                              MAJ:MIN RM   SIZE RO MOUNTPOINT
sda                                 8:0    0    12G  0 
├─sda1                              8:1    0   132M  0 /boot
├─sda2                              8:2    0     1G  0 [SWAP]
└─sda3                              8:3    0  10.9G  0 /
sdb                                 8:16   0   1.3G  0 
sdc                                 8:32   0    30G  0 
└─swap_vg-swap1 (dm-8)            253:8    0    30G  0 [SWAP]
sdd                                 8:48   0    50G  0 
└─core_vg-core (dm-7)             253:7    0    50G  0 /storage/core
sde                                 8:64   0    20G  0 
└─log_vg-log (dm-6)               253:6    0    20G  0 /storage/log
sdf                                 8:80   0    10G  0 
└─db_vg-db (dm-5)                 253:5    0    10G  0 /storage/db
sdg                                 8:96   0     5G  0 
└─dblog_vg-dblog (dm-4)           253:4    0     5G  0 /storage/dblog
sdh                                 8:112  0    32G  0 
└─seat_vg-seat (dm-3)             253:3    0    32G  0 /storage/seat
sdi                                 8:128  0     3G  0 
└─netdump_vg-netdump (dm-2)       253:2    0     3G  0 /storage/netdump
sdj                                 8:144  0    12G  0 
└─autodeploy_vg-autodeploy (dm-1) 253:1    0    12G  0 /storage/autodeploy
sdk                                 8:160  0    10G  0 
└─invsvc_vg-invsvc (dm-0)         253:0    0    10G  0 /storage/invsvc
sr0                                11:0    1  1024M  0 
fd0                                 2:0    1     4K  0 

sda is VMDK1 on the vCSA and it is sda3 that is full.

I found an excellent blog post from Mike Preston that explained how to expand the disk.

These next steps can be dangerous! PROCEED WITH CAUTION!

I needed to use fdisk to rewrite the partition table. This is done by typing the commands:

fdisk /dev/sda
d # d to delete a partition
3 # select partition 3 sda3

n # new
p # new partition
3 # choose partition 3

<Enter> # select default value at First sector prompt
<Enter> # select default value at Last sector prompt

a # make partition bootable
3 # select partition 3 to be be bootable
w # write the changes and exit

The output looks like this:

vcsa-01:~ # fdisk /dev/sda

Command (m for help): d
Partition number (1-4): 3

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4, default 3): 3
First sector (2377728-31457279, default 2377728):
Using default value 2377728
Last sector, +sectors or +size{K,M,G} (2377728-31457279, default 31457279):
Using default value 31457279

Command (m for help): a
Partition number (1-4): 3

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.

I naturally ran df -h to check the space but it was not expanded. Look at the Warning at the end of the output. The new partition table will not be active until after a reboot.

After reboot I ran df -h again but it was still the same size. Following onto step 3 of Mike’s post I saw I needed to extend the filesystem (all I had done was alter partition tables). I used the command resize2fs /dev/sda3:

vcsa-01:~ # resize2fs /dev/sda3
resize2fs 1.41.9 (22-Aug-2009)
Filesystem at /dev/sda3 is mounted on /; on-line resizing required
old desc_blocks = 1, new_desc_blocks = 1
Performing an on-line resize of /dev/sda3 to 3634944 (4k) blocks.
The filesystem on /dev/sda3 is now 3634944 blocks long.

and checking using df -h:

vcsa-01:~ # df -h
Filesystem                            Size  Used Avail Use% Mounted on
/dev/sda3                              14G  11G  3G    51% /

Success! Thanks Mike for the excellent post.

I may have fixed the immediate problem, but it was likely to happen again.

The other, easier and less risky way to fix the issue was to delete the dnsmasq.log files:

vcsa-01:/var/log # rm dnsmasq.log-20180204
vcsa-01:/var/log # rm dnsmasq.log-20180128

Prevention

The next day I started looking into why the archived dnsmasq.log files where so large. dnsmasq is a lightweight DHCP and caching DNS server so I didn’t know why it was logging so much. It just so happened I was checking Twitter and noticed this conversation between Justin Bias and Adam Eckerle which was exactly the issue I was having. What a coincidence that I saw this the day I was researching the issue. Adam really is a KB search wizard as I didn’t find the referenced KB article either.

The KB article is root partition on the vCenter Server Appliance is full due to dnsmasq.log files (52258). This article details the workaround as there is no permanent fix and is a known issue. The workaround is in two parts.

The first part is to modify the log rotation options for dnsmasq. This is done by using vi to edit /etc/logrotate.d/dnsmasq to change it to match (I’ve put a comment on the lines you need to change):

/var/log/vmware/dnsmasq.log {
nodateext # added to file
daily # changed from weekly
missingok
notifempty
compress #changed from delaycompress
maxsize 5M # added to file
rotate 5 # added to file
sharedscripts
postrotate
[ ! -f /var/run/dnsmasq.pid ] || kill -USR2 `cat /var/run/dnsmasq.pid`
endscript
create 0640 dnsmasq dnsmasq
}

Next is to move the dnsmasq.log files to the correct disk. Use vi to edit /etc/dnsmasq.conf and change the following line to match:

log-facility=/var/log/vmware/dnsmasq.log

Finally restart the dnsmasq service:

vcsa-01:~ # service dnsmasq restart
Shutting name service masq caching server
Starting name service masq caching server

Since then the dnsmasq.log files have been under control.

Conclusion

The real purpose of this post was the highlight the KB article for controlling the dnsmasq log files. If it wasn’t for Adam Eckerle Tweet I am not sure I would have found the article - and I consider my Google searching skills pretty good. Also thanks to Mike Preston for the new knowledge in expanding the /dev/sda3 partition.

This is the third time I have had issues with disk space in a vCSA v6.0 so please be mindful of the partitions filling up and giving you problems. Monitor them closely and in particular the /storage/log and /storage/db disks.

Categories:

Updated: