We all love our VPSes to be in tip-top shape, ready to respond to requests and always on the ready. However, there are some times when you run into troubles, sluggish responses, timeouts etc.
It is good to always have a “In Case of Emergency” plan, such as re-routing traffic to a backup server, sending notifications to customers etc. With that immediate fire to put out or at least brought under control, you need to begin investigating what went wrong. We will explore some basic tools that help you identify problems with your VPS.
This command provides a quick glance at some useful information. It is gives you the time the system has been up along with the number of users logged in and the system load averages for 1, 5 and 15 minute (rolling) windows.
[missionctrl@orbit ~]$ uptime
19:14:14 up 71 days, 2:14, 1 user, load average: 0.08, 0.06, 0.05
This shows that the VPS has been up for 71 days, 2hrs and 14 minutes. There is currently one user logged on to the system. Understanding the load factor is not straightforward. A load average value of 1.00 means that the CPU is 100% utilized. A value greater than 1 is okay as long as you have multiple CPUs. (To get the count of CPUs, use the command grep ‘model name’ /proc/cpuinfo | wc –l). So a load average of 2.00 for a 4 CPU system means that the overall CPU utilization is at 50%.
Here is a result from another VPS
cmd@user:~$ uptime
14:29:34 up 7 days, 15:41, 1 user, load average: 8.10, 8.02, 8.01
cmd@user:~$ grep 'model name' /proc/cpuinfo | wc -l
8
For a 8 CPU VPS, we can see each CPU is being utilized fully and it is time to review running processes and begin some investigation. This brings us to the next command
Top is an interactive application that displays running processes and CPU utilization details. Running top gives you a screen similar to this
top - 12:08:58 up 12 days, 12:25, 1 user, load average: 0.52, 0.38, 0.27
Tasks: 24 total, 1 running, 23 sleeping, 0 stopped, 0 zombie
%Cpu(s): 4.0 us, 9.3 sy, 0.0 ni, 86.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 524288 total, 63300 used, 460988 free, 0 buffers
KiB Swap: 524288 total, 29468 used, 494820 free. 16240 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 28120 784 576 S 0.0 0.1 0:13.38 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd/6726
3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper/6726
66 root 20 0 28812 3300 3192 S 0.0 0.6 1:32.32 systemd-journal
190 root 20 0 186900 592 316 S 0.0 0.1 0:20.31 rsyslogd
215 root 20 0 12604 8 4 S 0.0 0.0 0:00.00 agetty
216 root 20 0 12604 8 4 S 0.0 0.0 0:00.00 agetty
2134 root 20 0 38744 8 4 S 0.0 0.0 0:00.00 systemd-udevd
2209 systemd+ 20 0 25692 8 4 S 0.0 0.0 0:00.00 systemd-resolve
2848 root 20 0 40584 1268 956 S 0.0 0.2 0:00.00 cron
2849 hetrixt+ 20 0 4280 652 548 S 0.0 0.1 0:00.00 sh
2850 hetrixt+ 20 0 11608 1420 1160 S 0.0 0.3 0:00.07 bash
3376 www-data 20 0 61232 960 420 S 0.0 0.2 0:57.03 lighttpd
3545 root 20 0 55132 376 260 S 0.0 0.1 0:18.31 sshd
3740 root 20 0 82676 3864 3000 S 0.0 0.7 0:00.03 sshd
4029 cmd 20 0 82676 1944 1056 S 0.0 0.4 0:00.00 sshd
4030 cmd 20 0 20208 2040 1544 S 0.0 0.4 0:00.00 bash
4131 cmd 20 0 21924 1528 1100 R 0.0 0.3 0:00.00 top
4414 hetrixt+ 20 0 11608 640 376 S 0.0 0.1 0:00.00 bash
4415 hetrixt+ 20 0 8444 804 672 S 0.0 0.2 0:00.00 vmstat
4416 hetrixt+ 20 0 4212 584 484 S 0.0 0.1 0:00.00 tail
7578 root 20 0 4196 60 36 S 0.0 0.0 0:10.64 runsvdir
8378 Debian-+ 20 0 53248 84 32 S 0.0 0.0 0:00.24 exim4
22127 root 20 0 25848 184 116 S 0.0 0.0 0:06.81 cron
Use arrow keys to navigate the list. Press q to exit. In many ways, top tops all other commands we explore here as it provides information on running processes, memory consumption and also uptime information
The first line is similar to the output from uptime. We see there are 24 tasks running, not much CPU being consumed by any of the tasks. The two lines just above the process list, gives the available memory and swap details.
Homework assignment, installing and using htop
You will also want to check memory usage in terms of available space, swap allocation etc. Use free -m to give you this information.
[cmd@user ~]$ free -m
total used free shared buff/cache available
Mem: 1024 479 457 180 86 454
Swap: 0 0 0
The -m flag is used to report data in Megabytes. You could change it to -h, which translates to “human readable” form. All values are converted and suffixed by appropriate G/M/K to represent Gigabytes/Megabytes/Kilobytes respectively.
Using the -s flag updates the values are regular 5 second intervals. This will show if the memory consumption is increasing over time.
df stands for “Disk Filesystem” and is used to check disk space utilization. When invoked as is, it displays the disk allocation and utilization of all available filesystems on your node.
Filesystem 1K-blocks Used Available Use% Mounted on
udev 16407644 0 16407644 0% /dev
tmpfs 3287496 1128 3286368 1% /run
/dev/sda4 954185180 3247948 902397484 1% /
tmpfs 16437460 0 16437460 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 16437460 0 16437460 0% /sys/fs/cgroup
/dev/loop0 88704 88704 0 100% /snap/core/4486
/dev/sda2 1998672 147424 1730008 8% /boot
/dev/loop1 89088 89088 0 100% /snap/core/4830
tmpfs 3287492 0 3287492 0% /run/user/1000
/dev/loop2 89088 89088 0 100% /snap/core/4917
If the large number looks hard to read, you could use the -h flag to display data in human readable form.
Filesystem Size Used Avail Use% Mounted on
udev 16G 0 16G 0% /dev
tmpfs 3.2G 1.2M 3.2G 1% /run
/dev/sda4 910G 3.1G 861G 1% /
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/loop0 87M 87M 0 100% /snap/core/4486
/dev/sda2 2.0G 144M 1.7G 8% /boot
/dev/loop1 87M 87M 0 100% /snap/core/4830
tmpfs 3.2G 0 3.2G 0% /run/user/1000
/dev/loop2 87M 87M 0 100% /snap/core/4917
If you want to display used and available inodes, pass the -i flag to give a result like this
Filesystem Inodes IUsed IFree IUse% Mounted on
udev 4101911 472 4101439 1% /dev
tmpfs 4109365 687 4108678 1% /run
/dev/sda4 60661760 145756 60516004 1% /
tmpfs 4109365 1 4109364 1% /dev/shm
tmpfs 4109365 4 4109361 1% /run/lock
tmpfs 4109365 18 4109347 1% /sys/fs/cgroup
/dev/loop0 12819 12819 0 100% /snap/core/4486
/dev/sda2 131072 313 130759 1% /boot
/dev/loop1 12841 12841 0 100% /snap/core/4830
tmpfs 4109365 10 4109355 1% /run/user/1000
/dev/loop2 12842 12842 0 100% /snap/core/4917
On similar lines, the command du is worth exploring. du or Disk usage is used to identify which folders and/or files are consuming the most space. It differs from df as you could drill down to a particular folder and check usage. It is invoked as
du /path/to/directory
A sample output is shown below:
[cmd@user ~]$ sudo du /var/log
4 /var/log/ntpstats
60 /var/log/apt
98316 /var/log/journal/81ab9be955ae4eb489a0d397a990251d
98320 /var/log/journal
4 /var/log/lxd
4 /var/log/landscape
644 /var/log/installer
28 /var/log/unattended-upgrades
4 /var/log/dist-upgrade
119320 /var/log
Common flags include -h to print data in human readable form. The above output now looks like
4.0K /var/log/ntpstats
60K /var/log/apt
97M /var/log/journal/81ab9be955ae4eb489a0d397a990251d
97M /var/log/journal
4.0K /var/log/lxd
4.0K /var/log/landscape
644K /var/log/installer
28K /var/log/unattended-upgrades
4.0K /var/log/dist-upgrade
117M /var/log
-a is used to print sizes of files
-c is for printing a total line at the end of the display
-s is similar to -c, except this is just the final summary line and not the details
Flags can be combined to provide desired level of information
[cmd@user ~]$ du /var/log -s -h
117M /var/log
You get a call that there the application running on your server is not reachable. You have checked all the messages, done your ps search. There are no error messages and you can see the process running. As you are wondering what could have gone wrong, you decide to check the port number which the application is listening to. Turns out, someone had changed the configuration file and the port number has changed. netstat has helped here.
Some key flags used to control netstat output are (they can be combined)
Flag |
Description |
-l |
List only ports that are in Listen state |
-t |
List only ports that use TCP |
-u |
List only ports that use UDP |
-n |
List information, but do not perform any lookups on port, host or user names |
-p |
List information along with Process ID & Program Name |
Combining the flags, we can get some information that can help in debugging. For e.g., identify if lighttpd is listening to TCP requests on port 80, you could enter
netstat -ltpn | grep "lighttpd"
netstat -ltpn | grep ":80"
Either of the above commands should show you the entry (if everything is working)
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN
3376/lighttpd
tcp6 0 0 :::80 :::* LISTEN
3376/lighttpd
This informs the times when the system was rebooted. Here is a sample output
[cmd@user ~]$ last reboot
reboot system boot 2.6.32-042stab12 Sat Apr 28 02:51 - 19:44 (68+16:53)
reboot system boot 2.6.32-042stab12 Wed Jan 3 20:39 - 19:44 (182+23:04)
reboot system boot 2.6.32-042stab12 Wed Jan 3 20:39 - 20:39 (00:00)
This command is particularly useful to identify if there has been any shutdowns resulting in downtime on your server. In most cases you should be able to relate to every boot event. Note that it could be a provider initiated reboot (e.g. a Kernel patch that was required as a result of the Meltdown & Spectre bugs), though you would have received advance notification of this.
Systems that run systemd (Ubuntu 16.04+, Centos 6.5+, Fedora, Debian) also run a daemon called journald which keeps logs of boot messages, kernel messages and messages from various services. The journalctl app can be used to query & display the results from journald’s logging.
Just issuing the journalctl command displays all the logs from the beginning. Let us make it easier to filter out messages. To filter messages based on a service, use the -u flag.
journalctl -u mariadb.service
Lists all messages issued by the mariadb.service. Over a period of time, it could be pages of information, let us filter it further to messages since the last boot
journalctl -b -u mariadb.service
The -b flag limits to messages since the last restart. In case you restarted the machine due to a problem and want to identify messages before the most recent reboot, add the -1 flag like so
journalctl -b -1 -u mariadb.service
If you know the time frame around when the error occurred, you can add the –since flag
journalctl --since today # Lists journal entries from today
journalctl --since “2018-07-05 13:20:00” # Lists entries from 13:20 on 5th July
You can add the -u flag to limit to services you want
Most applications keep log entries based on settings such as ERROR/DEBUG level. These logs are in /var/logs/{application-name}, unless the application uses a different setting. Please consult the application’s documentation for exact locations and how the log messages can be diagnosed.
Hopefully, this article gives you a basic toolkit to look under the hood and see which bolts need to be tightened. For advanced users or for critical applications, we recommend using monitoring tools (self-hosted or 3rd party). Though that is an article for a later date.