Improving Performance
Many Unix-like operating systems and distributions are tuned for desktop or light use out of the box and not for a production database. This guide describes recommended system performance tunings for operators of new and existing Riak clusters. The tunings present in this guide should be considered as a starting point. It is important to make note of what changes are made and when in order to measure the impact of those changes.
For performance and tuning recommendations specific to running Riak clusters on the Amazon Web Services EC2 environment, see AWS Performance Tuning.
Unless otherwise specified, the tunings recommended below are for Linux distributions. Users implementing Riak on BSD and Solaris distributions can use these tuning recommendations to make analogous changes in those operating systems.
Storage and File System Tuning
Virtual Memory
Due to the heavily I/O-focused profile of Riak, swap usage can result in
the entire server becoming unresponsive. We recommend setting
vm.swappiness
to 0 in /etc/sysctl.conf
to prevent swapping as much
as possible:
vm.swappiness = 0
Ideally, you should disable swap to ensure that Riak’s process pages are
not swapped. Disabling swap will allow Riak to crash in situations where
it runs out of memory. This will leave a crash dump file, named
erl_crash.dump
, in the /var/log/riak
directory which can be used to
determine the cause of the memory usage.
Transparent Huge Pages (THP)
Owing to the way that THP handles memory usage, disproportionately large amounts of memory can become held up in any large database application. We recommend disabling THP at boot time. Unfortunately this operation is rather OS specific. As many of our customers are running Red Hat 6, we have included instructions on how to do so underneath. If you are using a different operating system, please refer to documentation for your OS.
In Red Hat 6, you can disable THP by editing grub.conf
and adding the following line:
transparent_hugepage=never
For the change to become effective, a server reboot is required.
Some Kernel tuning tools such as ktune specify that THP should be enabled. This can cause THP to seem to be enabled even though transparent_hugepage=never
has already been added to grub.conf
and the system rebooted. Should this occur, please refer to the documentation for the Kernel tuning tool you are using as to how to disable THP.
Mounts
Riak makes heavy use of disk I/O for its storage operations. It is
important that you mount volumes that Riak will be using for data
storage with the noatime
flag, meaning that filesystem
inodes on the volume will not be
touched when read. This flag can be set temporarily using the following
command:
mount -o remount,noatime <riak_data_volume>
Replace <riak_data_volume>
in the above example with your actual Riak
data volume. The noatime
can be set in /etc/fstab
to mount
permanently.
Schedulers
I/O or disk scheduling is a blanket term used to describe the method by which an operating system chooses how to order input and output operations to and from storage.
The default I/O scheduler (elevator) on Linux is completely fair queuing
or cfq
, which is designed for desktop use. While a good
general-purpose scheduler, is not designed to provide the kind of
throughput expected in production database deployments.
Scheduler recommendations:
- The
noop
scheduler when deploying on iSCSI over HBAs, or any hardware-based RAID. - The
deadline
scheduler when using SSD-based storage.
To check the scheduler in use for block device sda
, for example, use
the following command:
cat /sys/block/sda/queue/scheduler
To set the scheduler to deadline
, use the following command:
echo deadline > /sys/block/sda/queue/scheduler
The default I/O scheduler queue size is 128. The scheduler queue sorts writes in an attempt to optimize for sequential I/O and reduce seek time. Changing the depth of the scheduler queue to 1024 can increase the proportion of sequential I/O that disks perform and improve overall throughput.
To check the scheduler depth for block device sda
, use the following
command:
cat /sys/block/sda/queue/nr_requests
To increase the scheduler depth to 1024, use the following command:
echo 1024 > /sys/block/sda/queue/nr_requests
Filesystem
Advanced journaling filesystems like ZFS and XFS are recommended on some operating systems for greater reliability and recoverability.
At this time, Riak can recommend using ZFS on Solaris, SmartOS, and OmniOS. ZFS may work well with Riak on direct Solaris clones like IllumOS, but we cannot yet recommend this. ZFS on Linux is still too early in its project lifetime to be recommendable for production use due to concerns that have been raised about excessive memory use. ZFS on FreeBSD is more mature than ZFS on Linux, but Riak has not yet performed sufficient performance and reliability testing to recommend using ZFS and Riak on FreeBSD.
In the meantime, the ext3 and ext4 filesystems are sufficient on operating systems on which ZFS or XFS are not available or recommended.
The ext4 file system defaults include two options that increase
integrity but slow performance. Because Riak’s integrity is based on
multiple nodes holding the same data, these two options can be changed
to boost I/O performance. We recommend setting barrier=0
and
data=writeback
when using the ext4 filesystem.
Similarly, the XFS file system defaults can be optimized to improve
performance. We recommend setting nobarrier
, logbufs=8
,
logbsize=256k
, and allocsize=2M
when using the XFS filesystem.
As with the noatime
setting, these settings should be added to
/etc/fstab
so that they are persisted across server restarts.
Kernel and Network Tuning
The following settings are minimally sufficient to improve many aspects
of Riak usage on Linux, and should be added or updated in
/etc/sysctl.conf
:
net.ipv4.tcp_max_syn_backlog = 40000
net.core.somaxconn = 40000
net.core.wmem_default = 8388608
net.core.rmem_default = 8388608
net.ipv4.tcp_sack = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_moderate_rcvbuf = 1
In general, these recommended values should be compared with the system defaults and only changed if benchmarks or other performance metrics indicate that networking is the bottleneck.
The following settings are optional, but may improve performance on a 10Gb network:
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_mem = 134217728 134217728 134217728
net.ipv4.tcp_rmem = 4096 277750 134217728
net.ipv4.tcp_wmem = 4096 277750 134217728
net.core.netdev_max_backlog = 300000
Certain network interfaces ship with on-board features that have been
shown to hinder Riak network performance. These features can be disabled
via ethtool
.
For an Intel chipset NIC using the
ixgbe
driver running as eth0
, for example, run the following command:
ethtool -K eth0 lro off
For a Broadcom chipset NIC using the bnx
or bnx2
driver, run:
ethtool -K eth0 tso off
ethtool
settings can be persisted across reboots by adding the above
command to the /etc/rc.local
script.
Tuning these values will be required if they are changed, as they affect all network operations.
Optional I/O Settings
If your cluster is experiencing excessive I/O blocking, the following settings may help prevent disks from being overwhelmed during periods of high write activity at the expense of peak performance for spiky workloads:
vm.dirty_background_ratio = 0
vm.dirty_background_bytes = 209715200
vm.dirty_ratio = 40
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 100
vm.dirty_expire_centisecs = 200
These settings have been tested and benchmarked by Riak in nodes with 16 GB of RAM.
Open Files Limit
Riak and supporting tools can consume a large number of open file handles during normal operation. For stability, increasing the number of open files limit is necessary. See Open Files Limit for more details.