Production Checklist
Deploying Riak KV to a realtime production environment from a development or testing environment can be a complex process. While the specifics of that process will always depend on your environment and practices, there are some basics for you to consider and a few questions that you will want to ask while making this transition.
We’ve compiled these considerations and questions into separate categories for you to look over.
System
- Are all systems in your cluster as close to identical as possible in terms of both hardware and software?
- Have you set appropriate open files limits on all of your systems?
- Have you applied the Riak KV performance improvement recommendations?
Network
- Are all systems using the same NTP servers to synchronize clocks?
- Are you sure that your NTP clients’ configuration is monotonic (i.e. that your clocks will not roll back)?
- Is DNS correctly configured for all systems’ production deployments?
- Are connections correctly routed between all Riak nodes?
- Are connections correctly set up in your load balancer?
- Are your firewalls correctly configured?
- Check that network latency and throughput are as expected for all of the
following (we suggest using iperf to verify):
- between nodes in the cluster
- between the load balancer and all nodes in the cluster
- between application servers and the load balancer
- Do all Riak nodes appear in the load balancer’s rotation?
- Is the load balancer configured to balance connections with roundrobin or a similarly random distribution scheme?
Riak KV
- Check configuration files:
- Does each machine have the correct name and IP settings in
riak.conf
(or inapp.config
if you’re using the older configuration files)? - Are all configurable settings identical across the cluster?
- Have all of the settings in your configuration file(s) that were changed for debugging purposes been reverted back to production settings?
- If you’re using multiple data backends, are all of your bucket types configured to use the correct backend?
- If you are using Riak Security, have you checked off all items in the security checklist and turned on security?
- If you’re using multiple data backends, do all machines’ config files agree on their configuration?
- Do all nodes agree on the value of the [
allow_mult
][config basic] setting? - Do you have a sibling resolution strategy in
place if
allow_mult
is set totrue
? - Have you carefully weighed the consistency trade-offs that must be made if
allow_mult
is set tofalse
? - Are all of your apps replication properties configured correctly and uniformly across the cluster?
- If you are using Riak Search, is it enabled on all nodes? If you are not, has it been disabled on all nodes?
- If you are using strong consistency for some or all of your
data:
- Does your cluster consist of at least three nodes? If it does not, you will not be able to use this feature, and you are advised against enabling it.
- If your cluster does consist of at least three nodes, has the strong consistency subsystem been [enabled][config strong consistency] on all nodes?
- Is the
target_n_val
that is set on each node higher than anyn_val
that you intend to use for strongly consistent bucket types (or any bucket types for that matter)? The default is 4, which will likely need to be raised if you are using strong consistency.
- Have all bucket types that you intend to use been created and successfully activated?
- If you are using
riak_control
, is it enabled on the node(s) from which you intend to use it?
- Does each machine have the correct name and IP settings in
- Check data mount points:
- Is
/var/lib/riak
mounted? - Can you grow that disk later when it starts filling up?
- Do all nodes have their own storage systems (i.e. no SANs), or do you have a plan in place for switching to that configuration later?
- Is
Are all Riak KV nodes up?
- Run
riak ping
on all nodes. You should getpong
as a response. - Run
riak-admin wait-for-service riak_kv <node_name>@<IP>
on each node. You should getriak_kv is up
as a response.
The
<node_name>@<IP>
string should come from your [configuration file(s)][configure reference].- Run
Do all nodes agree on the ring state?
- Run
riak-admin ringready
. You should getTRUE ALL nodes agree on the ring [list_of_nodes]
. - Run
riak-admin member-status
. All nodes should be valid (i.e. listed asValid: 1
), and all nodes should appear in the list - Run
riak-admin ring-status
. The ring should be ready (Ring Ready: true
), there should be no unreachable nodes (All nodes are up and reachable
), and there should be no pending changes to the ring (No pending changes
). - Run
riak-admin transfers
. There should be no active transfers (No transfers active
).
- Run
Operations
- Does your monitoring system ensure that NTP is running?
- Are you collecting time series data on
the whole cluster?
- System metrics
- CPU load
- Memory used
- Network throughput
- Disk space used/available
- Disk input/output operations per second (IOPS)
- Riak metrics (from the
/stats
HTTP endpoint or usingriak-admin
) - Latencies:
GET
andPUT
(mean/median/95th/99th/100th) - Vnode stats:
GET
s,PUT
s,GET
totals,PUT
totals - Node stats:
GET
s,PUT
s,GET
totals,PUT
totals - Finite state machine (FSM) stats:
GET
/PUT
FSMobjsize
(99th and 100th percentile)GET
/PUT
FSMtimes
(mean/median/95th/99th/100th)
- Protocol buffer connection stats
pbc_connects
pbc_active
pbc_connects_total
- Are the following being graphed (at least the key metrics)?
- Basic system status
- Median and 95th and 99th percentile latencies (as these tend to be leading indicators of trouble)
Application and Load
- Have you benchmarked your cluster with simulated load to confirm that your configuration will meet your performance needs?
- Are the [develop client libraries] in use in your application up to date?
- Do the client libraries that you’re using support the version of Riak KV that you’re deploying?
Confirming Configuration with Riaknostic
Recent versions of Riak KV ship with Riaknostic, a diagnostic utility that
can be invoked by running riak-admin diag <check>
, where check
is
one of the following:
disk
dumps
memory_use
nodes_connected
ring_membership
ring_preflists
ring_size
search
sysctl
Running riak-admin diag
with no additional arguments will run all
checks and report the findings. This is a good way of verifying that
you’ve gotten at least some of the configurations mentioned above
correct, that all nodes in your cluster are up, and that nothing is
grossly misconfigured. Any warnings produced by riak-admin diag
should
be addressed before going to production.
Troubleshooting and Support
- Does your team, including developing and operations, know how to open support requests with Riak?
- Is your team familiar with Riak Support’s Service-Level Agreement
(SLA) levels?
- Normal and Low are for issues not immediately impacting production systems
- High is for problems that impact production or soon-to-be-production systems, but where stability is not currently compromised
- Urgent is for problems causing production outages or for those issues that are likely to turn into production outages very soon. On-call engineers respond to urgent requests within 30 minutes, 24 / 7.
- Does your team know how to gather
riak-debug
results from the whole cluster when opening tickets? If not, that process goes something like this:- SSH into each machine, run
riak-debug
, and grab the resultant.tar.gz
file - Attach all debug tarballs from the whole cluster each time you open a new High- or Urgent-priority ticket
- SSH into each machine, run
The Final Step: Taking it to Production
Once you’ve been running in production for a month or so, look back at the metrics gathered above. Based on the numbers you’re seeing so far, configure alerting thresholds on your latencies, disk consumption, and memory. These are the places most likely to give you advance warning of trouble.
When you go to increase capacity down the line, having historic metrics will give you very clear indicators of having resolved scaling problems, as well as metrics for understanding what to upgrade and when.