Statistics & Monitoring Reference
  
  
Riak provides data related to current operating status, which includes
statistics in the form of counters and histograms. These statistics
are made available through the HTTP API via the /stats endpoint, or through the riak-admin interface, in particular the stat and status commands.
This page presents the most commonly monitored and gathered statistics, as well as numerous solutions for monitoring and gathering statistics that our customers and community report using successfully in Riak cluster environments. You can learn more about the specific Riak statistics provided in the Inspecting a Node and HTTP Status documentation.
System Metrics To Graph
Graphing general system metrics of Riak nodes will help with diagnostics and early warnings of potential problems, as well as help guide provisioning and scaling decisions.
- CPU (user/system/wait/idle)
- Processor Load
- Available Memory
- Available disk space
- Used file descriptors
- Swap Usage
- IOWait
- Read operations
- Write operations
- Network throughput
- Network errors
We also recommend tracking your system’s virtual and writebacks. Things like massive flushes of dirty pages or steadily climbing writeback volumes can indicate poor virtual memory tuning. More information can be found here and in our documentation on system tuning.
Riak Metrics to Graph
Riak metrics fall into several general categories:
- Throughput metrics
- Latency metrics
- Erlang resource usage metrics
- General Riak load/health metrics
If graphing all of the available Riak metrics is not practical, you should pick a minimum relevant subset from these categories. Some of the most helpful metrics are discussed below.
Throughput Metrics
Graphing the throughput stats relevant to your use case is often helpful for capacity planning and usage trend analysis. In addition, it helps you establish an expected baseline – that way, you can investigate unexpected spikes or dips in the throughput. The following stats are recorded for operations that happened during the last minute.
| Metric | Relevance | Operations (for the last minute) | 
|---|---|---|
| node_gets | K/V | Reads coordinated by this node | 
| node_puts | K/V | Writes coordinated by this node | 
| vnode_counter_update | Data Types | Update Counters operations coordinated by local vnodes | 
| vnode_set_update | Data Types | Update Sets operations coordinated by local vnodes | 
| vnode_map_update | Data Types | Update Maps operations coordinated by local vnodes | 
| search_query_throughput_one | Search | Search queries on the node | 
| search_index_throughtput_one | Search | Documents indexed by Search | 
| consistent_gets | Strong Consistency | Consistent reads on this node | 
| consistent_puts | Strong Consistency | Consistent writes on this node | 
| vnode_index_reads | Secondary Indexes | Number of local replicas participating in secondary index reads | 
Note that there are no separate stats for updates to Flags or
Registers, as these are included in vnode_map_update.
Latency Metrics
As with the throughput metrics, keeping an eye on average (and max) latency times will help detect usage patterns, and provide advanced warnings for potential problems.
FSM Time Stats represent the amount of time in microseconds required to traverse the GET or PUT Finite State Machine code, offering a picture of general node health. From your application’s perspective, FSM Time effectively represents experienced latency. Mean, Median, and 95th-, 99th-, and 100th-percentile (Max) counters are displayed. These are one-minute stats.
| Metric | Also | Relevance | Latency (in microseconds) | 
|---|---|---|---|
| node_get_fsm_time_mean | _median,_95,_99,_100 | K/V | Time between reception of client read request and subsequent response to client | 
| node_put_fsm_time_mean | _median,_95,_99,_100 | K/V | Time between reception of client write request and subsequent response to client | 
| object_counter_merge_time_mean | _median,_95,_99,_100 | Data Types | Time it takes to perform an Update Counter operation | 
| object_set_merge_time_mean | _median,_95,_99,_100 | Data Types | Time it takes to perform an Update Set operation | 
| object_map_merge_time_mean | _median,_95,_99,_100 | Data Types | Time it takes to perform an Update Map operation | 
| search_query_latency_median | _min,_95,_99,_999,_max | Search | Search query latency | 
| search_index_latency_median | _min,_95,_99,_999,_max | Search | Time it takes Search to index a new document | 
| consistent_get_time_mean | _median,_95,_99,_100 | Strong Consistency | Strongly consistent read latency | 
| consistent_put_time_mean | _median,_95,_99,_100 | Strong Consistency | Strongly consistent write latency | 
Erlang Resource Usage Metrics
These are system metrics from the perspective of the Erlang VM, measuring resources allocated and used by Erlang.
| Metric | Notes | 
|---|---|
| sys_process_count | Number of processes currently running in the Erlang VM | 
| memory_processes | Total amount of memory allocated for Erlang processes (in bytes) | 
| memory_processes_used | Total amount of memory used by Erlang processes (in bytes) | 
General Riak Load/Health Metrics
These various stats give a picture of the general level of activity or load on the Riak node at any given moment.
| Metric | Also | Notes | 
|---|---|---|
| node_get_fsm_siblings_mean | _median,_95,_99,_100 | Number of siblings encountered during all GET operations by this node within the last minute. Watch for abnormally high sibling counts, especially max ones. | 
| node_get_fsm_objsize_mean | _median,_95,_99,_100 | Object size encountered by this node within the last minute. Abnormally large objects (especially paired with high sibling counts) can indicate sibling explosion. | 
| riak_search_vnodeq_mean | _median,_95,_99,_100 | Number of unprocessed messages in the vnode message queues of the Riak Search subsystem on this node in the last minute. The queues give you an idea of how backed up Solr is getting. | 
| search_index_fail_one | Number of “Failed to index document” errors Search encountered for the last minute | |
| pbc_active | Number of currently active protocol buffer connections | |
| pbc_connects | Number of new protocol buffer connections established during the last minute | |
| read_repairs | Number of read repair operations this node has coordinated in the last minute (determine baseline, watch for abnormal spikes) | |
| list_fsm_active | Number of List Keys FSMs currently active (should be 0) | |
| node_get_fsm_rejected | Number of GET FSMs actively being rejected by Sidejob’s overload protection | |
| node_put_fsm_rejected | Number of PUT FSMs actively being rejected by Sidejob’s overload protection | 
Command-line Interface
The riak-admin tool provides two
interfaces for retrieving statistics and other information: status
and stat.
status
Running the riak-admin status command will return all of the
currently available information from a running node.
riak-admin status
This will return a list of over 300 key/value pairs, like this:
1-minute stats for 'dev1@127.0.0.1'
-------------------------------------------
connected_nodes : ['dev2@127.0.0.1','dev3@127.0.0.1']
consistent_get_objsize_100 : 0
consistent_get_objsize_195 : 0
... etc ...
A comprehensive list of available stats can be found in the Inspecting a Node document.
stat
The riak-admin stat command is related to the riak-admin status
command but provides a more fine-grained interface for interacting with
stats and information. Full documentation of this command can be found
in the Inspecting a Node document.
Statistics and Monitoring Tools
There are many open source, self-hosted, and service-based solutions for aggregating and analyzing statistics and log data for the purposes of monitoring, alerting, and trend analysis on a Riak cluster. Some solutions provide Riak-specific modules or plugins as noted.
The following are solutions which customers and community members have reported success with when used for monitoring the operational status of their Riak clusters. Community and open source projects are presented along with commercial and hosted services.
Many of the below tools were either created by third-parties or Riak engineers for general usage, and have been passed to the community for further updates. As such, many of the below only aggregate the statistics and messages that were output by Riak 1.4.x.
Like all code under Riak Labs, the below tools are “best effort” and have no dedicated Riak support. We both appreciate and need your contribution to keep these tools stable and up to date. Please open up a GitHub issue on the repository if you’d like to be a maintainer.
Look for banners calling out the tools we’ve verified that support the latest Riak 2.x statistics!
Self-Hosted Monitoring Tools
Riaknostic
Riaknostic is a growing suite of diagnostic checks that can be run against your Riak node to discover common problems and recommend how to resolve them. These checks are derived from the experience of the Riak Client Services Team as well as numerous public discussions on the mailing list, IRC room, and other online media.
Riaknostic integrates into the riak-admin command via a diag
subcommand, and is a great first step in the process of diagnosing and
troubleshooting issues on Riak nodes.
Riak Control
Riak Control is Riak’s REST-driven user-interface for managing Riak clusters. It is designed to give you quick insight into the health of your cluster and allow for easy management of nodes.
While Riak Control does not currently offer specific monitoring and statistics aggregation or analysis functionality, it does offer features which provide immediate insight into overall cluster health, node status, and handoff operations.
collectd
collectd gathers statistics about the system it is running on and stores them. The statistics are then typically graphed to find current performance bottlenecks, predict system load, and analyze trends.
Ganglia
Ganglia is a monitoring system specifically designed for large, high-performance groups of computers, such as clusters and grids. Customers and community members using Riak have reported success in using Ganglia to monitor Riak clusters.
A Riak Ganglia module for collecting statistics from
the Riak HTTP /stats endpoint is also available.
Nagios
Tested and Verified Support for Riak 2.x.
Nagios is a monitoring and alerting solution that can provide information on the status of Riak cluster nodes, in addition to various types of alerting when particular events occur. Nagios also offers logging and reporting of events and can be used for identifying trends and capacity planning.
A collection of reusable Riak-specific scripts are available to the community for use with Nagios.
OpenTSDB
OpenTSDB is a distributed, scalable Time Series Database (TSDB) used to store, index, and serve metrics from various sources. It can collect data at a large scale and graph these metrics on the fly.
A Riak collector for OpenTSDB is available as part of the tcollector framework.
Riemann
Riemann uses a powerful stream processing language to aggregate events from client agents running on Riak nodes, and can help track trends or report on events as they occur. Statistics can be gathered from your nodes and forwarded to a solution such as Graphite for producing related graphs.
A Riemann Tools project consisting of small programs for sending data to Riemann provides a module specifically designed to read Riak statistics.
Zabbix
Tested and Verified Support for Riak 2.x Stats.
Zabbix is an open-source performance monitoring, alerting, and graphing solution that can provide information on the state of Riak cluster nodes.
A Zabbix plugin for Riak is available to get you started monitoring Riak using Zabbix.
Hosted Service Monitoring Tools
The following are some commercial tools which Riak customers have reported successfully using for statistics gathering and monitoring within their Riak clusters.
Circonus
Circonus provides organization-wide monitoring, trend analysis, alerting, notifications, and dashboards. It can been used to provide trend analysis and help with troubleshooting and capacity planning in a Riak cluster environment.
New Relic
Tested and Verified Support for Riak 2.x Stats.
New Relic is a data analytics and visualization platform that can provide information on the current and past states of Riak nodes and visualizations of machine generated data such as log files.
A Riak New Relic Agent for collecting statistics from the Riak HTTP /stats endpoint is also available.
Splunk
Splunk is available as downloadable software or
as a service, and provides tools for visualization of machine generated
data such as log files. It can be connected to Riak’s HTTP statistics
/stats endpoint.
Splunk can be used to aggregate all Riak cluster node operational log files, including operating system and Riak-specific logs and Riak statistics data. These data are then available for real time graphing, search, and other visualization ideal for troubleshooting complex issues and spotting trends.
Summary
Riak exposes numerous forms of vital statistic information which can be aggregated, monitored, analyzed, graphed, and reported on in a variety of ways using numerous open source and commercial solutions.
If you use a solution not listed here with Riak and would like to include it (or would otherwise like to update the information on this page), feel free to fork the docs, add it in the appropriate section, and send a pull request to the Riak Docs.
