Building OpenTSDB for 10,000 hosts

At Datto, we needed a highly scalable metric storage system (or “time series database”) for tracking system- and application-level stats throughout our environment. We had a nice little Hadoop cluster I'd set up that could handle some IOPs. So we decided to test OpenTSDB and Druid. Druid was certainly appealing:

  • They claimed OLAP powers
  •  Ambari had a built-in install
  •  There was a Grafana plugin

The problem with Druid was simple: we couldn't successfully query data once we got it in. Both the Grafana plugin and CLI tools could not produce readable data after weeks of effort.

Given our problems with Druid (and the fact that OpenTSDB did not exhibit these problems), choosing our TSDB was easy.

So...

OpenTSDB.

OpenTSDB is a Time-series database (TSDB) that uses Apache HBase for storage. Apache HBase is a distributed NoSQL datastore built on Hadoop and based on Google’s Big Table paper.

OpenTSDB is actually one of the older time-series databases (TSDBs) out there. This does cause a lot of inaccurate documentation to show up. You'll get recommendations on how to use/tune OpenTSDB for versions from 6+ years ago. And you'll see companion suggestions on how to tune HBase from the same time. Given the age (and the significant changes HBase saw with the 2.0 release), a number of these recommendations aren't quite accurate any more. So I'm going to document the setup we landed on that has given us fairly solid performance for OpenTSDB.

OpenTSDB config:

tsd.network.async_io = true
tsd.network.reuse_address = true
tsd.network.tcp_no_delay = true
tsd.http.query.allow_delete = true
tsd.http.request.cors_domains = *
tsd.http.request.enable_chunked = true
tsd.http.request.max_chunk = 20097152
tsd.query.filter.expansion_limit = 16184
tsd.query.skip_unresolved_tagvs = true
tsd.core.auto_create_metrics = true
tsd.core.meta.enable_realtime_ts = false
tsd.core.meta.enable_realtime_uid = false
tsd.core.meta.enable_tsuid_incrementing = false
tsd.core.meta.enable_tsuid_tracking = false
tsd.core.tree.enable_processing = false
tsd.core.uid.random_metrics = false
tsd.storage.enable_appends = false
tsd.storage.enable_compaction = false
tsd.storage.fix_duplicates = true
tsd.storage.flush_interval = 10000
tsd.storage.hbase.prefetch_meta = true
tsd.storage.hbase.scanner.maxNumRows = 512
tsd.storage.max_tags = 12
tsd.storage.repair_appends = true
tsd.storage.salt.width = 1
tsd.storage.uid.width.metric = 4
tsd.storage.uid.width.tagk = 4
tsd.storage.uid.width.tagv = 4
tsd.storage.use_otsdb_timestamp = false
tsd.uid.lru.enable = true

Excluding a few basic configs ('cause nobody needs to see that I picked the default port, or what my ZK cluster looks like), we've had good luck with these settings on OpenTSDB 2.4. Some important notes:

  1. Network settings are all defaults, but please use tsd.network.async_io = true, tsd.network.reuse_address = true, and tsd.network.tcp_no_delay = true. Your servers will thank you.
  2. Disable tree processing, meta table creation (tsd.core.meta.*), and compaction in the main TSD. All these can be done in separate processes which will let the primary TSD just run ingest/queries. Much faster.
  3. Ensure you enable salting (tsd.storage.salt.width). It really improves distribution of metrics in the data table. But once you've set a salt width do not change it. If you change the salt width, you'll end up with inaccessible data.
  4. Increasing the max requested number of rows from HBase (tsd.storage.hbase.scanner.maxNumRows) means larger chunks come back from HBase at a time. This primarily helps with larger queries.
  5. Appends cause a noticeable CPU spike. It's probably best to avoid them (even if they do improve storage usage). Supposedly, OpenTSDB 3.0 will have a new append strategy that avoids this CPU overhead. Something to revisit then...
  6. Date-Tier compaction in HBase sounds nice, but is broken in many versions of HBase (including 2.0.2, which is what Hortonworks currently ships). Don't enable it or your data table will get sad very quickly. (Essentially, when regions split, the cleanup fails rather catastrophically... This problem is patched although I can't find the relevant HBase Jira ticket, but 2.0.2 definitely exhibits it.)

This guide has some solid recommendations (including the afore-mentioned date-tier compaction), but a few thoughts with HBase 2.0 in mind:

  1. HBase cache settings are not cluster-wide any more, many of them can be set on individual tables. I'll cover that more below.
  2. The memstore suggestions are a little...aggressive. I'll cover it more below.

HBase config

Largely leaving default configs (if you're using Ambari/Cloudera Manager) is safe, but some suggestions:

  1. Use a smaller than default memstore flush size (Ambari's default of 128MB is a bit high). We're using 64MB and the smaller writes do help. OpenTSDB devs recommend 16 MB, which Ambari won't allow you to set.
  2. Ensure hbase.bucketcache.size is at least 4GB (4096). This allows for better caching of data as we write.
  3. The data table should have caching enabled on write alter 'tsdb', {NAME=>'t', CACHE_BLOOMS_ON_WRITE=>true, CACHE_INDEX_ON_WRITE=>true, CACHE_BLOCKS_ON_WRITE=>true} (or something, but that's a good start.)
  4. Absolutely use SNAPPY/LZO compression (if you can enable it), it gives a nice reduction in storage with basically no extra cost at compression/decompression.
  5. If you can enable short-circuit reads in HBase, it's worth doing. It should slightly reduce network traffic and, consequently, IO in the cluster.

MOST IMPORTANT

Pre-splitting Data table

All of these tweaks help, but the most important thing you can do to support a properly functioning OpenTSDB cluster is to pre-split your HBase regions. OpenTSDB docs recommend this, but don't go into many details on how to get a successful pre-split. What I found most effective: set up the table (_*with salting enabled*_) and start ingesting _some_ data. Then, as the regions begin to grow:

  1. Execute a full table split (echo 'split' | sudo -u hbase hbase shell)
  2. Wait for some more data to populate and for the split regions to properly close.
  3. Repeat.

If you are confident you can do the proper HBase pre-splitting (calculating region names, etc.) with OpenTSDB's hashed row keys, more power to you. With our intention of growth for this database, I didn't know where proper splits wpi;d be, so using at least some actual data made my splits fairly accurate. We aren't seeing much hot-spotting. Queries are significantly faster. General cluster load is slightly lower. In fact, a pre-split table also has better success compacting regions, which brings us to...

HBase compaction

compaction in HBase is actually a series of processes that help:

  1. clean up closed regions
  2. reduce the number of files that make up a region
  3. help finish region splits
  4. ensure data locality

Because of the significant changes in HBase 2.0, ensuring region compaction can complete is vital. With hbck (HBase's version of fsck) somewhat neutered by the upgrade (and documentation around the new hbck 2.0 being rather hand-wavy at present), the compaction queue is your best friend in ensuring the health of your tables.

If your compaction queue is backing up, you're going to have a bad time.

The strangest part is, when there are a significant number of regions marked "other" in HBase's Master UI, interacting with a table slows down significantly. alter statements time out, loading the table detail page in the UI takes significantly longer, and general table performance drops. The only way I've found in HBase 2.0 to truly remove "other" regions (which appear to be regions marked CLOSED that haven't been actually closed yet) is letting the compaction queue run. Even disabling and re-enabling the table will not remove these regions. It can take a noticeable amount of time to clean up regions like this. Even nominally healthy clusters (nicely pre-split, current compaction queues staying relatively low, etc.) can take 30+ minutes to remove 1 CLOSED region properly.

The problem here is that there isn't much we can do to help the compaction queues. We can run catalogjanitor (helps close split regions) and cleaner_chore (forces WALs to write and cycle) on our cluster, which can address some issues. But generally, we just have to be patient. Master failover (like in previous versions) can also help, but there is no obvious "silver bullet" to address compaction problems.

It is possible this is an artifact of a bad config on my part that I haven't discovered. As it stands, ensuring compaction success seems to largely rely on heavy planning for how your table will be used. Pre-splitting is the best tool to ensure this.

Additional Work We’ve Done

  1. We've installed tsuid-ratelimiter into our TSDs, which gave us a functioning meta table. You can read more about tsuid-ratelimiter here in this article. Because of how this plugin works, you should try to pre-split your meta table as well. It doesn't need to be nearly as split as the raw data table, but it definitely helps handle the initial write influx.
  2. We've enabled rollup tables to help offset larger queries. To handle the amount of ingest we're receiving (~900,000 points per second), we had to write our own rollup tool in Rust, which we're currently working on open-sourcing.
    1. Rollups are critical as your OpenTSDB installation grows. Having separate tables at reasonable intervals allows Graphite-like TTLs (different tables can have different TTLs) and the ability to query coarser-grained data for longer time ranges, which really helps tools like Grafana.
  3. We've had to patch some functionality into the Grafana OpenTSDB plugin. We haven't had time to "upgrade the backend" as the Grafana project requested for these changes to be properly merged.
  4. We're also quite excited for OpenTSDB 3.0, which promises a caching layer (huge win for frequent queries), Elasticsearch-backed meta information (and generally improved metadata creation/management), and overall code/performance improvements.

In Toto

There has been a lot of work put into our OpenTSDB installation at Datto. We are currently storing ~74 million metric/tag combos totaling ~61 TB of data with an average 95% retrieval time of 620 ms. With all this data flying around, ensuring stability and performance has been an interesting challenge. This is obviously an on-going process, but the work outlined above has helped us run a fairly stable system with a fairly consistent increase in ingest/query rate.

About the Author

Systems Administrator and monitoring enthusiast

LinkedIn  

More from this author