Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My day job relies heavily on Amazon Redshift. Most of our fact tables are sorted by date and almost always a locale.

I wrote a Python script to delete the stale records based on that time key. We have a separate vacuum service that cleans up the mess continuously.

Is this considered a time series database? Or is there something a dedicated time series database does differently?



Ajay from TimescaleDB here.

In general, a time-series database makes architectural decisions and focuses on capabilities for time-series data that lead to orders of magnitude higher performance and a better developer experience.

For example, TimescaleDB:

- Auto-partitions data into 100s-10,000s of chunks, resulting in faster inserts and queries, yet provides the Hypertable abstraction that allows you to treat the data as if it lives in a single table (eg full SQL, secondary indexes, etc)

- Provides Continuous Aggregates, which automatically calculate the results of a query in the background and materialize the results [1]

- Supports native compression, using a variety of best-in-class algorithms based on datatype (eg delta-of-delta encoding, Simple-8b RLE, XOR-based compression (aka "Gorilla" compression), dictionary compression [2]

- Adds other capabilities like interpolation, LOCF, data retention policies, etc, that are necessary for time-series data

- Scales up on a single node, or out across multiple nodes

There's more, but I'll stop because I don't want this to sound like marketing.

You are welcome to implement this yourself (and our docs explain how if you'd like), but most find it easier to use the database we already provide for free[3]. :-)

[1] https://docs.timescale.com/latest/using-timescaledb/continuo...

[2] https://docs.timescale.com/latest/using-timescaledb/compress...

[3] https://blog.timescale.com/blog/multi-node-petabyte-scale-ti...


Thanks for the detailed explanation, that helps a lot.

It sounds like a time-series database does in fact include some elements of ETL at the database layer. I can see how that would be helpful.


Glad to hear. I'm also available via email ajay (at) timescale.com if you have additional questions

We also have a pretty active community on Slack (4,000+ members): https://slack.timescale.com/


I tend to think of them as databases where rows are an aggregate count of events over a period of time.

An overly simplistic example being an API that captures upvotes. Instead of storing the individual upvotes, you incoming requests in a queue and only write out a count of upvotes over, say 1 minute. That way, if you want to get a count over a larger period of time, you're setting a ceiling on the number of records involved in your operation. If you have 1 minute resolution and you're looking over a year of data, you're adding at most, 6024366 records for each post. This is really handy for analytics data where you don't necessarily care about the individual records.

The project I was specifically referencing was one that captured data streams from sensors like accelerometers and thermometers, which is inherently time-series data, because it's literally just (timestamp, sensor-reading). But to make use of that, you need a library of tools that understand that the underlying data is a time-series to do things like smoothing data, or highlight key events. For example, a torque spike had a "signature" which involved looking at the difference over time periods at a particular resolution. But a temperature spike would look different. Etc.


I see, thanks for the explanation. Is aggregation in general a required feature of a time-series database? Or specifically the ability to calculate aggregates continuously over changing (incoming?) data?

Do time series databases store the raw underlying data for the aggregates or just enough to calculate the next window?


Time-series databases themselves are usually key/value or relational.

The "time-series" features come from the focus around time as a primary component of the data (or some kind of increasing numeric value). Features like automatic partitioning and sharding, better compression, aggregating into time buckets, extracting and managing date values, smoothing missing values in queries, etc.


Redshift does a lot of that. Is there something unique to a "time series" database that Redshift does not do? Is it the inclusion of some ETL capabilities? I'm not sure what makes a database truly time-series. It sounds like there is some compromise made that requires a time dimension. I'm not sure what that compromise would be beyond the general case of a columnar database.


Nothing unique. Any database can store "time-series" data. If you look at InfluxDB (key/value) or TimescaleDB (relational), they offer things like high precision time values, higher compression algorithms that work better with timestamps, utilities to make it easier to query for time (buckets, timezones, gap-filling), automatic rollups and aggregations, data deletions based on time, etc.

You can do all that yourself on Redshift but they just offer more built-in functionality for it.


Exactly what mani says. For many people, Redshift, or just plain Postgres, is good enough. But if you find yourself wresting with poor performance, slow queries, high costs, or just an annoying developer experience, I'd encourage you to look at time-series databases (eg TimescaleDB, where I work).


You could have made a timeseries db. There's nothing magic about it. It's just a collection of engineering tradeoffs where you throw away relational features and indexes on various fields, instead focusing on performance for the huge number of events coming into an append only system (you can update but it's super rare).


To be clear, TimescaleDB does NOT throw away and indexing capabilities or ACID features. Its only limitation is that UNIQUE constraints need to include the time column. Also, its query optimizations are far from trivial to implement.


If you build it on a modern relational data warehouse then you don't have to throw away any of that. That's more of how specific databases (like InfluxDB) choose to handle it but not necessarily a component of time-series itself.


Yes absolutely. However if you are aggressive in getting performance then you're likely to do things like batching the data and storing compressed blobs in the db and the rest of the row is essentially just metadata to the blob. Once you do this, relations on the data are basically thrown away because you can no longer do any joins on the blob data.

You can choose to not do this but then get ready to eat mediocre performance.


Sure but that's because a "time-series" database is trying to get functionality that isn't already part of the underlying data store, for example if you're using RocksDB underneath and want to avoid slow scans.

You don't need to do that with a relational columnstore like Redshift, BigQuery, MemSQL, Vertica, Kdb+, or others. They're designed for massive aggregations with rows stored in highly compressed column segments that support updates and joins. And they're much faster than any custom compression scheme on top of another slower data store.

There's also the in-between options like OpenTSDB and Pinalytics that use Hbase/Bigtable. That's a sorted key/value datastore but it still applies compression on the row segments so you can leverage that in scans without a custom compression scheme on top.


>And they'll be much faster than any custom compression scheme on top of another data store.

As usual, it depends on the situation and what's being measured. For example, if you are able to batch a lot, then write performance of a blob will be line rate. If you need an ACK on each event then that's not possible and indeed a column store will be better.

> OpenTSDB

AFAICT, OpenTSDB compacts columns which is similar to what I've described.

https://github.com/OpenTSDB/opentsdb/blob/14ab3ef8a865816cf9...

https://github.com/OpenTSDB/opentsdb/blob/14ab3ef8a865816cf9...

> Pinalytics

This is proprietary so there's no visibility into what they do here.

FWIW, KairosDB is a timeseries db build on Cassandra and uses a row with 1,814,400,000 to side step the data management overhead via bucketing. Not a binary blob but certainly not the straight forward data layout that someone asking about 'why can't I just store data in a normal database' might expect.

https://kairosdb.github.io/docs/build/html/CassandraSchema.h...


All relational columnstores use batch inserts and updates. In fact that's the recommended way to write columnstore data and SQL already supports various batch insert methods. The modern databases also use a row-store representation as a buffer for any single-row mutations before merging the columnstore segments. [1]

OpenTSDB compacts columns inside a column-family. This is because Hbase/cassandra/dynamo key/value stores support arbitrary columns for each row and store a verbose identifier for the column/cell/version matrix. It's a custom data format to save disk space but the underlying compaction still refers to HBase LSM-tree compaction. [2] The rows are still stored individually and compression is enabled on HDFS.

But yes, it's all highly dependent on the situation and it's the same fundamental mechanics of batching and compressing data. My point is that it's better to just use a database that is built with this architecture and can provide greater performance, usability and full querying flexibility instead of bolting it on top of a less functional data store.

1. https://docs.memsql.com/v7.1/concepts/columnstore/

2. http://opentsdb.net/docs/build/html/user_guide/definitions.h...


Almost all of our load operations are a merge. Our ETL layer takes care of that. The vacuum then deals with re-sorting rows and freeing blocks.

I have never worked with something that bills itself as a "time-series" database. Does that involve pushing elements of the ETL layer down into the database? Or is it just optimizing for access by a time dimension?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: