Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The naive relational time series table looks like:

propertyid, timestamp, value

Then add a covering index so your reads only ever hit the index. This works completely fine for low billions of rows. After that suggest using clickhouse. It is less general but at large enough scale you need to make some tradeoffs. Completely fine to start with a relational DB in many cases though.



But then I have to use standard SQL instead of a vendor's half assed, poorly documented query language.


Thanks for this. Can we keep going?

The article mentioned stock prices, so let's use your schema:

LLY.NYSE, 1726377148, 924.38

SHOP.NYSE, 1726377216, 72.45

SHOP.NYSE, 1726377245, 72.41

LLY.NYSE, 1726377284, 924.39

LLY.NYSE, 1726377310, 924.36

Okay, so you're appending values that capture a moment in time for a given index, with the understanding that you're almost never going to revise a tuple. So far as we're concerned, time only flows in one direction. Periodicity is determined by the client application but could be throttled due to a bottleneck, resulting in fewer writes.

I can imagine how storing in blocks based on time ranges and the fact that time number goes up makes this very easy to index.

With even this simple example, it's clear what this type of datastore could be useful for, even though I'm not convinced you couldn't do pretty exciting things in Redis or sqlite3.

That said, I'm still significantly confused why the hell you can only store numeric values. It just seems like a very arbitrary constraint.


If you use a bulk insert pattern you might be able to get ~50K inserts per second. If you have to insert one record at a time, it will likely drop down to ~3K or so.

The key is the covering index, you don't want to hit the actual table at all. You should re-organize the index periodically as well in order to ensure optimal query times.

This is suboptimal in many ways as the records are tiny yet there is per row overhead. However, it works fine to low billions of rows in general. I think it would work fine in SQLite as well. Redis is a very different animal however so a different strategy would be needed.

Migrating the same schema to ClickHouse when you need it (I have no affiliation with them), will give you 50X lower data footprint, increase your ingest rates by 10X and your query times by about 2X. However, if you don't really need that, don't bother adding another component in your infrastructure (imo). Also, ClickHouse is much more of a purpose built Ferrari. I.e. you can't expect it to do all kinds of "off schema" type queries (i.e. give me all the stock values on Jan 10, @ 10:32:02.232am). In a relational DB you can add indexes and get results reasonably well. In ClickHouse schema == query pattern essentially. Naturally, you should know what you are doing with both models but in ClickHouse, don't treat it like a "magic data box" that will can handle anything you throw at it (beyond a certain scale at least).


The constraint is not inherent to TSDB generally. Influxdb supports string values for example, as mentioned in the article. You also have log databases, like Loki, that are designed like a TSDB, except they only store strings.

My guess is that constraining to numbers greatly simplifies the implementation, especially so for the query language and aggregation functions.


There are some compression algos (e.g. gorilla) which do a good job at lossless compression. However the key is not to try to store arbitrary data types in the same table (don't use "OO" ideas here).


Non-TSDB databases can do an adequate job — up to a certain point. However, getting good performance comes down to query access paths.

For enormous amounts of data, you want your data to be sequential (a few batch reads are generally faster than lots of random ones) and sorted. Databases like Postgres don't store rows in index order, and their indexes are geared towards small fetches as opposed to big ranges of sequential data. (Postgres does have BRIN indexes, but they're not that great for this.) Other databases do support clustered tables, but they're still page-based underneath and still not suboptimal for other reasons. It's a good learning exercise to compare traditional databases with something like ClickHouse, which was s designed from the ground up to store data in this optimized way. (It's a general-purpose OLAP database, so columns can be any kind of data, not just numbers, and data doesn't need to be time-based; so it's useful for much more than "pure" TSDBs.)

As for numbers: For metric data you're storing numeric measurements. The whole point is applying aggregations to the data, like calculating an average or a percentile over a certain time period. Storing numbers also has the advantage of allowing downsampling. For many purposes you don't need 1s granularity for all your data. So as you "roll" data over into old time periods, you can downsample by storing the data points aggregated as mean/min/max/etc. New data can be exact down to the second, while old data can be exact down to the hour, for example.


You absolutely need a covering index in a relational DB. This way the data is read entirely from the index. The table just essentially goes along for the ride. The extra storage is a little distasteful from a conceptual / academic perspective but it works ok below a certain scale (low B rows). Beyond that use ClickHouse (but understand how it works - don't treat it as a "magic data box").


Thank you, this was a very illuminating answer for me - especially when you mentioned downsampling.


> That said, I'm still significantly confused why the hell you can only store numeric values. It just seems like a very arbitrary constraint.

"In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order."

A a numeric datapoint is by far the largest use case, timeseries databases optimize for that. Most allow you to have several labels ofcourse, i.e. you could store country border crossings as

{from: US,to:Canada} 1726377148, 2123 {from: US,to:Mexico} 1726377148, 34567


I don't believe there is any restriction in InfluxDB but I can't remember for sure. Of course in a relational DB you can absolutely store strings but create a separate table for it (same in ClickHouse). If you have a somewhat restricted set of strings, ClickHouse will laughably compress it down to almost nothing using the low cardinality approach.

However, strings aren't as common as usually you are a value of some sort.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: