Edit: While Horizon assigns versions internally, it looks like they are not currently used for catching concurrent client-side modifications by different users. (I previously wrote that they would cause the "losing" write to fail if it was based on an outdated version of the data, but that doesn't seem to be the case yet.)
We're constantly improving performance and a lot has happened within the past year. I think that at this point RethinkDB is as good a database for analytics as many of the other general-purpose databases when it comes to features and performance.
From what I can tell, there are still two main limitations that apply in some, but not all scenarios:
* Grouping big results without an associated aggregation requires the full result to fit into RAM. I believe this was the limitation that you ran into a year ago, which lead to RAM exhaustion. This limitation is still there ( https://github.com/rethinkdb/rethinkdb/issues/2719 in our issue tracker). However we're shipping a new command `fold` with the upcoming 2.3 release of RethinkDB, which can be used in the vast majority of cases to perform streaming grouped operations (in conjunction with a matching index). See https://github.com/rethinkdb/rethinkdb/issues/3736 for details.
* Scanning data sets that don't fit into memory on rotational disks is still inefficient. Most SQL databases deploy sophisticated optimizations to structure their disk layout in order to minimize the effects of high seek times. RethinkDB's disk layout it built with a stronger focus on SSDs. This limitation hence doesn't apply if the data is stored on SSDs.
Out of curiosity, why would you prefer this sort of implementation to something more in line with MogileFS? [e.g. Metadata storage with the actual file stored independently on the local file system of multiple physical nodes]
> That's not what http://rethinkdb.com/docs/quickstart/ says. It shows a connection that exposes context-bound Builder methods with trigger (insert, changes, run, etc) methods.
I think the confusion stems from the fact that the Quickstart guide assumes that you're running queries in the Data Explorer, a web frontend for prototyping queries. In the Data Explorer, clicking the "Run" button is what triggers the execution of the AST.
If you wrote something like `r.table("tv_shows").insert(...)` in your application code, it wouldn't do anything except for returning an AST object. You can store that object, or call the `run(conn)` method on it to send it over a RethinkDB connection and execute it.
Note that the `r` object in these queries has no state. You can think of it as a namespace that serves as a starting point for building queries.
We've since simplified the build process, and it can now build most dependencies automatically.
The only exception right now are the web UI assets which still need to be downloaded separately or copied from Linux (building these on Windows will come later).
We don't link or compile in any Cygwin code, and use all Windows APIs directly. The build system uses some Cygwin tools though.
Incidentally we considered using Cygwin to achieve Windows compatibility at some point, but found that it didn't implement some of the lower-level APIs that RethinkDB uses on Linux.
A new epoch timestamp is generated when you either first create a new table, or when you use the "emergency repair" operation to manually recover from a dead Raft cluster (usually one that has lost a majority of servers).
The wall-clock time comes from the server that processes that query.
Whenever the epoch timestamp changes, replicas will get a fresh set of Raft member IDs, and it's expected that they start with an empty Raft log.
Where exactly the epoch timestamps come from is not really relevant to this bug. With the bug fixed, any given node will only accept multi_table_manager_t actions that have a strictly larger epoch timestamp than what they have right now. That is enough to guarantee that they never go back to a previous configuration, and never rejoin a Raft cluster with the old member ID, but a wiped Raft log.
I guess I'm just confused how this clock becomes a trusted source of truth for forward progress? Is there a way of asserting that the clock makes forward progress that I don't understand?
EDIT: Or is it that it's not required to show forward progress? Still reading the rethinkdb source & docs, thanks for the information so far.
During normal operation, RethinkDB uses the standard Raft protocol for managing configuration operations. The Raft protocol only uses the clock for heartbeats and election timeouts. So we're not using the clock as a trusted source of truth.
However, a Raft cluster will get stuck if half or more of the members are permanently lost. RethinkDB offers a manual recovery mechanism called "emergency repair" in this case. When the administrator executes an emergency repair, RethinkDB discards the old Raft cluster and starts a completely new Raft cluster, with an empty log and so on. However, some servers might not find out about the emergency repair operation immediately. So we would end up with some servers that were in the new Raft cluster and some that were still in the old Raft cluster. We want the ones in the old Raft cluster to discard their old Raft metadata and join the new Raft cluster.
The process of having those servers join the new Raft cluster is managed using the epoch_t struct. An epoch_t is a unique identifier for a Raft cluster. It contains a wall-clock timestamp and a random UUID. When the user runs an emergency repair, the wall-clock time is initialized to max(current_time(), prev_epoch.timestamp+1). When two servers detect that they are in different Raft clusters, the one that has the lower epoch timestamp discards its data and joins the other one's Raft cluster. The UUID is used for tiebreaking in the unlikely event that the timestamps are the same.
So the clock isn't being used as a trusted source of truth; it's being used as part of a quick-and-dirty emergency repair mechanism for the case where the Raft cluster is permanently broken. The emergency repair mechanism isn't guaranteed to preserve any consistency invariants (as the documentation clearly warns).
Ahh ok! An interesting approach. For some reason I'm uncomfortable with it but I need to spend more time reasoning about what failure mode rethinkdb uses care about.
I'm still not sure I fully understand the cases where this could occur, as I'm used to a very different approach to consistent operations that depends highly on physical infrastructure.