Looking at the docs it doesn't appear this is meant to stream to the client, but to the server. From there you would still need to manage queues and sockets, etc.. I've had to write several exchanges that do something along these lines (stream order data to a client) and I've always accomplished it by having the code that writes to the db also push a client update to a queue.
I guess my question is why would this be a preferred solution? It seems to run afoul to the 'one job' design goal. What am I missing?
Hi Chris, great question. There are a couple of advantages actually.
- You don't have to write logic to figure out which clients need to be updated with some data. You just run the queries that you need to to generate the data for the client, and then RethinkDB sends you an update only if the result of that specific query changes.
- Having the thing that modifies the data send the updates to the connected clients becomes increasingly difficult if you have multiple application servers. Then you would need to set up some separate message passing / broadcasting infrastructure if you also want to update clients connected to other servers. RethinkDB takes care of "passing the news around", even in a distributed environment.
- RethinkDB supports changefeeds not just on the raw data, but also on transformations of the data. Not all transformations are supported yet (for example map reduce queries are not), but our goal is definitely to support changefeeds on pretty much any query. Just knowing that the underlying data has changed isn't enough. In a traditional setting, you would still need to either recompute the whole query, or implement your own logic for incrementally updating their results for every type of query you want to run. RethinkDB updates query results incrementally and efficiently.
As I mentioned the code that writes the event to the db notifies the client about it, so it knows about the client. This isn't really an issue.
In the case of an exchange I never split a single order book across multiple servers, but I can imagine a lot of applications where this could be an issue. How do you handle data consistency across nodes? Ultimately you have to solve the same issue...
Again, this isn't a big limitation for my use case. That said your answer has certainly given me a greater understanding of other circumstances where it would be very useful. Thank you.
If you shard a RethinkDB table to split it across multiple servers, and then create a changefeed on the table, the database will automatically send changes from both servers. Basically, server management/sharding in RethinkDB is visible to ops people, but is completely abstracted from the application developer. All writes are immediately consistent.
Rethink doesn't provide ACID guarantees, though. If you want to make a change to multiple documents in a table and have ACID guarantees, I'd stick with traditional RDBMSes.
How are all writes immediately consistent? What if two clients make simultaneous writes to each shard? Or more extreme, what if there is a net split and writes continue on both shards?
I suppose I could just ask if you have an architecture document floating around :)
That's a great question. We published a blog post earlier this week explaining why building realtime features into the database is so exciting: http://rethinkdb.com/blog/realtime-web/
It would be trivial to pipe the output of the stream to the `client` yourself. The benefit of that being that rethink doesn't have to auth the clients.
This is a preferred solution because the only efficient alternative is to inspect every insert or update, manually determining which clients care about those changes.
And what happens to the fact that the data has changed if the client becomes disconnected for whatever reason? Is the fact that the data has changed committed with the data atomically. Is the query persistent across connections? Or does it end in tears?
Changefeeds are currently bound to a database connection, so they will get terminated if for example the application server goes down or there's a networking issue.
We feel like for many queries (especially the ones you find in web applications) that's not a big deal, since you can efficiently re-run them after reconnecting. In other cases it definitely matters, and we are going to add what you describe in a future release. You can follow the progress (or chime in if you like) at https://github.com/rethinkdb/rethinkdb/issues/3471 .
So obviously that would be application specific. In the case of an exchange when a client connects they get a snapshot of the whole market and then begin receiving streaming updates that took place after that. I generally prefer ascending event ids over timestamps but either approach is workable. The logic about which id to start with can be managed by the server in a socket implementation or on the client when polling. So, disconnects/reconnects for that type of application are handled fairly elegantly.
I guess my question is why would this be a preferred solution? It seems to run afoul to the 'one job' design goal. What am I missing?