> you caan get most of the useful git-like time-travel semantics (modulo schema migrations) out of timeseries data with a separate transaction history table
Well, not really. Dolt isn't just time travel. If all you want is time travel (or append-only data), you can do that with Postgres or MySQL pretty easily and get great performance. What Dolt brings to the table is actual Git semantics, a real commit graph you can branch, merge, fork, clone. You can inspect the historical values of each cell and examine the author and commit message for each change. Two people can clone a database, both insert / update / delete rows, and then merge each other's changes. If they touched the same rows, they'll have a merge conflict they have to resolve. If you find somebody's data they've published in the Dolt format has an error, you can fork it, correct the data, and submit them a PR to merge back, just like you do with code on Github. It's a database built for collaboration from the ground up. Instead of authorizing each person you want to make edits specific grants on particular tables / subsets of the data, you just let them clone it, then submit a PR that you can examine and approve. Or if they disagree with your policies, they can fork the data and still merge in updates from your master source as necessary. Git demonstrated what a powerful model the commit graph is for code. Dolt brings it to data, with SQL built in.
To answer your question about indexes across schema migrations, indexes are versioned with the schema of the table. This means they can be efficiently synced during a pull operation, but it means that previous versions of the data don't have indexes on them. We're considering adding the ability to add an index to the data from the beginning of the commit history as a rebase operation, but haven't implemented that yet.
Heyo, thank you for responding! I totally agree that the commit, PR, merge semantics are powerful and valuable, and you can’t get that easily with existing databases.
I guess in my imaginary perfect world, you don’t need to use commit, PR, merge semantics to make normal edits. You can use existing bitemporal/transaction history ideas in online updates, appends, and deletes, and then you have a higher-level abstraction for offline(-only?) branching and merging.
I guess what I’m saying is that I don’t totally buy the idea that you need git “all the way down”, especially if it gets in the way of performance. But maybe I’m just used to “the old way”, and I’ll cringe reading this in 20 years. :)
Major kudos for this project, it’s super impressive. I’m excited to see how it grows!
I have a hunch that some form of multi-temporalism, or perhaps more fancily some kind multi-lattice-ism, as the secret model underlying all of these.
But putting that probably-wrong-because-I'm-a-dilettante wankery to one side: a serious advantage of bitemporalism is that most layfolk understand how clocks and calendars work. This is less true of git commits.
I think I agree with you. Git uses logical ordering rather than temporal ordering, and their concept of a primary key is rather nebulous.
We've been able to train up "regular folk" on bitemporalism at our company. The distinction between valid and transaction history takes a little while, but it sticks. Git has two or three such conceptual sticking points (staging vs committing, branching & merging, remote vs local).
FWIW we are absolutely working on modes of using the product that "just work" like a normal database without any explicit commits / merges. In that case commits would happen automatically (maybe one for each write transaction?) and you'd get all the history "for free." We aren't sure how useful that is, though. Definitely looking for customers interested in this use case to guide us on their needs.
I work at an insurance company, where it's extremely important that we have a way of persisting (1) what happened when [transaction history], and (2) how our opinions of what happened when changed over time [valid history]. Basically, you can think of (2) as a materialized view of (1). But it all lives in one database.
There is much business value to be extracted out of this idea, because most companies (1) do not have the ability to cheaply (i.e. ad-hoc) introspect this way (2) would gain a lot of value from this ability. My hunch is that most companies out there are interested in answering that set of questions (referred to as bitemporal data, in the literature).
This is the space that TimescaleDB seems to be competing in, although we don't use them (we use our own extension of Postgres, that's only a few hundred lines of PL-pgSQL, plus some helper libraries in Go/Python).
In my perspective, I think git-style semantics would be very powerful as a layer above an existing bitemporal system. We've implemented similar systems, where records get a semver-style version in their bitemporal key, so we can see how different producer logic versions minted different records over time. It's be really cool to have branching versions, and to be able to rebase them -- this is something our system doesn't currently support.
Anyways, hope our data point is useful for you all. Happy to share more if you have questions about anything.
Thanks for this - really interesting (Luke from Terminus here). We have engaged plenty with some reinsurance folk around some query use cases (recursive stuff to better understand risk), but that was before we did a lot of the git-like stuff. Can see how this might be powerful on a historical view in insurance.
Have you heard of patch-theory-based VCS systems (Pijul is unstable, I heard Darcs is slower)? Instead of a chain of independent states (commits), each state is a set of patches (some of which depend on others). Pijul stores an internal structure that's more complex than "a series of chars/lines".
This eliminates complex rebasing topologies, and Git not recognizing that old and rebased commits are the same. But I'm not sure if it works well in practice. And it doesn't extend to SQL data sets yet.
I don't know if you're interested in this or not, but I just wanted to mention it.
Depends what you mean by "work" and "large," I guess. We have repos in the 250 GB range that definitely push the limits of functionality, but everything still works (pull, push, merge, clone, etc., and queries return the right results, if slowly).
Well, not really. Dolt isn't just time travel. If all you want is time travel (or append-only data), you can do that with Postgres or MySQL pretty easily and get great performance. What Dolt brings to the table is actual Git semantics, a real commit graph you can branch, merge, fork, clone. You can inspect the historical values of each cell and examine the author and commit message for each change. Two people can clone a database, both insert / update / delete rows, and then merge each other's changes. If they touched the same rows, they'll have a merge conflict they have to resolve. If you find somebody's data they've published in the Dolt format has an error, you can fork it, correct the data, and submit them a PR to merge back, just like you do with code on Github. It's a database built for collaboration from the ground up. Instead of authorizing each person you want to make edits specific grants on particular tables / subsets of the data, you just let them clone it, then submit a PR that you can examine and approve. Or if they disagree with your policies, they can fork the data and still merge in updates from your master source as necessary. Git demonstrated what a powerful model the commit graph is for code. Dolt brings it to data, with SQL built in.
To answer your question about indexes across schema migrations, indexes are versioned with the schema of the table. This means they can be efficiently synced during a pull operation, but it means that previous versions of the data don't have indexes on them. We're considering adding the ability to add an index to the data from the beginning of the commit history as a rebase operation, but haven't implemented that yet.