Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Sensible SQLite Defaults (briandouglas.ie)
223 points by thunderbong on Dec 6, 2024 | hide | past | favorite | 67 comments


This will probably get buried, but I'm wondering: Has any scripting language attempted to use SQLite as its memory model? You get GC, memory safety, atomicity guarantees, core dump functionality, and rich data structures for free. If configured to run in-memory, I reckon it wouldn't impose enough overhead to be problematic.

It sounds crazy, but most scripting languages add this stuff anyway.

With rqlite this can be extended to distributed systems programming as well.


Not crazy at all; just uncommon. It would be an implementation of the basic recommendation from the excellent paper, "Out of the Tar Pit".

https://curtclifton.net/papers/MoseleyMarks06a.pdf


Thank you for the link! Reading now.


That's interesting. There would be overhead compared to directly managing memory. But it would be wild to be able to declare an array of structs that you could directly query.

    var people person[5]
    people[5] // { name: "Human Person", age: 39 }
    people.query("SELECT name FROM a WHERE age = 39") // "Human Person"
If everything in memory was a table structure, you could do joins across variables too.

Would it have to be a scripted language? Seems like you could do it as a compiled language.

I love SQL so if I had the option to use it within in application I'd be hard pressed not to. But from what I've learned from the popularity of ORMs and NoSQL, most don't share that feeling.


You've just described LINQ which has a sqlish dsl as part of C#:

```

List<int> numbers = [5, 4, 1, 3, 9, 8, 6, 7, 2, 0];

// The query variables can also be implicitly typed by using var

// Query #1.

IEnumerable<int> filteringQuery =

    from num in numbers

    where num is < 3 or > 7

    select num;

// Query #2. IEnumerable<int> orderingQuery =

    from num in numbers

    where num is < 3 or > 7

    orderby num ascending

    select num;

// Query #3. string[] groupingQuery = ["carrots", "cabbage", "broccoli", "beans", "barley"];

IEnumerable<IGrouping<char, string>> queryFoodGroups =

    from item in groupingQuery

    group item by item[0];
```

https://learn.microsoft.com/en-us/dotnet/csharp/linq/get-sta...


I'm not prisenco, but does LINQ have its memory model in SQL?

What tickles me about the idea I posted above is that you could take a snapshot of your entire program state (at sequence points) and it's just an SQLite db file.


Linq uses a bunch of iterators functions that accept lambdas. C# allows them to capture the lambdas as an expression so it can convert the expression into SQL or whatever.

it also has an SQL-like syntax you can use that is compiled into those iterators.

the interface is generic, with builtin support for ienumerable types and SQL data sources.


> that you could directly query

To take it even further, the language grammar could be a superset of the SQL grammar.

> Would it have to be a scripted language?

Not at all. It's just the first thing which came to mind. I'm tickled by the idea of having the entire program state (stack frames and all!) in SQLite so that you could ship a snapshot of your program around.

> There would be overhead compared to directly managing memory.

Of course, but if the critical queries were pre-compiled via sqlite3_prepare, I doubt the overhead would so greatly exceed the memory management overhead in language runtimes which ship with a GC that it would be a show-stopper.


wow the idea of shipping a snapshot is also crazy good , bookmarking your comment as well the parent comment as well.

Because snapshotting is a rather crazy thing which have to be managed by criu or qemu

this could even be faster than a vm shipped.

MY goal with snapshots can be literally anything (yet , I like the idea of extraction snapshots for extremely large files very lucrative idea)


While not serialized to a database, memory-dump/snapshot based distribution was done with smalltalk and various lisps. Basically dump the gc and the executable just reloads it right into RAM. Wasn't the best from what I've read over time.

Having an SQL based callstack sounds painfully slow.


> Wasn't the best from what I've read over time.

Because?


I don't have references, just bits of notion from reading various accounts of things online over the years. Most of the software that was distributed as an image would also run as an image and stop by saving that image down to the disk. This could lead to persistent difficult to debug errors (imagine a random RAM bit flip getting persisted into the image) that couldn't be recreated because they only existed on that one clients machine. Each customer was effectively modifying their versions as they ran and saved and reloaded the software, and the as seller was contracted to make local modifications for them, so the install base would diverge over time. There was no central or single source to rebuild the program from. Turning it off and on again just reloaded the saved bugs.


> just bits of notion from reading various accounts of things online over the years

And that casts shade.

> Most of the software … would … stop by saving that image down to the disk.

Not so. Fat-client apps for corporate DBs.

> There was no … source to rebuild the program from.

Not so.

I'll leave you to read a 1984 manual, page 459:

https://rmod-files.lille.inria.fr/FreeBooks/TheInteractivePr...

> … just reloaded the saved bugs.

Let's pretend that someone had made "bad" changes and saved them in their current image file, and they had not archived their other changes.

They would be able to recover all their "good" changes without the "bad" changes because their changes were automatically recorded in a transaction log file.

They could replay the "good" changes from the changes log into an original image file, and not replay the "bad" changes.


>And that casts shade

I was just being clear from my knowledge was secondhand.

I hadn't heard about a smalltalk transaction log.

That's a great link. I'll grab a local copy and read through it later. I love tech history.

Thanks.


fwiw Stef's Free Online Smalltalk Books

http://stephane.ducasse.free.fr/FreeBooks.html


> Having an SQL based callstack sounds painfully slow.

It could be, but this aspect of the idea could also be user-togglable at runtime.


Thank you!

Considering this further... I think Lua might be a good match for this idea. Everything it can do can be pretty easily mapped to SQLite queries and its type system is pretty close to the limited set of types offered by SQLite.


yup , I was actually looking at the tglite bytecode engine. I am really interested in this project.

My thought was about arturo lang though https://arturo-lang.io/ since it has https://github.com/arturo-lang/grafito , and it just felt like having a better integration with sql though lua is also fine.

We can also integrate this directly to something like lua running on sqlite compiled by cosmopolitan project to even ship a cross platform single executable sqlite.

The possibilites are endless.

If you are genuinely interested in this project. I am willing to help in my free time. But I am a novice and I don't have experience in low level langauges like C/rust and much prefer golang.


I have some work on this idea: https://tablam.org

BTW something closer was done with FoxPro and similar languages...


this is such a crazy good take , I was thinking of creating a golang based module loader / code loader directly from a sqllite file and a tool / cli that can help you with this thing where what I intend to do is lets say my cli name is goter (go + motor IDK)

goter init (creates a sqllite and a main.go etc. but all main.go etc. are just a symlink to something inside sqllite I am not sure , its just a really really hypothetical idea)

goter add dep <some dependency>

(which it also then loads in that sqllite file)

goter run main.go (which can run)

goter share (which can output a sqllite file)

goter load .db file (which could even have whole golang versions inside of it)

goter build docker (creates a docker image , if possible)

goter collab ipc (creates a ipc gateway between two servers using something like piping-web if behind nat or directly as well) (to have really easy multi deployments)

and integration with popular tools like go-blueprint / even the ability to create custom templates like go-blueprint if possible since clack has been migrated to go by some guy.

this doesn't have to be golang specific , it can be language agnostic.

My thoughts about this were creating a dead simple cloud editor / code runner (where the code runs inside bubblewrap) and this has the golang code and the whole codebase of a single user is just a sqllite file , insanely simple to transport , Insanely efficient.

I am not sure if this makes sense , but you have guided me in a direction that feels just as enlightening as the day I discovered cosmopolitan project on hackernews / redbean (offtopic) (cross platform executables say what!)

THANKS A LOT , I AM BOOKMARKING THIS , THIS IS SO SO GOOD OF AN IDEA THAT I MIGHT'VE LOST A BIT OF MY MIND AT 1 AM Comprehending how good of an idea this might be.


I also think about this sometimes, maybe not as only memory model but a simple scripting language with tight integration with sqlite would be pretty cool.

P.S. Actually if think I saw something similar somewhere in the past, if it comes to mind I'll send it


yeah dude send it , I am thinking of writing a scripting language some years later and I have absolutely loved this idea.


I'm not aziis98, but Smalltalk had something similar which was significantly greater in scope.

IIRC, your application shipped with the entire development environment.


Unless you removed it.


I had a situation with HashBackup (I'm the author) where I needed a map-like data structure that could get rather large (hundreds of MB to a few GB) with big backups. In Python, storing this data in a dict was not really feasible because every unique integer takes 24 bytes, and since the data structure stores things like blockid and file offsets, there are a lot of unique integers.

Instead, I created an in-memory SQLite database for the data structure and it was something like 1/3 the size of the Python dict. It was around 10-15% slower, but for the memory savings it was an okay trade-off.


Have you taken a peek at sqlsync? It's a cool take where you write your entire application logic to operate as transactions and queries over a sqlite database but you do it in wasm so that the same transaction logic can run everywhere. The end result is nifty. That's at least my recollection of the thing.

https://sqlsync.dev/


I think the core idea is fascinating. This model of making the data of the application transparently persistent was a fad in the 1980s-1990s when object databases like Objectivity and GemStone were all the rage. Those faded from the mainstream for various reasons, though they're still around and still heavily used in industries such as finance.

Object databases like Objectivity, POET, Matisse, and others provide tools that took your source code and augmented your classes with low-level code that automatically loaded objects from the database. So you wrote your classes and ran the tools, and you suddenly got magical persistent objects. (A lot like CORBA and COM, but usually without an IDL.) These databases initially only did C++, but later they supported Java, which had the big benefit that the tool could take the compiled byte code, disassemble it, inject the persistent glue, and then assemble it again. (GemStone was Smalltalk, which already has this idea of persistent built in.)

Someone eventually applied this to Python, which had the benefit, being a dynamic language like Python, of letting everything be done entirely at runtime. The most successful project was Zope, the CMS, which came with the ZODB. With ZODB, you could treat entire graphs of Python values as transparently persisted to a database file: You just did "db.users[1].name = 'bob'" and that was it.

Of course, this results in the access path problem described in the tar pit paper. Without a way to slice and dice data from any angle, your stuck with the ad-hoc graph structure you've created through values and their relationships, as opposed to the relational model. I'm not sure this is entirely an unsolved problem, though lack of queryability was the Achilles' heel of all these objects databases.

Another persistent language is M [1], also known as MUMPS, a quirky language and toolchain that came out of NIH and has been popular in the medical industry since the 1970s, and also in banking (Fidelity National).

M is tightly coupled with a kind of hierarchical array-oriented database; any variable can be declared as persistent, meaning that they will be read from and written to the data store, and persisted across runs. Since the underlying representation are in the form of sparse arrays, you can have lots of huge, nested data structures that are super efficient to read and write. I'm not sure there is anything else like it.

In fact, InterSystems, the main commercial vendor of M software, markets their implementation as an object database (it also does SQL, on top of M).

[1] https://en.wikipedia.org/wiki/MUMPS


So I just found brainfck written in sql and its crazy good

I was hoping that I could find something like lisp intepreter in brainfck and AI hallucinated a hell lot ,

I implemented a brainfck interpreter in pure SQL using CTEs 23 points•15 comments•submitted 6 years ago by chunkyksSQLite, db of champions to r/SQL

I did this in SQLite, but it'll probably work in other databases.

Why? No good reason. Anyways, this program is a bit traditional:

$ sqlite3 < bf.sql

Hello World!

WITH RECURSIVE program AS (SELECT '++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.' AS p, '' AS input, 3 AS width ), jumpdepth AS (SELECT 0 AS idx, 0 AS jumpdepth, '' AS jumplist, NULL as jumpback, NULL AS direction, p || '0' AS p, width FROM program UNION ALL SELECT idx+1, CASE SUBSTR(p, idx+1, 1) WHEN '[' THEN jumpdepth+1 WHEN ']' THEN jumpdepth-1 ELSE jumpdepth END, CASE SUBSTR(p, idx+1, 1) WHEN '[' THEN SUBSTR('0000000' || (idx+1), -width) || jumplist WHEN ']' THEN SUBSTR(jumplist,width+1) ELSE jumplist END, CASE SUBSTR(p, idx+1, 1) WHEN ']' THEN CAST(SUBSTR(jumplist,1,width) AS INTEGER) ELSE NULL END, CASE SUBSTR(p, idx+1, 1) WHEN '[' THEN 'L' WHEN ']' THEN 'R' ELSE NULL END, p, width FROM jumpdepth WHERE LENGTH(p)>=idx), jumptable(a,b,dir) AS (SELECT idx,jumpback,'L' FROM jumpdepth WHERE jumpback IS NOT NULL UNION ALL SELECT jumpback,idx+1,'R' FROM jumpdepth WHERE jumpback IS NOT NULL), bf(ep, p, width, defaulttapeentry, ip, dp, instruction, output, input, tape) AS (SELECT 0, p, width, SUBSTR('0000000', -width), 1, 1, '', '', input, SUBSTR('000000', -width) FROM program UNION ALL SELECT ep+1, p, width, defaulttapeentry, CASE WHEN jumptable.b IS NOT NULL AND ((dir='R' AND CAST(SUBSTR(tape, width(dp-1)+1, width) AS INTEGER)=0) OR (dir='L' AND CAST(SUBSTR(tape, width(dp-1)+1, width) AS INTEGER)!=0)) THEN jumptable.b ELSE ip+1 END, CASE SUBSTR(p, ip, 1) WHEN '>' THEN dp+1 WHEN '<' THEN MAX(dp-1,1) ELSE dp END, SUBSTR(p, ip, 1), CASE WHEN SUBSTR(p, ip, 1)='.' THEN (output || CHAR(SUBSTR(tape, (dp-1)width+1, width))) ELSE output END, CASE WHEN SUBSTR(p, ip, 1)=',' THEN SUBSTR(input, 2) ELSE input END, CASE SUBSTR(p, ip, 1) WHEN '<' THEN CASE WHEN dp=1 THEN defaulttapeentry || tape ELSE tape END WHEN '>' THEN CASE WHEN dpwidth=LENGTH(tape) THEN tape || defaulttapeentry ELSE tape END WHEN '+' THEN SUBSTR(tape,1,width(dp-1)) || SUBSTR('0000000' || (CAST(SUBSTR(tape,width(dp-1)+1,width) AS INTEGER)+1), -width) || SUBSTR(tape,widthdp+1) WHEN '-' THEN SUBSTR(tape,1,width(dp-1)) || SUBSTR('0000000' || (CAST(SUBSTR(tape,width(dp-1)+1,width) AS INTEGER)-1), -width) || SUBSTR(tape,widthdp+1) WHEN ',' THEN SUBSTR(tape,1,width(dp-1)) || SUBSTR('0000000' || (UNICODE(SUBSTR(input,1,1))), -width) || SUBSTR(tape,width*(dp+1)) ELSE tape END FROM bf LEFT JOIN jumptable ON jumptable.a=ip WHERE LENGTH(p) >= ip) SELECT output FROM bf ORDER BY ep DESC LIMIT 1;

https://www.reddit.com/r/brainfuck/comments/83cw7l/i_impleme...


this prints hello world and this has for loops etc. as well


FWIW, auto_vacuum set to INCREMENTAL doesn't do any vacuuming automatically; it just stores the information needed to vacuum the DB.

From https://www.sqlite.org/pragma.html#pragma_auto_vacuum:

> When the value of auto-vacuum is 2 or "incremental" then the additional information needed to do auto-vacuuming is stored in the database file but auto-vacuuming does not occur automatically at each commit as it does with auto_vacuum=full. In incremental mode, the separate incremental_vacuum pragma must be invoked to cause the auto-vacuum to occur.


I think it would be really useful to have a “Why Not?” section for each option as well.

Some of these feel like tradeoffs and setting them blindly without understanding the downsides seems incorrect.


https://www.sqlite.org/pragma.html does a pretty good job on the tradeoffs. Something like WAL you'd want to look a bit deeper on... but need not go too far afield: https://www.sqlite.org/wal.html


The note of several of them as being defaults for a web app kind of sets expectations for me. These are defaults for maybe you're writing a web app and want it to be a bit more like MySQL or Postgres. They aren't defaults for using SQLite as an in-memory cache on a complex piece of analysis.

SQLite is an amazingly widely used piece of software. It's impossible for one set of defaults to be perfect for all use cases.


I agree. While it’s probably not possible to settle on defaults that work for each and every scenario, my personal preference is that factory defaults should tend to optimise for safety primarily. (Both operational safety, but also in regards to usage.)

For example, OP suggests setting the `synchronous` pragma to `NORMAL`. This can be a performance gain, but it also comes at the cost of slightly decreased durability. So for that setting, I’d feel that `FULL` (the default) makes more sense as factory default for a database.


I agree that they are tradeoffs and I find it strange to use a term like "sensible" to describe them. In my mind, it implies that if you're not using them you're not being sensible. And in my experience, people who can't explain that these are tradeoffs often use words like that[0]. Of course English is not my first language so that may be the reason why I think that.

These may be useful settings for a certain kind of application under a certain workload. As usual, monitor your application and decide what is suitable for your situation. It is limiting to think in a binary way of something being sensible or not.

[0] Like "Best Practices". Any else you can think of?


That's the essence of defaults: they're just better defaults, not an excuse to stop caring about what better settings to use once you have runtime performance metrics to guide whether they need changing and which direction to change them in.


Right, but they are not the official defaults. If you are changing the official defaults, then you might as well do it in a more principled way.

If you are going to need to optimise with performance metrics anyway, then why not stick to just the official defaults (unless the official defaults are non-functional, is that the case?)

I think anyone who is reading a blog post on “better defaults” is front loading some of the optimisation, so you could let them make a principled choice straight away for marginal extra cost.


There is some principle here: these are more sensible defaults (but still only defaults) for a very different purpose when compared to the defaults that make sense for "generic SQLite use" that has to keep those set to something that works across all versions. Having different, domain-specific defaults based on an assumption of "you're setting up a new project using the current version of SQLite" makes a whole lot of sense.


A lot of these aren’t defaults because of backwards compatibility. IMO there is no reason to not use WAL mode, but it’s not default because it came later


There is a list of 6 disadvantages of WAL mode on the SQLite site:

https://www.sqlite.org/wal.html


As long as all the processes are on the same machine the wal mode is a good idea.

It's not good in the case where multiple machines are sharing the same database. Like say if you had a shared settings file which allowed multiple VMs to be set in one place.

Obviously the same-machine situation is the most common. But you asked for a reason for when wal is not appropriate.


> If you are going to need to optimise with performance metrics anyway, then why not stick to just the official defaults (unless the official defaults are non-functional, is that the case?)

Well, why don't you do the research and tell us?

In my 10-15 years of dealing with official defaults of many programs is that they do work, but in 90% of cases they are overly-conservative.


I do not agree that WAL mode should be enabled by default. It can break things if used incorrectly.

https://www.vldb.org/pvldb/vol15/p3535-gaffney.pdf

“To accelerate searching the WAL, SQLite creates a WAL index in shared memory. This improves the performance of read transactions, but the use of shared memory requires that all readers must be on the same machine [and OS instance]. Thus, WAL mode does not work on a network filesystem.”

“It is not possible to change the page size after entering WAL mode.”

“In addition, WAL mode comes with the added complexity of checkpoint operations and additional files to store the WAL and the WAL index.”

https://www.sqlite.org/lang_attach.html

“SQLite does not guarantee ACID consistency with ATTACH DATABASE in WAL mode. “Transactions involving multiple attached databases are atomic, assuming that the main database is not ":memory:" and the journal_mode is not WAL. If the main database is ":memory:" or if the journal_mode is WAL, then transactions continue to be atomic within each individual database file. But if the host computer crashes in the middle of a COMMIT where two or more database files are updated, some of those files might get the changes where others might not.”


WAL isn't the footgun, SQLite over network attached storage is. Additional files are not a problem. If you need cross-DB file ACID you can just turn off WAL mode in that instance.


>> WAL isn't the footgun, SQLite over network attached storage is

I agree that using SQLite over a network is not a good idea. But people do do that, and it works tolerably well. On most networks it'll work just fine, especially if writes are sparse.

The point of defaults is that they should cover the "widest range". By all means read the docs, and choose to use Wal. But that's an action you take and understand.

The point of articles like this though is to help alert people to the settings that exist. You should always read the docs and apply them to your context before using any setting.


It’s unreliable on any network for good reason: SQLite relies on the kernel for state information on the file system. Two machines sharing one file over a network have two unsynchronized versions of that state. One machine’s write does not update the other’s in-kernel image.


These all sounds like load-specific / ancillary concerns.

I don't think most people are in one of those cases, and most people will greatly benefit from WAL. WAL should absolutely be the default unless you have a good reason to not use it.


You may prefer non-acid characteristics in your databases in general.

I do not. There may be specific uses, but standards should be followed when possible.


"When possible" in this case being simply don't enable the WAL if your use requires transactions across multiple attached sqlite DBs.


ACID guarantees work just fine. They don't work perfectly in scenarios where SQLite should not be used anyway: multiple machines accessing the same DB file over network, and having multiple attached databases.

None of these use cases are common.


> You may prefer non-acid characteristics in your databases in general.

No. Quite the opposite. However the multi-database case strikes me as a very limited edge case, I would be surprised if double digit % users of sqlite even knew it’s an option let alone used it, and the concern is further lessened because afaik things like FKs do not work cross-db.


Unfortunately, there doesn't seem to be a PRAGMA option for strict tables. Docs at following link say that 'STRICT' must be appended to end of CREATE TABLE statement to disable the flexible typing 'feature'.

https://www.sqlite.org/stricttables.html


I don't think there can be with sqlite's backwards compatibility model. The DB has to assume it may be accessed with multiple versions of the library or clients using different pragma settings. So one client disabling strict (or using a version predating that feature) and inserting a bunch of violating records would break the DB file when used by another client with strict enabled.

With it part of table creation, which IIRC is read for each connection to create the schema, it's part of the DB file not the client configuration. If you open a DB using strict with a version that doesn't support it, it will fail but not cause problems for other users.


I wish there was a branch of SQLite that cleaned up everything, set new sane behavior (and broke backward compatibility since it’s needed to achieve that).

Eg strict mode by default, WAL2, BEGIN CONCURRENT, etc.

Maybe HC-Tree will become that?

https://sqlite.org/hctree/doc/hctree/doc/hctree/index.html


It is time for version 4 of SQLite.


Yep, that all makes perfect sense!


> PRAGMA busy_timeout = 5000;

I would go higher than this by default. Probably the best default is something like "amount of time a human would be willing to wait for something they want before giving up" which, if you have to pick one value, is hopefully longer than 5 sec. Perhaps 30000.

Now, there are also cases where fail-fast is best. However 5000 isn't good for those either. And that assumes a wider system that handles failures in quite a nice way, which doesn't happen without special effort, so doesn't make sense to target with a default value.


A very similar list (more aimed at high-performance) here:

https://highperformancesqlite.com/articles/sqlite-recommende...


> Why?: Prevents "database is locked" errors by giving SQLite 5 seconds to wait for a locked resource before returning an error, useful for handling multiple concurrent accesses.

Alongside that one of the most important things to do is to have strong read/write segregation: separate readonly and r/w pools, the latter having a single connection which is immediately set to BEGIN IMMEDIATE when it's borrowed out.


What do you mean by "connection" in the context of SQLite?


How is the meaning not clear? If you're trying for a "there's no connections in sqlite" type gotcha that's not even true. SQLite itself uses the term internally and it shows up in the docs.

https://www.sqlite.org/c3ref/open.html https://www.sqlite.org/c3ref/sqlite3.html



In my tests, with all these settings, if you hammer SQLite with reads/writes, it won't be able do any WAL checkpoints (needs no active transactions), and the WAL file can quickly grow from, say, 8 MB to tens, hundreds of GB until your machine runs of out space (the docs mention this). So my default is to also keep track of the size of the WAL file, and if it grows beyond some threshold, temporarily suspend all reads/writes and checkpoint manually.


How do you do this, with custom code in a maintenance thread?


Yes, a separate thread.


why is cache size negative?


According to the docs, it sets the cache_size based on kilobytes rather than pages. [1]

[1] https://www.sqlite.org/pragma.html#pragma_cache_size




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: