Your excellent story compelled me to share another: We rarely interact directly ...

cachvico · on Nov 29, 2023

Verbatim from my current code:

    if strings.Contains(dbname, "prod") {
        panic("Refusing to wipe production database!")
    }
    Truncate(db)

devsda · on Nov 29, 2023

Ours are not named with a common identifier and this also needs constant effort to maintain while refactoring and there's still scope for a mistake.

*ideally* devs should not have prod access or their credentials should only have limited access without permissions for destructive actions like drop/truncate etc.

But in reality, there's always that one helpful dba/dev who shares admin credentials for a quick prod fix with someone and then those credentials end up in a wiki somewhere as part of an SOP.

masklinn · on Nov 29, 2023

That‘s why you do credentialing via ssh keys, and keys are explained and map to a user, and non-dba keys should expire.

If you need access for a quick prod fix, your key gets added to the machine with that explanation and a week (or lees) lifetime.

amluto · on Nov 29, 2023

I also have a table with one row in it indicating whether the database is prod.

manfre · on Nov 29, 2023

I've added a similar safety to every project. It's not perfect, but this last line of defense has saved team members from themselves more than once.

For Django projects, add the below to manage.py:

    env_name = os.environ.get("ENVIRONMENT", "ENVIRONMENT_NOT_SET")
    if env_name in TEST_PROTECTED_ENVIRONMENTS and "test" in sys.argv:
        raise Exception(f"You cannot run tests with ENVIRONMENT={env_name}")

hunterrr · on Nov 29, 2023

I think runtime checks like this using environment variables is great however what has burned me in the past is that when debugging problems, not knowing what happened at runtime when the logs were produced was problematic. So when test protected environments environment variable needed to be updated I might have a hard time back tracking to it

ericbarrett · on Nov 29, 2023

Everybody replying to you that this is fragile is missing the point. This kind of code isn't the first line of defense—it's the last.

RankingMember · on Nov 29, 2023

Exactly- it's layers of prevention rather than being just one screwup away.

jerf · on Nov 29, 2023

And when your last line of defense fires... you don't just breath a sigh of relief that the system is robust. You also must dig in to how to catch it sooner in your previous lines.

For instance, test code shouldn't have access to production DB passwords. Maybe that means a slightly less convenient login for the dev to get to production, but it's worth it.

toasted-subs · on Nov 29, 2023

Yup, I have 3 prompts if you want to wipe anything.

One of the reasons I put interactions between databases behind a cli.

csomar · on Nov 29, 2023

This is bad because if someone forgot to add prod or for whatever reason the code executed beyond the panic, you’ll wipe out the db.

There is no code that will protect your db/data. Only replication to a read-only storage will help in such situations.

saagarjha · on Nov 29, 2023

If code is executing past a panic, I think it is unlikely that you can trust the integrity of your database anyways.

Andrex · on Nov 29, 2023

But what if you have a chron job that auto replicates and then deletes everything after you forward it?

augustk · on Nov 29, 2023

And then it turns out that the order of the parameters was mixed up... just kidding.

jve · on Nov 29, 2023

Just yesterday, I did C# Regex.Match with a supersimple regex: ^\d+ And it seemed not to work. I asked ChatGTP and he noted that I had subtle mistake: the parameters were other way around... :facepalm:

augustk · on Nov 29, 2023

That's indeed a drawback of function call syntax compared to method call syntax where the object comes before the name of the method.

stylepoints · on Nov 29, 2023

#metoo

bazzargh · on Nov 29, 2023

We had this - 10 years ago. In our case there was a QA environment which was supposed to be used by pushing code up with production configs, then an automated process copied the code to where it actually ran _doing substitutions on the configs to prevent it connecting to the production databases_. However this process was annoyingly slow, and developers had ssh access. So someone (not me) ssh'd in, and sped up their test by connecting the deploy location for their app to git and doing a git pull.

Of course this bypassed the rewrite process, and there was inadequate separation between QA and prod, so now they were connected to the live DB; and then they ran `rake test`...(cue millions of voices suddenly crying out in terror and then being suddenly silenced). The DB was big enough that this process actually took 30 minutes or so and some data was saved by pulling the plug about half-way through.

And _of course_ for maximum blast radius this was one of the apps that was still talking to the old 'monolith' db instead of a split-out microservice, and _of course_ this happened when we'd been complaining to ops that their backups hadn't run for over a week and _of course_ the binlogs we could use to replay the db on top of a backup only went back a week.

I think it was 4 days before the company came back online; we were big enough that this made the news. It was a _herculean_ effort to recover this; some data was restored by going through audit logs, some by restoring wiped blocks on HDs, and so on.

norman784 · on Nov 29, 2023

Our test suite expects that the database name has a `_test` suffix, so you can't run the tests even locally without the suffix.

masklinn · on Nov 29, 2023

Our test harness takes an optional template as input and immediately copies it.

It’s useful to distribute the test anyway, especially for non-transactional tests.

If the database initialisation is costly that’s useful even if tests run on empty, as copying a database from a template is much faster than creating one DDL by DDL, for postgres at least.

masklinn · on Nov 29, 2023

(distribute as in parallelise, possibly across multiple machines)

spixy · on Nov 29, 2023

Our test suite uses DB user that exists in docker DB but not in prod, so droping prod database cannot happen.

vichle · on Nov 29, 2023

This is why I always delete by ID when cleaning up after tests.