Your excellent story compelled me to share another:
We rarely interact directly with production databases as we have an event sourced architecture. When we do, we run a shell script which tunnels through a bastion host to give us direct access to the database in our production environment, and exposes the standard environment variables to configure a Postgres client.
Our test suites drop and recreate our tables, or truncate them, as part of the test run.
One day, a lead developer ran “make test” after he’d been doing some exploratory work in the prod database as part of a bug fix. The test code respected the environment variables and connected to prod instead of docker. Immediately, our tests dropped and recreated the production tables for that database a few dozen times.
Ours are not named with a common identifier and this also needs constant effort to maintain while refactoring and there's still scope for a mistake.
*ideally* devs should not have prod access or their credentials should only have limited access without permissions for destructive actions like drop/truncate etc.
But in reality, there's always that one helpful dba/dev who shares admin credentials for a quick prod fix with someone and then those credentials end up in a wiki somewhere as part of an SOP.
I've added a similar safety to every project. It's not perfect, but this last line of defense has saved team members from themselves more than once.
For Django projects, add the below to manage.py:
env_name = os.environ.get("ENVIRONMENT", "ENVIRONMENT_NOT_SET")
if env_name in TEST_PROTECTED_ENVIRONMENTS and "test" in sys.argv:
raise Exception(f"You cannot run tests with ENVIRONMENT={env_name}")
I think runtime checks like this using environment variables is great however what has burned me in the past is that when debugging problems, not knowing what happened at runtime when the logs were produced was problematic. So when test protected environments environment variable needed to be updated I might have a hard time back tracking to it
And when your last line of defense fires... you don't just breath a sigh of relief that the system is robust. You also must dig in to how to catch it sooner in your previous lines.
For instance, test code shouldn't have access to production DB passwords. Maybe that means a slightly less convenient login for the dev to get to production, but it's worth it.
Just yesterday, I did C# Regex.Match with a supersimple regex: ^\d+ And it seemed not to work. I asked ChatGTP and he noted that I had subtle mistake: the parameters were other way around... :facepalm:
We had this - 10 years ago. In our case there was a QA environment which was supposed to be used by pushing code up with production configs, then an automated process copied the code to where it actually ran _doing substitutions on the configs to prevent it connecting to the production databases_. However this process was annoyingly slow, and developers had ssh access. So someone (not me) ssh'd in, and sped up their test by connecting the deploy location for their app to git and doing a git pull.
Of course this bypassed the rewrite process, and there was inadequate separation between QA and prod, so now they were connected to the live DB; and then they ran `rake test`...(cue millions of voices suddenly crying out in terror and then being suddenly silenced). The DB was big enough that this process actually took 30 minutes or so and some data was saved by pulling the plug about half-way through.
And _of course_ for maximum blast radius this was one of the apps that was still talking to the old 'monolith' db instead of a split-out microservice, and _of course_ this happened when we'd been complaining to ops that their backups hadn't run for over a week and _of course_ the binlogs we could use to replay the db on top of a backup only went back a week.
I think it was 4 days before the company came back online; we were big enough that this made the news. It was a _herculean_ effort to recover this; some data was restored by going through audit logs, some by restoring wiped blocks on HDs, and so on.
Our test harness takes an optional template as input and immediately copies it.
It’s useful to distribute the test anyway, especially for non-transactional tests.
If the database initialisation is costly that’s useful even if tests run on empty, as copying a database from a template is much faster than creating one DDL by DDL, for postgres at least.
We rarely interact directly with production databases as we have an event sourced architecture. When we do, we run a shell script which tunnels through a bastion host to give us direct access to the database in our production environment, and exposes the standard environment variables to configure a Postgres client.
Our test suites drop and recreate our tables, or truncate them, as part of the test run.
One day, a lead developer ran “make test” after he’d been doing some exploratory work in the prod database as part of a bug fix. The test code respected the environment variables and connected to prod instead of docker. Immediately, our tests dropped and recreated the production tables for that database a few dozen times.