Database Migrations

Migrations on CourtListener are hard because of four things:

Our database is huge. Databases of our size are just more complicated than those Django is designed to work with. Django does an admirable job, but it's not optimized for big data, and when you're migrating hundreds of gigs of data it requires special approaches.
Our replicated database offering means that Django's built-in migration system falls a bit short. It's not designed to work in an world with multiple databases that are streaming data to each other—few if any automated systems are.
Our bulk data scripts use PostgreSQL COPY TO commands to export the database and need the entire list of columns for many tables.
We use Django Simple History, which uses database triggers to keep history tables of our database.

Naturally, these three things interact with each other:

Big tables make migrations slower.
Slower migrations mean more opportunities for differences between replicated tables.
Triggers generate history records, but you don't want those triggers to generate records on replication subscribers, which get the records via replication already.

And so forth. The main way we deal with these issues is by making SQL files for each of our migrations, and then carefully analyzing the SQL that the Django migration would normally run. In some cases, it's fine. In other cases, it's disastrous.

The tips and notes below keep us more or less in a happy place.

Steps to Making New Migrations

Before writing any code for DB changes, fill in this Google Sheets template once for each new/changed model you're proposing, and share this with with with your manager or code reviewer either in the issue you're trying to fix or in a new issue if one doesn't already exist.

This allows a design to coalesce before any code is written.

Once your proposed changes have gotten feedback and are approved, follow the steps below to implement them in code and then submit a PR.

Make necessary changes to the Django models for which the migration applies to.

Create the migration files:

Create a Django migration file: docker exec -it cl-django python manage.py makemigrations <app_name>
- This creates a .py migration file under the "migration" folder for the specific app.
- Keeping the migration_id at the beginning of the filename, rename the rest of the filename to describe what this migration does.
- Add _noop.py suffix if the migration doesn't do anything (like, it just tweaks the choices parameter).
Generate raw SQL for the migration: docker exec -it cl-django python manage.py sqlmigrate <app_name> <migration_id>
Make a SQL file for our replica:
- In the same "migration" folder, make a new .sql file with the same filename as the .py migration file generated above.
- Copy and paste the generated raw SQL (from step ii) into the new .sql file. The raw SQL begins with "BEGIN;" and ends of "COMMIT;".
- Remove any SQL commands that include "triggers", see below on why.
Make a SQL file for our customers:
- In the same "migration" folder, make a copy of the .sql file made in step iii, rename the file to add _customers suffix.
- Remove any SQL commands that include "event" or "triggers", see below on why.
Tweak the raw SQL files as needed to avoid the issues outlined below, if any.

This table outlines the differences of the migrations:

Targeted Servers	Extension	Has Event Tables	Has Triggers	Why?
Our production server	`.py`	Yes	Yes	Prod uses triggers to generate data it puts in the event tables.
Our replica	`.sql`	Yes	No	Event data is streamed from prod to this replica, so tables are needed, but triggers would create duplicate data if they existed here too.
Customer Replicas	`_customers.sql`	No	No	We don't want the events at all.

Update the bulk data script.
- Look for the script in scripts/make_bulk_data.sh, and update it to add/remove whatever columns you need to.
- Add a line to the bulk data documentation - noting the change - with a release note containing the date and explaining the change(s).
Apply the migration locally and test the migration as needed
- Apply the migration: docker exec -it cl-django python manage.py migrate
- Verify the migration was successful through the admin panel
(Optional) Occasionally, you may find that you need to undo & redo the last local migration
- Undo the last local migration by reapplying the last migration before the current migration: docker exec -it cl-django python manage.py migrate <app_name> <last_migration_id>
- Make your changes
- Check the output from docker exec -it cl-django python manage.py makemigrations --dry-run to confirm it's consistent with your expectations
- Redo step 2 above to make the migration files

Migrating Event Tables and Triggers

Clients don't have event tables or triggers.
cl-replica doesn't have triggers.
prod has event tables and triggers.

Known Problems

Migrations that do literally nothing

This isn't the worst crime, but sometimes Django can be pretty dumb. For example, if you convert a text field to be blank=True, that'll create a migration that sets DEFAULT='', followed immediately by DROP DEFAULT. That doesn't do anything except confuse things, so the first rule of good migrations is: "Migrations should do something."

Adding and removing indexes to a column don't use CONCURRENTLY by default

By default, Django creates and removes indexes without using the CONCURRENTLY statement. This locks the table for the duration of the index creation. This is devastating.

It takes more work, but as of Django 3.0 this can be avoided by tweaking the Python migration files to use AddIndexConcurrently and RemoveIndexConcurrently. Search the code for these to find examples.

Note that CONCURRENTLY can't be used in a transaction block.

The old process for this is described in this excellent blog post. You can see an example of the way we used to do this in CourtListener too.

Making data changes in same transaction as schema changes

You cannot make data changes in the same transaction as schema changes. Doing so can raise an error like:

ERROR:  cannot ALTER TABLE "search_opinion" because it has pending trigger events

This post has some more information about this: https://stackoverflow.com/a/12838113/64911

Notes on schema changes

Adding a column

If you add a column to the publisher that doesn't already exist on the subscriber you'll get a message at the subscriber that says something like:

ERROR:  logical replication target relation "public.t" is missing some replicated columns

Or:

2018-12-09 05:59:45 UTC::@:[13373]:LOG: logical replication apply worker for subscription "replicasubscription" has started
2018-12-09 05:59:45 UTC::@:[13373]:ERROR: null value in column "recap_sequence_number" violates not-null constraint
2018-12-09 05:59:45 UTC::@:[13373]:DETAIL: Failing row contains (48064261, 2018-12-07 04:48:40.388377+00, 2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null).
2018-12-09 05:59:45 UTC::@:[6342]:LOG: worker process: logical replication worker for subscription 18390 (PID 13373) exited with exit code 1

(See https://github.com/freelawproject/courtlistener/issues/919)

Both of these messages sort of make sense. In each you're trying to move data (or a null value) to the subscriber and the subscriber doesn't know what to do with it. The fix, of course, is to have that column set up at the subscriber first, as hinted in the documentation:

In many cases, intermittent errors can be avoided by applying additive schema changes to the subscriber first.

Removing a column

If you remove a column at the subscriber first, you will receive tuples with fields you don't know how to handle. If you remove it at the publisher first, you'll have columns on the subscriber that don't know how to be populated. The general rule is to drop a column at the publisher first, then at the subscriber, once things have flushed. See:

https://github.com/freelawproject/courtlistener/issues/1164

Misc Additional Reading

It's also worth reviewing these references, which point to problems that can occur on high-volume PostgreSQL instances like ours:

https://www.pgrs.net/2014/05/14/safe-operations-for-high-volume-postgresql/

https://github.com/ankane/strong_migrations

https://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql

Obsoleted info

New columns With DEFAULT values in Postgresql < v11

The biggest issue we run into is that creating new columns with DEFAULT values, can be fine in regular environments, but cause a crisis in huge tables like the ones we have. Django does this kind of migration by default when you create a new text column with blank=True. That's very bad and until we upgrade to Postgresql 11 we will have to contend with this issue.

Here is some background reading on why this is a problem:

https://github.com/freelawproject/courtlistener/issues/1106

https://github.com/freelawproject/courtlistener/issues/1109