Rapid Release and Continuous Delivery
Rapid release in the name of a project to make releases more frequent and deploy bugfixes more rapidly. This is a project which will require company-wide changes over many years but here I’ll focus on some upcoming changes to continuous delivery (CD) which will get us just a bit closer to making releases “rapid”.
The Changes
Operations is in the process of changing how CD works with the goal to make it possible to deploy production systems automatically. In short term, I hope that it will allow production systems, within the same minor version, to be deployed fully automatically on a fixed schedule. Long term, I hope this will also be useable deployments of a new major/minor version. Though this will require changes to how we advertise, sell, test and develop our product.
The changes described here are planned to be rolled out for test deployments shortly. Some features required for this to go into production are still missing. However, I’d like to see how this works in practice. Some details about missing bits an pieces can be found below.
There is one visible change some of you may already have noticed: a new auto deployment mode appeared in TeamCity:
The features described here are all hidden behind this option.
Most notably, the following features have been added or reworked:
- A check for pending, customer-specific changes has been added.
- A maintenance page is shown during deployment.
- The installation is stopped for the full duration of a deployment.
- DB dumps have been reimplemented.
- DB restore (recovery) on failure has been added.
Recovery
To prevent changes to the database after the deployment has started, the installation is stopped and a maintenance page is shown during deployment. If we want deployments to be automated, we’ll have to run them outside regular working hours but we don’t want to spend our nights fixing Nice when a deployment fails. So, we stop everything, create a backup, and if something goes wrong, simply restore the backup and investigate the issue the next day.
Worst case, recovery fails and someone has to intervene and recover the system manually. A manual recovery, however, is far less of an unknown than a deployment that can fail for a great number of reasons and, over time, as we fix initial bugs, manual recovery will hopefully become a rare event.
Backups / DB Dumps
DB dumps have been reimplemented to overcome some shortcomings of the old implementation. 1 For the user, the way a DB needs to be restored changed. It’s now automated.
If a deployment fails, a restore will happened automatically. When a deployment fails and a restore is done, TeamCity build step DB / recovery will print the details:
Original, replaced DB:
tco db-connect db1.stage.tocco.cust.vshn.net/nice_toccotest_renamed_31249ba7
Restored DB:
tco db-connect db1.stage.tocco.cust.vshn.net/nice_toccotest
In exceptional cases, the dump may need to be restored manually. In that case execute the command printed as part of the TeamCity build step dump DB:
Restore command:
tco db-backup restore db1.stage.tocco.cust.vshn.net db2.stage.tocco.cust.vshn.net:file:manual/nice_toccotest-2025-03-11T12:24:20+01:00.dump
This will create a new DB and print the name of it. Rename the DB as needed. See also DB Backup Bonus Features below.
Maintenance Page
Because the installation remains stopped for a prolonged time, a maintenance page is shown until the deployment has completed:
This page will go away once the deployment completed, whether it succeeded or a recover was needed. The one exception is when recover fails. In this case the page will remain until disabled manually.
You can start Nice again by scaling it:
oc project nice-${installation}
oc scale --replicas 1 dc/nice
And then disabling the maintenance page manually:
oc project nice-${installation}
tocco-mntnc stop
See also documentation for tocco-mntnc. The above command will only work on Ubuntu/Debian out of the box. Everyone else refer to the documentation.
Check for Pending Customer-Specific Changes
When the auto deployment mode is enabled, deployments are skipped automatically if there are any customer-specific changes pending deployment. This to avoid accidentally deploying Change Requests or similar changes.
In case there is any pending changes the list of offending changes is printed and all further steps skipped. This only applies to production systems, test systems are deployed unconditionally.
This feature uses tco pending-deployment
in the background to find pending changes.
Status Quo
Various pieces are missing, some of which will be required before automated production deployment become a possibility:
-
Alerting
The maintenance page has been adjusted to ensure monitoring alerts while it maintenance page active. We’ll have to set a downtime when a deployment is started and ensure an alert is sent out if deployment and recovery fail and manual intervention is needed.
Without this, a maintenance page can remain unnoticed indefinitely.
-
Options
No options are available yet. You can’t deploy without a maintenance page, or allow a deployment even if there is pending changes.
We’ll have to figure out what’s needed over time and add the required options.
-
Missing safeguards to ensure sufficient disk space
There is safeguards in place to ensure the DB server cannot run out of disk space during a restore and render the whole server unusable. But we’ll probably need some additional checks or simply more spare space to ensure a deployment will never start if we can’t guarantee sufficient space for a potential recovery.
-
Creaminess
Also, who wants to receive ice cream privileges in the future needs to complain to Marcel about the color of the curtain in the office.
-
Deployment / Rollback of Kubernetes Configuration
When a recovery happens we restore the DB and the old version of Nice will be started again. However, the configuration is not rolled back. If you edit the DeploymentConfig, it’ll try to roll out the new version again and fail.
This will require some major changes as to how we do deployments. We’ll also switch from DeploymentConfig to Deployment in the process which will get rid of all the deprecation warnings you see when using
oc
.
DB Backup Bonus Features
The dump functionality used by CD has been integrated into tco
and can be used
independently of CD.
Create a dump / backup:
$ tco db-backup dump db2.prod/nice_stn
tco db-backup restore db1.prod db2.prod.tocco.cust.vshn.net:file:manual/nice_stn-2025-03-11T14:50:31+01:00.dump
Then use the printed command to restore it again:
$ tco db-backup restore db1.stage.tocco.cust.vshn.net db2.stage.tocco.cust.vshn.net:file:manual/nice_toccotest-2025-03-11T12:24:20+01:00.dump
[db1.stage.tocco.cust.vshn.net INFO tocco_backup_transfer::db] Dump restored as "nice_toccotest_restored_6e5fcc61".
tco db-connect db1.stage.tocco.cust.vshn.net/nice_toccotest_restored_6e5fcc61
This is basically what CD does except that it doesn’t replace the target database.
You can also list backups, and as bonus, those backups will include the last daily backup 2:
$ tco db-backup list db1.stage
...
tco db-backup restore db1.stage db2.stage.tocco.cust.vshn.net:file:nice_tgvtestold_history.dump
db_name: nice_tgvtestold_history
time: 2025-03-10 23:13:55 ( 0 d 21 h 19 min ago)
tco db-backup restore db1.stage db2.stage.tocco.cust.vshn.net:file:nice_toccotest.dump
db_name: nice_toccotest
time: 2025-03-10 23:13:59 ( 0 d 21 h 19 min ago)
tco db-backup restore db1.stage db2.stage.tocco.cust.vshn.net:file:manual/nice_toccotest-2025-03-11T12:24:20+01:00.dump
db_name: nice_toccotest
time: 2025-03-11 12:24:20 ( 0 d 8 h 8 min ago)
tco db-backup restore db1.stage db2.stage.tocco.cust.vshn.net:file:nice_toccotest_history.dump
db_name: nice_toccotest_history
time: 2025-03-10 23:14:30 ( 0 d 21 h 18 min ago)
...
Simply execute one of the restore commands printed as part of the list of available backups.
Some limitations apply:
-
All
db-backup
commands currently require that you specify a DB server and/or DB name. You cannot currently specify installation names as is possible in other places. I’ll likely add this functionality after the current design has proven itself. -
It’s not possible to restore across clusters. You may have noticed that that many commands contain two host names, one where the restore is done (e.g. db1.stage (master)) and another one where the backup is located (e.g. db2.stage (slave)).
To minimize downtime (=time maintenance page is active) to a minimum in case of a failure, direct server-to-server communication is used to access backups. This has only been set up between a master and its corresponding slave. So, copying between clusters (e.g. to copy from prod to test) is not supported.
-
It’s not currently possible to restore on localhost. I’ll consider adding this feature if enough people request it.
-
a) Backups now always happen on slave, where there is enough disk space,
b) No more failures when replication is too far behind.
c) Backups are guaranteed to include all changes up to the point when a backup is started.
d) Requests are passed from master to slave automatically (master delegates work as needed). ↩︎ -
DB backups, created as part of our daily backups, are created on disk first before being archived. These on-disk backups can be accessed and restored via
tco db-backup {list,restore}
too. This are the same backups that can be found on the slaves at/var/lib/postgresql-backup/
. However, usetco db-backup list
against the master which will still include backups from the slave. ↩︎