Any ideas on how to run maintenance on a site that is always under use?

cheese5505 asked:

I help out with a large gaming site in Australia. We run competitions from 7am local time to 1am the next day, every day of the week. We haven’t skipped a day since the site was released. Naturally, this makes maintenance extremely hard to run, and we find that our staging server is up to 50 commits ahead of our production branch. Usually, the main dev has to wake up extremely early to merge branches and make sure everything is working properly.

We have been trying to make our staging site as similar as we can to the production site, but we can only make it so similar.

Our site is based off Laravel with a Node.JS server for realtime. We are using Laravel Forge.

Does anyone have any suggestions on how we could push updates more frequently? We are open to anything.

Thanks

My answer:


There are a lot of things you could be doing to improve your deployment process. A few of them are:

  • Ensure your code is well tested.

    Ideally you should have 100% unit test coverage, as well as integration testing for every conceivable scenario.

    If you haven’t got this, you should probably drop everything and get this taken care of.

    Having a complete test suite will allow you to…

  • Run continuous integration.

    Whenever someone commits a change, CI can then automatically run the test suite on it. If the test suite passes, it can then deploy immediately (or schedule a deployment). For changes that don’t require any significant change to your databases, this alone will save you a lot of time and headache.

    In case of a problem, CI can also give you a one-click rollback.

    CI is much less useful if your test suite isn’t complete and correct, as the entire premise rests on being able to validate your code in an automated way.

  • Make atomic updates.

    Ideally you should not just be copying new files over the old on the production server. Instead, use a tool such as capistrano, which copies every file, and then uses a symbolic link to point to the desired deployment. Rolling back is instantaneous as it involves simply changing the symlink to point to the previous deployment. (Though this doesn’t necessarily cover your database migration.)

    Also look into whether containers such as Docker can help you.

  • Make smaller, more frequent changes.

    Whether you have tests, CI, or nothing, this alone can help you significantly. Every change should have its own git branch, and a deployment should have as few changes as possible. Because changes are smaller, there is less to potentially go wrong during a deployment.

    On that note, make changes more isolated whenever possible. If you’ve made a change to the Omaha game, and it doesn’t affect Texas Hold’em, 5 card stud or anything else, then that is the only game that needs to be suspended for a maintenance.

  • Analyze anything long-running.

    You mentioned some parts of your deployments take a long time. This is probably database schema changes. It’s well worth having a DBA look at your database, along with each schema change, to see what can be performing better.

    Have a subject matter expert look at any other part of a deployment which takes up large blocks of time.

  • Work odd hours.

    You may already be doing this, but it bears mentioning. Developers (and sysadmins!) should not be expected to work “9 to 5” anymore, especially for a 24×7 operation. If someone is expected to spend the overnight hours babysitting a deployment, fixing any problems, and then keep a daytime schedule, your expectations are unrealistic, and you are setting that person up for burnout.


View the full question and answer on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.