Filter by:

How many systems make up a typical software application these days? I suspect it’s more than a few. With the advent of the internet, then web-services and now the cloud, few applications are self-contained any more. A recent analysis of our own cloud service made me realise that software engineering has become more like systems engineering, or as the title of this blog states, a hybrid that is software systems engineering. An excellent book I read this summer, Release It, reinforces this view.

What is systems engineering?

Cockpit of Jaguar GR.3A

My first proper job after university was working in a fairly special avionic systems team. They were formed to rapidly upgrade the aging Jaguar attack aircraft’s avionics, for service in the Bosnia conflict. Our leader was inspiring and won an MBE for his work. I’ll never forget meeting him and how he introduced systems engineering as being the discipline that “makes the whole greater than the sum of its parts”.

Modern aircraft are a classic example of this kind of complex system, being made up of component parts such as engines, flight computers, GPS and radar. They also demonstrate the importance of resilience and graceful degradation: if one part fails, the whole does not fail, but may reduce in the overall value it provides.

Fast forward nearly twenty years and as I look at even the most basic software application I’ve written recently, the systems engineering definition and principles echo round my head again.

Applying systems engineering to software

To illustrate this, I’ll use one of our simplest, internal web applications: our release note generator. I blogged about this app previously, but here’s a quick re-cap of what it does. Given a release identifier (a subversion revision, in our case) the application summarises how this release differs from what’s currently in production. The summary is a list of each related work-item (story or defect), including the work-item’s id, title, state and type.

Release Note data and dependencies Release Note data and dependencies

Even though it’s a very simple web app, as the picture shows, it still depends on three other software systems:

  • Production version provider

  • Source control system

  • Work-item repository (an external, cloud service)

This application is thus a software system, with component parts that can, and probably will at some point, fail. Our challenge as software systems engineers is to minimise the impact of any part failing.

Failure modes and how to handle them (gracefully)

The three systems that the application depends on are similar enough that they have the same general failure modes:

  • Timeout. The system responds, but exceeds the allowed time limit

  • Network or server error - aka “computer says no”. The system either doesn’t respond at all or responds with an error, for example, HTTP Status 500 - Server Error

  • Unexpected response. The system responds but in a manner that the application hasn’t catered for. For example, it expected a numerical response, such as 1.23, but instead received some text, such as 1.23A

(Note that these failure modes are very general - Release It details a much greater number of specific failure modes, but they are beyond the scope of this post.)

If we fail to consider how to handle these general failure modes, then the application will handle them for us, with its default mechanism. At worst, this could be showing the user the ugly Yellow Screen of Death (YSOD); at best it will be a generic error message that leaves the user confused, frustrated and helpless.

When we consider these modes, it becomes apparent that although the application’s dependencies all have the same failure modes, their impact and mitigation very much depends on which system fails, as the table below shows. Even for an airplane, the failure of some systems (such as all its engines) will cause the airplane to eventually drop out of the sky!

Mitigation of failure modes per system

Production version provider

Source control system

Work-item repository


Fallback to a fixed delta, e.g. the given, new revision less 500.

Critical failure - no value can be delivered.

Point the user at the specific problem.

Retry with longer timeouts.

Work-items can still be grouped by Id (this comes from the source control system), but use placeholder text for other work-item data, to indicate it’s not available.

Retry with longer timeouts.

Network/Server Error

As above

Critical failure - no value can be delivered.

Point the user at the specific problem.

Work-items can still be grouped by Id (this comes from the source control system), but use placeholder text for other work-item data, to indicate it’s not available.

Unexpected response

As above

As above

As above

The timeout failure mode for the work-item repository warrants further discussion. A release note typically contains the details of 10+ work-items. To retrieve these details, a separate HTTP request is sent per work-item. If each request encounters either a timeout (of, for example, 20 seconds) or a delay (of, for example, just under 20 seconds), then a poorly engineered implementation that serialises these requests would result in a considerable delay (of, for example, 200 seconds) before the user sees any response at all! On the other hand, a well-engineered solution, that performs these requests in parallel, could both optimise performance and provide more graceful handling of this failure mode.

Conclusion: know your application’s systems and plan for their failure

I’m hoping that this post has given you two things to consider and act on:

  • Most software applications today are complex systems. As software systems engineers, we should know and care about the systems our application depends on

  • Complex systems have component parts that can and will fail. The world will be a nicer place if we, as software system engineers, plan for this, and build applications that gracefully degrade instead of metaphorically falling out of the sky

If you’d like to learn more about this topic, then I highly recommend you read Release It.

Share this article

Talk to us +44 207 785 8888