Video pending uploadMonitoring is such a broad topic that it can be intimidating to start. Rather than just talk about what to monitor, we're going to cover common anti-patterns of monitoring.
Monitoring A FalsehoodInferring the health of an application is always complicated - and when we design for graceful failure, it can be difficult to determine the real health of our application.
Pattern: Real App Status PageRather than relying on just something simple like a string match on the homepage, you can utilise
Health Endpoint Monitoring. This pattern uses individual health checks for each component within an application. By reporting on the 'perceived' health of databases and other dependent services like authentication - you have a greater level of insight into real application health. This is especially useful when applications are spread across Cloud and on-premises, as these couplings can become brittle - for instance during an undersea cable cut.
Ephemeral LoggingIt's common to log to an ephemeral disk locally for many reasons - it's fast, it's cheap, and it's easy. However, in the event of the loss of a node - discovering the root cause can quickly become impossible if the logs have been lost.
Pattern: Externalise LogsBy externalising important logging messages to another server - you have greater insight into the root cause of issues, and can improve future platform health.
Alert FatigueIn almost every platform there are disabled alerts and flapping services. While it can be tempting to leave these as-is, the 'known faults', this inherently trains operators to ignore what can be critical flags. Common examples include SSL certs approaching expiry, which has impacted everything from websites to
card payment machinesPattern: Snooze vs DisableInstead of disabling the alert, most monitoring software supports a time based snooze - which forces an operator to re-review the alert again and take restorative action.
These are merely 3 of many anti-patterns that can catch you out. Through proper practices, you can make your journey to Cloud Operationally Excellent.