A Monitoring, Alerting, and Notification Blueprint for SaaS Applications
Monitoring, Alerting, and Notification play a key role in the work-life of any engineer who is focused on operating a SaaS application or a service that makes up a SaaS application or the SaaS infrastructure itself. With the hope that you might find it useful as you think about developing or enhancing the monitoring, alerting, and notification suite for your own SaaS platform, in this article I’d like to share a blueprint that is informed by the work that we have done for OpenGov Cloud Platform. And, just in case you are not already aware of it, do take a look at Google’s SRE Book, which was influential in our thinking on this.
At the risk of stating the obvious, the decisions that you make in selecting the tools or services for building the suite are your own. It is not the purpose of this article to advocate the use of a tool or a service or a vendor over another. Also, you need not build your strategy along each and every vector that I share below. Start according to the needs of your business and slowly but surely keep adding to it as those needs mature.
Monitoring is the process of collecting, processing, and displaying, (a) real-time quantitative data about a system, such as request rates and query processing times, or (b) event notifications, such as for failover in a redundant pair, an application or system restart, or (c) security events like those received from intrusion detection systems or discovery of a vulnerable component. Monitoring enables a system to tell us when it’s broken or, in a perfect world, tell us what’s about to break.
Alerting is the process of informing human developers/operators through systems such as email, chat, ticketing, or paging. If the failure isn’t self-correcting, those humans would have corrected the problem before the customers/users are impacted. It is worth pointing out that alerting does not necessarily imply “waking humans up”. Certain alerts, such as for self-correcting events or “warnings” may be used to raise awareness, lest they should become pathological for the system. As the system’s resilience improves through the use of automated healing and resolution (a common example is automated horizontal scaling), the human impact of self-correcting events becomes less consequential. You may want to consider suppressing alerts which are not immediately actionable to keep the signal-to-noise ratio high, enhance developer experience by reducing distractions, and to minimize anti-pattern behavior (tendency to ignore what may be perceived as unactionable noise).
Notification is targeted towards your human stakeholders—your customers/end-users and your business partners in customer support, customer success staff, execs, and even sales engineers (their demos might be impacted!) about critical issues that might be actively affecting them or, in some cases, might’ve affected them. The Notification framework that I present below can also be used to notify your stakeholders about ongoing or scheduled restorative and maintenance events.
The infographic below wraps up the concepts discussed thus far.
A thought that might cross your mind from the infographic above is if there really needs to be a human element between Alerting and Notification. Generally speaking, stakeholder notification can be automated but it is important to keep in mind that our stakeholders are, after all, humans and would value actionable notifications that are meaningful (e.g., impact assessment, workaround, mitigation plans) more than cursory automated notifications that are simply informational. Actionable notifications are not that easy to automate. Further, there is a case to be made that in difficult times your stakeholders would want to see that humans are involved and in control, and not automated systems. That being said, it might be worth automating cursory informational notifications as a placeholder for a better developer/operator experience, but do a fast follow with the human touch.
At OpenGov we think of Monitoring using three methods: Black Box, Gray Box, and White Box.
Black Box Monitoring is built using system primitives that are fairly basic and do not require exposure to system internals in any way. A typical example of such a primitive is health-check probes, e.g. those used by network load balancers, such as this or Kubernetes’ liveness and readiness probes. Another example is service health-check API.
- Technically this may be used at any level (application, services that constitute the application, or even processes that constitute the service), but first, apply this at the application level.
- When implemented properly (e.g., covering the full-path of core business logic) at the application level, black box API testing can also serve as a good indicator of your application’s uptime (or, what I like to call as AoS, Availability of Service), which you will find useful from SLA reporting perspectives as well—just remember to temporarily disable the monitors during planned maintenance windows that could impact the uptime of your service.
Gray Box Monitoring goes a layer deeper by mimicking or replaying actions of a typical user or another system. A typical example of such a primitive is automated browser tests. Almost certainly you will apply this at the application level.
- Gray box tests can be made fairly rich and can serve as excellent (albeit rather heavyweight) alternative to full-path black box API tests, especially because they can be used to measure QoS (Quality of Service) just as well as AoS.
- Such tests can cover many user operations—both read-only and read-write, thereby covering a broad spectrum of business logic for complex applications.
- Read-only operations may be performed on actual customer instances or entities, which is helpful if the test needs to cover something that is inherently customer-specific. (OpenGov has had to use it for our “Public Transparency” application in one case where the dashboard was customized with a specific image that is associated with the tenure of a governing official.)
- A QoS scenario for read-only operations is where the customer data is “complex” such that it may lead to an unacceptable amount of time in the loading of certain elements on a page.
- You would almost never want to perform a read-write operation on actual customer instances—even with the optimistic expectation of a successful attempt to roll back the change. Use a demo instance or entity for such operations. (In a multi-tenant SaaS application the demo instance would be served identically to a customer instance, so you would get the signal that you are looking for.) There is an inherent risk in any read-write operation, but think “rinse and repeat”—it is a good practice to roll back your change with a complementary operation so your instance/entity and its data are ready for the next iteration of this test or another unrelated test.
White Box Monitoring is based on metrics and state variables that are exposed by the internals of the system. Technically this may be used at any level, but you will find this best applied at the level of underlying processes.
- The key vectors for white box monitoring that you would want to plan out are: logging, operational statistics/metrics (e.g., request rate, CPU/RAM utilization), performance statistics/metrics (think APM), automatic error/exception reporting, and security events (discovered vulnerabilities or low level events from intrusion detection systems).
- This is arguably the broadest and the most impactful of all methods, and, as with most things white box, will take the most time to get right. You want to plan this one out well as it will be a big investment of your resources and time.
Next, let’s talk a bit about some of the tools and services that can be used to realize the Monitoring methods discussed above. Again, this isn’t an endorsement of any of these tools or services—assess their capabilities and ROI for your own needs. And in case you are out in the market to buy the tech instead of building it, it is worth keeping in mind that there is quite a bit of consolidation going on in commercial vendors in this space, with vendors that started with metrics management branching out to log and performance management and everything in between.
- In the rather broad category of White Box Monitoring,
The infographic below incorporates these recent concepts.
You will find that a typical Alerting framework is not as varied and expansive as a Monitoring framework. It is equally important nonetheless, especially from a developer experience, system reliability, and alert federation perspective. The main concern vectors for you would be to handle both system-generated alerts and human-generated alerts, and in the former case, handle self-correcting events somewhat differently from events that require human intervention to correct.
At OpenGov we primarily rely on two means for Alerting: a Paging platform, for alerts that require human intervention to correct, and a Real-time Awareness platform, which for all alerts and in some rather unique ways that lend well to our organization and processes. We use Atlassian Opsgenie for Paging. Other alternatives include PagerDuty and VictorOps. We use Slack for Real-time Awareness because of its rich API-level integration with the rest of our SRE suite.
We use a “federated model” of Alerting at OpenGov, which has helped us scale our teams while maintaining a high degree of responsiveness to our customers. This model is well suited to our “autonomous team” structure in which multiple teams run what they build using the DevOps model. There is no one team (e.g., an “ops team”) that is the first line of defense on alerts. Our alerts are demultiplexed at the source—e.g., at the Monitoring system itself—such that it hits the team that is the most capable of handling the alert. Typically, the team that built a product (an application or a service) will be alerted first. Such teams have a weekly on-call rotation between its engineers. To support the federation we have created an OpsGenie Team and an independent Slack alert channel per team. We have well-known and easy-to-use mechanisms using which the first-responder team can engage other supporting teams, such as the “infrastructure team” or the “platform team” depending on what they find during initial triage.
We use Paging in three situations:
- System-generated alerts that require human intervention to correct—e.g., from our Monitoring systems.
- Human-generated alerts—e.g., when one team needs urgent assistance from another team.
- Customer-assist alerts—e.g, when a customer reports a “blocker issue”, our customer support ticketing system, Zendesk, triggers a page using integration with OpsGenie to the engineering team that is responsible for that application. Another similar use-case for us is for “internal customer assist”. Our operations team uses Jira Service Desk to handle tickets from engineering and other departments. We have that integrated with OpsGenie to prevent breach of our “internal SLAs”.
The infographic below incorporates these recent concepts.
The Role of Slack
Slack can be used in some fairly impactful ways to raise awareness and collaborate on incidents. For us, it serves as a hub of our monitoring, alerting, and notification workflows. While I would not go as far as calling it a “must-have”, it is a key component thereof. I wanted to leave this thought with those of you who might be considering Slack as a peer-to-peer communication tool. With the integrations that a lot of tools have built around Slack’s flexible APIs, it can actually do a lot more for you. I wish I knew or had more experience with another similar service, but unfortunately nothing looks like a viable alternative as of right now—Microsoft Teams is more geared towards communications and integrations with their suite; Atlassian Hipchat is migrating customers to Slack. Here are some examples of the ways in which we use Slack:
- We have integrated Slack with OpsGenie such that someone can simply type “/genie alert <message> for <other-team>” in any Slack channel to trigger the escalation for the “other team”.
- For awareness and non-urgent collaboration we use custom Slack handles, such as @engops-on-call, which are used to engage the on-call person for a team, in this case EngOps.
- Our “external” customer support ticketing system has specific alert channels to raise awareness on blocker and critical defects reported by customers.
- Every team has its own “support channel” (that receives a ticketing system feed) and “alerts channel” (that receives a monitoring system feed).
- Notifications from our “status pages” (see next section) are mirrored into specific Slack channels.
You can use a Notification framework to notify your human stakeholders—your customers/users and your business partners about critical issues that might be actively affecting them or, in some cases, might have affected them or about ongoing or scheduled restorative and maintenance events. This is best and most comprehensively done using “status page” services. We use Atlassian Statuspage, but other alternatives include Status.io and StatusCast. You can also use “announcements functionality” in your customer support ticketing system or in-app notifications, or even platforms such as WalkMe.
In and of itself, this framework is fairly simple. You create an incident or scheduled maintenance, provide specifics, such as the nature of impact, impacted applications, timing of impact, etc. and let the system take care of informing “subscribers” using email, SMS, or RSS. That being said, to keep the signal-to-noise ratio close to optimal at an organizational level, you may want to adopt an approach of provisioning multiple distinct status pages, depending on the complexity of your application and stakeholder ecosystem, the maturity of your processes, and your business needs:
- Customer/User facing “external” status page, which is publicly visible. This is best managed by your customer support team, which typically specializes in tailoring messaging to that audience—e.g., what to share, how to position it, nomenclature, etc.
- Business partner facing “internal” status page, which is visible only within your company. This may be used by the engineering/operations team to share updates with your business partners about the applications and systems that are used by your customers—think production environments. The internal page might allow additional latitude in content/terminology, which may not be directly meaningful to your end users.
- Department facing “private” status page, which is visible only within your department (e.g., R&D). This may be used within the engineering team to share updates about the applications and systems that do not directly impact your customers—think pre-production environments (e.g., testing or staging or CI/CD).
|Status Page||Environment||Producers of Information||Consumers of Information|
|External||Production||Customer support||Customers, End-users, Subscribers|
|Internal||Production||Engineering/Operations||Business partners, Internal stakeholders|
The infographic below incorporates these recent concepts.
I am purposefully not discussing the specifics of the customer support ticketing framework in this article, but generally speaking, you want to think about your external (i.e., customer/end-user facing) system and your internal system. You could use the same service (e.g., Zendesk) for both, but, like us, you may find reasons (e.g., cost, customizations, and integrations) for not doing so.
A Case Study: Support & Escalation Communication
In conclusion of this article, I wanted to share a case study that showcases how the above methods can be brought together to effectively overcome typical organizational challenges around communication in “difficult situations”. This case study highlights the flow of communication in The OpenGov Cloud™ Platform support and escalation workflow. We intended to eliminate or minimize the “top down communication” anti-pattern, where an escalation from a customer might catch our executive team unaware, and further lead to us in R&D getting issue notifications from our execs as opposed to our support systems and teams.
The swimlanes in the infographic below show the various stakeholders in the communication workflow—Customers, OpenGov Customer Support, OpenGov R&D, OpenGov Field (Customer Success, Professional Services, Sales), and the OpenGov Executive team. The description that follows provides a narrative for key aspects and highlights the responsibilities of the key stakeholders (match the numbers in the infographic to the numbered item in the narrative that follows).
The workflow is purposefully designed to err on the side of over-communication in case of high-severity incidents impacting the OpenGov customer base. It works well for an organization of our size, but you may want to tune it according to your needs. Another thing worth noting at the outset is that as of now OpenGov does not have an “external status page”. Instead we use other notification approaches (such as Zendesk announcements, in-app notifications, and WalkMe) to communicate with our end users.
- An issue may be identified in one of the following ways:
- By a customer – typically a software issue.
- By the OpenGov R&D team – could be a software issue identified by the OpenGov engineering (Dev/QA) team or an infrastructure issue identified by OpenGov R&D monitoring systems.
- By the OpenGov Field team – typically a software issue.
- Issues reported by customers or the OpenGov Field team (external issues) lead to the creation of a Zendesk ticket, which is attended to by the OpenGov Customer Support team.
- Funneling issues through the OpenGov Customer Support team is really important for the success of the overall support & escalation process. They are in tune with ongoing incidents and can prevent superfluous escalations. They write knowledge-base (KB) articles that can help you navigate around product issues.
- Issues reported by the OpenGov R&D team (internal issues) and external issues (from step 2) lead to the creation of a Jira ticket, which is attended to by the OpenGov R&D team – either application engineering or sustaining (L3/Tier-3) support engineering.
- Generally speaking, infrastructure issues do not result in the creation of a Jira ticket – they directly lead to a Statuspage notification.
- Jira issues that are classified with severity Blocker are granted a very high level of attention and are put on a specialized continuous communication track (as described below) until resolved.
- The infographic does not show this for brevity, but any Blocker issue will be immediately surfaced on a dedicated Slack channel called #alert-blocker-defect.
- OpenGov R&D team will create an incident on our internal status page and will keep providing updates to it continuously.
- While this might sound like an out of place comment, it is worth noting that in addition to incidents, the internal status page is used by OpenGov R&D team to inform stakeholders about scheduled maintenance windows, such as those related to feature releases, hotfixes, and infrastructure updates, which may impact typical field activities like demos.
- All internal status page updates are mirrored to a dedicated Slack channel called #og-internal-status.
- OpenGov Customer Support team shall receive the notifications from the internal status page and Slack. (A one-time subscription will be required.)
- OpenGov Field team shall also receive the notifications from the internal status page. (A one-time subscription will be required.)
- OpenGov Customer Support team will create a notification on OpenGov Resource Center (powered by Zendesk).
- Any and all customers that “follow” the Zendesk notifications shall get such announcements.
- OpenGov Field team shall also receive the notifications from Zendesk. (A one-time subscription will be required.)
- OpenGov Customer Support team will create an OpenGov internal email notification (must not be system generated, but rather from a human) notifying the following key internal stakeholders about the incident:
- OpenGov R&D Team
- OpenGov Field Team
- OpenGov Executive Team
- OpenGov Field team, depending on the severity of the issue, or situation of their customer (e.g. “red” account) may choose to send an external email advisory to one or more customers.
- Same as (8), but on the discretion (e.g. for customers under the umbrella of executive sponsor program) of the OpenGov Executive team.
If you would like to read more from OpenGov’s engineering team, read the rest of our Technology blog posts.
Interested in contributing to OpenGov’s Engineering culture of innovation, leading-edge technology adoption, and quality? Check out our current job openings.