For the purpose of monitoring systems, there are many different approaches and methodologies. Technical monitoring, functional monitoring, and business process monitoring are the three fundamental types of monitoring.
In this article, I will focus on the high level recommendations on functional and technical solution monitoring and explain how to set up a clear strategy for the processes around it. I will leave out business process monitoring as this is specific to each business and is dealt at the management level.
So why is monitoring important? Here are just a few of most important considerations:
- System observability - it increases the visibility and accessibility of solution data streams.
- Business Continuity - ensures early problem spotting and detection of anomalies.
- Team Awareness - ensures the team is aware with what is happening at all times, and is dealing with the problems independently.
- Customer Experience - your customers will generally enjoy a higher level of quality, and assuming they are happy with the solution, this is likely to increase customer loyalty.
There is a relationship between observability and monitoring, and they do have distinct functions. Monitoring involves gathering and displaying data before using it for further analysis or monitoring, observability refers to the accessibility of data.
Functional monitoring only looks at the functional aspect of the solution, evaluating an use-case or a group of use-cases on a system. It identifies performance and availability problems at the functional level and ensures this is visible and recorded.
Functional monitoring is usually performed automatically by executing scripted operations on a system. Robot-based monitoring is excellent to ensure quality of service and users' experience.
When your solution is actively used by customers, functional monitoring is essential to ensure quality of service. Essentially, this is about testing all core user journeys and system workflows repeatedly, and then monitoring the results for any anomalies.
By testing the workflows, we continuously get information about the availability of the system. Depending on the application, these tests are run on production on a specific schedule (e.g., hourly or daily).
To illustrate an example of functional monitoring, I will use a generic use journey of a customer placing an order.
- Customer chooses a product
- Selects quantity and shipment options
- Select payment options
- Completes payment
- Receives order confirmation on screen and via email
This user journey is quite common, but it is far from easy to test functionally, since it integrates with payment and possibly shipment flows provided by external services. In addition to this, users may come from various promotions and may bypass some of the steps, which further complicates the testing process.
To address this, the best strategy is to apply multi-aspect functional testing:
- Monitor User Journey, using robotic implementation, e.g. have a specific test which would execute steps and report on results
- Monitor real user activity, e.g. number of orders started vs completed, number of users dropping out at a specific step. Ensure your decisions are based on data generated by multiple cases. Low amounts of traffic may not spot potential problems.
- External service monitoring - collect metrics with regard to availability and performance of external services required in this process, e.g. continuously check availability of shipment or payment options, provided via 3rd party and report on any anomalies
The purpose of technical monitoring is to determine how well the software components underlying the system perform in real time. It focuses on specific technical functions of each component in isolation and may not report on the functionality of the system as a whole. It effectively reports on issues and allows operators to decide how to fix these.
It is important to stress out that technical monitoring may not identify all problems within the system, as some issues may not show up in technical monitoring at all.
Below, I will focus on several best practices for ensuring that your systems are effectively monitored:
Context and Critical Components Mapping
The first step is to understand the context of your system, this is where you will need to conceptualise your system and get to know all components. The most important thing here is to determine and document which of the areas identified in your initial evaluation of your environment are the most business critical. Here are some steps I would recommend:
- Create a list of all critical components, including any standalone components, APIs, databases and custom applications.
- Look outside of your application's environment. Does your application depend on any third parties? If so, consider how critical their role is and document this.
- Create a list of all components which "support" your solution, such as hardware servers, auto-scaling, backups and disaster recovery implementations. This is usually attributed to infrastructure monitoring, and most of the well known cloud providers already include tools to cover basic monitoring.
- Consider the value of monitoring your development, testing/UAT environments, including continuous delivery processes. This is usually done by the development team anyway, but depending on your setup it is good practice to ensure things are running smoothly and take action on potential problems early.
Select and decide on a Technical Monitoring Solution
There are many tools available on the market. The most important thing is to choose a monitoring platform that can cover and monitor most of your business critical components in one single place. Adding additional tooling can significantly increase complexity, time to resolution, and the amount of effort required to perform proactive performance assessment and improvement activities.
Analyse and decide on which metrics and what kind of data is important to you and your application. The monitoring tools you choose depend on the type of the system you run, and may look very different for a small e-commerce environment than for a highly distributed containerized Java application.
Make sure the potential monitoring platform candidates can deliver the necessary metrics. Some tools might be able to gather the metrics you need right out of the box, while others might need significant adjustment or even code changes in your application.
At a bare minimum your metrics should report on component availability, performance and critical errors in the application. This can be relatively simple to achieve as most of the available tools are able to hook into your components and start aggregating your errors quickly, as well as monitor available endpoints.
While monitoring is basically collecting data, alerting is a proactive notification approach to monitoring, via email, SMS, ticketing system etc. While alerting is quite useful, you need to consider avoiding "alert fatigue" - when the system sends out too many alerts which require no action which leads to monitoring teams possibly missing important ones. Alert only on critical situations which are critical and DO require action. With alerting, less is more.
When setting up alerting, consider the business context of your application rather than getting too technical, for example in an ecommerce application, alerting and monitoring the technical metrics of a purchase workflow might prove much more valuable than collecting CPU data for example. Consider what is important for your end users and focus your alerting to enhance that end user experience.
It's important to keep in mind that tuning alert volume to reach an appropriate level is often an on-going process. When starting out, refer back to the business purpose of your application. In an e-commerce application, for example, monitoring the transaction response time for business critical transactions such as 'add to cart' and 'checkout' are clearly important, but monitoring for CPU usage on a given application server is probably unnecessary. Consider what is important for your end users and focus your alerting to enhance that end user experience. Think more about user experience and focus on what is business critical.
It is critically important to develop an effective monitoring strategy in order to have a truly performant and reliable application.
The best strategy is to combine functional and technical monitoring to obtain a complete view of the system. This will ensure control, impact awareness and will facilitate an adequate level of quality of service and confidence in operations.