There are times when you have to be honest to people that rely on you, and say you’re sorry to have let them down. This is how I’ve felt more than once during the last couple of weeks, as the alerting system of Spatineo Monitor has been sending a lot of uncalled-for alerts about “insufficient data” to many of our customers.
The core issue of writing the monitoring data to our database fast enough has now been fixed, and we have not sent any of these alerts since 4th April 2018. As these cases go, we could list a number of reasons for why this happened, but it does not change the fact that we should have done better. The entire Spatineo team, and particularly me be as the Head of Customer Experience, are truly sorry for the inconvenience caused by these inbox-filling alerts that had nothing to do with the how your services were doing, and we will contact the customers badly affected by this with an offer for compensation. Please contact our sales or support if you have any questions this blog post does not give an adequate answer for.
So you sent out thousands of superfluous alert messages – what happened?
Ironically, our alerting system was functioning exactly as it’s supposed to be all the time. In addition to the usual alert conditions in service availability, such as too slow average response time or too error results returned too often, we also create alerts when we no longer can reliably say if the service is down or not based on the particular way of measuring it: In technical terms, one possible state for an alert is insufficient data. This normally happens when we stop measuring the service in the way configured for the alert indicator, due to detecting that the particular combination of request parameters for measuring the service is no longer valid according to the updated service capabilities.
Let’s say a service health measurement is configured to use a particular WMS layer, and this layer disappears from the capabilities document of the measured service. We will automatically stop measuring the server using this now invalid meter, and if there are alert indicators bound to this meter, they will change into the “insufficient data” status after a while due to missing measurement data. We do try to keep measuring the service still, but since the layer is now different the same alerting thresholds may not be appropriate anymore, so the Monitor user should manually configure the alert limits for the new combination of layer, coordinate system, image size and format.
Monitoring tens of thousands of web services every 5 minutes is hard work – occasionally too hard for our servers to keep up for months in a row. To make sure our monitoring servers are producing a continuous flow of measurement data, and are not stuck, we continuously monitor all of them and reboot them if necessary. What happened several times between 23rd March and 4th April, was that we could not write the monitoring results into our monitoring database as fast as the data was collected. This caused our internal “watch dog” process to think that many of our monitoring agent servers were not functioning, and to react by rebooting those services automatically. Usually this does not cause any trouble, as the rebooting a single server occasionally has little effect on the whole process. When this happens to many servers repeatedly whoever, we sometimes end up with small gaps in the 5-minute-interval monitoring time series. These gaps were then correctly noticed by the alerting system as missing data for the particular alert indicator, and thus the “insufficient data” alert was sent out. As soon as the data stream was restored, the alert condition turned back to OK again.
While logical in hindsight, it took us quite a while to figure out that the slow data writing operations to the database were the core reason causing for all the trouble we, and unfortunately also our customers, were witnessing. When that became clear the fix was simple: we added more processing power to the database servers and the issue disappeared. Sigh.
Lessons learned the hard way
To prevent things like this from happening in the future we have now added a few more critical performance metrics for our monitoring system and we are using to track the system health, and to react quickly if things start looking bad. So now we are quite a bit better prepared for helping you keep your services running with high quality.
Once again, my sincerest apologies for the inconvenience caused on behalf of the whole technical team of Spatineo. I would like to thank all of our customers for continued support, and assure you that we are continuously working to improve the reliability and usefulness of our tools. We could not do this without your help and feedback.
Head of Customer Experience and Interoperability