Friday, December 9, 2011

Monitoring - Our Basic Principles

So, we decided we'd create our own monitoring tool, to consolidate our monitoring of various systems in one convenient location and to allow our team to be more proactive rather than reactive. I am no expert in this field and as in my current gig -this has been the first time I received the responsibility for maintaining large scale distributed apps (ERP packages, which I worked with before are much simpler).

We had some very simple design goals. What we did not wish to do, was to spend time creating an application from scratch and getting into very complex monitoring scenarios in which we would have to manage an in-house monster. Our desire was to monitor for the 20% most common failures that were causing the 80% of the headaches. This being our idea, I took some time to talk to our OAS app specialist, who outlined the symptoms of the most common failures and how clients see these failures reflected in specific reports. On his suggestion, we decided that the simplest policy would be to monitor specific reports which let you know whether specific components have failed.

So what reports would we look at and what would it tell us?*

  1. Site - Day of Month Delivery Information --> Are base transformers running?
  2. Site - Hour of Day Delivery Information --> Are the base transformers running in real-time?
  3. Site - Browser Delivery Information -> Because information summary tables are dependent on Nightly execution (one phase at least), looking at this report for the current day can give us hints of whether Nightly executed or not -it could execute partially, but we are only doing a general health check.
  4. Inventory - Site Detail Forecast --> Like the previous report, this also gives us clues about Nightly's execution
  5. Reach & Frequency - Site Reach by Date --> Are UV transformers running?
  6. Last RLC --> Failure in RLC can have many causes, so it is just a good place to monitor.
*Bear in mind that we are aware that most of the monitoring can be done with tools such as Nagios. However, our situation is we do not have access to our client's monitoring tools and we have many environments to monitor.
This monitoring that we are doing allows us to get a sense of whether all is fine, using more "junior" staff, if there are problems we then escalate. This saves us valuable time! In our tool, we also decided to collect information from other sources (non-API).

No comments: