Friday, December 9, 2011

Monitoring - Our Basic Principles

So, we decided we'd create our own monitoring tool, to consolidate our monitoring of various systems in one convenient location and to allow our team to be more proactive rather than reactive. I am no expert in this field and as in my current gig -this has been the first time I received the responsibility for maintaining large scale distributed apps (ERP packages, which I worked with before are much simpler).

We had some very simple design goals. What we did not wish to do, was to spend time creating an application from scratch and getting into very complex monitoring scenarios in which we would have to manage an in-house monster. Our desire was to monitor for the 20% most common failures that were causing the 80% of the headaches. This being our idea, I took some time to talk to our OAS app specialist, who outlined the symptoms of the most common failures and how clients see these failures reflected in specific reports. On his suggestion, we decided that the simplest policy would be to monitor specific reports which let you know whether specific components have failed.

So what reports would we look at and what would it tell us?*

  1. Site - Day of Month Delivery Information --> Are base transformers running?
  2. Site - Hour of Day Delivery Information --> Are the base transformers running in real-time?
  3. Site - Browser Delivery Information -> Because information summary tables are dependent on Nightly execution (one phase at least), looking at this report for the current day can give us hints of whether Nightly executed or not -it could execute partially, but we are only doing a general health check.
  4. Inventory - Site Detail Forecast --> Like the previous report, this also gives us clues about Nightly's execution
  5. Reach & Frequency - Site Reach by Date --> Are UV transformers running?
  6. Last RLC --> Failure in RLC can have many causes, so it is just a good place to monitor.
*Bear in mind that we are aware that most of the monitoring can be done with tools such as Nagios. However, our situation is we do not have access to our client's monitoring tools and we have many environments to monitor.
This monitoring that we are doing allows us to get a sense of whether all is fine, using more "junior" staff, if there are problems we then escalate. This saves us valuable time! In our tool, we also decided to collect information from other sources (non-API).

Thursday, December 8, 2011

Monitoring tool using the API

Looooong time no post! Was extremely busy with projects and issues here at work. However, I am extremely excited about a project I developed with help from a work buddy.

A little background, we have about 15 clients that have their box installation of OAS (with over 200 servers). We have an extremely compact team, relative to the size of our clients and yet we are responsible for maintaining proper functioning of all these servers. OAS does offer some rich set of monitoring options which send e-mails, that gives status updates on key components, but our experience has been that receiving around 100-150 e-mails a day has caused overload and made this type of monitoring not too effective.

We decided to create a very simple monitoring system, looking at OAS from the client perspective and making our frontend team responsible for proactive support -and they are much larger in number.

In my next few posts I will outline our concept as well as upload the tool and code.