Infrastructure Monitoring – Part 1: Introduction and Requirements

Deploying an effective infrastructure monitoring system is no small task. This series should give you a leg up on developing a strategy for implementing a monitoring solution for your environment.

Part 1 – Introduction and Requirements
Part 2 – Industry Leaders and Selection
Part 3 – Effective Monitoring
Part 4 – Implementation and Discovery
Part 5 – Dashboards, Reports, and Access
Part 6 – Continuous Improvement

The benefits of a monitoring system should be clear to anyone who has seen or used monitoring in a past environment. Here are just a few obvious ones:

Early detection of issues leads to a quicker time to resolution and therefore less downtime
Early warning of “trouble” can allow for a resolution before an outage occurs, or at least allow you more flexibility to schedule an outage.
Increased control and visibility into the environment
Historical trending data allows for more accurate forecasting

What is possible?

There are is a VERY wide variety of monitoring solutions available with limitless monitoring capabilities. If you’re just starting to explore the monitoring industry but not sure what types of capabilities exist, here’s a good list to get you thinking about what you might want for your environment:

Checking metrics every X minutes/seconds:
- Disk/CPU/Memory Usage (do my resources look good?)
  - Do I have at least 10% of disk space free? How about at least 2GB?
  - Has my CPU been >90% for the past 10 minutes?
- Network Availability (can users get to it?)
- Application Status (is the application running?)
  - Is a “licenseserver.exe” process running?
  - Is the “httpd” service started?
  - How many resources is a specific process using over time?
- Application Functionality (is the system doing what it’s supposed to?)
  - Does the index.html page contain the text “Welcome!”?
  - Is the server listening on TCP port 12345?
- Extendable monitoring with custom scripting
  - Internally developed system with very specific monitoring needs? Write a script to tell the monitoring solution if the health is good.
Alerting:
- Notify an on-call administrator so they can fix the problem (text in the middle of the night).
- Escalate to a second administrator after 30min if the first alert wasn’t acknowledged and the problem still exists.
- Let your helpdesk know there is an outage or open a ticket to a 3rd party provider.
- Send another alert when the problem is resolved.
Trending/Historical:
- How much has this volume grown in the past 12 months?
- A 1TB disk just hit 90% usage, how much should I add to give it another 6 months of growth?
- Based off of the past 2 years, how much storage should I buy when I replace our SAN?
- The CPU on this server has been consistently >80%, should I increase the resources?
Troubleshooting:
- The application stopped working and CPU is stuck at 99%. Could this full disk have something to do with it?
- Application ABC is down, we coincidentally received an alert from server XYZ. Is it related? Was ABC dependent on XYZ?
- Application ABC went down at at 2AM. Memory usage for the process started growing slowly at 3PM the previous day. Why?
Overview:
- Overall uptime metrics
- Outstanding alerts/issues
- Dashboard capabilities for quick overview of the environment
- Mapping of applications for quick health check of an application

Defining Scope

Evaluating and choosing a monitoring system requires an understanding of the “balancing act” between and financial investment, employee time, solution functionality. Defining a scope will allow you to narrow down your requirements and manage expectations for what the solution will do. For infrastructure monitoring, your scope will mostly revolve around what you want to monitor and where. These should be “broad strokes” which will be broken down into individual requirements later.

Consider whether some of the ideas below apply to your project:

Monitoring…
- Operating System Metrics
- Network Infrastructure
- Virtualization Infrastructure
- Storage Infrastructure
- Application Functionality
Implementing…
- Alerting (including identifying stakeholders for the monitored systems)
- Global views (for a distributed deployment)
- Dashboards
Documentation
Training

Defining Requirements

In order to effectively select a product that meets your requirements (but remains within the project scope), you first need to define them, in writing. Your goal should be a list of attributes of the system. You should assign each requirement into one of two categories: Required or Optional.

If your list of requirements is large and complex, chances are no single solution will meet all of them. You may want to keep the number of required capabilities low, and rate the rest on a scale of 1-10 for importance. Below is a good starting point, but be sure and add/remove requirements specific to your organization and systems.

The selected infrastructure monitoring solution shall…

be able to monitor the following:
- Windows Server 2008 R2 / 2012 R2
  - Windows Service Status (Running/Stopped) – Required
  - Process Status (Running/Not Found) – Required
  - CPU / RAM / Disk Utilization – Required
- Red Hat Enterprise Linux 6 / 7
  - Linux Service Status (Running/Stopped) – Required
  - Process Status (Running/Not Found) – Required
  - CPU / RAM / Disk Utilization – Required
- Specific vendor appliances and applications
  - NetApp/EMC Storage Array – Required
  - VMware ESX Host CPU / RAM Utilization – Required
  - APC Uninterruptible Power Supply – Optional
be able to monitor all systems without an agent needing to be installed (WMI/RPC/SSH/SNMP). – Required
be able to send alerts on the metrics listed in this document. – Required
be able to view historical information in visual way. – Required
be able to schedule scans and identify new/unmonitored systems. – Optional
be able to monitor 300 servers running 1200 services. -Required
be able to be extended with custom scripting. -Optional
cost less than $X for the initial deployment. -Required
cost less than $X per year. -Required
be supported by the vendor in the form of support tickets and available training. -Required

Conclusion

In this part we accomplished the following:

Discussed some benefits of infrastructure monitoring
Listed common capabilities of products in the industry
Scoped the project appropriately
Defined our initial draft of requirements for the solution

In Part 2 we’ll cover the following:

An overview of the major monitoring solutions (strengths/weaknesses)
The iterative process of selecting the right fit for your requirements
The benefit of doing a “Proof of Concept”

What is possible?

Defining Scope

Defining Requirements

Conclusion

About the Author: Caesar Kabalan

vSphere Flash Read Cache – Part 2: Implementation

vSphere Flash Read Cache – Part 1: Intro and Sizing

One Comment

Leave A Comment Cancel reply

Recent Posts

Archives

Recent Comments

Categories

SpectralCoding

Infrastructure Monitoring – Part 1: Introduction and Requirements

What is possible?

Defining Scope

Defining Requirements

Conclusion

Share This Story, Choose Your Platform!

About the Author: Caesar Kabalan

Related Posts

vSphere Flash Read Cache – Part 2: Implementation

vSphere Flash Read Cache – Part 1: Intro and Sizing

One Comment

Leave A Comment Cancel reply

Recent Posts

Archives

Recent Comments

Categories

SpectralCoding