Deploying an effective infrastructure monitoring system is no small task. This series should give you a leg up on developing a strategy for implementing a monitoring solution for your environment.
- Part 1 – Introduction and Requirements
- Part 2 – Industry Leaders and Selection
- Part 3 – Effective Monitoring
- Part 4 – Implementation and Discovery
- Part 5 – Dashboards, Reports, and Access
- Part 6 – Continuous Improvement
The benefits of a monitoring system should be clear to anyone who has seen or used monitoring in a past environment. Here are just a few obvious ones:
- Early detection of issues leads to a quicker time to resolution and therefore less downtime
- Early warning of “trouble” can allow for a resolution before an outage occurs, or at least allow you more flexibility to schedule an outage.
- Increased control and visibility into the environment
- Historical trending data allows for more accurate forecasting
What is possible?
There are is a VERY wide variety of monitoring solutions available with limitless monitoring capabilities. If you’re just starting to explore the monitoring industry but not sure what types of capabilities exist, here’s a good list to get you thinking about what you might want for your environment:
- Checking metrics every X minutes/seconds:
- Disk/CPU/Memory Usage (do my resources look good?)
- Do I have at least 10% of disk space free? How about at least 2GB?
- Has my CPU been >90% for the past 10 minutes?
- Network Availability (can users get to it?)
- Application Status (is the application running?)
- Is a “licenseserver.exe” process running?
- Is the “httpd” service started?
- How many resources is a specific process using over time?
- Application Functionality (is the system doing what it’s supposed to?)
- Does the index.html page contain the text “Welcome!”?
- Is the server listening on TCP port 12345?
- Extendable monitoring with custom scripting
- Internally developed system with very specific monitoring needs? Write a script to tell the monitoring solution if the health is good.
- Disk/CPU/Memory Usage (do my resources look good?)
- Alerting:
- Notify an on-call administrator so they can fix the problem (text in the middle of the night).
- Escalate to a second administrator after 30min if the first alert wasn’t acknowledged and the problem still exists.
- Let your helpdesk know there is an outage or open a ticket to a 3rd party provider.
- Send another alert when the problem is resolved.
- Trending/Historical:
- How much has this volume grown in the past 12 months?
- A 1TB disk just hit 90% usage, how much should I add to give it another 6 months of growth?
- Based off of the past 2 years, how much storage should I buy when I replace our SAN?
- The CPU on this server has been consistently >80%, should I increase the resources?
- Troubleshooting:
- The application stopped working and CPU is stuck at 99%. Could this full disk have something to do with it?
- Application ABC is down, we coincidentally received an alert from server XYZ. Is it related? Was ABC dependent on XYZ?
- Application ABC went down at at 2AM. Memory usage for the process started growing slowly at 3PM the previous day. Why?
- Overview:
- Overall uptime metrics
- Outstanding alerts/issues
- Dashboard capabilities for quick overview of the environment
- Mapping of applications for quick health check of an application
Defining Scope
Evaluating and choosing a monitoring system requires an understanding of the “balancing act” between and financial investment, employee time, solution functionality. Defining a scope will allow you to narrow down your requirements and manage expectations for what the solution will do. For infrastructure monitoring, your scope will mostly revolve around what you want to monitor and where. These should be “broad strokes” which will be broken down into individual requirements later.
Consider whether some of the ideas below apply to your project:
- Monitoring…
- Operating System Metrics
- Network Infrastructure
- Virtualization Infrastructure
- Storage Infrastructure
- Application Functionality
- Implementing…
- Alerting (including identifying stakeholders for the monitored systems)
- Global views (for a distributed deployment)
- Dashboards
- Documentation
- Training
Defining Requirements
In order to effectively select a product that meets your requirements (but remains within the project scope), you first need to define them, in writing. Your goal should be a list of attributes of the system. You should assign each requirement into one of two categories: Required or Optional.
If your list of requirements is large and complex, chances are no single solution will meet all of them. You may want to keep the number of required capabilities low, and rate the rest on a scale of 1-10 for importance. Below is a good starting point, but be sure and add/remove requirements specific to your organization and systems.
The selected infrastructure monitoring solution shall…
- be able to monitor the following:
- Windows Server 2008 R2 / 2012 R2
- Windows Service Status (Running/Stopped) – Required
- Process Status (Running/Not Found) – Required
- CPU / RAM / Disk Utilization – Required
- Red Hat Enterprise Linux 6 / 7
- Linux Service Status (Running/Stopped) – Required
- Process Status (Running/Not Found) – Required
- CPU / RAM / Disk Utilization – Required
- Specific vendor appliances and applications
- NetApp/EMC Storage Array – Required
- VMware ESX Host CPU / RAM Utilization – Required
- APC Uninterruptible Power Supply – Optional
- Windows Server 2008 R2 / 2012 R2
- be able to monitor all systems without an agent needing to be installed (WMI/RPC/SSH/SNMP). – Required
- be able to send alerts on the metrics listed in this document. – Required
- be able to view historical information in visual way. – Required
- be able to schedule scans and identify new/unmonitored systems. – Optional
- be able to monitor 300 servers running 1200 services. -Required
- be able to be extended with custom scripting. -Optional
- cost less than $X for the initial deployment. -Required
- cost less than $X per year. -Required
- be supported by the vendor in the form of support tickets and available training. -Required
Conclusion
In this part we accomplished the following:
- Discussed some benefits of infrastructure monitoring
- Listed common capabilities of products in the industry
- Scoped the project appropriately
- Defined our initial draft of requirements for the solution
In Part 2 we’ll cover the following:
- An overview of the major monitoring solutions (strengths/weaknesses)
- The iterative process of selecting the right fit for your requirements
- The benefit of doing a “Proof of Concept”
Thank you so much for this, this has been a big help towards helping me setup a monitoring platform for my organization!