Guides & Stories – SpectralCoding

Infrastructure Monitoring – Part 1: Introduction and Requirements

Caesar Kabalan — Wed, 17 Feb 2016 02:53:40 +0000

Deploying an effective infrastructure monitoring system is no small task. This series should give you a leg up on developing a strategy for implementing a monitoring solution for your environment.

Part 1 – Introduction and Requirements
Part 2 – Industry Leaders and Selection
Part 3 – Effective Monitoring
Part 4 – Implementation and Discovery
Part 5 – Dashboards, Reports, and Access
Part 6 – Continuous Improvement

The benefits of a monitoring system should be clear to anyone who has seen or used monitoring in a past environment. Here are just a few obvious ones:

Early detection of issues leads to a quicker time to resolution and therefore less downtime
Early warning of “trouble” can allow for a resolution before an outage occurs, or at least allow you more flexibility to schedule an outage.
Increased control and visibility into the environment
Historical trending data allows for more accurate forecasting

What is possible?

There are is a VERY wide variety of monitoring solutions available with limitless monitoring capabilities. If you’re just starting to explore the monitoring industry but not sure what types of capabilities exist, here’s a good list to get you thinking about what you might want for your environment:

Checking metrics every X minutes/seconds:
- Disk/CPU/Memory Usage (do my resources look good?)
  - Do I have at least 10% of disk space free? How about at least 2GB?
  - Has my CPU been >90% for the past 10 minutes?
- Network Availability (can users get to it?)
- Application Status (is the application running?)
  - Is a “licenseserver.exe” process running?
  - Is the “httpd” service started?
  - How many resources is a specific process using over time?
- Application Functionality (is the system doing what it’s supposed to?)
  - Does the index.html page contain the text “Welcome!”?
  - Is the server listening on TCP port 12345?
- Extendable monitoring with custom scripting
  - Internally developed system with very specific monitoring needs? Write a script to tell the monitoring solution if the health is good.
Alerting:
- Notify an on-call administrator so they can fix the problem (text in the middle of the night).
- Escalate to a second administrator after 30min if the first alert wasn’t acknowledged and the problem still exists.
- Let your helpdesk know there is an outage or open a ticket to a 3rd party provider.
- Send another alert when the problem is resolved.
Trending/Historical:
- How much has this volume grown in the past 12 months?
- A 1TB disk just hit 90% usage, how much should I add to give it another 6 months of growth?
- Based off of the past 2 years, how much storage should I buy when I replace our SAN?
- The CPU on this server has been consistently >80%, should I increase the resources?
Troubleshooting:
- The application stopped working and CPU is stuck at 99%. Could this full disk have something to do with it?
- Application ABC is down, we coincidentally received an alert from server XYZ. Is it related? Was ABC dependent on XYZ?
- Application ABC went down at at 2AM. Memory usage for the process started growing slowly at 3PM the previous day. Why?
Overview:
- Overall uptime metrics
- Outstanding alerts/issues
- Dashboard capabilities for quick overview of the environment
- Mapping of applications for quick health check of an application

Defining Scope

Evaluating and choosing a monitoring system requires an understanding of the “balancing act” between and financial investment, employee time, solution functionality. Defining a scope will allow you to narrow down your requirements and manage expectations for what the solution will do. For infrastructure monitoring, your scope will mostly revolve around what you want to monitor and where. These should be “broad strokes” which will be broken down into individual requirements later.

Consider whether some of the ideas below apply to your project:

Monitoring…
- Operating System Metrics
- Network Infrastructure
- Virtualization Infrastructure
- Storage Infrastructure
- Application Functionality
Implementing…
- Alerting (including identifying stakeholders for the monitored systems)
- Global views (for a distributed deployment)
- Dashboards
Documentation
Training

Defining Requirements

In order to effectively select a product that meets your requirements (but remains within the project scope), you first need to define them, in writing. Your goal should be a list of attributes of the system. You should assign each requirement into one of two categories: Required or Optional.

If your list of requirements is large and complex, chances are no single solution will meet all of them. You may want to keep the number of required capabilities low, and rate the rest on a scale of 1-10 for importance. Below is a good starting point, but be sure and add/remove requirements specific to your organization and systems.

The selected infrastructure monitoring solution shall…

be able to monitor the following:
- Windows Server 2008 R2 / 2012 R2
  - Windows Service Status (Running/Stopped) – Required
  - Process Status (Running/Not Found) – Required
  - CPU / RAM / Disk Utilization – Required
- Red Hat Enterprise Linux 6 / 7
  - Linux Service Status (Running/Stopped) – Required
  - Process Status (Running/Not Found) – Required
  - CPU / RAM / Disk Utilization – Required
- Specific vendor appliances and applications
  - NetApp/EMC Storage Array – Required
  - VMware ESX Host CPU / RAM Utilization – Required
  - APC Uninterruptible Power Supply – Optional
be able to monitor all systems without an agent needing to be installed (WMI/RPC/SSH/SNMP). – Required
be able to send alerts on the metrics listed in this document. – Required
be able to view historical information in visual way. – Required
be able to schedule scans and identify new/unmonitored systems. – Optional
be able to monitor 300 servers running 1200 services. -Required
be able to be extended with custom scripting. -Optional
cost less than $X for the initial deployment. -Required
cost less than $X per year. -Required
be supported by the vendor in the form of support tickets and available training. -Required

Conclusion

In this part we accomplished the following:

Discussed some benefits of infrastructure monitoring
Listed common capabilities of products in the industry
Scoped the project appropriately
Defined our initial draft of requirements for the solution

In Part 2 we’ll cover the following:

An overview of the major monitoring solutions (strengths/weaknesses)
The iterative process of selecting the right fit for your requirements
The benefit of doing a “Proof of Concept”

vSphere Flash Read Cache – Part 2: Implementation

Caesar Kabalan — Wed, 09 Jul 2014 17:27:54 +0000

Introduction

In the last part of this series we walked through determining if a workload is appropriate for vSphere Flash Read Cache. We then looked at the different aspects of sizing the cache appropriately for a workload. We determined the correct size via table hit statistics queried from the Oracle Database. We determined the correct block size by using the vscsiStats ESX command to gather statistics on the size of I/O commands on a specific VMDK. In our example the database is 1.1TB, and we require a 150GB cache with a 512KB block size.

In this part we will cover the following:

Presenting a number of SSDs to the hypervisor in a RAID0 configuration
Configure the ESX host to treat the newly found drive as an SSD
Make the SSD available as a Virtual Flash Resource
Configure our Data Warehouse’s data drive to have the configuration we reasoned out in Part 1.

Provisioning the SSDs

For our solution we are using two Dell PowerEdge M710HD blade servers with two Dell 200GB SAS SSDs each.

After inserting them into the blade we then needed to create a new RAID set combining the drives together into a RAID0 stripe. In the case of vFRC RAID0 is a very attractive option because fault tolerance is not necessary (if the cache randomly disappears all I/O goes straight to the back-end storage device) and it gives increased performance over a single larger SSD.

The steps for creating a RAID set vary by controller manufacturer and individual steps are outside of this article. If you have a single SSD, depending on the disk controller you may be able to pass through the SSD straight to the hypervisor. If you have more than one SSD or a disk controller that doesn’t support passthrough, you will need to create a RAID0 stripe of all disks.

An example setup for a Dell PERC controller with two 200GB disks is below:

Reboot the host into ESX (ensure your boot order is still properly defined).

Mark Disk as SSD

Now the expected result would be to be able to add this RAID0 SSD set as a Virtual Flash Resource. Wrong! If you have a single SSD with passthrough you can skip this step. Almost any RAID controller set up to RAID SSDs will display as a non-SSD, therefore making it “non-flash” and unable to be added as a Virtual Flash Resource.

You can verify this by selecting a host in the vSphere Web Client, going to the Manage tab, then to Settings, and then to Virtual Flash Resource Management and choosing “Add Capacity…” at the top.

In order for a device to be displayed in this list it MUST be recognized as an SSD by the ESX host. To verify the host views the drive as a non-SSD in the vSphere Web Client, with a host selected, go to Manage, then Storage, then Storage Devices.

Below you can see the drive marked as a non-SSD:

Unfortunately there is no easy right-click solution for marking as an SSD. The solution requires us to set the SSD flags via the ESX command line. Open up a command line to your ESX host. Run the following command:

[... SNIP ...]
naa.600508e0000000007cf406db80531904
   Device Display Name: Dell Serial Attached SCSI Disk (naa.600508e0000000007cf406db80531904)
   Storage Array Type: VMW_SATP_LOCAL
   Storage Array Type Device Config: SATP VMW_SATP_LOCAL does not support device configuration.
   Path Selection Policy: VMW_PSP_FIXED
   Path Selection Policy Device Config: {preferred=vmhba2:C1:T0:L0;current=vmhba2:C1:T0:L0}
   Path Selection Policy Device Custom Config:
   Working Paths: vmhba2:C1:T0:L0
   Is Local SAS Device: true
   Is Boot USB Device: false
[... SNIP ...]

Find your disk in the long list that will be displayed. I suggest copy and pasting the output for that specific disk into a notepad so you don’t get it confused with a LUN. That would be bad.

You can display extended details about the disk as follows:

naa.600508e0000000007cf406db80531904
   Display Name: Dell Serial Attached SCSI Disk (naa.600508e0000000007cf406db80531904)
   Has Settable Display Name: true
   Size: 380416
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path: /vmfs/devices/disks/naa.600508e0000000007cf406db80531904
   Vendor: Dell
   Model: Virtual Disk
   Revision: 1028
   SCSI Level: 6
   Is Pseudo: false
   Status: degraded
   Is RDM Capable: true
   Is Local: false
   Is Removable: false
   Is SSD: false
   Is Offline: false
   Is Perennially Reserved: false
   Queue Full Sample Size: 0
   Queue Full Threshold: 0
   Thin Provisioning Status: unknown
   Attached Filters:
   VAAI Status: unknown
   Other UIDs: vml.0200000000600508e0000000007cf406db80531904566972747561
   Is Local SAS Device: false
   Is Boot USB Device: false
   No of outstanding IOs with competing worlds: 32

To mark the disk as SSD run the following command. Be sure and take care to replace the items in brackets.

esxcli storage nmp satp rule add --satp=[Storage Array Type] --device=[SCSI NAA] --option="enable_ssd enable_local"

You can find the Storage Array Type from the first “device list” command. In my example the command would be:

esxcli storage nmp satp rule add --satp=VMW_SATP_LOCAL --device=naa.600508e0000000007cf406db80531904 --option="enable_ssd enable_local"

You will need to reboot the ESX host for the disk to be recognized as an SSD. After the host reboots you can double check the drive type again:

Provision the Flash Read Cache

After your disk is marked as an SSD it should be available in the “Add Capacity…” window under Virtual Flash Resource Management under Settings.

Check the boxes and click “OK”. The new space should then show up as additional capacity under Virtual Flash Resource Management:

Now that we have successfully provisioned the disks as a Flash Resource on a single host, it is time to redo the above steps on this page for all other hosts that the guest will reside on.

NOW IS THE TIME TO PROVISION YOUR OTHER HOSTS USING THE STEPS ON THIS PAGE.

Once all the hosts have been provisioned you can enable the vFRC on a VMDK. Browse to a VM and Edit Settings. The VM can be running. Expand the VMDK you analyzed in Part 1 and click Advanced in the Virtual Flash Read Cache row. Enter the information determined in Part 1. My example is below:

OK your way out of the settings and verify the vFRC is now displayed under VM Hardware:

Once you can see it assigned to the VM you’re done… for now. The next phase is monitoring and tweaking.

Conclusion

In this part we’ve accomplished the following:

Summarized our findings from Part 1, where we determined a 150GB vFRC cache size with a 512KB block size.
Install and provision the SSDs to the ESX host via the disk/RAID controller.
Change disk flags via the ESX command line to view the new RAID set as a SSD.
Provision a new Virtual Flash Resource from the newly recognized SSD.
Assign vFRC resources to a VM using the initial settings found in Part 1.

Additional Reading:

In Part 3 we’ll cover the following:

Monitor the performance changes on the Data Warehouse Database.
Tweak the vFRC settings as needed while providing reasoning behind the change.

vSphere Flash Read Cache – Part 1: Intro and Sizing

Caesar Kabalan — Wed, 09 Jul 2014 04:09:55 +0000

Introduction

This series will consist of several parts written over several weeks and will encompass the planning, sizing, implementation, and monitoring of vSphere Flash Read Cache to improve Data Warehouse performance using locally attached SSDs.

Part 1 – Introduction and Sizing
Part 2 – Implementation
Part 3 – Monitoring / Tweaking
Part 4 – Conclusion

While going through the new features we gained by moving from vSphere 5.0 to vSphere 5.5 we decided to see how vSphere Flash Read Cache could help our Data Warehouse jobs’ run times.

As part of this discovery process we needed to take a look and see if vSphere Flash Read Cache (hereafter as vFRC) would be a good candidate for our Data Warehouse.

Facts about the target system:

RedHat Enterprise Linux 5.8
Oracle Enterprise Edition 11gR2 (11.2)
Approximately 1.1TB of data
Not using ASM

Some quick analysis on the database revealed that 90.2% of database disk actions were physical reads, making this a very promising candidate for a read caching solution. Because of budgetary concerns we decided to start small with some local SAS SSDs. Other, more expensive options include Flash PCIe cards that were incompatible with our blade servers.

Reading some best practice articles (below) it was decided that a lot of effort would have to go into determining the proper sizing of the cache allocated to the VM.

Cache Sizing

From the Oracle side we pulled some access data that allowed us to see table names, table sizes, and the number of reads from each table over a certain period of time.

Anonymized data is below:

…and so on for another 15,000 lines.

We found that in our Data Warehouse that 50% of all reads were satisfied by the same ~740MB of data. As you expand that percentage it grows very quickly. 80% of reads were satisfied by the same 110GB of data. This meant our DW, while being 1.1TB in size had a small amount of frequently accessed data and a large amount of seldomly accessed data.

Analyzing this data we determined a good starting point to be 150GB of data at around the 97.5% mark. This meant that we could store 97.5% of the most popular data in 150GB. The last 2.5% was the remaining 950GB and was most likely historical tables that are rarely accessed.

Cache Block Size Sizing

At this point we have a good starting point for a vFRC size, but now we need data on the proper block size to choose.

First, an excerpt from the vFRC Performance Study by VMware:

[…] Cache block size impacts vFRC performance. The best way to choose the best cache block size is to match it according to the I/O size of the workload. VscsiStats may be used to find the I/O size in real-time when running the workload. […]

vscsiStats is a command accessible from the ESX command line. From the command help:

VscsiStats — This tool controls vscsi data collection for virtual machine disk I/O workload characterization. Users can start and stop online data collection for particular VMs as well as print out online histogram data. Command traces can also be collected and printed.

In order to determine the best block size we will use vscsiStats to determine the spread of different I/O request sizes.

Start by listing available VMs and their disks with:

Virtual Machine worldGroupID: 1203178, Virtual Machine Display Name: USATEST01, Virtual Machine Config File: /vmfs/volumes/523358d4-466be286-1837-842b2b0ca19e/USATEST01/USATEST01.vmx, {
 Virtual SCSI Disk handleID: 8192 (scsi0:0)
}

Once you find the correct disk to monitor begin collecting statistics using the following format:

vscsiStats -s -w [worldGroupID] -i [handleID]

Example:

vscsiStats: Starting Vscsi stats collection for worldGroup 1203178, handleID 8192 (scsi0:0)
Success.

As the statistics are collected you can query the ioLength histogram using the following command:

vscsiStats -p ioLength -w [worldGroupID] -i [handleID]

In my example the output is as follows. This example VM is mostly idle.

[... SNIP ...]
Histogram: IO lengths of Read commands for virtual machine worldGroupID : 1203178, virtual disk handleID : 8192 (scsi0:0) {
 min : 512
 max : 1052672
 mean : 52740
 count : 33823
   {
      2627               (<=                512)
      562                (<=               1024)
      960                (<=               2048)
      758                (<=               4095)
      7513               (<=               4096)
      950                (<=               8191)
      1098               (<=               8192)
      1707               (<=              16383)
      3633               (<=              16384)
      10227              (<=              32768)
      275                (<=              49152)
      1210               (<=              65535)
      375                (<=              65536)
      95                 (<=              81920)
      130                (<=             131072)
      485                (<=             262144)
      293                (<=             524288)
      925                (>              524288)
   }
}
[... SNIP ...]

While this command gives output for read, write, and combined SCSI commands, we don’t really care about the last two. Since vFRC only caches reads we really only care about how big the reads are that are being satisfied by the SCSI disk. In the case of the disk above that the most popular READ block size was 32K at 10227 of 33823 reads (~30%). On our Data Warehouse we ended up with the following table:

Since they were split between 16K blocks at 31% and 512K blocks at 57% we opted for 512K block sizes.

In addition the total count of read commands was 389,544 over this period of time, while write commands totaled 32,843. Running the math shows us that we;re pretty close to our 90% mark from the database.

Conclusion

In this part we’ve accomplished the following:

Identified a need for faster performance on our Data Warehouse.
Used Oracle’s reporting tools to determine that we need approximately 150GB of cache in order to satisfy the large majority of the reads of the system.
Used VMware’s vscsiStats command to determine that the most used block sizes were 16K and 512K
Combined these conclusions to determine a good initial configuration of 150GB cache with a 512K block sizes.

Additional Reading:

(From Introduction) vSphere Flash Read Cache Product Page
(From Introduction) Performance of vSphere Flash Read Cache in VMware vSphere 5.5 (Highly Recommend!)
(From Introduction) VMware vSphere Flash Read Cache 1.0 FAQ
Using vscsiStats for Storage Performance Analysis
Getting started with vscsiStats

In Part 2 we’ll cover the following:

Install two Enterprise SSDs in each of two ESX hosts
Configure each Dell ESX host’s SAS controller to treat the SSDs as a single RAID0 set
Configure the ESX host to treat the newly found drive as an SSD
Make the SSD available as a Virtual Flash Resource
Configure our Data Warehouse’s data drive to have the configuration we reasoned out in Part 1.