SpectralCoding

SysAdmin Blog and RSS Feed List

Caesar Kabalan — Tue, 20 Sep 2016 16:33:32 +0000

This comes up once in a while, so I’ve decided to make a post, update it, and whenever asked, link back to this. I’ll also add the link to any referenced Reddit threads at the bottom of this post.

If anyone has items to add, feel free to reply and I’ll add it to the user-submitted section.

Last Updated: September 20th, 2016

RSS Reader

Since Google Reader shut down I found and immediately fell in love with NewsBlur. I love it enough that I pay the $7/yr to support the developer and I think I get some. It’s the best reader I’ve found that emulates the Google Reader style and adds so many features I didn’t know I wanted. I support NewsBlur all the way, but I’ve also heard Feedly is good.

OPML/XML Feed Export

I’ve exported my NewsBlur feed here in XML/OPML format: spectralcoding_newsblur.xml

You can use this file as an import for many different feed readers.

Formatted List

Below is the human readable version of the above export. Feel free to pick and choose to add to your list. Order is alphabetical, not based on popularity/quality or anything.

System Administration

Backblaze Blog – http://feeds.feedburner.com/BackblazeBlog
Blog NetAdmin/SysAdmin – http://blog.nasa.fr/wp/feed/
Brainstorming of a sysadmin – http://blackbird.si/feed/
Carlus’ SysAdmin Blog – http://sysadmin.carlusgg.com/?feed=rss2
Cliff Saran’s Enterprise blog – http://www.computerweekly.com/blogs/it-fud-blog/atom.xml
CloudFlare – https://blog.cloudflare.com/rss/
Everything Sysadmin – http://everythingsysadmin.com/atom.xml
Featured Questions – Server Fault – http://serverfault.com/feeds/featured
How to Become a Network Engineer – http://networkengineercareer.blogspot.com/feeds/posts/default
Justin’s IT Blog – http://www.jpaul.me/feed/
Nicovs’ Blog – http://www.nicovs.be/feed/
Packet Life – http://packetlife.net/blog/feed/
R.I.Pienaar – https://www.devco.net/feed
Remi Bergsma’s blog – http://blog.remibergsma.com/feed/
Scott’s Weblog – http://feeds.scottlowe.org/slowe/content/feed/
Standalone Sysadmin – http://feeds.feedburner.com/standalone-sysadmin
StorageMojo – http://storagemojo.com/feed/
The Lone Sysadmin – http://feeds.feedburner.com/lonesysadmin/mkpe
The Nubby SysAdmin – http://feeds.feedburner.com/TheNubbyAdmin
domas mituzas – http://dom.as/feed/
e r i k i m h d o t c o m – http://erikimh.com/feed/
http://blog.mist.io/ – http://blog.mist.io/rss
tolaris.com » tolaris.com » – http://www.tolaris.com/feed/

Linux

Benjamin Cane – http://feeds.feedburner.com/bencane/SAUo?format=xml
DistroWatch.com: News – http://distrowatch.feedsportal.com/c/34025/f/617435/index.rss
HashPrompt – http://feeds.feedburner.com/HashPrompt?format=xml
Linux Sysadmin Blog – http://linuxsysadminblog.com/atom.xml
My SysAd Blog — Unix – http://feeds.feedburner.com/sysad
UNIX System Administration: Solaris, AIX, HP-UX, Tru64, BSD. – http://feeds.feedburner.com/unixsadm
When a shell is not enough – http://www.shellguardians.com/feeds/posts/default
nixCraft: Linux Tips, Hacks, Tutori… – http://feeds.cyberciti.biz/Nixcraft-LinuxFreebsdSolarisTipsTricks
systemBash – http://feeds.feedburner.com/systembash

Windows

4sysops – http://feeds.4sysops.com/4sysops

Virtualization

Jase’s Place – http://www.jasemccarty.com/blog/?feed=rss2
Nickapedia – http://feeds.feedburner.com/Nickapedia
Technodrone – http://feeds.feedburner.com/technodrone
vElemental – http://velemental.com/feed/

Database

MySQL Performance Blog – https://www.percona.com/blog/feed/

Web Technologies

High Scalability – http://feeds.feedburner.com/HighScalability
Pingdom Royal – http://feeds.feedburner.com/royalpingdom

Security

Darknet – The Darkside – http://feeds.feedburner.com/darknethackers
Schneier on Security – https://www.schneier.com:443/blog/index2.rdf
Krebs on Security – http://krebsonsecurity.com/feed/
Google Online Security Blog – http://feeds.feedburner.com/GoogleOnlineSecurityBlog

Programming

Coding Horror – http://feeds.feedburner.com/codinghorror/
Jon Skeet’s coding blog – http://feeds.feedburner.com/JonSkeetCodingBlog
Martijn’s C# Programming Blog – http://feeds.feedburner.com/MartijnsCHashCodingBlog
Visual Studio Tips and Tricks – http://feeds.feedburner.com/zainnab

Personal

Android Police – Android News, Apps, Games, Phones, Tablets – http://feeds.feedburner.com/AndroidPolice
Freenode Staff Blog – http://blog.freenode.net/feed/
Geekologie – Gadgets, Gizmos, and Awesome – http://feeds.feedburner.com/geekologie/iShm
Hackaday – http://feeds2.feedburner.com/hackaday/LgoM
Information Is Beautiful – http://feeds.feedburner.com/InformationIsBeautiful
Linode Blog – https://blog.linode.com/feed/atom/
Linode Status – Incident History – http://status.linode.com/history.rss
Official Gmail Blog – http://feeds.feedburner.com/OfficialGmailBlog
Official Google Blog – http://feeds.feedburner.com/blogspot/MKuf
Slashdot – News for nerds, stuff that matters – http://rss.slashdot.org/Slashdot/slashdot
The Daily WTF – http://syndication.thedailywtf.com/TheDailyWtf
xkcd.com – http://www.xkcd.com/rss.xml

User Submitted

Linux

Words of wisdom from a systems engineer – https://major.io/
Cal’s Blog – http://iops.io/

Personal

Wired Security – http://www.wired.com/category/security/

Referenced Reddit Threads

July 23rd, 2015 – [/r/linuxadmin] SysAdmin Blog and RSS Feed List (xpost /r/sysadmin
October 21st, 2015 – [/r/sysadmin] Useful Resources for SysAdmins
January 6th, 2016 – [/r/sysadmin] How do you stay up to date with tech?
January 19th, 2016 – [/r/sysadmin] Best book for new sysadmin without education?
February 27th, 2016 – [/r/sysadmin] RSS Feeds
March 2nd, 2016 – [/r/sysadmin] IT Security Websites
March 20th, 2016 – [/r/sysadmin] How do you keep up to date with technology?
April 22nd, 2016 – [/r/sysadmin] What does your RSS feed mainly consist of?
May 3rd, 2016 – [/r/sysadmin] Where do YOU go for IT/security news?
May 5th, 2016 – [/r/sysadmin] Where do SysAdmins hang out?
June 13th, 2016 – [/r/sysadmin] What are the Best Blogs/Websites/Etc For Keeping Up-to-date With the Industry
June 28th, 2016 – [/r/sysadmin] IT News sites/apps?
September 20th, 2016 – [/r/sysadmin] Where do you guys read the most relevant IT news?

Infrastructure Monitoring – Part 3: Effective Monitoring

Caesar Kabalan — Sun, 10 Apr 2016 00:50:39 +0000

This part three of the series will focus on providing an overview how to drive the most value out of your monitoring environment. This should give you a head start during implementation so you can make the correct configuration choices the first time.

Part 1 – Introduction and Requirements
Part 2 – Industry Leaders and Selection
Part 3 – Effective Monitoring
Part 4 – Implementation and Discovery
Part 5 – Dashboards, Reports, and Access
Part 6 – Continuous Improvement

How do you effectively monitor an environment? This question by itself doesn’t really make sense; the point of monitoring is to take some action on that information. The question should really be “How can I leverage my monitoring solution to make my environment better?”.

Environment Improvement via Monitoring

In essence it usually comes down to the following:

Quicker Time-To-Resolution – How quickly can you resolve a problem after it forms?
Interruption Prevention – How can I prevent interruptions of service, or schedule them for a less disruptive time?
Dashboards – How can I and others quickly assess the health of the environment? (Part 5)
Reporting – How can I gather information across my environment to better make future decisions? (Part 5)

Most monitoring solutions collect more data than you will ever access. This is not wasteful, it helps ensure that you have all the information at your fingertips when you need to answer a question. In my experience people can sometimes fall into the trap of “We have all this amazing data, lets alert on all of it.” That is bad and will lead to the monitoring system not being effectively valued. I suggest following the concept of gather lots, act on little.

Gather Lots, Act on Little

This concept is very simple: Your monitoring solution should ONLY reach out to you if there is an issue that you can take an action on.

Every time you create an alert, set a threshold, or make any change that could generate more alerts, ask yourself: “If I get this alert, what action am I going to take?” That doesn’t mean right this second, but each notification you get should translate into an item on your “To Do” list.

Consider the following notifications that you could configure via the various products listed in Part 2:

System goes offline.
- Are you REALLY going to investigate why this system went offline? Probably, keep it around.
System disk usage reaches 90%.
- Are you REALLY going to investigate that disk and see what is consuming space? Probably, keep it around.
System memory usage reaches 90%.
- Are you REALLY going to log in and see what processes are consuming the memory? Probably not, consider being very picky about when to enable this.
System CPU usage reaches 90%.
- Are you REALLY going to log in and see what processes are consuming the CPU resources? Probably not, consider being very picky about when to enable this.
Server Network Traffic reaches 90% of the link speed.
- Do you have any way to determine what caused this? Is this even worth investigating as it was likely an infrequent-burst?

Monitoring systems gather a lot, and that’s great when you need a lot of information. The downside of alerting on too much is the monitoring tool will earn a reputation for being too spammy. If you’re not going to take actions on CPU spikes, they’re just going to get automatically deleted. Your users are going to start associating emails from the tool with “this doesn’t apply to me” and they will begin to be ignored.

Quicker Time-To-Resolution (TTR)

If your monitoring solution only sends alerts that have actionable outcomes, this immediately removes the guesswork of “Do I really need to fix this?” Chances are, yes, you need to fix it. It’s not quite that simple but there is functionality inside the monitoring system you can leverage to quickly resolve issues:

Send alerts directly to the people who can fix the problem. Maybe this is a ticket queue or a list of emails.
For important servers, consider after-hours alerting where low-criticality issues go to email, and high-criticality issues go directly to a text message.
If your monitoring solution supports escalations, consider implementing them to keep awareness of the issue fresh in everyone’s mind.
Use the historical trending in your tool to quickly determine if this is an on-going growth issue, or if this is a one-off that will resolve itself.
- For example, Drive C hit 95%. It’s been steadily growing over the past year, it’s probably Windows Update creep and we can’t resolve, let’s add 10GB of disk space.
- For example, Drive D hit 95%. It’s had slow growth over the past year, except last week it started growing much faster. Consider talking to the application owner to see what may have caused the acceleration in growth. Maybe he forgot to turn off verbose logging and you don’t have to expand the disk. Maybe he enabled new functionality and the disk will need to be grown frequently.

Interruption Prevention / Early Warning

Once the environment is fairly stable, after you’ve fixed all the initial problems the newly implemented monitoring solution uncovered, you can being to be proactive. Many types of utilization alerts function as early warning indicators.

Generally a 90% threshold on a disk doesn’t indicate a problem NOW, but it means there will likely be a problem once another 10% disk usage is consumed. You can take that information and work with the application owner to expand the disk live, expand the disk tonight after hours, take an extended outage this weekend, etc.

Without any indicator, usually the disk would fill up, the application would in some way have deteriorated functionality, triggering a user to contact someone who can help fix it. That conversation will end up with you having the exact same conversation with the application owner, except with less options for resolution since the system is down. This process takes time and costs businesses money.

Being pro-active with the environment quite literally gives man-hours back to your organization. Your 10-minute analysis and response of resizing a disk before it is an issue, saves hours of frustration, troubleshooting, and meetings on the back end.

Conclusion

At this point you should have a good idea what you’ll want to alert on, and how you can effectively convert those alerts into value for your organization.

In this part we accomplished the following:

Discussed the core concepts of how monitoring makes the environment better
Explored the concert of “Gather Lots, Act on Little”
Understood how effective monitoring directly leads to less downtime and a quicker Time-To-Resolution
Learned how to leverage the monitoring system to solve issues before they’re problems.

In Part 4 we’ll cover the following:

Implementation methodology
Basic setup requirements for most monitoring systems.
Discovery of systems and services.

Infrastructure Monitoring – Part 2: Industry Leaders and Selection

Caesar Kabalan — Wed, 23 Mar 2016 04:40:11 +0000

This part two of the series will focus on providing an overview of the major vendors in the infrastructure monitoring space, help narrow down your options, and talk about the benefits of doing a Proof of Concept.

Part 1 – Introduction and Requirements
Part 2 – Industry Leaders and Selection
Part 3 – Effective Monitoring
Part 4 – Implementation and Discovery
Part 5 – Dashboards, Reports, and Access
Part 6 – Continuous Improvement

I have not personally used every single vendor in this list in a production environment. All of them I have heard industry feedback and consensus on, some of them I work with on a daily basis, and some of them I have simply downloaded and tested. There are simply too many industry options, therefore this list focuses on internal self-hosted solutions.

The Vendor List (In Approximate Order of Cost)

Nagios Core is a the free open source “central” functionality of many of the inexpensive/free monitoring options. The free product includes the core monitoring platform and basic event/alert functionality. Most configuration and interaction with the system is done via text configuration files with most of the data viewable via a web interface. This option, while free, will definitely require a LOT of hands-on time during initial setup. Most “check” functionality is provided by many community maintained scripts/plugins you will have to integrate into the check engine. If your monitoring requirements are not complex, are fairly rigid, and new system additions are infrequent you may be able to get away with Nagios Core. You will need to have a moderate amount of Linux experience to be successful with this solution.

Icinga, like several other entries, took the OSS foundation of Nagios Core and overlaid more feature-full monitoring functionality. They’ve since rewriting the core in their 2.x release effectively leaving behind their direct connection to Nagios. Those familiar with Nagios products will definitely see terminology carryovers such as host groups, flap detection, and host/service checks. The Web 2.0 interface is an obvious improvement over Nagios Core, as well as scalability as it supports a distributed monitoring architecture out of the box. Icinga is free, but funds its development via paid support contracts.

Zabbix has been around since it’s initial 1.0 release in 2004, is Open Source, and monetizes itself via the traditional “free but pay for support” method. Zabbix touts it’s scalability and flexible permission options along with wide agent-less operating system monitoring compatibility. Unlike some of the Nagios-based products, Zabbix has many feature-rich checks built in including user emulation (such as response times during a log in and log out cycle).

Nagios XI is the paid and heavily extended version of Nagios Core maintained by the Nagios Core developers. Nagios XI takes you out of the realm of text configuration into the web interface where you do most of your day to day work. You’ll still need a decent amount of Linux experience with some of the less common tasks (such as installing a new plugin for new monitoring functionality and troubleshooting). As such, the web interface is a huge improvement over Nagios Core especially for graphic visualizations. Many of the same monitoring capabilities are provided via the same community-driven Nagios Exchange. Nagios XI functions as a great introduction to “Enterprise Monitoring” with features including bulk import, auto discovery, configuration wizards, authentication integration, and reporting.

Paessler Router Traffic Grapher (or PRTG) positions itself as a competitor to the Nagios XI, Icinga, and Zabbix solutions. Their transparent pricing allow for a free full-feature deployment for 100 sensors (Note: A sensor a specific metrics on a specific host). One unique feature of PRTG is they support both a web interface as well as a native windows “thick client” for managing/viewing your sensors. They also support PRTG server clustering for failover or for redundant monitoring of the same service from multiple locations.

Opsview is another monitoring solution whose roots lie with Nagios Core back in 2003. Opsview is probably the most feature-rich of the Nagios-based monitoring solutions. They have fair penetration into the enterprise space as well as flexible licensing options for <25 Hosts (Free!), <300 Hosts, and Unlimited Hosts. Opsview has placed itself in an excellent sweet spot between small businesses who have out-grown heavy hands-on monitoring (Nagios, custom scripts) but do not have the need or funds to implement any of the “big iron” monitoring suites.

SolarWinds Orion brings together the suite of monitoring products which the most relevant of which are Network Performance Monitor (NPM), and Server & Application Monitor (SAM). Their other product offerings dive deeper into various areas of IT Monitoring including Virtualization, Storage, Application/Database Performance, Patch Management, etc. Most of their tools have varying levels of integration with their central web interface with NPM/SAM being heavily intertwined. This tool has an intuitive web interface, though certain aspects can be challenging to set up without experience. The biggest complaint I see around SolarWinds is that they’re acquiring more functionality than they can integrate into their core interface. Some of their addons “exchange” data with the central interface, but also have their own more feature-rich interface. This is a tool you’ll definitely want time to experiment with before you import every server for monitoring. It is by far the most feature-rich of the options on this page and is therefore likely the most expensive.

Honorable Mentions

NetCrunch – NetCrunch combines the monitoring of network infrastructure devices like: switches, routers and printers with the monitoring of servers, applications and virtualization hosts.
ScienceLogic – Complete Hybrid IT Monitoring – Complete monitoring for power, network, storage, servers, applications, and the public cloud.
NewRelic SERVERS – Server monitoring from the app perspective – See how apps perform in the context of your server health.
ManageEngine – Application Performance Monitoring across physical, virtual and cloud environments.
Observium – A low-maintenance auto-discovering network monitoring platform supporting a wide range of device types, platforms and operating systems.
SevOne – The patented SevOne Cluster architecture leverages distributed computing to scale infinitely and collect millions of metrics, flows, and logs while providing real-time reporting down to the second.
CA Unified Infrastructure Management – A single, scalable platform for monitoring servers, applications, networks, databases, storage devices and even the customer experience.
op5 Monitor – Monitor every server, from the cloud to the basement. If you are in need of control, we have the solution.
Zenoss – An award-winning open source IT monitoring product that offers visibility over the entire IT stack, from network devices to applications.
Check_MK – Comprehensive Open-Source-Solution for IT-Monitoring developed around the proven Nagios-core.
Cacti – A complete network graphing solution designed to harness the power of RRDTool’s data storage and graphing functionality.

Narrowing Down Contenders

After reading through the above list and doing additional market research of your own, a few should jump out as potential products to pursue. If nothing on this page seems like what you’re looking for, you may not be looking for infrastructure monitoring software, but rather something specific to your needs including application monitoring, or database performance monitoring. Browse the vendors websites focusing on the screenshots and see if you can evaluate the products to narrow down further which meet your requirements from Part 1.

The process should be very iterative between your requirements and what you’re finding that meets your budget. You may find your requirements called for a solution that is priced out of your budget. You should go back to your requirements and tweak them to find what meets your needs. Maybe you can go without monitor X system, or maybe you can spend a little more time during setup to reduce the up-front investment.

At this point depending on the scale of your project you may choose to dive right in to exploring each product live in your environment, or you may choose to contact the vendor to further narrow down your options. Before making a purchase I would highly recommend getting some hands-on time with the tool if possible.

Proof of Concept

You will not have the time to do a PoC (Proof of Concept) on each vendor you find but if you narrow down to 1-3 solutions (preferably 2) you can spend significant time getting to know each solution. Your PoC should end up with a full installation of the tool in your environment either in trial-mode or with a temporary license. They key to the PoC process is NOT to monitor your entire environment, but pick a few complex systems to fully monitor to get a feel for the complexity. The challenge of these tools isn’t to “check if a server is up or down”, it’s to configure the tool in a way that gives YOU the most relevant data.

Here are a few tests I would suggest to compare the systems:

How hard was the installation process? Were there a lot of issues that may deter you from the solution?
How hard/easy is to to add a new system?
Is the interface fairly responsive and error-free?
If you have a set of servers which work together (such as an app and database pair) try to monitor them both, their OS metrics, and the services that run on them. Don’t go overboard but monitor items which may be “actionable”.
How does the dashboarding work? Can you quickly create a simple dashboard to show the status of your environment?
How are permissions handled? Do you have very granular permissions you can configure for team members or is it a simple Read-Only or Administrator?
How are alerts configured? Is it easy to change who gets alerted for a server, or when?

Selection

Once you’ve had hands-on time with the different solutions, hopefully you will have a good idea of which product is the best fit for your needs. If you’re still having trouble deciding between two close solutions you should consider putting your requirements into a weighted list. Assign an importance/weight to each requirement and then rate each solution from 0 to 10 on that requirement. Multiply the weight by the rating and add up the numbers for each solution. The product with the highest number gives you the best solution based on your requirements and their perceived importance. Don’t forget to factor in other deciding factors including cost, time to deploy, and ongoing maintenance. See this article for more information on creating a weighted matrix: Toolbox for IT: Constructing a Weighted Matrix

Depending on the size of the solution’s organization and pricing flexibility you may be able to leverage large discounts off of list price. Without going into excessive detail, during the purchasing process you should work with someone from your organizations procurement team to ensure you get the best possible cost for the product and fully understand the licensing model. If you don’t have access to a procurement resource then treat this purchase like buying a new car. Each vendor wants your business and don’t be afraid to hint at the fact you’re considering competing products. It usually worth at least testing to waters to see if they can drop their final price or throw in additional functionality to “sweeten the deal”.

Conclusion

At this point you should have a product selected, licenses in hand and preparing to deploy to your environment.

In this part we accomplished the following:

Took a brief tour of the major monitoring solutions (strengths/weaknesses)
Explored the iterative process of selecting the right fit for your requirements
Discussed the benefit of doing a “Proof of Concept”
Selected the correct solution for our environment

In Part 3 we’ll cover the following:

What makes a monitoring solution effective? How can it be the MOST effective?
What types of things we want to monitor and why
What do we want to be alerted on?

Infrastructure Monitoring – Part 1: Introduction and Requirements

Caesar Kabalan — Wed, 17 Feb 2016 02:53:40 +0000

Deploying an effective infrastructure monitoring system is no small task. This series should give you a leg up on developing a strategy for implementing a monitoring solution for your environment.

Part 1 – Introduction and Requirements
Part 2 – Industry Leaders and Selection
Part 3 – Effective Monitoring
Part 4 – Implementation and Discovery
Part 5 – Dashboards, Reports, and Access
Part 6 – Continuous Improvement

The benefits of a monitoring system should be clear to anyone who has seen or used monitoring in a past environment. Here are just a few obvious ones:

Early detection of issues leads to a quicker time to resolution and therefore less downtime
Early warning of “trouble” can allow for a resolution before an outage occurs, or at least allow you more flexibility to schedule an outage.
Increased control and visibility into the environment
Historical trending data allows for more accurate forecasting

What is possible?

There are is a VERY wide variety of monitoring solutions available with limitless monitoring capabilities. If you’re just starting to explore the monitoring industry but not sure what types of capabilities exist, here’s a good list to get you thinking about what you might want for your environment:

Checking metrics every X minutes/seconds:
- Disk/CPU/Memory Usage (do my resources look good?)
  - Do I have at least 10% of disk space free? How about at least 2GB?
  - Has my CPU been >90% for the past 10 minutes?
- Network Availability (can users get to it?)
- Application Status (is the application running?)
  - Is a “licenseserver.exe” process running?
  - Is the “httpd” service started?
  - How many resources is a specific process using over time?
- Application Functionality (is the system doing what it’s supposed to?)
  - Does the index.html page contain the text “Welcome!”?
  - Is the server listening on TCP port 12345?
- Extendable monitoring with custom scripting
  - Internally developed system with very specific monitoring needs? Write a script to tell the monitoring solution if the health is good.
Alerting:
- Notify an on-call administrator so they can fix the problem (text in the middle of the night).
- Escalate to a second administrator after 30min if the first alert wasn’t acknowledged and the problem still exists.
- Let your helpdesk know there is an outage or open a ticket to a 3rd party provider.
- Send another alert when the problem is resolved.
Trending/Historical:
- How much has this volume grown in the past 12 months?
- A 1TB disk just hit 90% usage, how much should I add to give it another 6 months of growth?
- Based off of the past 2 years, how much storage should I buy when I replace our SAN?
- The CPU on this server has been consistently >80%, should I increase the resources?
Troubleshooting:
- The application stopped working and CPU is stuck at 99%. Could this full disk have something to do with it?
- Application ABC is down, we coincidentally received an alert from server XYZ. Is it related? Was ABC dependent on XYZ?
- Application ABC went down at at 2AM. Memory usage for the process started growing slowly at 3PM the previous day. Why?
Overview:
- Overall uptime metrics
- Outstanding alerts/issues
- Dashboard capabilities for quick overview of the environment
- Mapping of applications for quick health check of an application

Defining Scope

Evaluating and choosing a monitoring system requires an understanding of the “balancing act” between and financial investment, employee time, solution functionality. Defining a scope will allow you to narrow down your requirements and manage expectations for what the solution will do. For infrastructure monitoring, your scope will mostly revolve around what you want to monitor and where. These should be “broad strokes” which will be broken down into individual requirements later.

Consider whether some of the ideas below apply to your project:

Monitoring…
- Operating System Metrics
- Network Infrastructure
- Virtualization Infrastructure
- Storage Infrastructure
- Application Functionality
Implementing…
- Alerting (including identifying stakeholders for the monitored systems)
- Global views (for a distributed deployment)
- Dashboards
Documentation
Training

Defining Requirements

In order to effectively select a product that meets your requirements (but remains within the project scope), you first need to define them, in writing. Your goal should be a list of attributes of the system. You should assign each requirement into one of two categories: Required or Optional.

If your list of requirements is large and complex, chances are no single solution will meet all of them. You may want to keep the number of required capabilities low, and rate the rest on a scale of 1-10 for importance. Below is a good starting point, but be sure and add/remove requirements specific to your organization and systems.

The selected infrastructure monitoring solution shall…

be able to monitor the following:
- Windows Server 2008 R2 / 2012 R2
  - Windows Service Status (Running/Stopped) – Required
  - Process Status (Running/Not Found) – Required
  - CPU / RAM / Disk Utilization – Required
- Red Hat Enterprise Linux 6 / 7
  - Linux Service Status (Running/Stopped) – Required
  - Process Status (Running/Not Found) – Required
  - CPU / RAM / Disk Utilization – Required
- Specific vendor appliances and applications
  - NetApp/EMC Storage Array – Required
  - VMware ESX Host CPU / RAM Utilization – Required
  - APC Uninterruptible Power Supply – Optional
be able to monitor all systems without an agent needing to be installed (WMI/RPC/SSH/SNMP). – Required
be able to send alerts on the metrics listed in this document. – Required
be able to view historical information in visual way. – Required
be able to schedule scans and identify new/unmonitored systems. – Optional
be able to monitor 300 servers running 1200 services. -Required
be able to be extended with custom scripting. -Optional
cost less than $X for the initial deployment. -Required
cost less than $X per year. -Required
be supported by the vendor in the form of support tickets and available training. -Required

Conclusion

In this part we accomplished the following:

Discussed some benefits of infrastructure monitoring
Listed common capabilities of products in the industry
Scoped the project appropriately
Defined our initial draft of requirements for the solution

In Part 2 we’ll cover the following:

An overview of the major monitoring solutions (strengths/weaknesses)
The iterative process of selecting the right fit for your requirements
The benefit of doing a “Proof of Concept”

vSphere Flash Read Cache – Part 2: Implementation

Caesar Kabalan — Wed, 09 Jul 2014 17:27:54 +0000

Introduction

In the last part of this series we walked through determining if a workload is appropriate for vSphere Flash Read Cache. We then looked at the different aspects of sizing the cache appropriately for a workload. We determined the correct size via table hit statistics queried from the Oracle Database. We determined the correct block size by using the vscsiStats ESX command to gather statistics on the size of I/O commands on a specific VMDK. In our example the database is 1.1TB, and we require a 150GB cache with a 512KB block size.

In this part we will cover the following:

Presenting a number of SSDs to the hypervisor in a RAID0 configuration
Configure the ESX host to treat the newly found drive as an SSD
Make the SSD available as a Virtual Flash Resource
Configure our Data Warehouse’s data drive to have the configuration we reasoned out in Part 1.

Provisioning the SSDs

For our solution we are using two Dell PowerEdge M710HD blade servers with two Dell 200GB SAS SSDs each.

After inserting them into the blade we then needed to create a new RAID set combining the drives together into a RAID0 stripe. In the case of vFRC RAID0 is a very attractive option because fault tolerance is not necessary (if the cache randomly disappears all I/O goes straight to the back-end storage device) and it gives increased performance over a single larger SSD.

The steps for creating a RAID set vary by controller manufacturer and individual steps are outside of this article. If you have a single SSD, depending on the disk controller you may be able to pass through the SSD straight to the hypervisor. If you have more than one SSD or a disk controller that doesn’t support passthrough, you will need to create a RAID0 stripe of all disks.

An example setup for a Dell PERC controller with two 200GB disks is below:

Reboot the host into ESX (ensure your boot order is still properly defined).

Mark Disk as SSD

Now the expected result would be to be able to add this RAID0 SSD set as a Virtual Flash Resource. Wrong! If you have a single SSD with passthrough you can skip this step. Almost any RAID controller set up to RAID SSDs will display as a non-SSD, therefore making it “non-flash” and unable to be added as a Virtual Flash Resource.

You can verify this by selecting a host in the vSphere Web Client, going to the Manage tab, then to Settings, and then to Virtual Flash Resource Management and choosing “Add Capacity…” at the top.

In order for a device to be displayed in this list it MUST be recognized as an SSD by the ESX host. To verify the host views the drive as a non-SSD in the vSphere Web Client, with a host selected, go to Manage, then Storage, then Storage Devices.

Below you can see the drive marked as a non-SSD:

Unfortunately there is no easy right-click solution for marking as an SSD. The solution requires us to set the SSD flags via the ESX command line. Open up a command line to your ESX host. Run the following command:

[... SNIP ...]
naa.600508e0000000007cf406db80531904
   Device Display Name: Dell Serial Attached SCSI Disk (naa.600508e0000000007cf406db80531904)
   Storage Array Type: VMW_SATP_LOCAL
   Storage Array Type Device Config: SATP VMW_SATP_LOCAL does not support device configuration.
   Path Selection Policy: VMW_PSP_FIXED
   Path Selection Policy Device Config: {preferred=vmhba2:C1:T0:L0;current=vmhba2:C1:T0:L0}
   Path Selection Policy Device Custom Config:
   Working Paths: vmhba2:C1:T0:L0
   Is Local SAS Device: true
   Is Boot USB Device: false
[... SNIP ...]

Find your disk in the long list that will be displayed. I suggest copy and pasting the output for that specific disk into a notepad so you don’t get it confused with a LUN. That would be bad.

You can display extended details about the disk as follows:

naa.600508e0000000007cf406db80531904
   Display Name: Dell Serial Attached SCSI Disk (naa.600508e0000000007cf406db80531904)
   Has Settable Display Name: true
   Size: 380416
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path: /vmfs/devices/disks/naa.600508e0000000007cf406db80531904
   Vendor: Dell
   Model: Virtual Disk
   Revision: 1028
   SCSI Level: 6
   Is Pseudo: false
   Status: degraded
   Is RDM Capable: true
   Is Local: false
   Is Removable: false
   Is SSD: false
   Is Offline: false
   Is Perennially Reserved: false
   Queue Full Sample Size: 0
   Queue Full Threshold: 0
   Thin Provisioning Status: unknown
   Attached Filters:
   VAAI Status: unknown
   Other UIDs: vml.0200000000600508e0000000007cf406db80531904566972747561
   Is Local SAS Device: false
   Is Boot USB Device: false
   No of outstanding IOs with competing worlds: 32

To mark the disk as SSD run the following command. Be sure and take care to replace the items in brackets.

esxcli storage nmp satp rule add --satp=[Storage Array Type] --device=[SCSI NAA] --option="enable_ssd enable_local"

You can find the Storage Array Type from the first “device list” command. In my example the command would be:

esxcli storage nmp satp rule add --satp=VMW_SATP_LOCAL --device=naa.600508e0000000007cf406db80531904 --option="enable_ssd enable_local"

You will need to reboot the ESX host for the disk to be recognized as an SSD. After the host reboots you can double check the drive type again:

Provision the Flash Read Cache

After your disk is marked as an SSD it should be available in the “Add Capacity…” window under Virtual Flash Resource Management under Settings.

Check the boxes and click “OK”. The new space should then show up as additional capacity under Virtual Flash Resource Management:

Now that we have successfully provisioned the disks as a Flash Resource on a single host, it is time to redo the above steps on this page for all other hosts that the guest will reside on.

NOW IS THE TIME TO PROVISION YOUR OTHER HOSTS USING THE STEPS ON THIS PAGE.

Once all the hosts have been provisioned you can enable the vFRC on a VMDK. Browse to a VM and Edit Settings. The VM can be running. Expand the VMDK you analyzed in Part 1 and click Advanced in the Virtual Flash Read Cache row. Enter the information determined in Part 1. My example is below:

OK your way out of the settings and verify the vFRC is now displayed under VM Hardware:

Once you can see it assigned to the VM you’re done… for now. The next phase is monitoring and tweaking.

Conclusion

In this part we’ve accomplished the following:

Summarized our findings from Part 1, where we determined a 150GB vFRC cache size with a 512KB block size.
Install and provision the SSDs to the ESX host via the disk/RAID controller.
Change disk flags via the ESX command line to view the new RAID set as a SSD.
Provision a new Virtual Flash Resource from the newly recognized SSD.
Assign vFRC resources to a VM using the initial settings found in Part 1.

Additional Reading:

In Part 3 we’ll cover the following:

Monitor the performance changes on the Data Warehouse Database.
Tweak the vFRC settings as needed while providing reasoning behind the change.

vSphere Flash Read Cache – Part 1: Intro and Sizing

Caesar Kabalan — Wed, 09 Jul 2014 04:09:55 +0000

Introduction

This series will consist of several parts written over several weeks and will encompass the planning, sizing, implementation, and monitoring of vSphere Flash Read Cache to improve Data Warehouse performance using locally attached SSDs.

Part 1 – Introduction and Sizing
Part 2 – Implementation
Part 3 – Monitoring / Tweaking
Part 4 – Conclusion

While going through the new features we gained by moving from vSphere 5.0 to vSphere 5.5 we decided to see how vSphere Flash Read Cache could help our Data Warehouse jobs’ run times.

As part of this discovery process we needed to take a look and see if vSphere Flash Read Cache (hereafter as vFRC) would be a good candidate for our Data Warehouse.

Facts about the target system:

RedHat Enterprise Linux 5.8
Oracle Enterprise Edition 11gR2 (11.2)
Approximately 1.1TB of data
Not using ASM

Some quick analysis on the database revealed that 90.2% of database disk actions were physical reads, making this a very promising candidate for a read caching solution. Because of budgetary concerns we decided to start small with some local SAS SSDs. Other, more expensive options include Flash PCIe cards that were incompatible with our blade servers.

Reading some best practice articles (below) it was decided that a lot of effort would have to go into determining the proper sizing of the cache allocated to the VM.

Cache Sizing

From the Oracle side we pulled some access data that allowed us to see table names, table sizes, and the number of reads from each table over a certain period of time.

Anonymized data is below:

…and so on for another 15,000 lines.

We found that in our Data Warehouse that 50% of all reads were satisfied by the same ~740MB of data. As you expand that percentage it grows very quickly. 80% of reads were satisfied by the same 110GB of data. This meant our DW, while being 1.1TB in size had a small amount of frequently accessed data and a large amount of seldomly accessed data.

Analyzing this data we determined a good starting point to be 150GB of data at around the 97.5% mark. This meant that we could store 97.5% of the most popular data in 150GB. The last 2.5% was the remaining 950GB and was most likely historical tables that are rarely accessed.

Cache Block Size Sizing

At this point we have a good starting point for a vFRC size, but now we need data on the proper block size to choose.

First, an excerpt from the vFRC Performance Study by VMware:

[…] Cache block size impacts vFRC performance. The best way to choose the best cache block size is to match it according to the I/O size of the workload. VscsiStats may be used to find the I/O size in real-time when running the workload. […]

vscsiStats is a command accessible from the ESX command line. From the command help:

VscsiStats — This tool controls vscsi data collection for virtual machine disk I/O workload characterization. Users can start and stop online data collection for particular VMs as well as print out online histogram data. Command traces can also be collected and printed.

In order to determine the best block size we will use vscsiStats to determine the spread of different I/O request sizes.

Start by listing available VMs and their disks with:

Virtual Machine worldGroupID: 1203178, Virtual Machine Display Name: USATEST01, Virtual Machine Config File: /vmfs/volumes/523358d4-466be286-1837-842b2b0ca19e/USATEST01/USATEST01.vmx, {
 Virtual SCSI Disk handleID: 8192 (scsi0:0)
}

Once you find the correct disk to monitor begin collecting statistics using the following format:

vscsiStats -s -w [worldGroupID] -i [handleID]

Example:

vscsiStats: Starting Vscsi stats collection for worldGroup 1203178, handleID 8192 (scsi0:0)
Success.

As the statistics are collected you can query the ioLength histogram using the following command:

vscsiStats -p ioLength -w [worldGroupID] -i [handleID]

In my example the output is as follows. This example VM is mostly idle.

[... SNIP ...]
Histogram: IO lengths of Read commands for virtual machine worldGroupID : 1203178, virtual disk handleID : 8192 (scsi0:0) {
 min : 512
 max : 1052672
 mean : 52740
 count : 33823
   {
      2627               (<=                512)
      562                (<=               1024)
      960                (<=               2048)
      758                (<=               4095)
      7513               (<=               4096)
      950                (<=               8191)
      1098               (<=               8192)
      1707               (<=              16383)
      3633               (<=              16384)
      10227              (<=              32768)
      275                (<=              49152)
      1210               (<=              65535)
      375                (<=              65536)
      95                 (<=              81920)
      130                (<=             131072)
      485                (<=             262144)
      293                (<=             524288)
      925                (>              524288)
   }
}
[... SNIP ...]

While this command gives output for read, write, and combined SCSI commands, we don’t really care about the last two. Since vFRC only caches reads we really only care about how big the reads are that are being satisfied by the SCSI disk. In the case of the disk above that the most popular READ block size was 32K at 10227 of 33823 reads (~30%). On our Data Warehouse we ended up with the following table:

Since they were split between 16K blocks at 31% and 512K blocks at 57% we opted for 512K block sizes.

In addition the total count of read commands was 389,544 over this period of time, while write commands totaled 32,843. Running the math shows us that we;re pretty close to our 90% mark from the database.

Conclusion

In this part we’ve accomplished the following:

Identified a need for faster performance on our Data Warehouse.
Used Oracle’s reporting tools to determine that we need approximately 150GB of cache in order to satisfy the large majority of the reads of the system.
Used VMware’s vscsiStats command to determine that the most used block sizes were 16K and 512K
Combined these conclusions to determine a good initial configuration of 150GB cache with a 512K block sizes.

Additional Reading:

(From Introduction) vSphere Flash Read Cache Product Page
(From Introduction) Performance of vSphere Flash Read Cache in VMware vSphere 5.5 (Highly Recommend!)
(From Introduction) VMware vSphere Flash Read Cache 1.0 FAQ
Using vscsiStats for Storage Performance Analysis
Getting started with vscsiStats

In Part 2 we’ll cover the following:

Install two Enterprise SSDs in each of two ESX hosts
Configure each Dell ESX host’s SAS controller to treat the SSDs as a single RAID0 set
Configure the ESX host to treat the newly found drive as an SSD
Make the SSD available as a Virtual Flash Resource
Configure our Data Warehouse’s data drive to have the configuration we reasoned out in Part 1.

Remote Task and Service Auditing

Caesar Kabalan — Fri, 15 Nov 2013 17:53:53 +0000

At work today I undertook the task of auditing a domain administrator account. Because we don’t want to have many system administrator’s profiles on all of our servers we have a specific domain administrator account which is used for routine maintenance and for anything that needs to be done locally on the server. Over time the account has been used for non-login purposes. For instance, if we needed some scheduled task to run as an administrator, we just set this as the RunAsUser. We also did the same for some specific services that we needed to run.

The problem arises when we need to change the password for this admin account. For instance, if one of our system administrators were to leave, for security reasons we must change the password for this account. If we were to simply change the password the domain administrator account would be changed but none of the saved tasks/services would have their credentials updated. When they try to run they’d login using the old password, be denied, and not run.

So, I had been given the task of auditing approximately 125 physicals servers and over 150 virtual machines running a variety of operating systems. A few weeks earlier I had to update a document which describes all of our serves, virtual machines, their specs, operating systems, and purpose. I was able to use this list of ~275 hostnames to filter out the Windows machines. This list was down to ~200 hosts.

So now I had 200 hosts that I was going to check by hand? No. After some quick googling I was able to find out about the `schtasks` command. The schtasks command allows you to manage all scheduled tasks on your local machine, or a remote machine. I was able to successfully query several remote machines with my domain administrator login and view detailed statistics with the following command:

schtasks /QUERY /S 

					
						

					
				[hostname] /V /FO CSV
/QUERY - Specifies that this command is querying existing tasks as opposed to creating new or deleting old tasks.
/S [hostname] - Specifies that this is the host to query, if omitted it queries the local machine. See also the /U and /P (username and password) switches.
/V - Verbose output, a bunch more information including which user the tasks runs as.
/FO CSV - Format output as a CSV (for my purposes this was for easy parsing and eventual excel import)

I wrote a console application to read a list of hosts from a file and run this query on each. The output was then split apart and parsed, and stored for later.

The next task was querying services. This was a bit more of a challenge. There was no nifty command line tool to do this remotely. I ended up using some WMI queries to connect to each machine and query the list of services. The responses were stored for later.

So in my application’s current run state I had a list of all services and scheduled tasks on about 200 windows hosts. I then checked the appropriate fields for the domain account we wanted to audit, then output this as a .csv file. I imported the file into excel and had a nice neat presentable list of relevant tasks/services, their purpose, use, etc.

All in all what would have taken several boring hours by hand (likely 3+) I was able to automate in about two hours. The run time to audit all of these machines only takes about 5 minutes (mostly due to the occasional time out). The ones which errored ~3% we were able to audit ourselves.

I guess the moral of this story is: Automate if possible! If it takes you 5 hours to do a task by hand, and 5 hours to automate the task, automate it! You never know when you might need to run the audit again. This being said, don’t automate every little thing. Don’t spend an hour making an application/script that you could do by hand in 5 minutes. Then again if this is something that you need to do on a regular basis, automate!

Computer Imaging: A Short Analysis

Caesar Kabalan — Tue, 12 Nov 2013 17:38:07 +0000

This was written by me back in the summer of 2011 when I worked for a university life cycling a few buildings. When I wrote this I was much less “well-rounded” in the technology and business world, so take this as it is: A short posting about imaging from someone who was a novice at the time. In this example we use ghost, a lot of places uses MDT or CloneZilla, so swap the terms. Here we go…

Well today we started rolling out ghost images on the 300 computers we’ve purchases for the College of Business. We spent the first few days unboxing and changing BIOS settings. When we finished yesterday we had a scene that looked something like in this picture (no that’s not me).

As a test run today we decided to Ghost 30 machines to see how well GhostCast scales. For those that arn’t familiar with Ghost, it is a hard disk imaging tool use for backup and mass deployment. In laymans terms it means you can configure one computer, make a snapshot of the hard drive, then push that image out to however many computers you wish. More information on Ghost can be found here. In the past images would be pushed out on a computer by computer basis or by using unicast, which can be very slow and time consuming. The cool thing about GhostCast is that it takes advantage of Multicast, which means the server can send out it’s packets to the broadcast address and the switches will duplicates those down to each computer. This means that instead of the Server having to broadcast 20GB to each computer, it can broadcast it to everyone at the same time. This is analogous to whispering the same number in everyones ear, and shouting it in the same room. Everyone ends up with the same information (whether a number or a ghost image) faster.

One of our System Admins had built several working images (one for lab computers, one for faculty/staff computers) for deployment throughout the building. The original plan was to install all of the machines at the permenant locations throughout the summer then ghost them. Since our team got done unboxing ahead of schedule we decided to set up a ghosting station where we can ghost computers in bulk without being on the campus network.

Because we had three guys to do the ghosting we decided to setup 2 stations. Each station has one monitor, one keyboard, and fifteen computers. We stacked these fifteen computers into two groups, connected power, and connected each to a standalone switch. Because we only had 24-port switches at our disposal we decided to just bridge the two and add a consumer-grade Linksys router for DHCP. The final networking setup looked like the picture to the right.

We had our GhostCast Server booted so now it was time to get the soon-to-be-ghosted clients on-board. We didn’t feel there was a need to setup 30 monitors and keyboards, so we kept it simple. We had three stacks of five computers at each station. We had a USB flash drive that had the Ghost client preconfigured with DOS network drivers, and Ghost. One person would be behind the computers and would plug the monitor, keyboard, and USB flash drive into each computer. One of the front guys would boot it up and get it setup with the GhostCast server. Then we move on to the next computer. This process took about 10 minutes to get all 30 computers connected to the GhostCast server, and ended up looking something like this:

Once we had them all set we just left the last computer’s keyboard and monitor plugged in so we could monitor the progress. We hit Send on the GhostCast console and we were off. To send the 18GB image to all 30 computers would have taken hours, if not days. From the time we hit Send until all the computers could be powered off came to just shy of 30 minutes. This student lab image will be deployed to almost 200 computers. The time saving by using Ghost is just enormous.

Think of how much time it would take you to reinstall an OS, configure it, install your required programs, and make performance optimizations. Let’s say you’re FAST and it takes you TWO hours per computer. 200 Computers would take 400 man-hours to set up. Say your image-making guy took his time and spend 6 hours making a really polished image. This image could be pushed out to all computers in batches of 30. I say 30 perhaps due to space limitations, but there is no realistic limit to how many computers you can Ghost at the same time.

Those 200 computers can be configured in about 10.4 hours (2.6% of the original time):

6 Hours making a polished Ghost image
1.1 Hours hooking up the computers and connecting them to GhostCast
3.3 Hours copying the image, while you can be reading a magazine, or even planning how you’re going to spend your newly-gained 324 hours.

Like I said we only chose 30 because it was a number we’d been working with. If you had enough switches, power cords, ethernet cables, and space, you could ghost all 200 computers at the same time.

Those 200 computers can be configured in about 7.6 hours (1.9% of the original time):

6 Hours making a polished Ghost image
1.1 Hours hooking up the computers and connecting them to GhostCast
0.5 Hours copying the image, while you can be reading a magazine, or even planning how you’re going to spend your newly-gained 324 hours.

The bottom line is this, use Ghost wherever possible. It can save you headaches and it can save you time.

The Job Search: Part 2: Attracting (Pre-Interview)

Caesar Kabalan — Sat, 28 Sep 2013 22:13:14 +0000

So you’ve identified a job you want. Maybe from a friend, or LinkedIn, or just searching around online. How do you proceed? Send them your resume right? WRONG.

Well, you’re kind of right, but that’s not the first thing you do. Lets first discuss what a resume isn’t:

What A Resume Isn’t

A Life History (that’s an Autobiography)
A Work History (that’s an CV – Curriculum Vitae)
An opportunity to lie
A lengthy, generic document that you copy and send to everyone

What A Resume Is

A sales pitch
A way to show your strengths
A short, customized document you send to one employer

Let’s go a little bit more in depth. A common misconception is that you should list every job you’ve worked, not the case. Why not? The more information the better right? Lets draw a little bit of an analogy.

Selling Yourself Effectively

You’re essentially selling your skills for a price (hourly rate or salary), just like a store is selling a TV. What if there was no marketing around the TV. It just sitting there in the box, with the 100-page instruction manual sitting on top. The instruction manual has all the information that any marketing materials would and then some. I mean, more information is better right? This is essentially what you do when you send an employer a 5 page resume with everything from the job washing cars you had when you were 15, to your last job as a CIO.

Lets take the analogy a little bit further. What does a store do? They might have the instruction manual around somewhere, but they put a short list of eye-popping features and have the TV displaying some Hi-Def video. That’s what your resume should do for you.

What a Job Recruiter Does

Think about the organization’s role in the process. They need to quickly and efficiently determine the best candidate and hire them. This means quickly sorting through resumes and filtering out the ones that aren’t viable candidates. They sort them into the “Interview Pile” and the trashcan.

A recruiter for a company has a stack of resumes on his desk when his friend walks in. They talk for a few minutes and the friend starts to leave. As he’s leaving he sees the recruiter take half of the stack of resumes and throw them straight into the garbage can. The friend exclaims “Why are you throwing away half of those resumes’?! You haven’t even read them yet!”. The recruiter simply responds “I don’t want any of the unlucky ones working here.”

Now granted, the above is a joke, but a lot of times employers look for ANYTHING to exclude you. You have to be perfect AND set yourself apart. Lets talk about how resume paper came into existence. At some point someone needed an edge over the 20 other applicants for a job. They decided that instead or printing their resume on some standard paper, they would print it on something a bit fancier like card stock. This different caught the recruiters eye and he got put in the “Interview Pile”. Eventually enough people started setting themselves apart that resume paper is now a requirement. What was a bonus maybe fifty years ago is now the standard, and if you were to hand in a resume on non-resume paper today it would likely get thrown away.

Your goal is to catch their eye and be able to easily establish if you’re a viable candidate. If a recruiter gets frustrated or overloaded with content (like a 5 page resume), they’re not going to read your entire resume. They’re going to skim it and make a quick determination.

Below are some suggestions. I put suggested in the title, because no one set of rules is going to solve all situations. That being said, these suggestions should be considered for every resume and evaluated. They are followed by a short explanation of why or why not.

Suggested Do’s

Strongly consider a single one-sided page – Don’t want to overload a recruiter, they won’t take the time to read two pages of detail
Unique layout (No Word templates) – Unorigional resume and you’ll blend in with the crowd.
Customize for each job – Only relevant experience if you have too much for a single page – Again, you want the recruiter to quickly see if you’re competent. They don’t care if you used to wash cars.
Set yourself apart in a professional way – Catches the recruits eye, maybe something techy like a QR code for your website or something.
Make your previous jobs sound impressive, but don’t lie – See below

Suggested Don’ts

Spelling mistakes – No attention to detail, doesn’t care to be thorough
Lying – Don’t lie, it won’t help you in the long run, could even hurt you if you want to use this job as a future reference.
Pictures/Portraits – Does it really matter? Unless you’re looking to get the job based on looks, it will only hurt you. You could look like someone they don’t like, maybe an ex.
Big blocks of text – Hard to read, can appear as dense or boring
Objective section – Why? The objective is to get a job. You know that. They know that. It’s a wasted half-inch on your page. You’ll need all the space you can get.
Margins – Don’t change the margins much, leave them the default 1″, maybe if you need that one extra word you can change it to 0.95″ or something.

Lying vs Embellishment and Tweaking

Lying is bad. It will catch up to you somehow. Embellishment on the other hand is definitely OK, as long as everything you’re saying is truthful. Stretch the truth a small amount can be OK. Here are a few examples of the proper way to embellish:

“Counted money in the cash register every night” vs “Maintained the financial integrity of the cash drawer”
“Was a server for a cafe” vs “Provided an enjoyable eating experience for patrons by suggesting drink and food items, and responding to requests in a timely manner”
“Phone tech support for an ISP” vs “Provided tier 3 telephone support for AT&T DSL’s western region”
“Managed Active Directory” vs “Responded to and fulfilled requests for user modifications in Active Directory”

In each of those you’re saying the same thing, maybe providing a little more information (truth), but it sounds a lot more impressive and important. Fancy it up, to a point. No need to spray out dozens of complicated sounding acronyms. If you’re in the IT industry, anyone who will be looking at your resume probably knows what “IP” is, so you can put “IP” instead of “Internet Protocol”. Someone may not know what MCSM means, so it would be a good idea to be “Microsoft Certified Solutions Master”.

Example Resume

Lets take a look at the example resume below (click for larger):

What do you first notice about the resume? It has four major areas: Contact Information, Education, Professional Experience, Skills. There is a lot of information on the page but it is well organized into sections, subsections, and bullets. It does not feel daunting to read, you can easily identify the 4 entries under Professional Experience. Obviously I have edited the document to not include organization names and locations, but you can still see some of the “embellishing”, or making it sound official and maybe a little bit fancy.

Note that I created this resume from scratch using tables in word. I didn’t stick with one font size, but used different sizes for the titles, subtitles, content. I also used bolding and italics to emphasize certain areas. I’m particularly fond of the “small caps” feature of Microsoft Word, you can see this in the words “Education”, “Professional Experience”, and “Skills”. Even though they are the same size as “Redacted Univerity”, the fact that they are small caps, bolded, and italicized makes them stand out as “official titles”.

There are no spelling mistakes, no lies, no large blocks of text, and this was customized for the specific job I sent this in at. The layout took me maybe a half hour to develop and another hour or so to write the content. This resume has also been developed over many revisions and tweaks, each time getting a little bit better.

One thing that I might consider adding is a QR code or something similar at the top to the right of the phone/email/website. It will catch eyes, and if someone decides to pull out the Barcode Scanner app on their phone that even further sets you apart.

The Cover Letter

So now that you have your resume set and looking good, it is time to write your cover letter. A cover letter is NOT optional anymore. It’s purpose is to put a more personal feel into your application and connect you, the company, and your resume together into one thought. A good mix is to hit these topics, in this general order:

First Paragrah: Introduce yourself. State the job you’re applying for and how you found out about it (newspaper, craigslist, a current employee, etc).

Second/Third Paragraphs: Talk briefly about your personal and professional history. Two paragraphs max, be sure at hit points that would make you a good candidate (natural skill, attention to detail, etc). This is the place to briefly state beneficial personality traits that don’t belong on your resume. Essential: This is who I am, these are the skills and personality traits that would make me a good fit for your organization.

Fourth Paragraph: Thank you message and some contact information like a phone number.

The example to the left was at the end of my internship when I was attempting to get a full time position. A cover letter should be HEAVILY customized for each organization. While writing your cover letter it is a very good idea to do some research about the organization so you can connect yourself to their philosophies and unique characteristics. For example, if you know they support their employees’ efforts in doing volunteer work, you might throw in a sentence about volunteer work you do.

I hope this has provided a good guide for job seekers to get their name out there and noticed. In the next installment (Part 3: Interviewing) we will cover all aspects of the job interview. Highlights will include dress, attitude, illegal questions, proper responses to the hard questions. Thanks for reading and feel free to leave questions in the comments and I will be more than happy to answer them!

The Job Search: Part 1: Introduction

Caesar Kabalan — Wed, 14 Aug 2013 16:44:40 +0000

While at NAU I took a very special class called BizBlock. Without going into much detail, the class was focused on integrating different aspects of business: Marketing, Communication, and Management. A major part of the class was teaching you how to communicate effectively in a business setting as well as market yourself as an individual. Obviously, a lot of time was spent focusing on the major parts aspects of securing a job. In the future I may do a segment on finding a job, but these initial three posts will assume you have identified a job you want and now you need to make first contact.

Above is the basic process we’re going to look at. A lot of people are under the impression that a resume’s purpose is to get you a job. WRONG! It is meant to be something that catches their eye and gets you to the interview pile. From here you’ve already caught there eye, your role now is to make sure you’re the most memorable and competent person their interviewing. The interviewing is really where an organization is going to decide if they want to hire you. Once you’re in an interview your resume means very little. After you leave the interview, you’ve made your impression.

At this point, one of three things happen:

They’re Interested – You Get The Job
They’re Interested – You Get Another Interview
They’re Not Interested – You Don’t Get The Job

You want to make sure you’re in #1 or #2. You can maximize your chances by setting yourself apart with a Thank You Letter and following up. It is possible you’re not their first choice and they offer the job to someone else. They won’t tell you that you didn’t get the job until their top candidate has accepted. This is usually why it takes so long to hear back if you don’t get the job. If you happen to fall into the #3 category, no worries. Learn from your mistakes, pick yourself up, and try again.