16/09/2014

By Leon Adato, Head Geek, SolarWinds


There isn't a company in business today that hasn't suffered through an unplanned network outage – either internal outages that impact employees' ability to access vital systems such as business intelligence, customer relationship management, order fulfillment, voice services, and email; or outages that affect customer-facing applications including online ordering, customer support, and content delivery.

Even the most well-known brands across the globe, for whom money is ostensibly no barrier including Amazon and eBay, have experience interruptions to their services due to network outages.

Meanwhile, those same applications and services are becoming competitive differentiators for businesses. Differentiators that companies can ill-afford to be without. In fact, the average cost per minute of unplanned data centre downtime is now almost £5000, up a staggering 41 percent from just over £3000 per minute in 2010, according to the Ponemon Institute.

On top of all this, there is a new wrinkle to the availability issue: the harsh reality that “slow” is the new “down”. A recent SolarWinds survey has found 94% of business end users believe their application performance and availability directly affects their ability to do their job, with a further 44% saying that it’s absolutely critical. These eye-opening stats highlight that it's no longer enough that the systems and services are available. Response speed is now just as important.
These statistics – both the cost of downtime and the re-casting of slow response as being just as bad as no response at all – make a compelling business case for the importance of knowing what is happening on your network, in real-time, all the time. Meaning: comprehensive network and systems monitoring, management, and automated response.

Just to be clear: Network Management and Monitoring Systems (NMS) are no longer a nice-to-have cost centre in your IT department. They are an essential cost avoidance tool.

Obviously cost is avoided when downtime and outages are reduced. Lost-opportunity costs are also avoided when IT staff are free to focus on strategic projects that improve performance and reliability, rather than fighting fires. Finally, costs are avoided when monitoring not only alerts staff to the separate symptoms of a problem, but where metrics can quickly uncover the root cause of the issue. This allows the business to implement the correct solution on the first try (and as soon as possible), avoiding costly delays and misspent funds due to guesswork rather than data-based decisions.

A sophisticated NMS will collect data at all levels of the infrastructure – from basic hardware health on each component system; to the availability of applications and services on clustered resources; to the actual experience of each end-user currently on the system.

Taking this a step deeper, here are some capabilities any organisation should look for as they evaluate NMS tools:

Comprehensive Component Monitoring

While all NMS solutions should aspire to provide monitoring of business processes and the status of inter-related systems, this can't be done at the expense of ignoring the foundation. Network monitoring has to go beyond “ping”. It must also include the state of WAN interfaces, bandwidth information, dropped and errored packets as well as information on the status of network hardware like CPU and RAM.

On the server side, you need to track CPU, disk performance, system load, and memory; and then database connections, running processes and threads, service status, number of queries/second and more about the application.

But it doesn't stop there. A solid NMS solution must be able to have insight into the virtualisation and storage components such as hypervisors, physical resources presented to virtual machines, SAN fabric, and disk arrays.

Real-time visibility

Business critical applications such as CRM, CITRIX, ERP, etc., need continuous monitoring of at all layers of the “stack” - from the network to storage to virtualisation to virtual machines to the applications that are running on them. Critical applications are used by hundreds of users across the organisation and there will be processes like adding, modifying or deleting data, updates, backups, etc. running at all times. To ensure uptime, you need to make sure that the server is never overloaded. Any lack of resources may cause a bottleneck making the application appear to be running “slow”. Hence a holistic visibility into your critical application infrastructure, updated in real-time, is necessary.

Proactive Reporting, Intelligent Alerting

How can you reduce downtime? You should not end up waiting for end users to pinpoint the problem. A solid NMS will analyse applications and create a baseline of their normal behaviour. These baselines are easily converted to reports which will help you spot problem areas, be it a disk that is throwing errors, a flaky network connection, or an application that is running “hotter” that expected in terms of CPU or memory. Solid reporting (and the commitment to turn those reports into actionable improvements) can help you avoid downtime before it occurs.

But not all problems are predictable, and some problems escalate faster than a daily (or even hourly) report would capture. This is where intelligent alerting comes into play. Using that same baseline data, a solid NMS will suggest thresholds that alert you before a change in behaviour turns into a problem. The key here is to use a data-based approach to alerting. Don't turn on everything, expecting to shut off what turns out to be noise. The result is that everything will appear to be noise. Instead target areas that have been repeat offenders in the past, and look to shorten the MTTR (mean time to repair) by alerting on the condition as soon as it can be detected.

Apply Automation

One of the most-overlooked capabilities in a strong NMS product is the ability to automatically respond to triggering events. A disk is full? Why not attempt to clear the temporary folder before calling out the technician at 2am? At worst, the clearing attempt won't work and the alert will trigger on the next cycle. But in many cases – from restarting an application service that has crashed to re-balancing the load on a cluster of servers – having the NMS do the work means lightning-fast response to errors, which once again reduces or even eliminates downtime.

Learn from Outages

Outages are bound to happen in spite of the best effort you have put in. Use each critical outage as an opportunity. Good NMS will collect a wide array of metrics, but not all of those metrics will have a one-to-one correlation to an alert. After an outage, determine if you had the right data and simply failed to turn it into an actionable alert; or if the key indicators were not being collected – in which case it's an opportunity to add one more monitor to the line-up.

Understand Protocols

A good NMS will have multiple methods of collecting data from the environment. Hardware information may be collected using SNMP. That will show you (for example) that a WAN interface is passing 10 gigabits of data per second. But that won't tell you where the traffic is going. For that, the NetFlow protocol can be used. It will show how much of that 10 gigabits are database requests from the online ordering webserver, how much are system backups, and how much is Joe in accounting streaming “The Hour”.

Meanwhile, trigger-based protocols such as traps and syslog will send out data only when something notable occurs. While this won't help with forensics, it can provide insight into events not otherwise detectable.
Voice traffic has its own protocol – IPSLA – which provides a wealth of information from jitter to the actual quality of a call from different points on the network.

Finally, some NMS solutions will offer real-time analysis of the packets on the network, calculating the time it takes for a user to get information back from an internal system like ERP, or an external one like SalesForce.com. Techniques like this allow you to quickly answer the question “is the problem (slow response) the application, or the network?” and allow you to begin resolving the issue that much faster.

In Summary

Reducing downtime and improving application responsiveness can be easy if you take into consideration the above-mentioned factors. Not only do you need to monitor your important assets and critical factors, you also need to have an understanding of normal vs problematic behaviour. This can be further made easy if you have the right tools for server and applications monitoring in your network.