Infrastructure

Monitoring changes in your infrastructure

With New Relic Infrastructure, you can easily identify performance problems correlated with changes in your infrastructure's state.

Whether you're manually managing configuration changes or defining infrastructure as code with configuration automation tools, monitoring those changes are essential. No matter how stable or performant your production system is, you'll always be making changes that could have unforeseen impacts. A hotfix related to a security bug at the platform level or simple upgrade to any of hundreds of packages supporting your applications could send your performance reeling.

In this tutorial you will learn how to:

  • Correlate an Event with a performance problem

  • Scenario: Package change performance problem

Correlate an Event with a performance problem

As you learned in Using Filter Sets and Groups in New Relic Infrastructure, your inventory, event, and other data is all first organized through any applied filters on the host filter sidebar. Before you do anything, be sure you're looking at the correct filter set!

The first step in correlation is preparation. Make sure your alerts cover your infrastructure so you get informed when a problem occurs! This saves you valuable time (and brainpower) by informing you of an issue quickly and providing essential guidance in the UI.

On the Compute, Network, Storage, and Process tabs, you're going to see the Events Timeline (pictured below) at the top of each interface. Hovering over a filled blue square will float a tooltip that indicates the number of events tracked during that time period as well as their respective source. Clicking one of the groupings will navigate to the Events tab and list all events occuring at or before that grouping chronologically (descending).


Pictured: The Events Timeline shows events (blue squares), warning and critical alert condition openings (right arrows), and closures (left arrows). Warnings are yellow and critical alerts are represented as red.

The Events tab is a live feed in Infrastructure that displays user sessions, package changes, configuration changes, and host status among other events. Just like the Inventory page, you can use search and filter functions to easily identify particular events and can further filter event list sources by selecting the filter icon () next to the search bar. Selecting event source checkboxes here will change the available items in the adjacent event list.

Pictured: The Events UI lists events coming from a variety of sources chronologically, corresponding to the selected period on the Events timeline.

Scenario: Package change performance problem

This short video walks you through the following scenario steps:

According to a 2014 Gartner study, about 40% of outages at large companies were caused by configuration changes alone. Let's walk through a simple scenario involving a package change that results in a performance problem.

Step 1: Make sure you're alerting in Infrastructure
Having a comprehensive alert suite will be essential to quickly correlating performance issues to changes. Infrastructure indicates critical and warning violations you've set up right in the Events timeline. Being able to visually identify the context in which these events occur will make it easier to troubleshoot. In our example, we've recieved a critical violation indicating unusally high CPU usage.

Step 2: Filter down to impacted hosts
Because you've set alert conditions, you can filter down to affected hosts and look for configuration changes, package additions or modifications, or other changes that might have impacted performance. Hosts affected by the violation can be identified directly on the Events page or by clicking on the event and following the related link to Alerts. In our example, one host is affected: use1v-docker-large-customer-2​.

Step 3: Select the event grouping coincident with the alert violation
This will help you identify events in your timeline that might be directly attributable to your performance problem

Step 4: Work through the timeline of events
Once you've filtered narrowly enough and identified the relevant time period, work through the event list and look for events that could have plausably caused your incident. We can see three related events in our example that help determine the chain of events that resulted in the critical violation:

  1. User 'lcirne' begins a session on host use1v-docker-large-customer-2
  2. User 'lcirne' installs a package, account-service, on host use1v-docker-large-customer-2
  3. User 'lcirne' modifies the account-service package on host use1v-docker-large-customer-2
  4. Four seconds later, a warning violation event occurs, and two minutes later, a critical event violation related to CPU consumption

These events in succession lead us to the conclusion that the modification of this package led to the alert.

Now you can identify performance problems correlated with changes in your infrastructure!