Skip to main content
Sumo Logic

Lab 6 - Troubleshooting Pods Lab

This lab teaches you how to troubleshoot pods that exist in your Kubernetes environment.

Getting Started

In this lab we will use Sumo Logic to monitor Kubernetes pods, identify and investigate a pod in trouble, and figure out the cause so we can respond effectively. You will investigate using the combination of our pod left mouse click, surrounding messages, and track user activity using our Audit App.

Opening Explore

Explore is an out-of-the-box Sumo Logic view that you can use to navigate a visual representation of your Kubernetes stack.

  1.  To open Explore, click + New on the top menu bar.
  2. From the drop-down menu, select Explore

Screen Shot 2019-09-03 at 8.32.58 AM.png

The Explore navigation panel appears on the left with a collapsed view of your Kubernetes stack.

Screen Shot 2019-09-03 at 8.28.19 AM.png

  1. Now you can explore your Kubernetes environment.

Analyzing your Kubernetes environment 

Explore provides a framework in which you can view the contents of your Kubernetes clusters and easily navigate your system hierarchy. The navigation panel on the left shows a list of all your clusters with the namespaces, containers, and pods nested underneath each cluster.

  1. At the top of the navigation panel,  we want to Explore By Namespace View. If it is not already selected, click below Explore By to expand the menu and make a selection for the Kubernetes Namespace View . The contents of your selection appear below.

Screen Shot 2019-09-04 at 12.03.08 PM.png

  1. Select Dashboards and choose Kubernetes - Cluster Overview from the pulldown menu
    Screen Shot 2019-09-04 at 12.08.02 PM.png
  2.  From the Kubernentes - Cluster Overview we get dashboard panels letting us know the statuses of the clusters. For this lab, we want to click and then click on prod-loggen namespace in the Pods Running panelFrom this view we can see 2 pods of the prod-loggen that are not functioning (marked in Red) 
    Screen Shot 2019-09-04 at 12.12.15 PM.png
    1. By placing our mouse over the pods, we can see info about the services that have failed by noted that they are not keeping to a 1 state on the graph.
      Screen Shot 2019-09-04 at 12.32.48 PM.png
      1. To drill down further, now we will hover over pagerduty and left mouse click. Under Related Explore, click the pod pageduty-xxxxxxxx-xxxxxx. Since pods are ephemeral, the characters after carbonblack and pagerduty will constantly change. 
        Screen Shot 2019-09-03 at 8.26.14 AM.png
  3. The data for your selection is displayed in the panels of the dashboard on the right.Screen Shot 2019-09-03 at 8.21.25 AM.png
  4. Scroll to the bottom of this page until you get to the Log Stream. In the Logs panel, we see the logs that have been generated by PagerDuty. We are now going to dig into the logs. Click on the menu icon (3 dots) on the top right and select Open in Search.Screen Shot 2019-09-04 at 12.46.56 PM.png
  5. Now we can see the logs associated with PagerDuty in the log search interface. By default, we are in the Aggregates view. Expand the time to Last 15 min. Click on the Messages tab so we can see the actual log messages.Screen Shot 2019-09-04 at 12.56.01 PM.png
  6. Now taking a look at the Field Browser on the left, we see that we have a metadata tag called log. The log metadata tag has already parsed out the messages from the ingested logs so we can remove Message to further isolate. Uncheck Message in the Display Field. This removes Message from the results and will allow us to focus on just the already parsed log metadata log.Screen Shot 2019-09-04 at 1.01.17 PM.png
  7. Viewing the log messages, we want to focus on two of them. In the screenshot the it shows a indicating that something has gone wrong. The message also contains the HTTP status code 401 which is related to authentication.
  8. Since it appears there is a problem with authentication we can scroll down in the messages to see what access_id is being used for the authentication. Look through your log messages to find the same 2 messages. Note, that they may not appear on the first page. To isolate it down further, we will take the access_id and see if there are any other log messages associated with it. Screen Shot 2019-09-04 at 1.13.21 PM.png
  9. Copy the final value of the access_id. In this example, suRhn0DW7l4DZ, by highlighting it and right mouse click to select Copy Selected Text.Screen Shot 2020-03-31 at 2.45.56 PM.png
  10. Let's open a New Log Search for the keyword suRhn0DW7l4DZU and set the time for the last 65 minutes, -65m. As a reminder, to do a new log search click  + New on the top menu and select Log Search from the drop-down menu.

    Paste the access_id keyword suRhn0DW7l4DZU in the query window and click Start. Notice that logs are being retrieved from two Source Categories. Let's go check them out. On the left, under the Hidden fields, click on metadata text Source Category.Screen Shot 2019-09-04 at 1.27.02 PM.png
  11. We notice that the Labs/Sumo_Logic has a couple of messages. Labs/Sumo_Logic points to the data coming from our Audit App. We know that our Audit App tracks user activity, so we are getting a little suspicious. This is also interesting as it has just a couple of messages and we know that root cause items occur less frequently, the needle-in-the-haystack. Highlight and select the Labs/Sumo_Logic Source Category.Screen Shot 2019-09-04 at 1.32.06 PM.png
  12. As we look at the log messages, we see that the access key suRhn0DW7l4DZU was disabled and deletedWe are only viewing the logs that contain the access_id, which limits our search too much.
  13. To get a better understanding of what may have occurred. we can look at the messages that happened around these access key deletion and disabled. In one of the log messages, click on the Category Labs/Sumo_Logic dropdown, and select Surrounding Messages and +/- 1 Minute to see all the other log messages that occurred before and after the message you chose. Screen Shot 2019-09-04 at 1.35.23 PM.png
  14. We are now left with 5 messages which tell the story of what happened. We can read these lines from the bottom to the top to uncover the sequence of events. Screen Shot 2019-09-04 at 1.40.09 PM.png

Sequence of Events:

Line 5: User kenneth logs out.

Line 4: User shady+soc logs in.

Line 3: User shady+soc deletes user kenneth.

Line 2: Deleting user kenneth disables his access keys.

Line 1: Deleting user kenneth also deletes his access keys.


Shady+soc deleted the Kenneth user which also disabled and deleted his access key. Kenneth's access key was being used to authenticate with PagerDuty but since it was deleted the service had no way to authenticate and failed which we saw earlier from the Pod view. 

After further follow up, it was discovered that Kenneth had left the company and was off-boarded by shady+soc. This problem could've been avoided if shady+soc had either contacted Kenneth's manager prior to deleting his user to verify if Kenneth had any active keys associated with his account or scanned for active keys associated with the user account.

Quiz (True or False?)

  1. To isolate the pod's name, in the dashboard we hover over the pods honeycomb and click on the left mouse button.
  2. Surrounding Messages has +/- 60 Minutes as an option.
  3. The Audit App is helpful if you want to track user activity.