Skip to main content
Sumo Logic

Lab 5 - Troubleshooting Pods Lab

This lab teaches you how to troubleshoot pods that exist in your Kubernetes environment.

Getting Started

In this lab we will use Sumo Logic to monitor Kubernetes pods, dig in to an issue that has been detected and figure out the cause.

Opening Explore

Explore is an out-of-the-box Sumo Logic view that you can use to navigate a visual representation of your Kubernetes stack.

  1.  To open Explore, click + New on the top menu bar.
  2. From the drop-down menu, select Explore

Screen Shot 2019-09-03 at 8.32.58 AM.png

The Explore navigation panel appears on the left with a collapsed view of your Kubernetes stack.

Screen Shot 2019-09-03 at 8.28.19 AM.png

  1. Now you can explore your Kubernetes environment.

Analyzing your Kubernetes environment 

Explore provides a framework in which you can view the contents of your Kubernetes clusters and easily navigate your system hierarchy. The navigation panel on the left shows a list of all your clusters with the namespaces, containers, and pods nested underneath each cluster.

  1. At the top of the navigation panel,  we want to Explore By Service View. If it is not already selected, click below Explore By to expand the menu and make a selection for the Kubernetes Service View . The contents of your selection appear below.

Screen Shot 2019-09-04 at 12.03.08 PM.png

  1. Select Dashboards and choose Kubernetes - Cluster Overview from the pulldown menu
    Screen Shot 2019-09-04 at 12.08.02 PM.png
  2.  From the Kubernentes - Cluster Overview we get dashboard panels letting us know the statuses of the clusters. For this lab, we want to click on prod-loggen namespace in the Pods Running panelFrom this view we can see 2 pods of the prod-loggen that are not functioning (marked in Red) 
    Screen Shot 2019-09-04 at 12.12.15 PM.png
  3. By placing our mouse over the pods, we can see info about the services that have failed. Here we want to note the pod name. For this lab we will be focusing on PagerDuty pod. We will want to note the pod name including the last 5 characters that follow the hypen (in this example we are looking at pagerduty-84d685f79f-4wjln). Since pods are ephemeral, the characters after pagerduty will constantly change. 
    Screen Shot 2019-09-04 at 12.32.48 PM.png
  4. To drill down further, now we will go back to the Explore By menu at the top left of the screen and select Kubernetes Namespace View
    Screen Shot 2019-09-03 at 8.26.14 AM.png
  5. Click the arrow to the left of prod01.travellogic.info to view its contents. Screen Shot 2019-09-03 at 8.24.44 AM.png
  6.  Select prod-loggen and then pagerduty-*-* (that matches the pod name you noted in Step 4) to drill-down into the clusters to view the pods and containers.
    Screen Shot 2019-09-03 at 8.16.38 AM.png
  7. The data for your selection is displayed in the panels of the dashboard on the right.Screen Shot 2019-09-03 at 8.21.25 AM.png
  8. Scroll to the bottom of this page until you get to the Log Stream. In the Logs panel, we see the logs that have been generated by PagerDuty. We are now going to dig into the logs. Click on the menu icon (3 dots) on the top right and select Open in Search.Screen Shot 2019-09-04 at 12.46.56 PM.png
  9. Now we can see the logs associated with PagerDuty in the log search interface. By default, we are in the Aggregates view. Click on the Messages tab so we can see the actual log messages.Screen Shot 2019-09-04 at 12.56.01 PM.png
  10. Now that we can see the log messages we can further isolate it down by unchecking the Message Display Field. This will allow us to focus on just the already parsed log metadata message that has been extracted from the message. Another words,  we have parsed the log message into multiple fields, so we can leave out the message field and get a log message that is easier to see what's happening.Screen Shot 2019-09-04 at 1.01.17 PM.png
  11. Viewing the log messages, we want to focus on two of them. In the screenshot the first message shows a java.io.IOException indicating that something has gone wrong. The message also contains the HTTP status code 401 which is related to authentication.
  12. Since it appears there is a problem with authentication we can use the second message to see what access_id is being used for the authentication. Look through your log messages to find the same 2 messages. Note, that they may not appear on the first page. To isolate it down further, we will take the access_id and see if there are any other log messages associated with it. Copy the final value of the access_id. In this example, suRhn0DW7l4DZ, by highlighting it and right mouse click to select Copy Selected Text.Screen Shot 2019-09-04 at 1.13.21 PM.png
  13. Let's open a New Log Search for the keyword suRhn0DW7l4DZU and set the time for the Last 60 Minutes. As a reminder, to do a new log search click  + New on the top menu and select Log Search from the drop-down menu.

    Paste the access_id keyword suRhn0DW7l4DZU in the query window and click Start. Notice that logs are being retrieved from 2 Source Categories. Let's go check them out. On the left, under the Hidden fields, click on metadata text Source Category.Screen Shot 2019-09-04 at 1.27.02 PM.png
  14. Notice that the Labs/Sumo_Logic only has a couple of messages, this is interesting as root cause items occur less frequently, the needle-in-the-haystack. Hightlight and select the Labs/Sumo_Logic Source Category.Screen Shot 2019-09-04 at 1.32.06 PM.png
  15. It looks like the access key suRhn0DW7l4DZU was disabled and deletedWe are only viewing the logs that contain the access_id, which limits our search too much.
  16. To get a better understanding of what may have occurred. we can look at the messages that happened around these access key deletion and disabled. In one of the log messages, click on the Category Labs/Sumo_Logic dropdown, and select Surrounding Messages and +/- 1 Minute to see all the other log messages that occurred before and after the message you chose. Screen Shot 2019-09-04 at 1.35.23 PM.png
  17. We are now left with 5 messages which tell the story of what happened. We can read these lines from the bottom to the top to uncover the sequence of events. Screen Shot 2019-09-04 at 1.40.09 PM.png

Line 5: User kenneth logs out.

Line 4: User shady+soc logs in.

Line 3: User shady+soc deletes user kenneth.

Line 2: Deleting user kenneth disables his access keys.

Line 1: Deleting user kenneth also deletes his access keys.

Shady+soc deleted the Kenneth user which also disabled and deleted his access key. Kenneth's access key was being used to authenticate with PagerDuty but since it was deleted the service had no way to authenticate and failed which we saw earlier from the Pod view. 

After further follow up, it was discovered that Kenneth had left the company and was off-boarded by shady+soc. This problem could've been avoided if shady+soc had either contacted Kenneth's manager prior to deleting his user to verify if Kenneth had any active keys associated with his account or scanned for active keys associated with the user account.

Quiz (True/False)

  1. To isolate the pod's name, in the dashboard we hover over the pods honeycomb and note the beginning pod name and the last 5 characters after the hyphen, for example pagerduty-*-4wjln?
  2. Surrounding Messages has +/- 60 Minutes as an option?