In this lab we will use Sumo Logic to monitor Kubernetes pods, dig in to an issue that has been detected and figure out the cause.
Explore is an out-of-the-box Sumo Logic view that you can use to navigate a visual representation of your Kubernetes stack.
- To open Explore, click + New on the top menu bar.
- From the drop-down menu, select Explore.
The Explore navigation panel appears on the left with a collapsed view of your Kubernetes stack.
- Now you can explore your Kubernetes environment.
Analyzing your Kubernetes environment
Explore provides a framework in which you can view the contents of your Kubernetes clusters and easily navigate your system hierarchy. The navigation panel on the left shows a list of all your clusters with the namespaces, containers, and pods nested underneath each cluster.
- At the top of the navigation panel, we want to Explore By Service View. If it is not already selected, click below Explore By to expand the menu and make a selection for the Kubernetes Service View . The contents of your selection appear below.
- Select Dashboards and choose Kubernetes - Cluster Overview from the pulldown menu
- From the Kubernentes - Cluster Overview we get dashboard panels letting us know the statuses of the clusters. For this lab, we want to click on prod-loggen namespace in the Pods Running panel. From this view we can see 2 pods of the prod-loggen that are not functioning (marked in Red)
- By placing our mouse over the pods, we can see info about the services that have failed. Here we want to note the pod name. For this lab we will be focusing on PagerDuty pod. We will want to note the pod name including the last 5 characters that follow the hypen (in this example we are looking at pagerduty-84d685f79f-4wjln). Since pods are ephemeral, the characters after pagerduty will constantly change.
- To drill down further, now we will go back to the Explore By menu at the top left of the screen and select Kubernetes Namespace View
- Click the arrow to the left of prod01.travellogic.info to view its contents.
- Select prod-loggen and then pagerduty-*-* (that matches the pod name you noted in Step 4) to drill-down into the clusters to view the pods and containers.
- The data for your selection is displayed in the panels of the dashboard on the right.
- Scroll to the bottom of this page until you get to the Log Stream. In the Logs panel, we see the logs that have been generated by PagerDuty. We are now going to dig into the logs. Click on the menu icon (3 dots) on the top right and select Open in Search.
- Now we can see the logs associated with PagerDuty in the log search interface. By default, we are in the Aggregates view. Click on the Messages tab so we can see the actual log messages.
- Now that we can see the log messages we can further isolate it down by unchecking the Message Display Field. This will allow us to focus on just the already parsed log metadata message that has been extracted from the message. Another words, we have parsed the log message into multiple fields, so we can leave out the message field and get a log message that is easier to see what's happening.
- Viewing the log messages, we want to focus on two of them. In the screenshot the first message shows a java.io.IOException indicating that something has gone wrong. The message also contains the HTTP status code 401 which is related to authentication.
- Since it appears there is a problem with authentication we can use the second message to see what access_id is being used for the authentication. Look through your log messages to find the same 2 messages. Note, that they may not appear on the first page. To isolate it down further, we will take the access_id and see if there are any other log messages associated with it. Copy the final value of the access_id. In this example, suRhn0DW7l4DZ, by highlighting it and right mouse click to select Copy Selected Text.
- Let's open a New Log Search for the keyword suRhn0DW7l4DZU and set the time for the Last 60 Minutes. As a reminder, to do a new log search click + New on the top menu and select Log Search from the drop-down menu.
Paste the access_id keyword suRhn0DW7l4DZU in the query window and click Start. Notice that logs are being retrieved from 2 Source Categories. Let's go check them out. On the left, under the Hidden fields, click on metadata text Source Category.
- Notice that the Labs/Sumo_Logic only has a couple of messages, this is interesting as root cause items occur less frequently, the needle-in-the-haystack. Hightlight and select the Labs/Sumo_Logic Source Category.
- It looks like the access key suRhn0DW7l4DZU was disabled and deleted. We are only viewing the logs that contain the access_id, which limits our search too much.
- To get a better understanding of what may have occurred. we can look at the messages that happened around these access key deletion and disabled. In one of the log messages, click on the Category Labs/Sumo_Logic dropdown, and select Surrounding Messages and +/- 1 Minute to see all the other log messages that occurred before and after the message you chose.
- We are now left with 5 messages which tell the story of what happened. We can read these lines from the bottom to the top to uncover the sequence of events.
Sequence of Events:
Line 5: User kenneth logs out.
Line 4: User shady+soc logs in.
Line 3: User shady+soc deletes user kenneth.
Line 2: Deleting user kenneth disables his access keys.
Line 1: Deleting user kenneth also deletes his access keys.
Shady+soc deleted the Kenneth user which also disabled and deleted his access key. Kenneth's access key was being used to authenticate with PagerDuty but since it was deleted the service had no way to authenticate and failed which we saw earlier from the Pod view.
After further follow up, it was discovered that Kenneth had left the company and was off-boarded by shady+soc. This problem could've been avoided if shady+soc had either contacted Kenneth's manager prior to deleting his user to verify if Kenneth had any active keys associated with his account or scanned for active keys associated with the user account.
- To isolate the pod's name, in the dashboard we hover over the pods honeycomb and note the beginning pod name and the last 5 characters after the hyphen, for example pagerduty-*-4wjln?
- Surrounding Messages has +/- 60 Minutes as an option?