What is observability?
Observability is the ability of the internal states of a system to be determined by its external outputs.
For our purposes, Observability is the ability to observe an application from the outside and understand what is happening inside the application and its services. Observability helps ensure that the application is running reliably: the system is up and running (available), performant, and secure.
Modern applications are increasingly complex, as they leverage distributed technologies, cloud infrastructure, and container and orchestration tools. In addition, the connections between microservices, orchestrators, and underlying cloud resources is also growing in complexity. This complexity leads to situations where unforeseen events, unknown unknowns in terms of risk, are more prevalent and come with mysterious behaviors and failure modes. This can cause major issues in your overall incident remediation workflow, which can be broken down into three steps.
Monitor critical indicators of reliability such as errors or latency. Sometimes these unknown unknown types errors don't directly impact the metrics that you are tracking, which makes monitoring the issues more difficult.
Diagnose or isolate services or resources that might be the immediate cause of reliability issues. These unknown unknowns could impact systems in obscure ways. For example, the culprit service’s metrics might look alright, but a downstream service that consumes this service might have abnormal metrics, which could lead an SRE down the wrong path. There is no tribal knowledge that can help guide the SRE in the right direction.
Troubleshoot and uncover root cause(s) to guide recovery and ensure on-going application reliability. As in the case of the diagnosis step, the unknown unknowns might make it difficult to find the root cause.
Monitoring, diagnosing, and troubleshooting such issues is harder because there are no existing runbooks that can help resolve issues quickly. This problem is compounded by the fact that modern applications also emit astonishing amounts of machine data across the stack.
All this complexity, along with data deluge and unknown behaviors, can make it impossible to recover systems quickly if you don’t have a way to make sense of all of the information. This is why organizations need an Observability Solution.
Sumo Logic provides an AWS CloudFormation template that automates the setup and installation of the AWS Observability Solution for a given account and region. This allows you to configure the monitoring of your AWS infrastructure for optimum results.
Data Collection for the AWS Observability Solution
Sumo Logic collects logs, metrics, and events including AWS EC2 Host Metrics, CloudWatch logs and metrics, and CloudTrail logs. The collected data streams are enriched with the following metadata:
- Account. This is an alias for your AWS account—for example, production, development, or stage—that you supply when you install the solution.
- Namespace. This is the name of the AWS service and is automatically added by either the Host Metrics Source or the AWS Metadata (Tag) Source installed by the template, for example, aws/apigateway, aws/applicationelb, aws/dynamodb, aws/lambda, aws/rds, and so on.
- Region. This is the AWS region, for example, us-east-1, us-west-2, and so on.
- Entity. This represents either the AWS resource name or
iddepending on the AWS service being monitored.
This new metadata can also be used in ad-hoc logs and metrics searches.
Understanding the Observability Solution
The Observability Solution offers a unified platform for logs, metrics, traces, and metadata at the following layers:
The solution understands how the different datasets and services are connected, and stitches those relationships into an entity workflow that makes it more intuitive for users to get a holistic view of their service. The workflow also enables easier and faster monitoring, diagnosing, and troubleshooting.
The solution also offers features and capabilities that support each step of the troubleshooting process.
Monitor your systems effectively with new and improved alerting and dashboarding capabilities. The Observability Solution includes rich pre-built content that you can leverage to quickly start monitoring specific services.
Diagnose issues quickly using features like the Entity Explorer, trace analytics, and the Metrics Explorer.
Troubleshoot issues and find root causes through Behavior insights, Root Cause Explorer, and log search.
Resources created or modified
This section lists the resources that the CloudFormation template creates in AWS and Sumo Logic.
The AWS CloudFormation template execution creates or modifies the following resources in the AWS account if you are not already collecting data from those AWS services. If you are, the AWS CloudFormation template will simply integrate with your existing collector sources.
|AWS Data Source||AWS Resources Created||Applicable AWS Observability Dashboards|
|AWS CloudTrail Logs||S3 Bucket
|AWS API Gateway ULM
AWS Lambda ULM
Amazon DynamoDB ULM Amazon RDS ULM
|Amazon CloudWatch Metrics||AWS Lambda
|AWS API Gateway ULM
AWS Lambda ULM
Amazon DynamoDB ULM
|Amazon Application Load Balancer logs||S3 Bucket
|AWS Application Load Balancer ULM|
|AWS Lambda CloudWatch logs||AWS Lambda
|AWS Lambda ULM|
If you are using an existing bucket to collect AWS Application ELB logs, the Amazon S3 bucket policy for this bucket will be updated to include the policy below, if in case the policy does not already exist:
Resources created in Sumo Logic
The AWS CloudFormation template execution creates the following resources in Sumo Logic.
|App folder||Sumo Logic AWS Observability Apps-<Date of installation>|
|Field Extraction Rule||AwsObservabilityFieldExtractionRule
|Explorer View||AWS Observability|
|CloudWatch logs (HTTP) source||<AccountAlias>-cloudwatch-logs-<AWS::Region>|
|CloudWatch Metrics source||<AccountAlias>-cloudwatch-metrics-<AWS::Region>-ApplicationELB
|Amazon S3 Alb log source||<AccountAlias>-aws-observability-alb-logs-s3-<AWS::Region>|
|S3 Bucket Name||aws-observability-logs-<StackID>|