General concepts

Introduction

co.brick observe continuously monitors remote systems. During the runtime of an application, various attributes of the system change:

traffic,
workload,
the system is scaled up and down,
new components are deployed,
configuration is changed.

Each of the situations may either be normal and expected or can signal a problem. For example:

scaling up due to increasing traffic probably is desired,
increasing memory consumption with no additional workload may signal a problem.

Autodetection and proper classification of all possible scenarios are complex and may require knowledge of the internal implementation of the system and an understanding of the specifics of the industry in which the application operates (eg. e-commerce has different characteristics to online video streaming). co.brick observe is capable of automating this classification and can report abnormal behavior.

It defines three types of Issues: Incidents, Alerts, and Anomalies that have different characteristics and purposes.

Among other attributes, all issues have the following key features:

Attribute	Purpose	Description
Severity	Defines how big is the damage done by the issue.	For Reliability issues, high severity may mean that a service is irresponsive. For Security issues, it may indicate high scoring of a recognized vulnerability.
Priority	Defines how urgent it is to resolve the issue.	Issues of high severity may get low priority as it is known that some special circumstances allow waiting with the fix. For example, a highly vulnerable method of a library that is never called is of low risk.
Category	Classifies the issue into one, leading group.	Issues may be related to runtime problems (reliability), security threats, invalid configuration, and other, less frequent groups. Some issues may belong to more than one group, in this case, the fields define the single category that best denotes the area of the issue.
Status	Current status of the issue.	Allows distinguishing ongoing issues from resolved ones.

Incident

Incidents are actual, detected problems in the system that may lead to significant loss of functionality or are related to severe problems that need attention from the DevOps team. For example:

service becomes irresponsive, external users cannot effectively use it,
a new vulnerability was discovered in a library used in the system,
sensitive data is leaking into logs.

Anomaly

Anomalies are reported whenever the behavior of the system deviates from standard characteristics. Usually, that happens when metrics are reaching non-standard values (peaks, flat periods, significant deviations from average).

Anomalies are usually big in number and analyzing them individually would be an overkill. They become interesting in the context of an actual issue that DevOps need to resolve. In addition, multi-dimensional analysis that combines anomalies with other signals (eg. events like deployments or configuration changes) may lead to reporting Incidents that are related to one or many anomalies.

Alert

Alerts are raised whenever specific conditions are met. The conditions are defined in Alert Definitions and should reflect specific aspects of the system that is known to its operators. DevOps may be investigating a known problem and want to be alerted when it appears or SLAs/SLOs are violated and immediate action is needed. Usually, alerts are important as they reflect the actual needs of the DevOps or are related to preconfigured best practices. Alerts are not directly related to incidents. In some cases, an incident may be auto-created along with an alert.

Introduction​

Incident​

Anomaly​

Alert​

Introduction

Incident

Anomaly

Alert