General concepts
Introduction
co.brick observe continuously monitors remote systems. During the runtime of an application, various attributes of the system change:
- traffic,
- workload,
- the system is scaled up and down,
- new components are deployed,
- configuration is changed.
Each of the situations may either be normal and expected or can signal a problem. For example:
- scaling up due to increasing traffic probably is desired,
- increasing memory consumption with no additional workload may signal a problem.
Autodetection and proper classification of all possible scenarios are complex and may require knowledge of the internal implementation of the system and an understanding of the specifics of the industry in which the application operates (eg. e-commerce has different characteristics to online video streaming). co.brick observe is capable of automating this classification and can report abnormal behavior.
It defines three types of Issues: Incidents, Alerts, and Anomalies that have different characteristics and purposes.
Among other attributes, all issues have the following key features:
Attribute | Purpose | Description |
---|---|---|
Severity | Defines how big is the damage done by the issue. | For Reliability issues, high severity may mean that a service is irresponsive. For Security issues, it may indicate high scoring of a recognized vulnerability. |
Priority | Defines how urgent it is to resolve the issue. | Issues of high severity may get low priority as it is known that some special circumstances allow waiting with the fix. For example, a highly vulnerable method of a library that is never called is of low risk. |
Category | Classifies the issue into one, leading group. | Issues may be related to runtime problems (reliability), security threats, invalid configuration, and other, less frequent groups. Some issues may belong to more than one group, in this case, the fields define the single category that best denotes the area of the issue. |
Status | Current status of the issue. | Allows distinguishing ongoing issues from resolved ones. |
Incident
Incidents are actual, detected problems in the system that may lead to significant loss of functionality or are related to severe problems that need attention from the DevOps team. For example:
- service becomes irresponsive, external users cannot effectively use it,
- a new vulnerability was discovered in a library used in the system,
- sensitive data is leaking into logs.