Everything fails, all the time – Werner Vogels, Amazon
As the Amazon CTO once quoted, systems even if they are thoughtfully designed with the utmost care and skills, may fail. Thus, it is important to detect failures using automation to reduce the burden on DevOps and SREs. Developers use extensive prebuilt libraries and products to go to market as fast as they can due to the high-velocity development lifecycle. It is the onus on the SREs to keep the service alive and keeping the MTTR (Mean time to recover) to a minimum. This leads to a problem when the system becomes a black box for SRE and they have to put observability on top of it. Without knowing the internals and without having complete control over the logging information and metrics, they may run blindfolded sometimes until they learn more about the system, new issues, and until they improvise their playbooks or write a solution to prevent failures from happening in the future. That’s the human way of solving problems — learning by mistakes.
It is quite a common scenario in a large distributed system when there is an incident, the teams spend a lot of time to capture the right logs, parse it and try to correlate to find the root cause. Some teams are better, they automate log collection, aggregates them to a common platform, and then do all the hard work by searching into the ocean of log data using tools like Elastic or Splunk. It works fine when you understand the log structure and all the components and know what to look for. But as I mentioned above, it is really hard to keep the data structure consistent for a long time across all components. Most of the current log monitoring and collection tools just provide the capability to collect logs to a central place, parse the unstructured data, allow you to search or filter, and show visualization or trends. What if the system generates a new type of log data or pattern for which you have not automated or planned in advance? It becomes a problem.
That is the point when you really need autonomous machine learning to scale.
Automation is the key to detect such incidents, anomalies in the system — and proactively try to prevent as much as possible to reduce the chances of failure and improve recovery time. — Google SRE Handbook
Typically when an incident occurs, support engineers try to manually peek into the ocean of logs and metrics to find interesting errors and warnings and then start correlating various observations to come up with a root cause. This is a painfully slow process where a lot of time is wasted. This is where Zebrium machine learning capabilities helps in automatically correlating issues observed in logs and metrics of various components to predict the root cause.
The Zebrium autonomous log and metrics monitoring platform uses machine learning to catch software incidents and show IT and cybersecurity teams the root cause. It is designed to be used with any application, and it is known for its ease of use and quick set-up with customers, who also say the system often delivers initial results within hours of being installed. Unlike traditional monitoring and log management tools that require complex configuration and tuning to detect incidents, Zebrium’s approach to using unsupervised machine learning requires no manual configuration or human training.
->Zebrium Dashboard<-
Zebrium aggregates logs and metrics and makes them searchable using filters through easy navigation and drill-down. It also allows us to build alert rules — but most of the time you won’t have to! It uses unsupervised machine learning to autonomously learn the implicit structure of the log messages. It then cleanly organizes the content of each event type into tables with typed columns – perfect for fast and rich queries, reliable alerts, and high-quality pattern learning and anomaly detection. But most importantly, it uses machine learning to automatically catch problems and to show you root cause without you having to manually build any rules.
You can learn more about how Zebrium works here.
Zebrium provides various types of log collectors that can pull logs from Kubernetes, Docker, Linux, ECS, Syslog, AWS Cloudwatch, and any type of application.
Zebrium provides good integration with existing Elastic Stack (ELK Stack) clusters. You can even view the Zebrium incident dashboard inside Kibana. You can do so by doing the following:
Zebrium’s Autonomous Incident & Root Cause Detection works in two modes:
A special class of integrations relates to this second mode, including integrations with OpsGenie, PagerDuty, VictorOps, and Slack. Furthermore, Zebrium integration can be extended to any custom application using webhooks.
Machine intelligence is the key to automate and scale in a large enterprise environment which can reduce operational cost by reducing DevOps/SREs and increase MTTR that can radically transform the business. With the unsupervised learning algorithm used by Zebrium, It becomes easier to find a better correlation between incidents and failures from the log data and metrics without requiring human effort. Zebrium has provided simplified onboarding, that requires no configurational changes in the application or human training, and an easy to navigate UI. It is an appealing next-generation choice in the space of autonomous log and metric management platforms.
Please try their free version to play around with the autonomous machine learning algorithm on your log data and let us know about your thoughts on autonomous log monitoring.
Looking for help with observability stack implementation and consulting? do check out how we’re helping startups & enterprises as an observability consulting services provider.