How Database Observability Increases Operational Reliability

Orinal source post by Jonathan Kehayias

Early on in my career as a DBA, I began to realize I wasn’t a Database Administrator as much as the Default Blame Acceptor for nearly any application problem. Let’s face it—the first thing blamed when there’s an issue is typically going to be the database, and it’s a “guilty until proven otherwise” situation. More specifically, when something is “slow” or the performance isn’t what’s expected, the database is often one of the first suspects.

Years ago, the same thing happened with virtualization and storage just as frequently, though changes over time have reduced blame on these areas. What’s truly interesting is though the sizes of data have grown exponentially over time, the performance expectations have only gotten tighter and shorter for response times.

If you were to ask your business how fast a given process should run, what would the answer be? What about how much of the hardware they should be utilizing? Have you ever considered how the answers to these questions has changed over time, or even more importantly, how they might change in the future?

I personally like to think of my job as future-proofing systems from unexpected problems, but this isn’t always possible since circumstances and situations nobody would have ever imagined happening can always pop up. The evaluation of cost vs. benefit vs. risk also sometimes determines the risk is incredibly low compared to the cost of eliminating the risk and the potential benefit it might provide. But how exactly do you quantify the various portions of this equation?

Here’s what the Army Reserve taught me about cost vs. benefit vs. risk

To put it simply, to calculate costs compared to the benefit and risk, you must have a full view of what’s happening.

In addition to being an IT consultant, I’ve also been a member of the Army Reserve for more than half of my life. Almost every aspect of military operations, both in peacetime and wartime, is governed by the same basic evaluations I mentioned above. Risk assessments are a part of every decision made, every class taught, practically every event the military does (even conducting morning physical training). Everything has a risk assessment and an after-action review and applies past lessons learned to minimize the risk of loss, degradation in performance, or risk of failure of the mission/task to be conducted.

How do the risks get identified, though? It’s based on the mission/task at hand. It changes with the environment. What was potentially a low risk in garrison could be a medium or high risk in the field. Sometimes, it’s based entirely on a review of others’ failures and not based on direct experience. In a nutshell, it’s complex and requires full observability built over time and across multiple different areas of review; much like the world of IT.

There’s an old proverb that says, “You are only as strong as your weakest link.” In combat, the most important asset isn’t the infantryman on the front lines or the artillery providing indirect fire—it’s the supply lines necessary to replenish those other units when their initial supplies carried into the field become exhausted.

This has been proven time and time again. One could argue without access to the data stored in a database, the best written application in the world can’t function. Imagine all the code, compute power in the cloud, or microservices architectures in the middle tier unable to accomplish their mission because they aren’t being supplied with the data necessary to perform their operations.

It isn’t effective in combat to simply think supplies will be there when you need them; there’s a huge infrastructure built around projecting the requirements, determining how to get them where they should be efficiently, and monitoring those activities to ensure the expected outcome occurs.

Sometimes, it only takes a minor disruption in the flow to create catastrophic results. Something as simple as a flat tire on a vehicle could result in hours of delays. Often, there’s a plan designed around such situations with the expectation of minimizing the impact of any single event or failure, but this isn’t always the case. For example, if there’s a convoy carrying materials to the front lines and a vehicle breaks down, does the entire convoy stop while it recovers, or does the vehicle sit to the side while everything else continues past it?

Challenges to building an operational plan without observability

High availability and disaster recovery are essential parts of any operational plan for database observability. Redundancy is only as good as the ability to predict the likely points of failure. Redundancy goes beyond just having more than one of whatever is mission essential—it also includes strategic placement, so the loss of all mission-essential assets doesn’t happen through a single event. Often, this involves the ability to detect something has failed to operate as desired and switch to the alternative/backup strategy previously built into the plan.

Within IT, this is often based on service availability/nonavailability. But I can’t even begin to count the number of times a database server has been up and able to respond to connection requests and simple queries but not able to process the application workload due to a condition impacting performance severely, yet nobody knew what was going on. Worse, when someone looked at the system, there was no concept of what was normal or not normal based on the current conditions.

This sort of thing is incredibly haphazard by design, but as one of my favorite developer sayings went, “Faster is funnier.” The faster things go, the funnier the solutions implemented are likely to become. I’ve worked in systems where it was a spaghetti-tangled nightmare of Band-Aid on top of Band-Aid, sometimes littered with “TO DO” comments all over the code and personal anecdotes and rhetoric on almost every other line. I’m certain I’m not the only person to have reviewed code with comments referencing a ticketing system that no longer exists!

Why reporting, monitoring, and automation are critical to database observability

The ability to obtain situation reports (sitreps) in real time during operations is essential for the flexibility to react to changing conditions as they develop. In the modern, cloud-based operating environment, the ability to monitor every aspect of a solution is critical. In certain environments, the ability to predictively scale up and down resources effectively based on usage trends is incredibly important for minimizing costs.

Though the big platforms have recently implemented automated solutions for this, there are limitations on the amount of data retained for automation of such predictions. Additionally, though it’s nice this feature exists, there’s still the case for maintaining operational performance metrics separately in a database for multiple purposes. Cost projections of year-over-year growth can only be accomplished when you can see the trends in changes over time. What if the company were to acquire another company in a merger? How would this affect the resource requirements? Unfortunately, people often just make an educated guess at what might happen due to a lack of supporting data to make projections, even if this lack of supporting data only exists for one side of the merger.

Have you ever taken the time to consider the effectiveness of the situation reporting capabilities of your existing operational environment? During an emergency or in a time-constrained situation, it’s critical to have effective communication, relay accurate data, and give essential information.

Would your current solution be able to accomplish this? It’s not just about having the data available—it’s equally important for you to able to analyze the data rapidly and effectively so you can take timely, appropriate actions. Good operational data collection and analysis is commonly a project entirely independent of existing business projects.

Importance of AIOps for creating actionable data insights

At SQLskills, we utilize several cloud-based solutions for streaming our course content, and as a part of maintaining our service levels for this solution, we collect detailed monitoring data on usage, caching hits and misses, and errors across multiple service regions. Initially, we utilized the built-in monitoring and usage reports available on the platform. But we quickly realized the data retention period of 60 days wasn’t going to be sufficient for longer-term trending, so we implemented our own solution where we control the data retention policy by simply enabling the logging of metrics to flat text files. Having the raw data in text files is great for detailed troubleshooting after a problem occurs, but it doesn’t automatically inform us when problems are likely.

For the data to become operational, we must have automated processes to load, aggregate, and review the data against previous trends. This is where artificial intelligence (AI) and machine learning become important tools in data analysis. Our reporting ability for our content streaming goes as far back as two months after the initial launch on our website, when data logging was implemented. This became a separate project of its own due to the need to troubleshoot problems as we grew. Today, most issues can be resolved quickly based on the information available.

In my experience, it’s relatively easy to configure data collection, but this is unfortunately about as far as most businesses get due to the cost and time involved in building out a solution around the data to make it operational. From the outset, the number of sources, different structures and formats, and sheer volume of information seems daunting to tackle. Often, the different logging formats create a Tower of Babel scenario where each component has its own distinct language, and nothing says the same thing the same way.

How SolarWinds can help drive database observability

The SolarWinds® Platform is created with OpenTelemetry (OTEL) support to collect data from thousands of devices, including applications written in any of the various .NET languages, Java, PHP, Node.js, or Ruby and running on Linux, Windows, or Azure. The same holds true for data from MongoDB, MySQL, PostgreSQL, and SQL Server database instances; infrastructure hosts in the cloud; on-premises environments; and Kubernetes containers. With OTEL, you can easily integrate and centralize these data points in your observability solution.

With SolarWinds observability solutions, each asset in your IT infrastructure is designed to become a monitorable entity within a single view. Built with fully customizable dashboards for real-time situational reporting, these solutions can deliver unified application and infrastructure insights for deeper visibility across your entire technology stack.

The post How Database Observability Increases Operational Reliability appeared first on Orange Matter.

Leave a Reply