Episode 48 — Deploy with Configuration Management Using Drift Detection and Remediation

In this episode, we’re going to connect deployment to something that matters even more than the first successful rollout: keeping systems correct after they’ve been deployed. Configuration management is the practice of defining what a system should look like and then continuously or repeatedly ensuring it stays that way. The reason this matters in automation-heavy operations is simple: environments drift. Drift happens when reality slowly moves away from what you intended, sometimes because of manual fixes during an incident, sometimes because of updates that change defaults, and sometimes because of quiet environmental differences that accumulate over time. Drift detection is the ability to notice those differences reliably, and remediation is the ability to correct them safely. When you combine the two, you get an operational outcome that beginners often don’t realize is possible: re-running your deployment logic can act like a maintenance and repair tool rather than a risky event. The core idea is that your deployment isn’t just a moment, it’s a relationship between desired state and actual state that must be managed over time.
To make this approachable, think of configuration management as a contract you write between the system and your intent. The contract includes things like which software components should be present, which services should be running, which configuration values should be set, and which access controls should be in place. The important difference between configuration management and a one-time setup script is that configuration management expects reruns and expects imperfection. Instead of assuming the machine is new and clean, it assumes the machine might already exist, might have differences, and might have been touched by other processes. That assumption changes how you design deployments because you stop thinking in terms of “do these steps once” and start thinking in terms of “ensure these conditions are true.” In operations, this is what creates calm: if something drifts, you don’t need a custom emergency script, you can reapply the known-good contract. Drift detection is the system comparing the current state to that contract, and remediation is applying the smallest set of changes needed to bring reality back into alignment.
Drift is not always malicious or careless, and that’s worth emphasizing because beginners sometimes imagine drift as someone breaking rules. Drift can be caused by routine maintenance, like a patch that adjusts a configuration file, a package update that changes a dependency version, or a reboot that changes service start behavior. Drift can also be caused by environment variability, like differences between operating system images, inconsistent defaults, or timing issues in provisioning. It can even come from well-intentioned manual actions, like an administrator changing a setting to restore service quickly during an outage. The operational problem is not that drift exists; the problem is that drift accumulates and becomes invisible until it causes a failure. Configuration management makes drift visible and manageable by turning it into a measurable difference between desired and actual. Once drift is measurable, you can design safe remediation instead of guessing what changed.
A systematic drift detection mindset starts with deciding what you consider to be the desired state, because you cannot detect drift without a clear target. Desired state is more than a list of settings; it includes the boundaries of what matters and what doesn’t. For example, you might care deeply that a security feature is enabled, but you might not care about the order of lines in a configuration file as long as the effective values are correct. If your drift detection treats harmless differences as critical, you’ll create noise, and noisy drift detection leads to alert fatigue and unnecessary remediation. On the other hand, if you ignore important differences, you’ll miss real drift and lose the main benefit of configuration management. The operational sweet spot is to define desired state at a level that supports stability and security while avoiding needless churn. When you define this well, drift detection becomes a reliable signal rather than a constant nuisance.
Once desired state is clear, the next key is understanding how drift is observed. Drift detection can be done by querying the system for its current configuration, by inspecting files and settings, by checking service status, or by comparing recorded state to discovered state. The important operational principle is that detection should be based on evidence, not on assumptions about what past runs did. This is why idempotency and state awareness matter so much in deployment automation, because your system should be able to look at reality and decide what action is needed. A strong drift detection approach tells you not only that something differs, but how it differs and how confident you should be in that observation. For beginners, a helpful mental note is that good detection is specific, meaning it points to a concrete mismatch rather than saying “something is wrong.” Specific detection makes remediation safer because you can target the fix to the actual gap, not to a broad area where you hope the fix might land.
Remediation is where operational risk enters, because changing systems can cause outages if done carelessly. Safe remediation starts with minimal change, meaning you apply the smallest adjustment that restores the desired state rather than applying a full reset that disrupts everything. For example, if one service setting is wrong, remediation should correct that setting without restarting unrelated services or rewriting unrelated files. This is also why configuration management tools often emphasize convergence, because the system aims to make only the changes required to close the drift gap. Beginners sometimes assume remediation means reinstalling or rebuilding, but in mature operations, remediation is often a precise correction. Precision reduces downtime and makes post-incident recovery easier because you can fix what drifted without introducing new variables. The operational outcome you want is that remediation is boring, repeatable, and low impact, not dramatic and disruptive.
Another critical part of safe remediation is understanding when not to remediate automatically. Drift detection might reveal differences that are intentional, such as a temporary emergency change approved by the team, a phased rollout where different groups are at different versions, or an experimental environment that is not meant to match production. If remediation blindly enforces a single desired state everywhere, it can undo intentional changes and create new incidents. That’s why good configuration management includes the concept of scope, meaning which systems are governed by which desired state definitions, and exceptions, meaning situations where drift is allowed. The operational goal is not to eliminate variation; it’s to control it and to make it visible. Controlled exceptions are safer than hidden exceptions because they’re documented in the same place as the desired state. When exceptions are managed explicitly, remediation can remain safe because it knows when to act and when to hold back.
There’s also an important relationship between drift detection and deployment frequency. If you deploy frequently, you are constantly changing desired state, and drift detection must keep up without treating normal change as drift. In that world, drift detection is often used to confirm that the environment reached the new desired state after a deployment and stays there between deployments. If you deploy less frequently, drift can accumulate for longer, and detection becomes more like a health check that ensures the environment hasn’t quietly degraded since the last change. Either way, drift detection gives you a feedback loop, which is the real operational value. It tells you whether your deployment was actually effective, whether systems stayed aligned, and whether there are areas where manual changes or environmental factors are undermining consistency. Beginners sometimes focus on getting a deployment to succeed once, but operations cares about repeat success and continued correctness. Drift detection is how you verify that correctness without relying on hope.
A practical way to think about deploying with configuration management is to view deployment as two phases: reaching the desired state and staying in the desired state. The first phase is about applying changes, creating or updating resources, and ensuring the environment matches the definition. The second phase is about continuously confirming that match and correcting deviations when appropriate. Drift detection lives mostly in the second phase, but it also supports the first phase because it can validate that a deployment actually took effect. Remediation bridges both phases because it can fix incomplete deployments and later fix drift caused by outside changes. This is why configuration management is so important for repairability, because it gives you a mechanism to restore a known-good baseline. In a crisis, you want a trusted way to return to normal, and configuration management is that trusted way when it’s designed well. The operational outcome is fewer bespoke fixes and more consistent recovery.
One of the most common misconceptions is that configuration management is only for servers, but the underlying concept applies to any managed component that has a desired configuration. The moment you have multiple systems that should behave similarly, you need a way to define and enforce that similarity. Drift detection is the difference between believing your environment is consistent and knowing it is consistent. Remediation is the difference between noticing inconsistency and being able to correct it predictably. Beginners sometimes see manual configuration as faster, but manual configuration doesn’t scale and it doesn’t leave behind a reliable record of what “correct” means. When you manage configuration as code, you create a shared definition that can be reviewed, improved, and reused. That makes deployments more reliable and makes operational practices more teachable because the desired state becomes explicit and readable.
To bring this home, the big idea is that configuration management turns deployments into ongoing governance of system state, not one-time events. Drift detection gives you visibility into the gap between what you intended and what exists, which helps you catch issues early. Remediation gives you a controlled way to close that gap, ideally with minimal change and clear rules about when enforcement should occur. When you use both together, reruns stop being scary because they’re designed to converge, and partial failures become repairable because the same mechanism that deploys can also heal. You also improve team workflows because everyone is working from the same definition of desired state rather than from personal memory or undocumented habits. In operations, boring is good, and configuration management with drift detection and remediation is one of the best ways to make system correctness boring in the best possible sense.

Episode 48 — Deploy with Configuration Management Using Drift Detection and Remediation
Broadcast by