Unintended Consequences— Why you can’t trust your security software

CrowdStrike is only beginning

Thousands of companies across the globe recently suffered outages when an update from endpoint security vendor Crowdstrike caused windows computers to crash.  The impact was immediate and lasting; Delta Airlines needed nearly a week to recover.  Some are now reassessing if  they want to run that particular product.  That's the wrong lesson.  Any 3rd party software that runs with high levels of permissions (including security software) creates risk.  The question to ask is "what's your strategy to ensure resiliency given that you invariably need to run someone else's software"?

Let's start with what this wasn't.  Despite what Crowdstrike said, the cause wasn't a quality assurance (QA) error.  Their testing process missed something material, but QA is a control.  QA likely didn't create the bug.  Understanding root cause matters because QA will always have some level of "escapes".  Failure to understand and fix the error at all stages, especially the true root cause, increases the risk of another failure.

Unintended defects are just one risk of 3rd party software.  Sometimes they are intentional.  Rogue employees add backdoors that enable secret access.  Open source and licensed libraries included in the release can have viruses and other embedded malware, without the knowledge of the developers using them.  Even developer's tools can hacked to secretly insert malware.  Collectively, these "software supply chain attacks" result in vendors delivering compromised code which you then install.  Bad actors don't need to hack you if they hack your suppliers.  If you install code you didn't write, even if from a reputable vendor, you are assuming some level of risk.

So what can be done?

First, as software buyers, we should insist that critical vendors adequately describe what their code is expected to do.  After all, malware is software that intentionally behaves badly, so knowing expected behavior is key to preventing unexpected behavior.  We've built examples of software scanning tools that can identify high-risk behaviors before the code is even installed, and tools that detect unexpected behaviors at run time.  Together, these dramatically reduce software supply chain risk from things like "software injection" and "namespace confusion" attacks.

What about bugs with no malicious intent, like we saw with Crowdstrike?  All it takes is one bad pointer for an application with high privilege levels to crash a machine.

Again, it comes down to your ability to detect badly behaving code.  Anyone who allows all of their systems to simultaneously install updates takes a risk.  With enough releases of enough products, a failure like this was inevitable.  If your vendors don't do staged rollouts of updates, called "canaries", you should.  Why let all of your most critical systems be the canaries?  Either enforce staged rollouts or delay installs of new releases by a day (obviously, this advice doesn't apply to patches that address urgent and critical fixes, such as to mitigate "zero day" attacks).

What if the bugs aren't obvious right after installation?  Staged rollouts won't help if the machines only crash when some external event happens, like the calendar rolls to a new month.  This is where agility and a rapid response capability become critical.  Resiliency is achieved through redundancy, so you need the ability to swap out errant code and replace it with something else.  Quickly.  Modern software stacks and tools can deploy thousands of virtualized servers and containerized apps at the push of a button.  If an application like Crowdstrike causes crashes, redeploy without it (or replace it with another endpoint protection tool in the interim).  A defense in depth strategy means you should be able to run with a single control offline for a while.  Alternatively, other tools could be added on an interim basis.  They might not have all of the same capabilities but it's better than being down.  Not being able to rapidly redeploy in an emergency is a capability gap that must be addressed.

None of these guarantee smooth sailing, but taken together they dramatically reduce the impact of malicious software, bugs, and long periods of downtime.  It's up to technology groups that deploy vendor code to manage the risks their vendors introduce.