Navigating the Challenges of Policy as Code in Azure: Part 2

Previously, I have written about some challenges around policy as code. You can find the previous blog here. There I discuss some problems I see with policy as code in general. Now I’m going to delve a little deeper into the problem and look even more at what Microsoft gives us and what we might have to develop ourselves.

Challenges

One of the bigger problems with policy as code is that you will end up with many non-compliant policies. You find them by going to the portal and looking at the compliance state of the resource. For a developer this doesn’t work. From my experience there is no clarity about who is responsible in an organization for whether their resources are compliant or not. Of course, it is possible to agree on this, but that does not mean that we have the necessary tools. Microsoft gives us a lot we can use to deploy policies, but not a lot of guidance on how to work with them when they are deployed.

Microsoft has one solution, which might solve some of our challenges. One of the biggest challenges is to get an overview and be notified when something is non-compliant. Microsoft has long had a solution to send logs based on changes in policy to log analytics workspace. This has existed for a long time, but finally they have come up with an automated solution. You can find it here. It takes changes in policies and sends them as logs to log analytics workspace where you can do what you want with them. But what does it actually solve?

You get logs on changes, but for me who have worked with it for a long time it is just a starting point, they can reveal one thing and that is when new resources appear and become non-compliant or you make changes to an existing resource and it gets non-compliant. Then we can get a log on it. What now? We need an alarm that triggers on it, who should get it? Operations personnel, who then notify the person who set up the resource? Who fixes this problem afterwards? How long do we have on it? How do you categorize the severity of the non-compliant policy? Many questions, a lot of work and processes to be created and no good solutions as of now. Okay, enough problematization here now. Let’s look a little closer at what functionality we need, what Microsoft gives us and what we can do with the problem.

Thoughts

Let’s start with what Microsoft gives us – logging of changes in policy state:

Automation of DINE and Modify
Let’s start with the two policy functions that we can automate. There is no reason for DINE and Modify to be non-compliant. These are created to just get things done. Often I find that DINE and modify do not work on the first try. There is often a need to manually start a remediation task to ensure that they are compliant.
Possible solution:

Implement logging of policies
When a DINE or Modify resources triggers an alert rule
Action group triggers an azure function that automatically starts a remediation task and sends log back to log analytics about the result from the remediation task was successful. The azure function can try the remediation tasks x amount of times before reporting to the log analytics workspace with the result.
Another alert rule triggers on log from azure function if the remediation does not work. Triggers action group that sends email to someone who manually has to go in and fix problems

Policy is non-compliant and nobody fixes it
One of the biggest problems with us just logging is that we do not necessarily get information that the policy remains non-compliant. If no one addresses this problem, the logs will also disappear from the log analytics workspace and in the worst case we use the logs as the truth and then the starting point is wrong. Possible solutions

You need to have complete control over all policies when rolling out the solution. That means that everything must be compliant from the beginning. Then you will have a starting point that implies that you can work with logs that trigger on changes in the policy state.

Who is responsible for fixing non-compliant policy?
Problem number one: This is perhaps the most important and difficult for me and will be different for every company, depending on their organization. So there will not be one solution that solves all the problems here. But I will try to share my thoughts and see how they align with the CAF framework from Microsoft.

The desire here is to be able to automate this process as much as possible to streamline the work and maintain compliance. The following prosess if the most intuitive:

Alert is created
A platform team picks up the alert
They find the person that is responsible for that resource
They alert the person

For me, this is a lot of overhead and not very optimized for a larger organization with a lot of developers.

The goal is to alert the person who has created a resource that became non-compliant. It is not necessary to relay on the platform team for this. They should be aware of the state of the platform, but micromanagement and forwarding alerts is not very effective on a large scale.

We have the logs for the non-compliant resources. So it is possible to create a script (runbook or function app) that can connect the non-compliant resource to the person that owns it. However, there are no easy way to alert them. One solution could be

Alert is triggered
Script gets the person that is responsible
Create an action group based on the alert and put the person as receiver
Trigger an alarm that ensures the action group sends email to the user.

This dosent sound like a great solution either. You end up with a lot of action group, you can of course delete them, but then you remove the logs for later for that. The problem here is that it is not easy to send email from azure in a very simple way with dynamic emails based on logs.

Problem number two: What do you do if the person who is alerted does nothing with the alarm. How do you trigger a new alarm when there are no changes in the logs? Here I have no other solutions than to have several alarms that have different queries and look at the time the alarm went off and look for whether it has been fixed to trigger a new alarm. But here I also have a desire for an escalation, if the alarm is not fixed you can send it to another person who has more responsibility.

Based on these problems, I feel that we need a huge amount of tooling to ensure the compliance of the platform is secure, or a huge platform team that takes the job to alert everyone. I just don’t see it work on a larger scale. The conclusion is that we need additional tooling to handle the non-compliant policies at scale.

If this is series are of interest I will be digging deeper and creating small and different solutions based on the different problems I feel exists.

Summary

What we get from Microsoft has the potential to be used as a foundation for further work, but is far from fully developed with regard to an operational perspective. So what can we do. We can share the work and experience with each other. Share solutions and experience. I will be continuing my deep dive into this problem and creating small parts that I feel will take one step in the right direction.

Håvard Løkensgard

Navigating the Challenges of Policy as Code in Azure: Part 2

Challenges

Thoughts

Summary

Legg igjen en kommentar Avbryt svar

Navigating the Challenges of Policy as Code in Azure: Part 2

Challenges

Thoughts

Summary

Del dette:

Legg igjen en kommentar Avbryt svar