Rotating secrets with serverless

12 min readJan 17, 2024

Rotating secrets is a critical element of your security posture. When done manually, it is often overlooked due to becoming a more tedious and complex process as the company and secrets grow. Security breaches due to secret access are difficult to detect as the hacker has gained access via legitimate means, and they can persist for months unless the secret is rotated. To perform manual rotations, developers need to keep track of when secrets need to be rotated, execute the process of rotating them, and update the application accordingly. Often, teams must plan rotation events every few months, involving going through a list of secrets and manually rotating each one. This is time-consuming, tedious, and prone to human error — whether due to secrets being rotated incorrectly or forgetting to add them to the list at all. In this article, we will explore how to use a serverless approach to automate the secret rotation process, avoiding the need to endure one of these arduous events again.

For this article I will be using the example of rotating the keys for an AWS IAM service account, and updating them in a GitLab. Please understand this is very much a template. Individual parts can be swapped and replaced depending on your current stack and where your secrets are stored. In your situation you may be looking to rotate an API key, or a password, and update it in a database, or a VM. Whatever the situation, the parts may vary, but the concept from a serveless point of view are the same. That being said this will go into technical detail of how the sytem was implemented. If you are looking for a more broad strokes overvirew, please check out my other article here.

Overview

The fundamental concept of this process is based around 3 core parts:

A function to perform the rotation
A trigger as part of the service that needs the rotation done
Storage for the secret

In our example the function has an extra step that is required because of the way our service is structured, the trigger for each service passes the name of the IAM user whose keys are being rotated, and the secrets are stored in the GitLab project variables. The platform in this diagram merely represents whatever it is you are trying to access with the secret. In our case it would be AWS. I am using this as an example as it is one I have worked with first hand. There are definitely better ways to manage service account permissions and keys and inject them into your application, but this approach provides a simple and understandable method for what we are looking to achieve, as well as providing the opportunity to showcase how the core method can be modified (adding deployment), to provide a more holistic approach to secret rotation.

As we can see from our example, although more steps are involved, the fundamentals are the same. Now we have an idea of what we need to build we can get started!

How to rotate your secret

The first thing we need to define is: How do I rotate my desired secret? In our example we are using IAM user security credentials. If we were doing this rotation manually we would find that user in the console, create new credentials for them, then use these new credentials to update the relevant variables in our GitLab project. Once updated we would then redeploy our service with the new values, and finally once deployed, we can remove the old key from the user so that it is no longer in circulation.

Function

In order to translate this into our serverless function we will need to do this process via code. I have written a simple main.py python script that leverages boto3 for the AWS IAM elements, and python-gitlab for the GitLab elements:

import boto3
import gitlab
from time import sleep


def lambda_handler(event, context):
    iam = boto3.client('iam')
    old_key = iam.list_access_keys(UserName=event['userName'])['AccessKeyMetadata']
    print(f"Creating new key for {event['userName']}")
    access_key = iam.create_access_key(
        UserName=event['userName']
    )

    token = boto3.client('secretsmanager').get_secret_value(SecretId=event['secretARN'])['SecretString']
    gl = gitlab.Gitlab(private_token=token)
    project = gl.projects.get(event['gitProjectID'])
    print(f"Updating access key in project {project.name}")
    project.variables.update("AWS_ACCESS_KEY_ID", {"value": access_key['AccessKey']['AccessKeyId']})
    project.variables.update("AWS_SECRET_ACCESS_KEY", {"value": access_key['AccessKey']['SecretAccessKey']})

    pipeline = project.pipelines.create({'ref': 'main'})
    print("Waiting for deployment...")
    while project.pipelines.get(pipeline.id).status in ["created", "pending", "running"]:
        sleep(5)
    if project.pipelines.get(pipeline.id).status == "success":
        print("Pipeline successful")
    else:
        print(f"Pipeline {project.pipelines.get(pipeline.id).status}")
        return 400

    print(f"Deleting old key for {event['userName']}")
    iam.update_access_key(
        UserName=event['userName'],
        AccessKeyId=old_key[0]['AccessKeyId'],
        Status="Inactive"
    )
    iam.delete_access_key(
        UserName=event['userName'],
        AccessKeyId=old_key[0]['AccessKeyId']
    )
    print(f"Access key rotated for {event['userName']}")
    return 200

Here we can see the logical function steps we defined earlier in action. First we gather the old key for use later, and then create a new key:

iam = boto3.client('iam')
old_key = iam.list_access_keys(UserName=event['userName'])['AccessKeyMetadata']
print(f"Creating new key for {event['userName']}")
new_key = iam.create_access_key(
    UserName=event['userName']
)

Now we have our new key we can update the GitLab variables for the service project. To do this we will need to use a GitLab access token. In order to comply with security best practices this token should be kept as a secret (yes, I am aware of the irony in using a secret within a secret rotation function). To fully adhere to security best practices, each service should use its own project access token (which is what we will use for this example), however we then run into the issue of having another secret that has to be rotated periodically for each service. In order to mitigate this, we can sacrifice some security and instead use a group access token that covers all services we want to rotate secrets for. This is still a secret that needs to be rotated, but by only having one for all projects, there is only one secret that needs to be manually rotated… Or you can challenge yourself to create a GitLab access token rotator serverless function to manage all of them automatically!

token = boto3.client('secretsmanager').get_secret_value(SecretId=event['secretARN'])['SecretString']
gl = gitlab.Gitlab(private_token=token)
project = gl.projects.get(event['gitProjectID'])
print(f"Updating access key in project {project.name}")
project.variables.update("AWS_ACCESS_KEY_ID", {"value": access_key['AccessKey']['AccessKeyId']})
project.variables.update("AWS_SECRET_ACCESS_KEY", {"value": access_key['AccessKey']['SecretAccessKey']})

Here we have gathered our secret name from the event being passed in, as it will be different for each service, and created a GitLab client which is then used to update the CI/CD variables for the key in the given git project. Once again this value is passed in as part of the event as it is unique to each service. The key variables use the standardized names for AWS keys.

Now the variables have been updated we can update the service by triggering a pipeline to redeploy it using the new variables.

pipeline = project.pipelines.create({'ref': 'main'})
print("Waiting for deployment...")
while project.pipelines.get(pipeline.id).status in ["created", "pending", "running"]:
    sleep(5)
if project.pipelines.get(pipeline.id).status == "success":
    print("Pipeline successful")
else:
    print(f"Pipeline {project.pipelines.get(pipeline.id).status}")
    return 400

In our example we are simply using the main branch, but you can adjust this to suit your needs. Once the pipeline is triggered we need to wait to make sure it is successful before proceeding. This avoids us deleting keys if the pipeline fails, which would leave us without any valid keys for a running service. To do this we simply run a loop to check the status of the pipeline. If the pipeline succeeds we continue with our rotation, or aborts it if it fails. Of course for this to work you will need a .gitlab-ci.yml file in your service gitlab project. For this example my .gitlab-ci.yml file just prints a couple of lines, as what happens here is not within the scope of this article and we just want to make sure the service is being deployed.

Lastly we need to delete our old key

print(f"Deleting old key for {event['userName']}")
iam.update_access_key(
    UserName=event['userName'],
    AccessKeyId=old_key[0]['AccessKeyId'],
    Status="Inactive"
)
iam.delete_access_key(
    UserName=event['userName'],
    AccessKeyId=old_key[0]['AccessKeyId']
)
print(f"Access key rotated for {event['userName']}")
return 200

This snippet simply goes through the process of deactivation and deletion of the old key, before returning the success response.

With that our function is complete and we can move on to deploying it, along with the function infrastructure, you will also need the service side infrastructure to trigger the rotation.

Infrastructure

Here I will cover the required infrastructure and how to deploy it via terraform. This can also be done via console, but Terrafrom is very much the recommended method.

As you can see from the infrastructure diagram, it is split into 2 sections:

Rotator
Service

The rotator will remain constant, however the service infrastructure should be added alongside any existing infrastructure for that service, and should be repeated for each service that requires it.

Rotator

terraform {
  backend "remote" {
    organization = "my_org"
    workspaces {
      name = "secret-rotator"
    }
  }
  required_providers {
    aws = {
      source = "hashicorp/aws"
    }
  }
}

locals {
  project_name = "secretrotator"
  region       = "eu-central-1"
  account      = "123456789012"
}

provider "aws" {
  region = local.region
}

resource "null_resource" "install_layer_dependencies" {
  provisioner "local-exec" {
    command = "pip install -r layer/requirements.txt -t layer/python/lib/python3.9/site-packages"
  }
  triggers = {
    trigger = timestamp()
  }
}

data "archive_file" "layer_zip" {
  type        = "zip"
  source_dir  = "layer"
  output_path = "layer.zip"
  depends_on = [
    null_resource.install_layer_dependencies
  ]
}

resource "aws_lambda_layer_version" "lambda_layer" {
  filename = "layer.zip"
  source_code_hash = data.archive_file.layer_zip.output_base64sha256
  layer_name = local.project_name

  compatible_runtimes = ["python3.9"]
  depends_on = [
    data.archive_file.layer_zip
  ]
}

resource "aws_lambda_permission" "allow_events_bridge_to_run_lambda" {
  statement_id  = "AllowExecutionFromCloudWatch"
  action        = "lambda:InvokeFunction"
  function_name = module.lambda_function.lambda_function_name
  principal     = "events.amazonaws.com"
}

module "lambda_function" {
  source = "terraform-aws-modules/lambda/aws"

  function_name = local.project_name
  handler       = "main.lambda_handler"
  runtime       = "python3.9"

  timeout     = 900
  create_role = false
  lambda_role = aws_iam_role.role.arn

  source_path = "function"

  store_on_s3 = false

  layers = [
    aws_lambda_layer_version.lambda_layer.arn
  ]
  depends_on = [
    aws_lambda_layer_version.lambda_layer
  ]
}

resource "aws_iam_role" "role" {
  name = local.project_name

  assume_role_policy = jsonencode({
    Version = "2008-10-17"
    Statement = [
      {
        Principal = {
          Service = "lambda.amazonaws.com"
        }
        Action = "sts:AssumeRole"
        Effect = "Allow"
      }
    ]
  })
}

resource "aws_iam_role_policy" "role_policy" {
  name = local.project_name
  role = aws_iam_role.role.name

  policy = jsonencode({
    Statement = [
      {
        "Effect"   = "Allow",
        "Action"   = "logs:CreateLogGroup",
        "Resource" = "arn:aws:logs:*:${local.account}:*"
      },
      {
        "Effect" = "Allow",
        "Action" = [
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ],
        "Resource" = [
          "arn:aws:logs:*:${local.account}:log-group:/aws/lambda/${local.project_name}:*"
        ]
      },
      {
        "Effect" : "Allow",
        "Action" : [
          "iam:*AccessKey*",
        ],
        "Resource" : "arn:aws:iam::${local.account}:user/service/*"
      },
      {
        "Effect" : "Allow",
        "Action" : [
          "secretsmanager:GetSecretValue",
          "secretsmanager:DescribeSecret",
          "secretsmanager:ListSecrets"
        ],
        "Resource" : "arn:aws:secretsmanager:${local.region}:${local.account}:secret:*gitlab-token*"
      }
    ]
  })
}

Hopefully this should be clear to follow if you already have some terraform knowledge, but I will highlight a few things.

Terraform cloud is being used as a remote backend. Using other backend options works just as well, but it is worth pointing out to avoid confusion.
Lambda layers are being used to package the dependencies for the function. For a more in depth view on what is going on here, please check out my blog post here. The only dependencies involved are gitlab-python and boto3.
The lambda has been configured so that it can be triggered by cloudwatch events. This is what allows us to trigger it from our service infrastructure.
I have used a public Lambda Terraform module to deploy the lambda, as it simplifies things slightly, but this can easily be replaced by the out of the box resources.
Finally, a quick overview of the permissions:

CloudWatch permissions for logging
Full permissions to all AWS IAM User access keys for service accounts
Permissions get gitlab-token secrets

This covers our main.tf but to reference our lambda in other Terraform workspaces we also need a output.tf to pass on the function arn.

output "function_arn" {
  value = module.lambda_function.lambda_function_arn
}

Since we will be referring to this state from different workspaces we need to allow it to be shared in the Terraform Cloud workspace settings. Under General > Remote state sharing either add the specific workspaces, or share with all:

If you are using a different backend, more details on the terraform_remote_state can be found here.

With that, everything should be in place to deploy our rotator infrastructure. All thats left to do is terraform apply .

Service

The focus of the service infrastructure is simply to connect the service to the rotator so that it can be triggered with the correct event.

terraform {
  backend "remote" {
    organization = "my_org"
    workspaces {
      name = "service"
    }
  }
  required_providers {
    aws = {
      source = "hashicorp/aws"
    }
  }
}

locals {
  project_name      = "service"
  region            = "eu-central-1"
  gitlab_project_id = 1234567
}

provider "aws" {
  region = local.region
}

module "service_account" {
  source = "./modules/service_user"

  project_name      = local.project_name
  gitlab_project_id = local.gitlab_project_id
  gitlab_token      = var.gitlab_token
}

# ... Other resources

This is our main.tf file in our service project. I have only taken a snippet that is relevant to us. The rest of the configuration would contain any other resources necessary for the service. There are only a few things to note here. Firstly, the gitlab_project_id is unique to each service, so should be included at this level. I have hardcoded it in, but this could also be added as a variable. Secondly, The gitlab token should also be passed in at this level. Unlike the project ID it is sensitive, so to do this securely it should be passed in as a variable. Since I am using Terraform cloud, this can be done in the workspace under variables.

Finally, I have made the decision to modularize the service_account resources. This has been done as these resources should be repeated across all services that require it, so turning it into a module makes it much more scalable. For the scope of this article it is currently only a local module, but it should be published to a private registry that all service have access to in order to make it accessible. Here is the content of this module:

data "terraform_remote_state" "rotator" {
  backend = "remote"

  config = {
    organization = "my_org"
    workspaces = {
      name = "secret-rotator"
    }
  }
}

resource "aws_iam_user" "user" {
  name = "${var.project_name}-user"
  path = "/service/"
}

resource "aws_secretsmanager_secret" "token" {
  name = "${var.project_name}-gitlab-token"
}

resource "aws_secretsmanager_secret_version" "token" {
  secret_id     = aws_secretsmanager_secret.token.id
  secret_string = var.gitlab_token
}

resource "aws_cloudwatch_event_rule" "schedule" {
  name                = "${var.project_name}-rotatekeys"
  schedule_expression = "rate(90 days)"
}

resource "aws_cloudwatch_event_target" "schedule_lambda" {
  rule = aws_cloudwatch_event_rule.schedule.name
  arn  = data.terraform_remote_state.rotator.outputs.function_arn
  input = jsonencode({
    secretARN    = aws_secretsmanager_secret.token.arn
    gitProjectID = var.gitlab_project_id
    userName     = aws_iam_user.user.name
  })
}

Here I will highlight some things:

The rotator workspace has been referenced for use using the terraform_remote_state data source. This is then referenced as part of the aws_cloudwatch_event_target
In order to maintain security best practices for the rotator, the service prefix has been added to the IAM user so permissions can be restricted to only those with this naming convention.
The gitlab token we passed in from the level above is used to create a secret version. This means any changes to the token are done automatically as part of the apply rather than having to manually update it in the console
The event rule is triggered every 90 day. This complies with AWS recommendations.
All required info is passed as part of the event to the rotator trigger. This is the secret ARN, git project ID, and username of the user being rotated.

Once we have all of this in place, we can now terraform apply our service infrastructure, and we can test to ensure it works!

Testing

Rather than wait 90 days for our cloudwatch event to trigger normally, a much less time consuming way to test our system is to trigger the lambda manually. To ensure we are triggering it with exactly what we have deployed is to use the console to copy the payload from our Cloudwatch event rule, and use the test functionality for our Lambda to trigger it with the payload.

If the trigger is successful you should be able to see the script in effect:

New key created for the user
GitLab vars updated
Pipeline is triggered via token in your GitLab project
Function waits for pipeline to complete
Old key deleted

You can follow along in the logs to keep updated too:

If this trigger works, then we know when the Cloudwatch rule send the event normally it should work. If you want to double check you can adjust the rate to something more manageable (eg 5 mins) and then put it back to normal once testing is complete.

Summary

This was quite a lengthy and technical article to read through, however I hope it gave you some practical insight into how we can use serverless to perform a necessary task that would be mundane and error prone without automation. Having the function trigger a GitLab pipeline was also an added complexity that would not typically be seen when performing this process, but I thought it would be good to include it to show how you can add and change aspects of the process to fit your needs.