Post

Amazon EKS with Open WebUI and AWS Bedrock managed by OpenTofu

Deploy Open WebUI on Amazon EKS with AWS Bedrock as the LLM backend, provisioned with OpenTofu

Amazon EKS with Open WebUI and AWS Bedrock managed by OpenTofu

I will outline the steps for setting up an Amazon EKS environment that hosts Open WebUI backed by AWS Bedrock as the LLM provider. All infrastructure - from the VPC up to the Helm releases - is provisioned by OpenTofu using widely adopted community modules (terraform-aws-modules) and the official hashicorp/helm provider for chart installations.

The setup should align with the following criteria:

Diagram:

flowchart TD
  User(["fa:fa-user User / Browser"])
  Google["fa:fa-key Google OIDC"]

  User -- HTTPS --> R53
  R53 --> NLB
  NLB --> EG
  EG -. OIDC flow .-> Google
  EG -- Authenticated --> OW
  OW -- OpenAI API --> LL
  LL -- SigV4 --> BR
  BR --> GR

  subgraph AWS["fa:fa-cloud AWS us-east-1"]
    R53["fa:fa-globe Route 53"]
    NLB["fa:fa-network-wired NLB"]
    BR["fa:fa-brain Amazon Bedrock"]
    GR["fa:fa-lock Bedrock Guardrail"]
    subgraph EKS["fa:fa-dharmachakra Amazon EKS"]
      EG["fa:fa-shield-alt Envoy Gateway"]
      OW["fa:fa-comments Open WebUI"]
      LL["fa:fa-robot LiteLLM"]
    end
  end

Build Amazon EKS

Requirements

You will need to configure the AWS CLI and set up other necessary secrets and variables:

1
2
3
4
# AWS Credentials
export AWS_ACCESS_KEY_ID="xxxxxxxxxxxxxxxxxx"
export AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export AWS_SESSION_TOKEN="xxxxxxxx"

If you plan to follow this document and its tasks, you will need to set up a few environment variables, such as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Google OIDC credentials used by Envoy Gateway for authentication
export TF_VAR_google_client_id="${GOOGLE_CLIENT_ID}"
export TF_VAR_google_client_secret="${GOOGLE_CLIENT_SECRET}"

# AWS Region
export AWS_REGION="${AWS_REGION:-us-east-1}"
# Hostname / FQDN definitions
export CLUSTER_FQDN="${CLUSTER_FQDN:-k01.k8s.mylabs.dev}"
# Base Domain: k8s.mylabs.dev
export BASE_DOMAIN="${CLUSTER_FQDN#*.}"
# Cluster Name: k01
export CLUSTER_NAME="${CLUSTER_FQDN%%.*}"
# OpenTofu variables
export TF_VAR_cluster_fqdn="${CLUSTER_FQDN}"
export MY_EMAIL="${MY_EMAIL:-petr.ruzicka@gmail.com}"
export TF_VAR_tags="{\"Owner\":\"${MY_EMAIL}\",\"Environment\":\"dev\",\"Base-Domain\":\"${BASE_DOMAIN}\",\"Managed-by\":\"opentofu\"}"
# Derived shell variables
export TMP_DIR="${TMP_DIR:-${PWD}/tmp}"
mkdir -pv "${TMP_DIR}/${CLUSTER_FQDN}"

Install the required tools:

Configure AWS Route 53 Domain delegation

The DNS delegation tasks should be executed as a one-time operation.

flowchart LR
  CF["fa:fa-cloud Cloudflare\nmylabs.dev"]
  R53B["fa:fa-globe Route 53\nk8s.mylabs.dev"]
  R53C["fa:fa-globe Route 53\nk01.k8s.mylabs.dev"]
  ED["fa:fa-sync ExternalDNS"]
  NLB["fa:fa-network-wired NLB"]

  CF -- "NS delegation" --> R53B
  R53B -- "NS delegation" --> R53C
  ED -- "manages records" --> R53C
  R53C -- "*.k01.k8s.mylabs.dev" --> NLB

Create a Route 53 DNS zone for the EKS clusters and delegate it from Cloudflare using a standalone OpenTofu configuration with the AWS and Cloudflare providers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
export CLOUDFLARE_API_TOKEN="your-api-token-here"
CLOUDFLARE_TF_DIR="${TMP_DIR}/${BASE_DOMAIN}"
mkdir -p "${CLOUDFLARE_TF_DIR}"

tee "${CLOUDFLARE_TF_DIR}/main.tf" << \EOF
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      # renovate: datasource=terraform-provider depName=hashicorp/aws
      version = "6.49.0"
    }
    cloudflare = {
      source  = "cloudflare/cloudflare"
      # renovate: datasource=terraform-provider depName=cloudflare/cloudflare
      version = "5.19.1"
    }
  }
}

provider "aws" {
  default_tags {
    tags = var.tags
  }
}

# Cloudflare provider reads CLOUDFLARE_API_TOKEN from the environment
provider "cloudflare" {}

variable "tags" {
  description = "Tags applied to all AWS resources"
  type        = map(string)
}

locals {
  parent_domain = join(".", slice(split(".", var.tags["Base-Domain"]), 1, length(split(".", var.tags["Base-Domain"]))))
}

# Create Route 53 hosted zone for the base domain
resource "aws_route53_zone" "base" {
  name    = var.tags["Base-Domain"]
  comment = "Created by ${var.tags["Owner"]} [${var.tags["Environment"]}, ${var.tags["Managed-by"]}]"
}

# Look up the zone ID for the parent domain in Cloudflare
data "cloudflare_zone" "zone" {
  filter = { name = local.parent_domain }
}

# Create NS records in Cloudflare pointing the subdomain to Route 53 nameservers
resource "cloudflare_dns_record" "ns" {
  # Hardcoded because aws_route53_zone.base.name_servers is unknown at plan time
  count   = 4
  zone_id = data.cloudflare_zone.zone.zone_id
  name    = var.tags["Base-Domain"]
  type    = "NS"
  content = aws_route53_zone.base.name_servers[count.index]
  ttl     = 3600
}

# Create the EC2 Spot service-linked role if it does not yet exist
resource "aws_iam_service_linked_role" "spot" {
  aws_service_name = "spot.amazonaws.com"
}
EOF

tofu -chdir="${CLOUDFLARE_TF_DIR}" init
tofu -chdir="${CLOUDFLARE_TF_DIR}" apply

The OpenTofu configuration above creates a Route 53 hosted zone for k8s.mylabs.dev and adds four NS records in the Cloudflare mylabs.dev zone that delegate DNS queries for the subdomain to the Route 53 nameservers:

CloudFlare mylabs.dev zone CloudFlare mylabs.dev zone - NS records delegating k8s.mylabs.dev to Route 53

Route53 k8s.mylabs.dev zone Route53 k8s.mylabs.dev zone - hosted zone with NS and SOA records

Create S3 bucket for Amazon EKS backups and Tofu state

Create an S3 bucket to store Amazon EKS backups and OpenTofu remote state using CloudFormation. The bucket uses KMS encryption, lifecycle policies, and blocks all public access:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
if ! aws s3api head-bucket --bucket "${CLUSTER_FQDN}" 2> /dev/null; then
  tee "${TMP_DIR}/${CLUSTER_FQDN}/s3.yaml" << \EOF
AWSTemplateFormatVersion: "2010-09-09"
Description: S3 bucket for Amazon EKS backups and OpenTofu state files
Parameters:
  Name:
    Description: Name of the S3 bucket
    Type: String
Resources:
  ClusterS3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Ref Name
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      LifecycleConfiguration:
        Rules:
          - Id: MultipartUploadLifecycleRule
            Status: Enabled
            AbortIncompleteMultipartUpload:
              DaysAfterInitiation: 1
          - Id: VeleroExpiration
            Status: Enabled
            Prefix: velero/
            ExpirationInDays: 120
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: aws:kms
              KMSMasterKeyID: alias/aws/s3
  ClusterS3BucketPolicy:
    Type: AWS::S3::BucketPolicy
    Properties:
      Bucket: !Ref ClusterS3Bucket
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Sid: ForceSSLOnlyAccess
            Effect: Deny
            Principal: "*"
            Action: s3:*
            Resource:
              - !GetAtt ClusterS3Bucket.Arn
              - !Sub ${ClusterS3Bucket.Arn}/*
            Condition:
              Bool:
                aws:SecureTransport: "false"
Outputs:
  ClusterS3Bucket:
    Value: !Ref ClusterS3Bucket
EOF

  aws cloudformation deploy --region "${AWS_REGION}" \
    --stack-name "${CLUSTER_FQDN//./-}-s3" \
    --tags "Owner=${MY_EMAIL}" "Environment=dev" "Cluster=${CLUSTER_FQDN}" \
    --parameter-overrides "Name=${CLUSTER_FQDN}" \
    --template-file "${TMP_DIR}/${CLUSTER_FQDN}/s3.yaml"
fi

OpenTofu Code

All resources from this point onwards are managed by OpenTofu. Create the working directory and the main configuration file with provider versions, backend, and provider settings:

OpenTofu

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
tee "${TMP_DIR}/${CLUSTER_FQDN}/main.tf" << EOF
terraform {
  required_version = ">= 1.12.0"

  backend "s3" {
    bucket       = "${CLUSTER_FQDN}"
    key          = "terraform.tfstate"
    use_lockfile = true
  }

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      # renovate: datasource=terraform-provider depName=hashicorp/aws
      version = "6.49.0"
    }
    helm = {
      source  = "hashicorp/helm"
      # renovate: datasource=terraform-provider depName=hashicorp/helm
      version = "3.2.0"
    }
    kubectl = {
      source  = "alekc/kubectl"
      # renovate: datasource=terraform-provider depName=alekc/kubectl
      version = "2.4.1"
    }
    random = {
      source  = "hashicorp/random"
      # renovate: datasource=terraform-provider depName=hashicorp/random
      version = "3.9.0"
    }
  }
}

provider "aws" {
  default_tags {
    tags = merge(var.tags, { Cluster = var.cluster_fqdn })
  }
}

data "aws_region" "current" {}

data "aws_caller_identity" "current" {}

provider "kubectl" {
  host                   = module.eks.cluster_endpoint
  cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
  load_config_file       = false
  lazy_load              = true
  exec {
    api_version = "client.authentication.k8s.io/v1"
    command     = "aws"
    args        = ["eks", "get-token", "--cluster-name", module.eks.cluster_name]
  }
}

provider "helm" {
  kubernetes = {
    host                   = module.eks.cluster_endpoint
    cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
    exec = {
      api_version = "client.authentication.k8s.io/v1"
      command     = "aws"
      args        = ["eks", "get-token", "--cluster-name", module.eks.cluster_name]
    }
  }
}

locals {
  cluster_name = split(".", var.cluster_fqdn)[0]
  base_domain  = join(".", slice(split(".", var.cluster_fqdn), 1, length(split(".", var.cluster_fqdn))))
  pii_block = [
    "PASSWORD", "CREDIT_DEBIT_CARD_NUMBER", "PIN",
    "INTERNATIONAL_BANK_ACCOUNT_NUMBER", "SWIFT_CODE",
    "AWS_ACCESS_KEY", "AWS_SECRET_KEY",
    "US_SOCIAL_SECURITY_NUMBER", "US_INDIVIDUAL_TAX_IDENTIFICATION_NUMBER",
    "US_BANK_ACCOUNT_NUMBER", "US_BANK_ROUTING_NUMBER",
    "CA_HEALTH_NUMBER", "CA_SOCIAL_INSURANCE_NUMBER",
    "UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER", "UK_NATIONAL_INSURANCE_NUMBER",
    "UK_NATIONAL_HEALTH_SERVICE_NUMBER",
  ]
  pii_anonymize = ["PHONE", "EMAIL", "ADDRESS", "DRIVER_ID", "LICENSE_PLATE", "VEHICLE_IDENTIFICATION_NUMBER", "MAC_ADDRESS"]
  # [rule_no, action, from_port, to_port, protocol]
  nacl_ingress = [
    [89, "deny", 22, 22, "tcp"],
    [90, "deny", 3389, 3389, "tcp"],
    [100, "allow", 443, 443, "tcp"],
    [110, "allow", 1024, 65535, "tcp"],
    [120, "allow", 53, 53, "udp"],
    [130, "allow", 123, 123, "udp"],
    [140, "allow", 1024, 65535, "udp"],
  ]
}
EOF

Define the input variables. Values are provided via TF_VAR_ environment variables - no defaults are baked into the configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
tee "${TMP_DIR}/${CLUSTER_FQDN}/variables.tf" << \EOF
variable "cluster_fqdn" {
  description = "FQDN of the EKS cluster (e.g. k01.k8s.mylabs.dev)"
  type        = string
}

variable "google_client_id" {
  description = "Google OAuth Client ID for OIDC authentication"
  type        = string
}

variable "google_client_secret" {
  description = "Google OAuth Client Secret for OIDC authentication"
  type        = string
  sensitive   = true
}

variable "tags" {
  description = "Tags applied to all AWS resources"
  type        = map(string)
}
EOF

Route53 and KMS key

Route53 KMS

Use the terraform-aws-modules collection to provision the Route 53 hosted zone for ${CLUSTER_FQDN}, delegate it from the parent zone, and create the KMS key for EKS secrets and EBS volume encryption:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
tee "${TMP_DIR}/${CLUSTER_FQDN}/infra-aws.tf" << \EOF
data "aws_route53_zone" "base" {
  name         = "${local.base_domain}."
  private_zone = false
}

data "aws_s3_objects" "velero_backup" {
  bucket   = var.cluster_fqdn
  prefix   = "velero/backups/cert-manager-production"
  max_keys = 1
}

module "route53_zone" {
  source        = "terraform-aws-modules/route53/aws"
  # renovate: datasource=terraform-module depName=terraform-aws-modules/route53/aws
  version       = "6.5.0"
  name          = var.cluster_fqdn
  force_destroy = true
}

resource "aws_route53_record" "ns_delegation" {
  zone_id = data.aws_route53_zone.base.zone_id
  name    = var.cluster_fqdn
  type    = "NS"
  ttl     = 60
  records = module.route53_zone.name_servers
}

module "kms" {
  source  = "terraform-aws-modules/kms/aws"
  # renovate: datasource=terraform-module depName=terraform-aws-modules/kms/aws
  version = "4.2.0"

  description             = "KMS key for ${local.cluster_name} Amazon EKS"
  deletion_window_in_days = 7
  enable_key_rotation     = true
  aliases                 = ["eks-${local.cluster_name}"]

  key_statements = [
    {
      sid = "AllowEBSEncryptionViaEC2Service"
      principals = [{ type = "AWS", identifiers = ["*"] }]
      actions = [
        "kms:Encrypt", "kms:Decrypt", "kms:ReEncrypt*",
        "kms:GenerateDataKey*", "kms:CreateGrant", "kms:DescribeKey",
      ]
      resources = ["*"]
      condition = [
        {
          test     = "StringEquals"
          variable = "kms:ViaService"
          values   = ["ec2.${data.aws_region.current.region}.amazonaws.com"]
        },
        {
          test     = "StringEquals"
          variable = "kms:CallerAccount"
          values   = [data.aws_caller_identity.current.account_id]
        },
      ]
    },
    {
      sid = "AllowCloudWatchLogs"
      principals = [{ type = "Service", identifiers = ["logs.${data.aws_region.current.region}.amazonaws.com"] }]
      actions = [
        "kms:Encrypt*", "kms:Decrypt*", "kms:ReEncrypt*",
        "kms:GenerateDataKey*", "kms:Describe*",
      ]
      resources = ["*"]
      condition = [{
        test     = "ArnLike"
        variable = "kms:EncryptionContext:aws:logs:arn"
        values   = ["arn:aws:logs:${data.aws_region.current.region}:${data.aws_caller_identity.current.account_id}:*"]
      }]
    },
  ]
}
EOF

Amazon Bedrock

Enabling Bedrock foundation models is a one-time operation per account/region. Use the Bedrock console to opt in to the models you intend to use (Anthropic Claude, Meta Llama, Mistral, …).

Amazon Bedrock is a fully managed service that provides access to high-performing foundation models from leading AI companies (Anthropic, Meta, Mistral, Amazon, and others) through a unified API.

Amazon Bedrock

Enable model invocation logging so every Bedrock request is captured in CloudWatch, and define a guardrail that the IAM policy will reference to enforce guardrail usage. An IAM role is required to allow Bedrock to write log events to the log group:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
tee "${TMP_DIR}/${CLUSTER_FQDN}/bedrock.tf" << \EOF
resource "aws_bedrock_guardrail" "ai_safety" {
  name                      = "${local.cluster_name}-ai-safety"
  description               = "Guardrail for AI model safety and compliance"
  blocked_input_messaging   = "Input contains blocked PII"
  blocked_outputs_messaging = "Output contains blocked PII"

  content_policy_config {
    filters_config {
      type            = "SEXUAL"
      input_strength  = "HIGH"
      output_strength = "HIGH"
    }
    filters_config {
      type            = "PROMPT_ATTACK"
      input_strength  = "HIGH"
      output_strength = "NONE"
    }
  }

  sensitive_information_policy_config {
    dynamic "pii_entities_config" {
      for_each = local.pii_block
      content {
        type   = pii_entities_config.value
        action = "BLOCK"
      }
    }
    dynamic "pii_entities_config" {
      for_each = local.pii_anonymize
      content {
        type   = pii_entities_config.value
        action = "ANONYMIZE"
      }
    }
  }
}
EOF

Amazon EKS

Provision the cluster with terraform-aws-modules/eks/aws. The module wires up the OIDC provider, addons, EKS managed node group (with Bottlerocket on Graviton), and the Pod Identity associations consumed by the addons further down the page.

terraform-aws-modules/eks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
tee "${TMP_DIR}/${CLUSTER_FQDN}/eks.tf" << \EOF
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  # renovate: datasource=terraform-module depName=terraform-aws-modules/vpc/aws
  version = "6.6.1"

  name = local.cluster_name
  cidr = "192.168.0.0/16"

  azs             = ["${data.aws_region.current.region}a", "${data.aws_region.current.region}b"]
  private_subnets = ["192.168.0.0/19", "192.168.32.0/19"]
  public_subnets  = ["192.168.64.0/19", "192.168.96.0/19"]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  manage_default_security_group  = true
  default_security_group_ingress = []
  default_security_group_egress  = []

  # CIS AWS Foundations Benchmark v7.0 - 6.2: Ensure no Network ACLs allow
  # ingress from 0.0.0.0/0 to remote server administration ports (SSH/RDP)
  # https://docs.aws.amazon.com/securityhub/latest/userguide/ec2-controls.html#ec2-21
  manage_default_network_acl = true
  default_network_acl_ingress = [for r in local.nacl_ingress : {
    rule_no    = r[0]
    action     = r[1]
    from_port  = r[2]
    to_port    = r[3]
    protocol   = r[4]
    cidr_block = "0.0.0.0/0"
  }]

  public_subnet_tags = {
    "kubernetes.io/role/elb" = 1
  }
  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = 1
    "karpenter.sh/discovery"          = local.cluster_name
  }
}

# Enable Route53 DNS Resolver Logging for VPC
resource "aws_cloudwatch_log_group" "route53_resolver" {
  name              = "/aws/route53/${local.cluster_name}"
  retention_in_days = 1
  kms_key_id        = module.kms.key_arn
}

resource "aws_route53_resolver_query_log_config" "this" {
  name            = "${local.cluster_name}-dns-query-logging"
  destination_arn = aws_cloudwatch_log_group.route53_resolver.arn
}

resource "aws_route53_resolver_query_log_config_association" "this" {
  resolver_query_log_config_id = aws_route53_resolver_query_log_config.this.id
  resource_id                  = module.vpc.vpc_id
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  # renovate: datasource=terraform-module depName=terraform-aws-modules/eks/aws
  version = "21.23.0"

  name                   = local.cluster_name
  kubernetes_version     = "1.35"
  endpoint_public_access = true

  vpc_id                   = module.vpc.vpc_id
  subnet_ids               = module.vpc.private_subnets
  control_plane_subnet_ids = module.vpc.private_subnets

  create_kms_key    = false
  encryption_config = {
    provider_key_arn = module.kms.key_arn
    resources        = ["secrets"]
  }

  enable_cluster_creator_admin_permissions = true

  addons = {
    coredns                = {}
    kube-proxy             = {}
    eks-pod-identity-agent = {}
    snapshot-controller    = {}
    aws-ebs-csi-driver = {
      configuration_values = jsonencode({
        defaultStorageClass = { enabled = false }
        controller          = { loggingFormat = "json" }
      })
    }
    vpc-cni = {
      before_compute = true
      configuration_values = jsonencode({
        enableNetworkPolicy = "true"
        env                 = { ENABLE_PREFIX_DELEGATION = "true" }
      })
    }
  }

  eks_managed_node_groups = {
    mng01 = {
      name           = "${local.cluster_name}-mng01"
      ami_type       = "BOTTLEROCKET_ARM_64"
      instance_types = ["t4g.medium"]
      capacity_type  = "ON_DEMAND"
      min_size       = 2
      max_size       = 3
      desired_size   = 2
      subnet_ids     = [module.vpc.private_subnets[0]]

      block_device_mappings = {
        xvda = {
          device_name = "/dev/xvda"
          ebs = {
            volume_size           = 2
            volume_type           = "gp3"
            encrypted             = true
            kms_key_id            = module.kms.key_arn
            delete_on_termination = true
          }
        }
        xvdb = {
          device_name = "/dev/xvdb"
          ebs = {
            volume_size           = 20
            volume_type           = "gp3"
            encrypted             = true
            kms_key_id            = module.kms.key_arn
            delete_on_termination = true
          }
        }
      }

      labels = { "node.kubernetes.io/lifecycle" = "on-demand" }
    }
  }

  cloudwatch_log_group_retention_in_days = 1
  cloudwatch_log_group_kms_key_id        = module.kms.key_arn
  enabled_log_types                      = ["api", "audit", "authenticator", "controllerManager", "scheduler"]

  node_security_group_additional_rules = {
    ingress_self_443 = {
      description = "Node to node HTTPS (webhooks, metrics-server, etc.)"
      protocol    = "tcp"
      from_port   = 443
      to_port     = 443
      type        = "ingress"
      self        = true
    }
  }

  node_security_group_tags = {
    "karpenter.sh/discovery" = local.cluster_name
  }
}

module "ebs_csi_pod_identity" {
  source  = "terraform-aws-modules/eks-pod-identity/aws"
  # renovate: datasource=terraform-module depName=terraform-aws-modules/eks-pod-identity/aws
  version = "2.8.1"

  name                      = "${local.cluster_name}-ebs-csi"
  attach_aws_ebs_csi_policy = true
  aws_ebs_csi_kms_arns      = [module.kms.key_arn]

  associations = {
    main = {
      cluster_name    = module.eks.cluster_name
      namespace       = "kube-system"
      service_account = "ebs-csi-controller-sa"
    }
  }
}

# Custom gp3 StorageClass with KMS encryption replaces the default gp2 class. The EBS CSI addon has defaultStorageClass disabled so this takes precedence.
resource "kubectl_manifest" "gp3" {
  yaml_body = <<-YAML
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: gp3
      annotations:
        storageclass.kubernetes.io/is-default-class: "true"
    provisioner: ebs.csi.aws.com
    reclaimPolicy: Delete
    allowVolumeExpansion: true
    volumeBindingMode: WaitForFirstConsumer
    parameters:
      type: gp3
      encrypted: "true"
      kmsKeyId: ${module.kms.key_arn}
  YAML
  depends_on = [module.eks]
}

# Default VolumeSnapshotClass for the EBS CSI driver, required by Velero to create EBS snapshots when backing up PersistentVolumes.
resource "kubectl_manifest" "vsc_ebs" {
  yaml_body = <<-YAML
    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshotClass
    metadata:
      name: ebs-vsc
      annotations:
        snapshot.storage.kubernetes.io/is-default-class: "true"
    driver: ebs.csi.aws.com
    deletionPolicy: Delete
  YAML
  depends_on = [module.eks]
}
EOF

AWS Load Balancer Controller

AWS Load Balancer Controller provisions ELBv2 resources (ALB/NLB) for Services, Ingresses, and Gateways.

AWS Load Balancer Controller

Install the aws-load-balancer-controller Helm chart and customize its default values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
tee "${TMP_DIR}/${CLUSTER_FQDN}/aws-load-balancer-controller.tf" << \EOF
module "aws_lb_controller_pod_identity" {
  source  = "terraform-aws-modules/eks-pod-identity/aws"
  # renovate: datasource=terraform-module depName=terraform-aws-modules/eks-pod-identity/aws
  version = "2.8.1"

  name                            = "${local.cluster_name}-aws-lbc"
  attach_aws_lb_controller_policy = true

  associations = {
    main = {
      cluster_name    = module.eks.cluster_name
      namespace       = "aws-load-balancer-controller"
      service_account = "aws-load-balancer-controller"
    }
  }
}

resource "helm_release" "aws_load_balancer_controller" {
  # renovate: datasource=helm depName=aws-load-balancer-controller registryUrl=https://aws.github.io/eks-charts
  version          = "3.4.0"
  name             = "aws-load-balancer-controller"
  repository       = "https://aws.github.io/eks-charts"
  chart            = "aws-load-balancer-controller"
  namespace        = "aws-load-balancer-controller"
  create_namespace = true
  wait = true

  values = [<<-YAML
    clusterName: ${local.cluster_name}
    vpcId: ${module.vpc.vpc_id}
    serviceAccount:
      name: aws-load-balancer-controller
    defaultTags:
      Owner: ${var.tags["Owner"]}
      Environment: dev
      Cluster: ${var.cluster_fqdn}
  YAML
  ]

  depends_on = [
    module.aws_lb_controller_pod_identity,
    module.eks,
  ]
}
EOF

cert-manager

cert-manager adds certificates and certificate issuers as resource types in Kubernetes clusters and simplifies the process of obtaining, renewing, and using those certificates.

cert-manager

Install the cert-manager Helm chart and customize its default values. Provision the Pod Identity role granted to the cert-manager ServiceAccount (scoped to the ${CLUSTER_FQDN} hosted zone):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
tee "${TMP_DIR}/${CLUSTER_FQDN}/cert-manager.tf" << \EOF
module "cert_manager_pod_identity" {
  source  = "terraform-aws-modules/eks-pod-identity/aws"
  # renovate: datasource=terraform-module depName=terraform-aws-modules/eks-pod-identity/aws
  version = "2.8.1"

  name                       = "${local.cluster_name}-cert-manager"
  attach_cert_manager_policy = true
  cert_manager_hosted_zone_arns = [
    module.route53_zone.arn,
  ]

  associations = {
    main = {
      cluster_name    = module.eks.cluster_name
      namespace       = "cert-manager"
      service_account = "cert-manager"
    }
  }
}

resource "helm_release" "cert_manager" {
  # renovate: datasource=helm depName=cert-manager registryUrl=https://charts.jetstack.io extractVersion=^(?<version>.+)$
  version          = "v1.20.2"
  name             = "cert-manager"
  repository       = "https://charts.jetstack.io"
  chart            = "cert-manager"
  namespace        = "cert-manager"
  create_namespace = true
  wait             = true

  values = [<<-YAML
    crds:
      enabled: true
    extraArgs:
      - --enable-certificate-owner-ref=true
    serviceAccount:
      name: cert-manager
    enableCertificateOwnerRef: true
    webhook:
      replicaCount: 2
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app.kubernetes.io/instance: cert-manager
                  app.kubernetes.io/component: webhook
              topologyKey: kubernetes.io/hostname
  YAML
  ]

  depends_on = [
    module.cert_manager_pod_identity,
    # Ensure AWS LB Controller webhook is ready before creating Services otherwise the mutating webhook "mservice.elbv2.k8s.aws" rejects requests with "no endpoints available" if the controller pod is not yet running
    helm_release.aws_load_balancer_controller,
  ]
}
EOF

Create the ClusterIssuer and Certificate resources through OpenTofu using the alekc/kubectl provider.

  • ClusterIssuer configuring Let’s Encrypt production ACME with DNS-01 challenges solved via Route 53 (using cert-manager’s Pod Identity for AWS API access).

  • Wildcard TLS certificate for *.cluster_fqdn issued by Let’s Encrypt. Only created when no Velero backup exists (count condition) - on subsequent runs the certificate+secret are restored from the Velero backup instead, avoiding unnecessary ACME rate-limit consumption. wait_for blocks until cert-manager reports the certificate as Ready so that downstream resources (Gateway TLS listeners) can reference the secret.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
tee "${TMP_DIR}/${CLUSTER_FQDN}/cert-manager-letsencrypt.tf" << \EOF
resource "kubectl_manifest" "letsencrypt_production_dns" {
  yaml_body = <<-YAML
    apiVersion: cert-manager.io/v1
    kind: ClusterIssuer
    metadata:
      name: letsencrypt-production-dns
      labels:
        letsencrypt: production
    spec:
      acme:
        server: https://acme-v02.api.letsencrypt.org/directory
        email: ${var.tags["Owner"]}
        privateKeySecretRef:
          name: letsencrypt-production-dns
        solvers:
          - selector:
              dnsZones:
                - ${var.cluster_fqdn}
            dns01:
              route53: {}
  YAML

  depends_on = [helm_release.cert_manager]
}

resource "kubectl_manifest" "cert_production" {
  count = length(data.aws_s3_objects.velero_backup.keys) == 0 ? 1 : 0

  yaml_body = <<-YAML
    apiVersion: cert-manager.io/v1
    kind: Certificate
    metadata:
      name: cert-production
      namespace: cert-manager
      labels:
        letsencrypt: production
    spec:
      secretName: cert-production
      secretTemplate:
        labels:
          letsencrypt: production
      issuerRef:
        name: letsencrypt-production-dns
        kind: ClusterIssuer
      commonName: "*.${var.cluster_fqdn}"
      dnsNames:
        - "*.${var.cluster_fqdn}"
        - "${var.cluster_fqdn}"
  YAML

  wait_for {
    field {
      key   = "status.conditions.[0].status"
      value = "True"
    }
  }

  depends_on = [kubectl_manifest.letsencrypt_production_dns]
}
EOF

Velero

Velero is an open-source tool for backing up and restoring Kubernetes cluster resources and persistent volumes.

Velero

Install the velero Helm chart and customize its default values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
tee "${TMP_DIR}/${CLUSTER_FQDN}/velero.tf" << \EOF
data "aws_iam_policy_document" "velero" {
  statement {
    actions = [
      "ec2:DescribeVolumes",
      "ec2:DescribeSnapshots",
      "ec2:CreateTags",
      "ec2:CreateSnapshot",
      "ec2:DeleteSnapshot",
    ]
    resources = ["*"]
  }
  statement {
    actions   = ["s3:ListBucket", "s3:GetBucketLocation", "s3:ListBucketMultipartUploads"]
    resources = ["arn:aws:s3:::${var.cluster_fqdn}"]
  }
  statement {
    actions   = ["s3:PutObject", "s3:GetObject", "s3:DeleteObject", "s3:ListMultipartUploadParts", "s3:AbortMultipartUpload"]
    resources = ["arn:aws:s3:::${var.cluster_fqdn}/*"]
  }
}

module "velero_pod_identity" {
  source  = "terraform-aws-modules/eks-pod-identity/aws"
  # renovate: datasource=terraform-module depName=terraform-aws-modules/eks-pod-identity/aws
  version = "2.8.1"

  name                    = "${local.cluster_name}-velero"
  attach_custom_policy    = true
  source_policy_documents = [data.aws_iam_policy_document.velero.json]

  associations = {
    main = {
      cluster_name    = module.eks.cluster_name
      namespace       = "velero"
      service_account = "velero"
    }
  }
}

resource "helm_release" "velero" {
  # renovate: datasource=helm depName=velero registryUrl=https://vmware-tanzu.github.io/helm-charts
  version          = "12.0.2"
  name             = "velero"
  repository       = "https://vmware-tanzu.github.io/helm-charts"
  chart            = "velero"
  namespace        = "velero"
  create_namespace = true
  wait             = true

  values = [<<-YAML
    initContainers:
      - name: velero-plugin-for-aws
        # renovate: datasource=github-tags depName=vmware-tanzu/velero-plugin-for-aws extractVersion=^(?<version>.+)$
        image: velero/velero-plugin-for-aws:v1.14.1
        volumeMounts:
          - mountPath: /target
            name: plugins
    configuration:
      backupStorageLocation: []
      volumeSnapshotLocation:
        - provider: aws
          config:
            region: ${data.aws_region.current.region}
    serviceAccount:
      server:
        name: velero
    credentials:
      useSecret: false
  YAML
  ]

  depends_on = [
    helm_release.cert_manager,
    module.velero_pod_identity,
  ]
}

# Create BSL separately so we can use wait_for to confirm Velero has completed at least one backup sync cycle (status.lastSyncedTime is set).
resource "kubectl_manifest" "velero_bsl" {
  yaml_body = <<-YAML
    apiVersion: velero.io/v1
    kind: BackupStorageLocation
    metadata:
      name: default
      namespace: velero
    spec:
      provider: aws
      default: true
      objectStorage:
        bucket: ${var.cluster_fqdn}
        prefix: velero
      config:
        region: ${data.aws_region.current.region}
  YAML

  wait_for {
    field {
      key        = "status.lastSyncedTime"
      value      = ".+"
      value_type = "regex"
    }
  }

  depends_on = [helm_release.velero]
}

resource "kubectl_manifest" "velero_restore_cert" {
  count = length(data.aws_s3_objects.velero_backup.keys) > 0 ? 1 : 0

  yaml_body = <<-YAML
    apiVersion: velero.io/v1
    kind: Restore
    metadata:
      name: restore-cert-manager-production
      namespace: velero
      labels:
        letsencrypt: production
    spec:
      backupName: cert-manager-production
      existingResourcePolicy: update
  YAML

  wait_for {
    field {
      key   = "status.phase"
      value = "Completed"
    }
  }

  depends_on = [
    kubectl_manifest.velero_bsl,
  ]
}
EOF

Envoy Gateway

Envoy Gateway is an implementation of the Kubernetes Gateway API built on Envoy Proxy. It will terminate TLS, run the OIDC flow against Google, and forward authenticated requests to Open WebUI and other services.

Envoy Gateway

Install the gateway-helm Helm chart and customize its default values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
tee "${TMP_DIR}/${CLUSTER_FQDN}/envoy-gateway.tf" << \EOF
resource "helm_release" "envoy_gateway" {
  # renovate: datasource=docker depName=envoyproxy/gateway-helm registryUrl=https://docker.io
  version          = "1.8.1"
  name             = "envoy-gateway"
  repository       = "oci://docker.io/envoyproxy"
  chart            = "gateway-helm"
  namespace        = "envoy-gateway-system"
  create_namespace = true
  wait             = true

  depends_on = [
    helm_release.cert_manager,
  ]
}

# Kubernetes Secret holding the Google OAuth client secret, referenced by the SecurityPolicy OIDC configuration to authenticate users via Google.
resource "kubectl_manifest" "google_oidc_client_secret" {
  yaml_body = <<-YAML
    apiVersion: v1
    kind: Secret
    metadata:
      name: google-oidc-client-secret
      namespace: envoy-gateway-system
    type: Opaque
    stringData:
      client-secret: ${var.google_client_secret}
  YAML
  sensitive_fields = ["stringData"]
  depends_on       = [helm_release.envoy_gateway]
}

# GatewayClass registers Envoy Gateway as the controller for Gateway API resources. All Gateway objects referencing the "eg" class are reconciled by the Envoy Gateway controller.
resource "kubectl_manifest" "gatewayclass" {
  yaml_body = <<-YAML
    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: eg
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
  YAML
  depends_on = [helm_release.envoy_gateway]
}

# EnvoyProxy customizes the data-plane Service created by the Gateway. Annotations instruct the AWS Load Balancer Controller to provision an internet-facing NLB with IP-mode targets.
resource "kubectl_manifest" "envoy_proxy_nlb" {
  yaml_body = <<-YAML
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: EnvoyProxy
    metadata:
      name: aws-nlb
      namespace: envoy-gateway-system
    spec:
      provider:
        type: Kubernetes
        kubernetes:
          envoyService:
            annotations:
              service.beta.kubernetes.io/aws-load-balancer-type: external
              service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
              service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
              service.beta.kubernetes.io/aws-load-balancer-name: eks-${local.cluster_name}
  YAML
  depends_on = [helm_release.envoy_gateway]
}

# ReferenceGrant allows the Gateway in envoy-gateway-system to reference the "cert-production" TLS Secret in the cert-manager namespace. Without this, cross-namespace Secret references are rejected by the Gateway API.
resource "kubectl_manifest" "ref_grant_cert_secret" {
  yaml_body = <<-YAML
    apiVersion: gateway.networking.k8s.io/v1beta1
    kind: ReferenceGrant
    metadata:
      name: allow-eg-to-cert-manager-secrets
      namespace: cert-manager
    spec:
      from:
        - group: gateway.networking.k8s.io
          kind: Gateway
          namespace: envoy-gateway-system
      to:
        - group: ""
          kind: Secret
          name: cert-production
  YAML
  depends_on = [helm_release.envoy_gateway]
}

# Central Gateway resource that terminates TLS for both the wildcard (*.cluster_fqdn) and apex (cluster_fqdn) hostnames. It references the NLB-backed EnvoyProxy for infrastructure and the Let's Encrypt certificate from cert-manager for TLS. All HTTPRoutes in any namespace can attach to this Gateway.
resource "kubectl_manifest" "gateway" {
  yaml_body = <<-YAML
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: eg
      namespace: envoy-gateway-system
    spec:
      gatewayClassName: eg
      infrastructure:
        parametersRef:
          group: gateway.envoyproxy.io
          kind: EnvoyProxy
          name: aws-nlb
      listeners:
        - name: https
          port: 443
          protocol: HTTPS
          hostname: "*.${var.cluster_fqdn}"
          tls:
            mode: Terminate
            certificateRefs:
              - name: cert-production
                namespace: cert-manager
          allowedRoutes:
            namespaces:
              from: All
        - name: https-apex
          port: 443
          protocol: HTTPS
          hostname: "${var.cluster_fqdn}"
          tls:
            mode: Terminate
            certificateRefs:
              - name: cert-production
                namespace: cert-manager
          allowedRoutes:
            namespaces:
              from: All
  YAML
  depends_on = [
    kubectl_manifest.ref_grant_cert_secret,
    kubectl_manifest.envoy_proxy_nlb,
    kubectl_manifest.gatewayclass,
  ]
}

# SecurityPolicy attached to both Gateway listeners that enforces Google OIDC authentication and JWT-based authorization. Only the email specified in var.tags["Owner"] is allowed access. Authenticated user identity is forwarded to backends via X-Forwarded-Email and X-Forwarded-User headers.
resource "kubectl_manifest" "security_policy_oidc" {
  yaml_body = <<-YAML
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: SecurityPolicy
    metadata:
      name: google-oidc
      namespace: envoy-gateway-system
    spec:
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: Gateway
          name: eg
          sectionName: https
        - group: gateway.networking.k8s.io
          kind: Gateway
          name: eg
          sectionName: https-apex
      oidc:
        provider:
          issuer: "https://accounts.google.com"
        clientID: "${var.google_client_id}"
        clientSecret:
          name: google-oidc-client-secret
        redirectURL: "https://${var.cluster_fqdn}/oauth2/callback"
        scopes: [openid, email, profile]
        cookieNames:
          accessToken: oidc-access-token
          idToken: oidc-id-token
        cookieDomain: "${var.cluster_fqdn}"
        logoutPath: "/logout"
      jwt:
        providers:
          - name: google
            issuer: "https://accounts.google.com"
            remoteJWKS:
              uri: "https://www.googleapis.com/oauth2/v3/certs"
            extractFrom:
              cookies: [oidc-id-token]
            claimToHeaders:
              - { header: X-Forwarded-Email, claim: email }
              - { header: X-Forwarded-User,  claim: name  }
      authorization:
        defaultAction: Deny
        rules:
          - name: allow-specific-email
            action: Allow
            principal:
              jwt:
                provider: google
                claims:
                  - name: email
                    values: ["${var.tags["Owner"]}"]
  YAML
  depends_on = [
    kubectl_manifest.gateway,
    kubectl_manifest.google_oidc_client_secret,
  ]
}

# HTTPRoute for the apex domain (cluster_fqdn) that redirects all traffic to the chat subdomain (chat.cluster_fqdn) with a 302 status code, providing a convenient entry point.
resource "kubectl_manifest" "apex_httproute" {
  yaml_body = <<-YAML
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: apex
      namespace: envoy-gateway-system
    spec:
      parentRefs:
        - name: eg
          namespace: envoy-gateway-system
          sectionName: https-apex
      hostnames:
        - ${var.cluster_fqdn}
      rules:
        - filters:
            - type: RequestRedirect
              requestRedirect:
                hostname: chat.${var.cluster_fqdn}
                statusCode: 302
  YAML
  depends_on = [kubectl_manifest.gateway]
}
EOF

Karpenter

Karpenter automatically scales the node pool based on pending pod requirements. The EKS module provisions the IAM roles via the karpenter sub-module.

Karpenter

Install the karpenter Helm chart and customize its default values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
tee "${TMP_DIR}/${CLUSTER_FQDN}/karpenter.tf" << \EOF
module "karpenter" {
  source  = "terraform-aws-modules/eks/aws//modules/karpenter"
  # renovate: datasource=terraform-module depName=terraform-aws-modules/eks/aws
  version = "21.23.0"

  cluster_name = module.eks.cluster_name

  namespace       = "karpenter"
  service_account = "karpenter"

  node_iam_role_use_name_prefix = false
  node_iam_role_name            = "KarpenterNodeRole-${local.cluster_name}"

  queue_managed_sse_enabled = false
  queue_kms_master_key_id   = module.kms.key_id
}

resource "helm_release" "karpenter" {
  # renovate: datasource=github-tags depName=aws/karpenter-provider-aws
  version          = "1.12.1"
  name             = "karpenter"
  repository       = "oci://public.ecr.aws/karpenter"
  chart            = "karpenter"
  namespace        = "karpenter"
  create_namespace = true
  wait             = true

  values = [<<-YAML
    settings:
      clusterName: ${module.eks.cluster_name}
      eksControlPlane: true
      interruptionQueue: ${module.karpenter.queue_name}
      featureGates:
        spotToSpotConsolidation: true
    serviceAccount:
      name: karpenter
  YAML
  ]

  depends_on = [
    helm_release.cert_manager,
    module.karpenter,
  ]
}

# EC2NodeClass defines the AWS-specific node configuration for Karpenter-provisioned instances: Bottlerocket AMI, VPC subnets/security groups discovered via tags, the Karpenter IAM role, and KMS-encrypted gp3 EBS volumes.
resource "kubectl_manifest" "ec2_nodeclass_default" {
  yaml_body = <<-YAML
    apiVersion: karpenter.k8s.aws/v1
    kind: EC2NodeClass
    metadata:
      name: default
    spec:
      amiFamily: Bottlerocket
      amiSelectorTerms:
        - alias: bottlerocket@latest
      subnetSelectorTerms:
        - tags:
            karpenter.sh/discovery: "${local.cluster_name}"
      securityGroupSelectorTerms:
        - tags:
            karpenter.sh/discovery: "${local.cluster_name}"
      role: "KarpenterNodeRole-${local.cluster_name}"
      tags:
        Name: "${local.cluster_name}-karpenter"
        Cluster: "${var.cluster_fqdn}"
      blockDeviceMappings:
        - deviceName: /dev/xvda
          ebs:
            volumeSize: 2Gi
            volumeType: gp3
            encrypted: true
            kmsKeyID: ${module.kms.key_arn}
        - deviceName: /dev/xvdb
          ebs:
            volumeSize: 20Gi
            volumeType: gp3
            encrypted: true
            kmsKeyID: ${module.kms.key_arn}
  YAML
  depends_on = [helm_release.karpenter]
}

# NodePool defines the scheduling constraints for Karpenter: instances must have >=4 GiB RAM, run in a single AZ to minimize cross-AZ costs, use cost-efficient t4g/t3a families, and prefer spot capacity with on-demand fallback.
resource "kubectl_manifest" "nodepool_default" {
  yaml_body = <<-YAML
    apiVersion: karpenter.sh/v1
    kind: NodePool
    metadata:
      name: default
    spec:
      template:
        spec:
          requirements:
            - key: "karpenter.k8s.aws/instance-memory"
              operator: Gt
              values: ["4095"]
            - key: "topology.kubernetes.io/zone"
              operator: In
              values: ["${data.aws_region.current.region}a"]
            - key: "karpenter.k8s.aws/instance-family"
              operator: In
              values: ["t4g", "t3a"]
            - key: "karpenter.sh/capacity-type"
              operator: In
              values: ["spot", "on-demand"]
            - key: "kubernetes.io/arch"
              operator: In
              values: ["arm64", "amd64"]
          nodeClassRef:
            group: karpenter.k8s.aws
            kind: EC2NodeClass
            name: default
  YAML
  depends_on = [kubectl_manifest.ec2_nodeclass_default]
}
EOF

ExternalDNS

ExternalDNS synchronises Kubernetes Services, Ingresses, and Gateway API routes with Route 53.

ExternalDNS

Install the external-dns Helm chart and customize its default values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
tee "${TMP_DIR}/${CLUSTER_FQDN}/external-dns.tf" << \EOF
module "external_dns_pod_identity" {
  source  = "terraform-aws-modules/eks-pod-identity/aws"
  # renovate: datasource=terraform-module depName=terraform-aws-modules/eks-pod-identity/aws
  version = "2.8.1"

  name                       = "${local.cluster_name}-external-dns"
  attach_external_dns_policy = true
  external_dns_hosted_zone_arns = [
    module.route53_zone.arn,
  ]

  associations = {
    main = {
      cluster_name    = module.eks.cluster_name
      namespace       = "external-dns"
      service_account = "external-dns"
    }
  }
}

resource "helm_release" "external_dns" {
  # renovate: datasource=helm depName=external-dns registryUrl=https://kubernetes-sigs.github.io/external-dns/
  version          = "1.21.1"
  name             = "external-dns"
  repository       = "https://kubernetes-sigs.github.io/external-dns/"
  chart            = "external-dns"
  namespace        = "external-dns"
  create_namespace = true

  values = [<<-YAML
    serviceAccount:
      name: external-dns
    interval: 20s
    policy: sync
    domainFilters:
      - ${var.cluster_fqdn}
    sources:
      - service
      - ingress
      - gateway-httproute
      - gateway-grpcroute
  YAML
  ]

  depends_on = [
    helm_release.cert_manager,
    kubectl_manifest.nodepool_default,
    module.external_dns_pod_identity,
  ]
}
EOF

LiteLLM

LiteLLM is an OpenAI-compatible proxy that supports 100+ LLM providers including Amazon Bedrock. It passes guardrailConfig inline in the Bedrock Converse API call, satisfying the IAM bedrock:GuardrailIdentifier condition. It uses EKS Pod Identity for authentication - no IAM users or long-term credentials are needed. A standalone PostgreSQL database is deployed alongside LiteLLM for storing usage data and configuration - models are configured via a static YAML file.

LiteLLM

Install litellm using Helm and customize its default values. Create a dedicated IAM role granting the LiteLLM pod permission to call the Bedrock Converse/InvokeModel APIs with guardrail enforcement, and associate it with the litellm ServiceAccount through EKS Pod Identity:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
tee "${TMP_DIR}/${CLUSTER_FQDN}/litellm.tf" << \EOF
data "aws_iam_policy_document" "bedrock_invoke" {
  statement {
    sid = "BedrockInvoke"
    actions = [
      "bedrock:InvokeModel",
      "bedrock:InvokeModelWithResponseStream",
      "bedrock:Converse",
      "bedrock:ConverseStream",
    ]
    resources = [
      "arn:aws:bedrock:*::foundation-model/*",
      "arn:aws:bedrock:*:*:inference-profile/*",
    ]
    condition {
      test     = "StringEquals"
      variable = "bedrock:GuardrailIdentifier"
      values   = [aws_bedrock_guardrail.ai_safety.guardrail_arn]
    }
  }
  statement {
    sid       = "BedrockApplyGuardrail"
    actions   = ["bedrock:ApplyGuardrail"]
    resources = [aws_bedrock_guardrail.ai_safety.guardrail_arn]
  }
  statement {
    sid = "BedrockListAndGet"
    actions = [
      "bedrock:ListFoundationModels",
      "bedrock:GetFoundationModel",
      "bedrock:ListInferenceProfiles",
      "bedrock:GetInferenceProfile",
    ]
    resources = ["*"]
  }
}

module "litellm_pod_identity" {
  source  = "terraform-aws-modules/eks-pod-identity/aws"
  # renovate: datasource=terraform-module depName=terraform-aws-modules/eks-pod-identity/aws
  version = "2.8.1"

  name                    = "${local.cluster_name}-litellm"
  attach_custom_policy    = true
  source_policy_documents = [data.aws_iam_policy_document.bedrock_invoke.json]

  associations = {
    main = {
      cluster_name    = module.eks.cluster_name
      namespace       = "litellm"
      service_account = "litellm"
    }
  }
}

# Pre-create the master key secret with a known value so both LiteLLM and
# Open WebUI can reference the same API key deterministically.
resource "random_password" "litellm_master_key" {
  length  = 32
  special = false
}

resource "helm_release" "litellm" {
  # renovate: datasource=docker depName=docker.litellm.ai/berriai/litellm-helm
  version          = "1.88.0"
  name             = "litellm"
  chart            = "oci://docker.litellm.ai/berriai/litellm-helm"
  namespace        = "litellm"
  create_namespace = true
  wait             = true

  values = [<<-YAML
    replicaCount: 1
    image:
      repository: ghcr.io/berriai/litellm-database
      pullPolicy: Always
    resources:
      requests:
        memory: 1Gi
    masterkey: sk-${random_password.litellm_master_key.result}
    serviceAccount:
      create: true
      name: litellm
    service:
      port: 4000
    db:
      deployStandalone: true
    postgresql:
      image:
        tag: latest
      auth:
        password: ${random_password.litellm_master_key.result}
        postgres-password: ${random_password.litellm_master_key.result}
    disableSchemaUpdate: false
    migrationJob:
      enabled: false
    proxy_config:
      model_list:
        - model_name: us.anthropic.claude-haiku-4-5-20251001-v1:0
          litellm_params:
            model: bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0
            aws_region_name: ${data.aws_region.current.region}
            guardrailConfig:
              guardrailIdentifier: ${aws_bedrock_guardrail.ai_safety.guardrail_arn}
              guardrailVersion: "DRAFT"
              trace: "disabled"
      litellm_settings:
        drop_params: true
      general_settings:
        store_model_in_db: true
        store_prompts_in_spend_logs: true
  YAML
  ]

  depends_on = [
    kubectl_manifest.nodepool_default,
    module.litellm_pod_identity,
    helm_release.cert_manager,
  ]
}

# HTTPRoute exposes LiteLLM API through the Envoy Gateway at litellm.${cluster_fqdn}
resource "kubectl_manifest" "litellm_httproute" {
  yaml_body = <<-YAML
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: litellm
      namespace: litellm
    spec:
      parentRefs:
        - name: eg
          namespace: envoy-gateway-system
          sectionName: https
      hostnames:
        - litellm.${var.cluster_fqdn}
      rules:
        - backendRefs:
            - name: litellm
              port: 4000
  YAML
  depends_on = [
    helm_release.litellm,
    kubectl_manifest.gateway,
  ]
}
EOF

LiteLLM LiteLLM

Open WebUI

Open WebUI is a user-friendly web interface for chat-style interactions with LLMs. Install the open-webui Helm chart and customize its default values. Point it at LiteLLM’s in-cluster OpenAI-compatible endpoint and expose it through the Envoy Gateway:

Open WebUI

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
tee "${TMP_DIR}/${CLUSTER_FQDN}/open-webui.tf" << \EOF
resource "helm_release" "open_webui" {
  # renovate: datasource=helm depName=open-webui registryUrl=https://helm.openwebui.com
  version          = "14.8.0"
  name             = "open-webui"
  repository       = "https://helm.openwebui.com"
  chart            = "open-webui"
  namespace        = "open-webui"
  create_namespace = true

  values = [<<-YAML
    ollama:
      enabled: false
    pipelines:
      enabled: false
    persistence:
      enabled: false
    resources:
      requests:
        memory: 1Gi
      limits:
        memory: 2Gi
    openaiBaseApiUrl: http://litellm.litellm.svc:4000/v1
    extraEnvVars:
      - name: OPENAI_API_KEY
        value: sk-${random_password.litellm_master_key.result}
      - name: WEBUI_AUTH
        value: "false"
      - name: ENABLE_SIGNUP
        value: "false"
      - name: ENABLE_EVALUATION_ARENA_MODELS
        value: "false"
      - name: DEFAULT_MODELS
        value: us.anthropic.claude-haiku-4-5-20251001-v1:0
      - name: WEBUI_AUTH_TRUSTED_EMAIL_HEADER
        value: X-Forwarded-Email
      - name: WEBUI_AUTH_TRUSTED_NAME_HEADER
        value: X-Forwarded-User
  YAML
  ]

  depends_on = [helm_release.litellm]
}

# HTTPRoute exposing Open WebUI at chat.<cluster_fqdn>, the primary user-facing endpoint. Traffic passes through OIDC authentication enforced by the SecurityPolicy on the Gateway before reaching the Open WebUI Service.
resource "kubectl_manifest" "openwebui_httproute" {
  yaml_body = <<-YAML
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: open-webui
      namespace: open-webui
    spec:
      parentRefs:
        - name: eg
          namespace: envoy-gateway-system
          sectionName: https
      hostnames:
        - chat.${var.cluster_fqdn}
      rules:
        - backendRefs:
            - name: open-webui
              port: 80
  YAML
  depends_on = [
    helm_release.open_webui,
    kubectl_manifest.gateway,
  ]
}
EOF

Open WebUI Open WebUI

OpenTofu Code - apply

Initialise the OpenTofu working directory and apply the entire configuration in a single run:

1
2
3
4
tofu -chdir="${TMP_DIR}/${CLUSTER_FQDN}" init
if [[ ! ${MY_TASK:-} =~ delete: ]]; then
  tofu -chdir="${TMP_DIR}/${CLUSTER_FQDN}" apply -auto-approve
fi

Visit https://chat.${CLUSTER_FQDN} - you should be redirected through the Google OIDC flow by Envoy Gateway, and then land in Open WebUI with the Bedrock-backed Claude model.

Clean-up

Remove the cluster and all related resources with OpenTofu.

Clean-up

Set environment variables:

1
2
3
4
5
6
7
8
9
10
11
12
13
export AWS_REGION="${AWS_REGION:-us-east-1}"
export CLUSTER_FQDN="${CLUSTER_FQDN:-k01.k8s.mylabs.dev}"
export TF_VAR_cluster_fqdn="${CLUSTER_FQDN}"
export BASE_DOMAIN="${CLUSTER_FQDN#*.}"
export CLUSTER_NAME="${CLUSTER_FQDN%%.*}"
export MY_EMAIL="${MY_EMAIL:-petr.ruzicka@gmail.com}"
export TF_VAR_tags="{\"Owner\":\"${MY_EMAIL}\",\"Environment\":\"dev\",\"Base-Domain\":\"${BASE_DOMAIN}\",\"Managed-by\":\"opentofu\"}"
export TF_VAR_google_client_id="${GOOGLE_CLIENT_ID}"
export TF_VAR_google_client_secret="${GOOGLE_CLIENT_SECRET}"
export TMP_DIR="${TMP_DIR:-${PWD}/tmp}"
mkdir -p "${TMP_DIR}/${CLUSTER_FQDN}"
export KUBECONFIG="${KUBECONFIG:-${TMP_DIR}/${CLUSTER_FQDN}/kubeconfig.conf}"
aws eks update-kubeconfig --region "${AWS_REGION}" --name "${CLUSTER_NAME}" --kubeconfig "${KUBECONFIG}" || true

Back up the cert-manager certificate before tearing the cluster down (only if it was issued/renewed during this cluster’s lifetime - a completed CertificateRequest with the letsencrypt: production label only exists when cert-manager performed the ACME flow, not after a Velero restore):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
if kubectl get certificaterequest -n cert-manager -l letsencrypt=production -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}' | grep -q "True"; then
  kubectl apply -f - << EOF
apiVersion: velero.io/v1
kind: Backup
metadata:
  name: cert-manager-production
  namespace: velero
spec:
  ttl: 2160h
  includedNamespaces:
    - cert-manager
  includedResources:
    - certificates.cert-manager.io
    - secrets
  labelSelector:
    matchLabels:
      letsencrypt: production
EOF
fi

Recreate the OpenTofu code files:

1
2
export MY_TASK="${MISE_TASK_NAME}"
mise run "create:${MISE_TASK_NAME##*:}"

Remove the Gateway resource so the AWS Load Balancer Controller can properly delete the NLB and its security groups while still running:

1
tofu -chdir="${TMP_DIR}/${CLUSTER_FQDN}" destroy -target=kubectl_manifest.nodepool_default -target=kubectl_manifest.gateway -auto-approve || true

Terminate EC2 instances provisioned by Karpenter:

1
2
3
4
for EC2 in $(aws ec2 describe-instances --filters "Name=tag:kubernetes.io/cluster/${CLUSTER_NAME},Values=owned" "Name=tag:karpenter.sh/nodepool,Values=*" Name=instance-state-name,Values=running --query "Reservations[].Instances[].InstanceId" --output text); do
  echo "*** Removing Karpenter EC2: ${EC2}"
  aws ec2 terminate-instances --instance-ids "${EC2}"
done

Destroy the remaining infrastructure with OpenTofu:

1
2
3
4
if tofu -chdir="${TMP_DIR}/${CLUSTER_FQDN}" destroy -auto-approve; then
  aws s3 rm "s3://${CLUSTER_FQDN}/terraform.tfstate" --recursive
  rm -rf "${TMP_DIR:?}/${CLUSTER_FQDN:?}"
fi

Remove EBS volumes and snapshots related to the cluster (as a precaution):

1
2
3
4
5
6
7
8
9
for VOLUME in $(aws ec2 describe-volumes --filter "Name=tag:KubernetesCluster,Values=${CLUSTER_NAME}" "Name=tag:kubernetes.io/cluster/${CLUSTER_NAME},Values=owned" --query 'Volumes[].VolumeId' --output text); do
  echo "*** Removing Volume: ${VOLUME}"
  aws ec2 delete-volume --volume-id "${VOLUME}"
done

for SNAPSHOT in $(aws ec2 describe-snapshots --owner-ids self --filter "Name=tag:Name,Values=${CLUSTER_NAME}-dynamic-snapshot*" "Name=tag:kubernetes.io/cluster/${CLUSTER_NAME},Values=owned" --query 'Snapshots[].SnapshotId' --output text); do
  echo "*** Removing Snapshot: ${SNAPSHOT}"
  aws ec2 delete-snapshot --snapshot-id "${SNAPSHOT}"
done

Remove the CloudWatch log group:

1
2
3
if [[ "$(aws logs describe-log-groups --query "logGroups[?logGroupName==\`/aws/eks/${CLUSTER_NAME}/cluster\`] | [0].logGroupName" --output text)" = "/aws/eks/${CLUSTER_NAME}/cluster" ]]; then
  aws logs delete-log-group --log-group-name "/aws/eks/${CLUSTER_NAME}/cluster"
fi

Enjoy … 😉

This post is licensed under CC BY 4.0 by the author.