← Blog

Production console in Kubernetes from your laptop

Sooner or later you need to access a production console. Some bug only happens with real records, or support asks you to fix one row by hand.

If it was a plain VM you could SSH into it. Since it’s on Kubernetes your next choice could be kubectl exec into a running app pod, but then your console steals threads from a pod that is serving live traffic, and the pod goes away on the next deploy in the middle of your session. Also if your memory limits are up to date, then starting a second process is likely going to OOM the whole pod.

What we can do instead: spin a fresh throwaway pod from the same image the app runs, get your console there, and let the pod delete itself when you quit. No node access, nothing shared with live traffic, nothing to clean up.

Here is the script that I use.

To make things concrete, the example is going to be about Ruby on Rails console, but it can be easily swapped for something else.

The script

#!/bin/bash
set -e

# Start in a new pod/container
POD_NAME="api-${USERNAME}-console"

kubectl delete pod "$POD_NAME" --ignore-not-found
kubectl run "$POD_NAME" --image=my-registry/api:latest --stdin --tty --rm --restart=Never --image-pull-policy Always --override-type=strategic \
--overrides='
{
  "metadata": {
    "labels": { "pod-class": "throwaway" }
  },
  "spec": {
    "activeDeadlineSeconds": 86400,
    "containers": [
      {
        "name": "'"$POD_NAME"'",
        "envFrom": [
          { "configMapRef": { "name": "api-config" } },
          { "secretRef": { "name": "api-secret" } }
        ],
        "resources": {
          "requests": {
            "cpu": "1",
            "memory": "2Gi"
          }
        }
      }
    ]
  }
}
' \
-- ./bin/rails c

Going through it piece by piece:

  • POD_NAME="api-${USERNAME}-console" - the pod name has your username in it.
  • kubectl delete pod ... --ignore-not-found before kubectl run - if a previous run crashed and left a pod behind, kill it first so the name is free. --ignore-not-found keeps it quiet when there is nothing to delete.
  • --image=...api:latest - the same image production runs.
  • --stdin --tty --rm --restart=Never - interactive terminal, auto-remove on exit, and it is a plain one-shot pod, not a Deployment that would respawn itself. The --rm is the cleanup: when kubectl run returns, the pod is deleted. The pre-run delete handles the case where it didn’t get that far last time, and activeDeadlineSeconds plus the reaper handle the case where nothing ever returns.
  • --override-type=strategic plus --overrides='...' - kubectl run alone can’t set env from configmaps or set resources, so we patch the pod spec with JSON:
    • metadata.labels pod-class: throwaway - a label so the reaper at the end of this post can find these pods and only these pods. Every throwaway pod we make gets it.
    • activeDeadlineSeconds: 86400 - hard 24 hour kill switch. If someone forgets a console open over the weekend, the cluster ends it.
    • envFrom the api-config configmap and api-secret secret - the console gets the same env as the app, so it talks to the real database, Redis, and so on.
    • resources.requests cpu 1, memory 2Gi - enough to schedule and run a console, not so much that it can’t be placed.
  • -- ./bin/rails c - the command to run in the pod.

Change that last line and you get a different tool from the same skeleton: /bin/bash for a shell in a prod-like container, rails dbconsole -e production -p for a psql against the production database.

Why a Docker image and not just kubectl

The script above assumes a few things on your machine: gcloud installed, the right SDK version, kubectl, cluster credentials configured, and a service account key to authenticate with. A new developer / machine would spend an afternoon on that, and everyone’s setup would drift over time.

So we don’t ask for any of it. We bake it all into an image. Here is the Dockerfile:

FROM gcr.io/google.com/cloudsdktool/google-cloud-cli:558.0.0

# install Postgres for psql
RUN apt-get update && \
    apt-get install -y curl ca-certificates gnupg && \
    curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc | gpg --dearmor -o /usr/share/keyrings/postgresql-archive-keyring.gpg && \
    echo "deb [signed-by=/usr/share/keyrings/postgresql-archive-keyring.gpg] http://apt.postgresql.org/pub/repos/apt $(. /etc/os-release && echo "$VERSION_CODENAME")-pgdg main" > /etc/apt/sources.list.d/pgdg.list && \
    apt-get update && \
    apt-get install -y postgresql-client-18

ARG CLUSTER="app-cluster"
ARG SERVICE_ACCOUNT="kubernetes-account@my-project.iam.gserviceaccount.com"

ENV PATH /app/bin:/root/bin:$PATH
ENV SERVICE_ACCOUNT=${SERVICE_ACCOUNT}

COPY gcloud_config /root/.config/gcloud/configurations/config_${CLUSTER}
COPY service-account.json /root/service-account.json

# activate default configuration
RUN gcloud config configurations activate ${CLUSTER} && \
    gcloud container clusters get-credentials ${CLUSTER} && \
    gcloud auth activate-service-account \
        ${SERVICE_ACCOUNT} \
        --key-file=/root/service-account.json;

COPY bin/ /app/bin/

WORKDIR /root

What is going on:

  • The base image is Google’s google-cloud-cli, so gcloud and kubectl are already there, pinned to a known version.
  • We add postgresql-client because it is handy once you are inside the cluster.
  • gcloud_config is a ready gcloud configuration file kept next to the Dockerfile. It sets the project, zone, cluster, the credential file to use, and turns off update checks, prompts and usage reporting. We COPY it into place so the image starts already configured.
  • service-account.json is the key for a service account that can get cluster credentials and run pods. We COPY it in too.
  • The RUN gcloud config configurations activate ... && gcloud container clusters get-credentials ... && gcloud auth activate-service-account ... line does the auth at build time. When the image is built it is already logged in and pointed at the cluster.
  • COPY bin/ /app/bin/ puts the scripts on the PATH (set by the ENV PATH /app/bin:... line above). bin/api-console is the script from the top of this post.

A note on the key, because it is the obvious sharp edge. This image contains a real service account key. Anyone with access to this image has full production access, so the image needs to be in a private registry.

When developer with access to this registry leaves the company make sure to rotate the service account key.

Now using it. Because the script is on the PATH, you don’t need to start a shell in the container and then type a second command - you pass api-console as the container’s command and it runs straight away:

docker run -it --rm my-registry/kube-rat:latest api-console

That starts the container, runs api-console, which runs the kubectl run, which drops you into the Rails console in the cluster. When you exit, the pod is gone and so is the container.

The only thing you need on your laptop is Docker. Put the first docker run in an alias and you forget any of this exists.

So people don’t step on each other’s toes

Look at the pod name again: api-${USERNAME}-console. The username is in there on purpose.

Two people debugging the same incident will both run api-console. If the pod name were fixed, the second person’s kubectl run would either fail or, worse, the script’s “delete the old pod first” line would kill the first person’s console out from under them. With the username in the name they get api-alice-console and api-bob-console, two separate pods, no collision.

It also makes kubectl get pods readable. You see whose console is whose at a glance, instead of a wall of random suffixes.

One catch: $USERNAME is not always set inside a bare container. Pass it in:

docker run -it --rm -e USERNAME=$(whoami) my-registry/kube-rat:latest

Without it the pod is just api-console. It still works, but then everyone shares that one name and you are back to stepping on toes. So set it, and put it in the alias.

Heavy jobs: a bigger pod with a disk

Sometimes a console is not enough. You need to download a multi gigabyte file and run a script eats a lot of memory. The console pod is small and has no disk to speak of.

For that we have let’s create a new script called inside-job:

#!/bin/bash
set -e

POD_NAME="kube-rat-${USERNAME}-inside-job"

kubectl delete pod "$POD_NAME" --ignore-not-found
kubectl run "$POD_NAME" --image=my-registry/kube-rat:latest --stdin --tty --rm --restart=Never --image-pull-policy Always --override-type=strategic \
--overrides='
{
  "metadata": {
    "labels": { "pod-class": "throwaway" }
  },
  "spec": {
    "containers": [
      {
        "name": "'"$POD_NAME"'",
        "resources": {
          "requests": {
            "cpu": "2",
            "memory": "8Gi"
          }
        },
        "volumeMounts": [
          {
            "mountPath": "/data",
            "name": "ephemeral-volume"
          }
        ]
      }
    ],
    "volumes": [
      {
        "name": "ephemeral-volume",
        "ephemeral": {
          "volumeClaimTemplate": {
            "metadata": {
              "labels": {
                "type": "ephemeral"
              }
            },
            "spec": {
              "accessModes": ["ReadWriteOnce"],
              "storageClassName": "app-storage-class",
              "resources": {
                "requests": {
                  "storage": "30Gi"
                }
              }
            }
          }
        }
      }
    ]
  }
}
' \
-- /bin/bash

It is the same idea as api-console, with a few changes:

  • Same pod-class: throwaway label, so the reaper covers it too. This one needs the reaper more than the console does, see below.

  • Bigger requests: cpu 2, memory 8Gi. A heavy job needs the room.

  • A volume. We use a generic ephemeral volume mounted at /data. The volumeClaimTemplate asks for a 30Gi disk from our storage class. The disk is created when the pod starts and deleted when the pod goes away. It is scratch space, not something you keep.

  • The storage class is plain:

    # https://cloud.google.com/kubernetes-engine/docs/how-to/generic-ephemeral-volumes
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: app-storage-class
    provisioner: pd.csi.storage.gke.io
    volumeBindingMode: WaitForFirstConsumer
    allowVolumeExpansion: true
    parameters:
      type: pd-balanced

    WaitForFirstConsumer so the disk is created in the same zone as the pod, and allowVolumeExpansion so you can grow it later if 30Gi turns out to be too small.

  • No activeDeadlineSeconds, in case the script takes a long time.

  • The command is /bin/bash, it just drops you into a shell so you can do whatever you want in there.

Extra credit: reap old pods

The pods are supposed to delete themselves. --rm removes them when kubectl run returns, and activeDeadlineSeconds: 86400 ends the console ones after a day no matter what. But sometimes that does not happen: kubectl loses its connection, your laptop sleeps, the command errors out in a way that skips the cleanup, and the pod just sits there. The inside-job pods have no deadline at all, so a forgotten one stays forever. So you want a backup plan that does not depend on the client behaving.

We can add a small CronJob to fix that. It runs every hour, finds our throwaway pods older than 24 hours, and deletes them. Note the -l pod-class=throwaway on the kubectl get pods - that is the label we put on the console and inside-job pods. Without it the reaper would happily delete any old pod in the namespace, including ones that are supposed to be there. A delete loop with no filter is the kind of thing that looks fine until the day it isn’t.

First the permissions, a service account that can list and delete pods:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod-reaper
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reaper
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-reaper
subjects:
  - kind: ServiceAccount
    name: pod-reaper
roleRef:
  kind: Role
  name: pod-reaper
  apiGroup: rbac.authorization.k8s.io

Then the job itself:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: pod-reaper
spec:
  schedule: "0 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: pod-reaper
          restartPolicy: Never
          containers:
            - name: pod-reaper
              image: bitnami/kubectl:latest
              command:
                - /bin/bash
                - -c
                - |
                  cutoff=$(date -u -d '24 hours ago' +%s)
                  kubectl get pods -l pod-class=throwaway -o json | \
                    jq -r --argjson cutoff "$cutoff" \
                      '.items[] | select((.metadata.creationTimestamp | fromdateiso8601) < $cutoff) | .metadata.name' | \
                    xargs -r kubectl delete pod