Phase-0 hazard: floating staging/production OCI tags are reapable while HRs are suspended, can freeze go-live deploy #3
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
The cms-plugins deploy pins the pod image by digest, but that digest is resolved by Flux through a floating
<branch>OCI tag (staging/production). During Phase-0 the HelmReleases aresuspend: truewith zero running pods, so those floating tags are not referenced by any live workload. Becausegitea-oci-cleanupis fleet-aware and only auto-pins tags referenced by LIVE k8s workloads, the floatingstaging/productiontags are currently reapable. If the cleaner reaps the floating tag before a deliberateflux resume, theImagePolicywill have no tag to resolve at go-live →ImageUpdateAutomationcannot write a digest → the HelmRelease cannot roll a real image. Result: a frozen/failed first deploy that looks mysterious because the immutable0.1.<N>audit tags still exist in the registry.Evidence
.woodpecker/container.yaml:74-82—TAGS="-t $IMAGE:$VERSION -t $IMAGE:$BRANCH -t $IMAGE:$BRANCH-latest"thendocker buildx build ... $TAGS --push..woodpecker/container.yaml:61-73— comment confirms$BRANCHis "the floating channel pointer ... what Flux's ImagePolicy tracks".deploy/fleet-overlay/cms-plugins-staging/image-automation.yaml:29-37—filterTags.pattern: '^staging$',digestReflectionPolicy: Always.deploy/fleet-overlay/cms-plugins-production/image-automation.yaml:29-37— same with^production$.deploy/fleet-overlay/cms-plugins-staging/helmrelease.yaml:33-34—tag: staging/digest: "" # {"$imagepolicy": "kotkan:cms-plugins-staging:digest"}.suspend: true, 0 pods inkotkan→ floating tags unreferenced.ARCHITECTURE.md:78-101,DEPLOYMENT.md:22,DEPLOYMENT.md:116.Options
gitea-oci-cleanup(another repo — armer/cluster config): explicitly pin/keepgit.oleks.space/oleks/cms-plugins:{staging,production}so they survive even while no live workload references them. Most robust; survives indefinite Phase-0.0.1.<N>tag/digest for the very first resume, then switch to floating-tag tracking once a live reference exists and the cleaner auto-pins it. Repo-file-only but changes deploy-digest resolution semantics → owner design decision.flux resumeso the floating tags are freshly (re)published right before they become live-referenced — operational mitigation only; narrow race window remains.Recommendation
Option 1 (allowlist
cms-plugins:{staging,production}ingitea-oci-cleanup) as the durable fix, plus a go-live runbook note inDEPLOYMENT.mdthat the floating tags must exist beforeflux resume. Keeps the floating-tag→digest design intact and removes the Phase-0 reapability window without touching deploy-digest semantics. Cross-repo + live-cluster verification required, so it cannot be done as a cms-plugins repo edit.Surfaced by the deploy-hardening review pass; deferred from auto-fix because it spans the
gitea-oci-cleanupconfig and needs operational verification.Durable fix applied (Option 1) — pending deploy
Pinned the floating tags explicitly in the cleaner's allowlist so they're never reaped, plus a go-live runbook note.
servers/armer/scripts/registry-pins.txt— addedcontainer/cms-plugins==stagingandcontainer/cms-plugins==production(commita8b8fd8, pushed tooleks/armer@main). The cleaner always keeps pinnedtype/name==versiontuples (gitea-oci-cleanup.pyload_pins), independent of live-workload references — so the Phase-0 reapability window is closed.DEPLOYMENT.md— first-deploy checklist now carries a go-live prerequisite: confirm both floating tags exist beforeflux resume, with the pins mitigation documented (commit0b1f2ebondevelop).⏳ Remaining to close: the pins change only takes effect after a
deploy-rsdeploy to armer (cd ~/projects/servers/armer && nix run .#deploy) — it edits the declarative config but the runninggitea-oci-cleanupstill uses the old/etc/registry-pins.txtuntil redeployed. Not deploying from here (deploy timing is yours). Once deployed, this can be closed.Note: once the releases are eventually resumed, the fleet-aware auto-pin also covers these tags; the explicit pins are harmless to leave and protect the still-suspended window.
Deployed & verified — closing
nix run .#deployto armer completed (activation succeeded, magic-rollback confirmed). The runninggitea-oci-cleanupnow loads the pins from/nix/store/23glbn9snjkv2zjsviwish2yd2r0skvv-registry-pins/etc/registry-pins.txt, which contains:So both floating tags are now in the cleaner's always-keep allowlist regardless of live-workload references — the Phase-0 reapability window is closed. Go-live note is on
develop(DEPLOYMENT.md). Resolved.