Hi there,
We’ve had some issues recently with the recovery of one of our apps (cf this thread), it seemed resolved but the issue is still hitting us.
When our backend gets into an internal error state, it becomes unresponsive and fails the healthcheck. This causes the platform to kill the container and restart it.
However, when attempting recovery, the image no longer exists and the service just fails.
TCP health check failed on port 3000.
Instance is unhealthy. Attempting recovery...
Image download failure. Image does not exist. Please check your Service configuration.
Instance stopped.
Instance created. Preparing to start...
Image download failure. Image does not exist. Please check your Service configuration.
Instance stopped.
This brings our app down permanently, and the only workaround is to manually rebuild + redeploy the image.
Timeline:
- Feb 3rd 5pm (UTC) - app initial deployment
- Feb 4th 9:30am (UTC) - app fails, recovery attempt fails
What are the image cleanup/retention policies of the registry?
For context we also built & deployed a few images for our preview environments between the initial deployment and the recovery attempt, could that interfere with the retention policy for our prod image? If so, how can we tag the prod image specifically to lengthen its retention at least until the next prod image is built + deployed successfully?
Thanks in advance!