Failed to fetch image on recovery attempt

Hi there,

We’ve had some issues recently with the recovery of one of our apps (cf this thread), it seemed resolved but the issue is still hitting us.

When our backend gets into an internal error state, it becomes unresponsive and fails the healthcheck. This causes the platform to kill the container and restart it.

However, when attempting recovery, the image no longer exists and the service just fails.

TCP health check failed on port 3000.
Instance is unhealthy. Attempting recovery...
Image download failure. Image does not exist. Please check your Service configuration.
Instance stopped.
Instance created. Preparing to start...
Image download failure. Image does not exist. Please check your Service configuration.
Instance stopped.

This brings our app down permanently, and the only workaround is to manually rebuild + redeploy the image.

Timeline:

  • Feb 3rd 5pm (UTC) - app initial deployment
  • Feb 4th 9:30am (UTC) - app fails, recovery attempt fails

What are the image cleanup/retention policies of the registry?

For context we also built & deployed a few images for our preview environments between the initial deployment and the recovery attempt, could that interfere with the retention policy for our prod image? If so, how can we tag the prod image specifically to lengthen its retention at least until the next prod image is built + deployed successfully?

Thanks in advance!

Hi @Sam_Groot

It looks like a bug on our side. I will check it and let you know.

Hi @Lukasz_Oles ,

Thanks for looking into this! If there is anything I can provide to help troubleshooting, please let me know.

The deployment I mentioned earlier had the ID d89f51ee-1cdf-4450-b73d-be892d30c256.

No need, I’ve found the issue. We’ve fixed it, and tomorrow it should be released into production.

Thank you for reporting it!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.