Health check failing for application startup

Hey, I’ve been using Koyeb for the past 3 months to deploy my web service.

I’ve slowly upgraded the instance size that I’ve been using for my deployments over the past few weeks as my load on the machine has increased.

My latest deployment has been failing consistently since yesterday. It tells me TCP health check on port 8000 failed.

However, I’ve made sure of those things:

  1. All environment variables that are required are provided.
  2. The instance is working perfectly fine on my local machine.
  3. For the instance size, I was on the medium instance, which was more than enough to run my current application.

Then I read here in a previous discussion that the issue was with the instance size. However, I inspected my metrics and figured that I’m not even using 25% of my instance size in terms of memory. I still upgraded to the bigger instance size though. And it’s still failing the health check.

I even tried to do an HTTP health check. However, that didn’t work as well.

I tried to increase the duration that waits for the health check to return. That didn’t work too.

I’m not sure what is the solution now.

Can anyone help?

For reference, this is the message that I am seeing:

INFO: Started server process [1]
INFO: Waiting for application startup.
TCP health check failed on port 8000. Retrying…

Hello @Ahmad_Elsaeed,

Are you sure that it listens on port 8000?

When started locally, can you check on which port uvicorn is listening?

Run something like:

docker exec <container_name_or_id> ss -tnl | grep uv

Yes, it runs on port 8000. It has been for the past 3 months, and nothing changed.

I can try to run that command though.

Yes, please do it.
You can also compare the output with the current running deployment.
Updated command is:

ss -ltnp

I’ve forgot p flag.

Yes, I ran that command and yes it is running on port 8000.

Hey @Lukasz_Oles, I appreciate your help :). So I ran a profiler on the app startup process on my local machine, and it is around 447MB. Nothing that my instance shouldn’t be able to handle right?

I initially had an instance that has 2 GB RAM, and I’ve now even upgraded to 4 GB of RAM. Still the same issue.

For reference, my app is a Python FastAPI app.

Edit: I need someone from the team to help me with this as it is a time-sensitive issue. We’re trying to roll a new update of our service and we haven’t been able to because of this…

Edit 2: I’m suspecting there is something wrong happening here. I am monitoring the metrics when I’m deploying the new instance, and I’m looking at the CPU usage. For the medium size instance, I was getting up to 75% CPU usage. I upgraded to the large size (double the size of the medium) and it’s still at 62% CPU usage. I then upgraded to the XL because according to the troubleshooting docs, the percentage shouldn’t be up to 50%. But the X-large is giving me 99% CPU usage. I’m suspecting there is something wrong happening here. What could it be?

What does it do during the startup? Is it trying to connect to something?
What happens before it starts to listen on port 8000?
Maybe add some logs to the startup process to see when it hangs?

It looks like something is blocking the app from listening on port 8000 and that’s why health check is failing.

On startup, it’s trying to create 5 API endpoints and 9 FastAPI schedulers.

It doesn’t try to connect to anything external.

I added logs between the starting up of the app and its completion. It seems like the app runs 6 of the schedulers before the startup is completed.

This is where the TCP health check could fail because the schedulers don’t get completed before the health check. However, those schedulers have not changed since my last deployment that was successful. Not sure what could be the blocker here, but I’ve even added more grace period time for the TCP health check and it has also not succeeded.

I’ll try to deploy with the logs and see whether they get completed or not now.

cc: @Lukasz_Oles

Hey @Lukasz_Oles , I appreciate your help. The issue is fixed.

Sharing the solution in case someone wants to refer to it later. Basically, my app is a Python FastAPI app, and it was trying to run the schedulers in between the startup of the app and the completion of the startup.

In the decorator of each of the schedulers, I just added a parameter called “wait_first” that waits for one interval before running the schedulers. Hence, giving the app enough space to complete the startup before running the schedulers.

That fixed my issue without upgrading the size of the instance.

1 Like

@Ahmad_Elsaeed Thank you for sharing the solution!