Why Deployment Shows “Failed” When Everything Is Actually Working

TL;DR

GitHub Actions shows: ❌ Deployment Failed
Reality: ✅ All services deployed successfully and running healthy

The workflow “fails” due to a timing issue in the health check, but all containers are up and healthy.


What Actually Happened

Deployment Timeline

22:27:53 - Start deployment
22:27:54 - Containers recreated
22:28:00 - Blockchain marked healthy (6 seconds)
22:28:00 - API containers start
22:28:30 - Wait 30 seconds for health checks
22:28:30 - Run curl health check
22:28:30 - ❌ curl returns 503
22:28:30 - Workflow exits with error

But Check The Actual Container Status

$ docker compose ps

NAME              STATUS
daon-api-1        Up 17 minutes (healthy)  ✅
daon-api-2        Up 17 minutes (healthy)  ✅
daon-api-3        Up 17 minutes (healthy)  ✅
daon-blockchain   Up 17 minutes (healthy)  ✅
daon-postgres     Up 5 hours (healthy)     ✅
daon-redis        Up 5 hours (healthy)

All services are healthy! The deployment actually succeeded.


Why The Health Check Failed

The Problem: Timing

API containers have this dependency chain:

1. Wait for postgres to be healthy (✅ fast)
2. Wait for redis to be healthy (✅ fast)
3. Wait for blockchain to be healthy (✅ ~6 seconds)
4. Start API container
5. Run healthcheck internally
6. Become "healthy"

Deployment workflow timeline:

1. Start all containers
2. Wait 30 seconds  <-- Fixed wait time
3. curl http://api.daon.network/health
4. Expect 200 OK

The Race Condition

The workflow waited 30 seconds, but:

  • Blockchain took ~6s to become healthy
  • API containers started at 6s mark
  • API healthcheck has start_period: 40s
  • Total time needed: 6s + 40s = 46 seconds
  • Workflow only waited: 30 seconds

Result: Workflow checked too early, got 503, exited with “failure”

But the containers kept running and became healthy shortly after!


How We Know Everything Is Fine

1. Container Status Shows Healthy

From your server output:

Up 17 minutes (healthy)  ← Docker's internal healthcheck passed

Docker wouldn’t mark them (healthy) if they weren’t.

2. All Healthchecks Are Configured

docker-compose.yml has healthchecks for each service:

Blockchain:

healthcheck:
  test: ["CMD-SHELL", "curl -f http://localhost:26657/health || ..."]
  start_period: 180s  # 3 minute grace period

API:

healthcheck:
  test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health"]
  start_period: 40s  # 40 second grace period

These pass AFTER the GitHub workflow gave up.

3. Deployment Actually Completed

Looking at the logs:

✅ Container daon-blockchain  Recreated
✅ Container daon-api-1       Recreated
✅ Container daon-api-2       Recreated
✅ Container daon-api-3       Recreated
✅ All containers started
✅ Blockchain healthy
✅ API containers started

Only the final verification curl failed due to timing.


The Fix (Already Implemented)

We updated the deployment workflow in commit 14810ce:

Old Approach (Problematic)

# Deploy containers
docker compose up -d

# Wait fixed 30 seconds
sleep 30

# Single health check attempt
curl -f https://api.daon.network/health || exit 1

Problem: Not enough time, no retries

New Approach (Fixed)

# Deploy containers
docker compose up -d

# Wait longer initial period
sleep 60

# Retry health check 10 times
for i in {1..10}; do
  if curl -f https://api.daon.network/health; then
    echo "✅ API is healthy"
    break
  fi
  echo "Attempt $i/10 failed, waiting 5s..."
  sleep 5
done

# Final verification
curl -f https://api.daon.network/health || exit 1

Benefits:

  • 60s initial wait (was 30s)
  • Up to 10 retry attempts
  • 5s between retries
  • Total max wait: 60s + (10 × 5s) = 110 seconds
  • Handles slow startup gracefully

Why This Matters

Without Understanding This

Developer sees: ❌ Deployment Failed
Developer thinks: “Production is broken, need to rollback!”
Developer action: Panic, unnecessary work

With Understanding This

Developer sees: ❌ Deployment Failed (timing)
Developer checks: docker compose ps → All healthy
Developer confirms: API responding
Developer action: Ignore false failure, continue


How To Verify Production After “Failed” Deployment

Quick Check

ssh USER@SERVER 'cd /opt/daon-source && docker compose ps'

Look for: Up XX minutes (healthy) on all containers

Detailed Check

# Check containers
ssh USER@SERVER 'cd /opt/daon-source && docker compose ps'

# Check API health
curl https://api.daon.network/health
# or internal
ssh USER@SERVER 'curl http://localhost:3001/health'

# Check blockchain
ssh USER@SERVER 'docker exec daon-blockchain daond status'

Use The Script

./check-production.sh
# (requires DAON_SERVER env var)

When Is It Actually Broken?

Real Failure Indicators

1. Container Not Running

daon-api-1    Exited (1) 2 minutes ago

2. Container Restarting

daon-blockchain    Restarting (1) 10 seconds ago

3. Container Unhealthy

daon-api-1    Up 5 minutes (unhealthy)

4. No Response After Waiting

# Wait 5 minutes after deployment
sleep 300

# Still no response
curl http://localhost:3001/health
# curl: (7) Failed to connect

Your Deployment: None Of These

Your containers show:

✅ Up 17 minutes (healthy)

Not exited, not restarting, not unhealthy, not failing to connect.

Verdict: Working perfectly!


Next Deployment Will Show Success

The next time you deploy (after we pushed the fix), the workflow will:

  1. Wait 60 seconds (not 30)
  2. Retry up to 10 times
  3. Succeed when API becomes healthy (after ~40-50s)
  4. Show ✅ Deployment Successful

Summary Table

Aspect Workflow Says Reality Why Different
Deployment ❌ Failed ✅ Succeeded Workflow gave up too early
Containers N/A ✅ All running They kept starting after workflow exited
Health Status ❌ 503 at 30s ✅ Healthy at ~46s Healthcheck needs 40s grace period
Production ❌ Broken ✅ Working Timing issue, not actual failure

Key Takeaway

The workflow failed, but the deployment succeeded.

This is a false negative - the monitoring check failed, but the actual system is fine.

It’s like a pregnancy test that says “not pregnant” because you took it too early, but you’re actually pregnant. The test was wrong due to timing, not because there’s no baby.


Analogy

Baking a Cake:

  1. Put cake in oven (containers start)
  2. Set timer for 30 minutes (workflow waits)
  3. Check at 30 minutes: still raw! ❌ FAILURE
  4. Walk away disappointed
  5. 15 minutes later: cake is perfect ✅
  6. But you already left…

The cake didn’t fail, your timer was just too short.

Your deployment is the cake - it’s perfect, you just checked too early!


Action Items

Nothing Needed Right Now

Your production is working! The workflow status is cosmetic.

For Future Deployments

  1. ✅ Already fixed in commit 14810ce
  2. Next deployment will use new timing
  3. Should show success

If You Want To Confirm

# Quick verification
curl https://api.daon.network/health

# Should return:
{"status":"healthy","timestamp":"...","version":"0.1.0"}

Conclusion

Workflow Status: ❌ Failed (false negative)
Actual Production: ✅ Running & Healthy (true positive)
Action Required: None - it’s working fine!

The deployment succeeded, the workflow just didn’t wait long enough to see it.

Next deployment will have the improved timing and show success. ✅