Why Deployment Shows “Failed” When Everything Is Actually Working
TL;DR
GitHub Actions shows: ❌ Deployment Failed
Reality: ✅ All services deployed successfully and running healthy
The workflow “fails” due to a timing issue in the health check, but all containers are up and healthy.
What Actually Happened
Deployment Timeline
22:27:53 - Start deployment
22:27:54 - Containers recreated
22:28:00 - Blockchain marked healthy (6 seconds)
22:28:00 - API containers start
22:28:30 - Wait 30 seconds for health checks
22:28:30 - Run curl health check
22:28:30 - ❌ curl returns 503
22:28:30 - Workflow exits with error
But Check The Actual Container Status
$ docker compose ps
NAME STATUS
daon-api-1 Up 17 minutes (healthy) ✅
daon-api-2 Up 17 minutes (healthy) ✅
daon-api-3 Up 17 minutes (healthy) ✅
daon-blockchain Up 17 minutes (healthy) ✅
daon-postgres Up 5 hours (healthy) ✅
daon-redis Up 5 hours (healthy) ✅
All services are healthy! The deployment actually succeeded.
Why The Health Check Failed
The Problem: Timing
API containers have this dependency chain:
1. Wait for postgres to be healthy (✅ fast)
2. Wait for redis to be healthy (✅ fast)
3. Wait for blockchain to be healthy (✅ ~6 seconds)
4. Start API container
5. Run healthcheck internally
6. Become "healthy"
Deployment workflow timeline:
1. Start all containers
2. Wait 30 seconds <-- Fixed wait time
3. curl http://api.daon.network/health
4. Expect 200 OK
The Race Condition
The workflow waited 30 seconds, but:
- Blockchain took ~6s to become healthy
- API containers started at 6s mark
- API healthcheck has
start_period: 40s - Total time needed: 6s + 40s = 46 seconds
- Workflow only waited: 30 seconds
Result: Workflow checked too early, got 503, exited with “failure”
But the containers kept running and became healthy shortly after!
How We Know Everything Is Fine
1. Container Status Shows Healthy
From your server output:
Up 17 minutes (healthy) ← Docker's internal healthcheck passed
Docker wouldn’t mark them (healthy) if they weren’t.
2. All Healthchecks Are Configured
docker-compose.yml has healthchecks for each service:
Blockchain:
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:26657/health || ..."]
start_period: 180s # 3 minute grace period
API:
healthcheck:
test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health"]
start_period: 40s # 40 second grace period
These pass AFTER the GitHub workflow gave up.
3. Deployment Actually Completed
Looking at the logs:
✅ Container daon-blockchain Recreated
✅ Container daon-api-1 Recreated
✅ Container daon-api-2 Recreated
✅ Container daon-api-3 Recreated
✅ All containers started
✅ Blockchain healthy
✅ API containers started
Only the final verification curl failed due to timing.
The Fix (Already Implemented)
We updated the deployment workflow in commit 14810ce:
Old Approach (Problematic)
# Deploy containers
docker compose up -d
# Wait fixed 30 seconds
sleep 30
# Single health check attempt
curl -f https://api.daon.network/health || exit 1
Problem: Not enough time, no retries
New Approach (Fixed)
# Deploy containers
docker compose up -d
# Wait longer initial period
sleep 60
# Retry health check 10 times
for i in {1..10}; do
if curl -f https://api.daon.network/health; then
echo "✅ API is healthy"
break
fi
echo "Attempt $i/10 failed, waiting 5s..."
sleep 5
done
# Final verification
curl -f https://api.daon.network/health || exit 1
Benefits:
- 60s initial wait (was 30s)
- Up to 10 retry attempts
- 5s between retries
- Total max wait: 60s + (10 × 5s) = 110 seconds
- Handles slow startup gracefully
Why This Matters
Without Understanding This
Developer sees: ❌ Deployment Failed
Developer thinks: “Production is broken, need to rollback!”
Developer action: Panic, unnecessary work
With Understanding This
Developer sees: ❌ Deployment Failed (timing)
Developer checks: docker compose ps → All healthy
Developer confirms: API responding
Developer action: Ignore false failure, continue
How To Verify Production After “Failed” Deployment
Quick Check
ssh USER@SERVER 'cd /opt/daon-source && docker compose ps'
Look for: Up XX minutes (healthy) on all containers
Detailed Check
# Check containers
ssh USER@SERVER 'cd /opt/daon-source && docker compose ps'
# Check API health
curl https://api.daon.network/health
# or internal
ssh USER@SERVER 'curl http://localhost:3001/health'
# Check blockchain
ssh USER@SERVER 'docker exec daon-blockchain daond status'
Use The Script
./check-production.sh
# (requires DAON_SERVER env var)
When Is It Actually Broken?
Real Failure Indicators
1. Container Not Running
daon-api-1 Exited (1) 2 minutes ago
2. Container Restarting
daon-blockchain Restarting (1) 10 seconds ago
3. Container Unhealthy
daon-api-1 Up 5 minutes (unhealthy)
4. No Response After Waiting
# Wait 5 minutes after deployment
sleep 300
# Still no response
curl http://localhost:3001/health
# curl: (7) Failed to connect
Your Deployment: None Of These
Your containers show:
✅ Up 17 minutes (healthy)
Not exited, not restarting, not unhealthy, not failing to connect.
Verdict: Working perfectly!
Next Deployment Will Show Success
The next time you deploy (after we pushed the fix), the workflow will:
- Wait 60 seconds (not 30)
- Retry up to 10 times
- Succeed when API becomes healthy (after ~40-50s)
- Show ✅ Deployment Successful
Summary Table
| Aspect | Workflow Says | Reality | Why Different |
|---|---|---|---|
| Deployment | ❌ Failed | ✅ Succeeded | Workflow gave up too early |
| Containers | N/A | ✅ All running | They kept starting after workflow exited |
| Health Status | ❌ 503 at 30s | ✅ Healthy at ~46s | Healthcheck needs 40s grace period |
| Production | ❌ Broken | ✅ Working | Timing issue, not actual failure |
Key Takeaway
The workflow failed, but the deployment succeeded.
This is a false negative - the monitoring check failed, but the actual system is fine.
It’s like a pregnancy test that says “not pregnant” because you took it too early, but you’re actually pregnant. The test was wrong due to timing, not because there’s no baby.
Analogy
Baking a Cake:
- Put cake in oven (containers start)
- Set timer for 30 minutes (workflow waits)
- Check at 30 minutes: still raw! ❌ FAILURE
- Walk away disappointed
- 15 minutes later: cake is perfect ✅
- But you already left…
The cake didn’t fail, your timer was just too short.
Your deployment is the cake - it’s perfect, you just checked too early!
Action Items
Nothing Needed Right Now
Your production is working! The workflow status is cosmetic.
For Future Deployments
- ✅ Already fixed in commit
14810ce - Next deployment will use new timing
- Should show success
If You Want To Confirm
# Quick verification
curl https://api.daon.network/health
# Should return:
{"status":"healthy","timestamp":"...","version":"0.1.0"}
Conclusion
Workflow Status: ❌ Failed (false negative)
Actual Production: ✅ Running & Healthy (true positive)
Action Required: None - it’s working fine!
The deployment succeeded, the workflow just didn’t wait long enough to see it.
Next deployment will have the improved timing and show success. ✅