Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Docker Compose --wait not always honoring healthy healthcheck for container that crashed, then became healthy #12424

Open
hubertdeng123 opened this issue Dec 30, 2024 · 0 comments

Comments

@hubertdeng123
Copy link

Description

It seems like occasionally when a container is unhealthy on start and restarts and becomes healthy, the docker compose up -d --wait will fail with an unhealthy error message. This happens when docker compose up -d --wait is run in parallel, and with the policy restart: unless-stopped. Note that this occasionally happens, not all the time.

I would hope that even if the container is unhealthy and crashes on start, --wait will account for this as it eventually becomes healthy after restarting itself if it is within the timeout period.

Steps To Reproduce

I have 3 config files like so:

docker-compose-redis:

services:
  redis:
    image: ghcr.io/getsentry/image-mirror-library-redis:5.0-alpine
    healthcheck:
      test: redis-cli ping | grep PONG
      interval: 5s
      timeout: 5s
      retries: 3
    command:
      [
        'redis-server',
        '--appendonly',
        'yes',
        '--save',
        '60',
        '20',
        '--auto-aof-rewrite-percentage',
        '100',
        '--auto-aof-rewrite-min-size',
        '64mb',
      ]
    ports:
      - 127.0.0.1:6379:6379
    volumes:
      - redis-data:/data
    networks:
      - devservices
    extra_hosts:
      - host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  redis-data:

docker-compose-kafka:

services:
  kafka:
    image: ghcr.io/getsentry/image-mirror-confluentinc-cp-kafka:7.5.0
    healthcheck:
      test: kafka-topics --bootstrap-server 127.0.0.1:9092 --list
      interval: 5s
      timeout: 5s
      retries: 3
    environment:
      # https://docs.confluent.io/platform/current/installation/docker/config-reference.html#cp-kakfa-example
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: [email protected]:29093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_NODE_ID: 1001
      CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,INTERNAL://0.0.0.0:9093,EXTERNAL://0.0.0.0:9092,CONTROLLER://0.0.0.0:29093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://127.0.0.1:29092,INTERNAL://kafka:9093,EXTERNAL://127.0.0.1:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT,CONTROLLER:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_OFFSETS_TOPIC_NUM_PARTITIONS: 1
      KAFKA_LOG_RETENTION_HOURS: 24
      KAFKA_MESSAGE_MAX_BYTES: 50000000 # 50MB or bust
      KAFKA_MAX_REQUEST_SIZE: 50000000 # 50MB on requests apparently too
      CONFLUENT_SUPPORT_METRICS_ENABLE: false
      KAFKA_LOG4J_LOGGERS: kafka.cluster=WARN,kafka.controller=WARN,kafka.coordinator=WARN,kafka.log=WARN,kafka.server=WARN,state.change.logger=WARN
      KAFKA_LOG4J_ROOT_LOGLEVEL: WARN
      KAFKA_TOOLS_LOG4J_LOGLEVEL: WARN
    ulimits:
      nofile:
        soft: 4096
        hard: 4096
    ports:
      - 127.0.0.1:9092:9092
      - 127.0.0.1:9093:9093
    volumes:
      - kafka-data:/var/lib/kafka/data
    networks:
      - devservices
    extra_hosts:
      - host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  kafka-data:

docker-compose-relay:

services:
  relay:
    image: us-central1-docker.pkg.dev/sentryio/relay/relay:nightly
    ports:
      - 127.0.0.1:7899:7899
    command: [run, --config, /etc/relay]
    healthcheck:
      test: curl -f http://127.0.0.1:7899/api/relay/healthcheck/live/
      interval: 5s
      timeout: 5s
      retries: 3
    volumes:
      - ./config/relay.yml:/etc/relay/config.yml
      - ./config/devservices-credentials.json:/etc/relay/credentials.json
    extra_hosts:
      - host.docker.internal:host-gateway
    networks:
      - devservices
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  kafka-data:
  redis-data:

When I run

# Start up commands in parallel
docker compose -p redis -f docker-compose-redis.yml up redis -d --wait > redis_up.log 2>&1 &
kafka_pid=$!
docker compose -p kafka -f docker-compose-kafka.yml up kafka -d --wait > kafka_up.log 2>&1 &
redis_pid=$!
docker compose -p relay -f docker-compose-relay.yml up relay -d --wait > relay_up.log 2>&1 &
relay_pid=$!

# Wait for all up commands to complete
wait $kafka_pid $redis_pid $relay_pid

Relay sometimes fails the to come up with the --wait flag, even if the docker status is technically healthy.

Logs:

Container relay-relay-1  Creating
 Container relay-relay-1  Created
 Container relay-relay-1  Starting
 Container relay-relay-1  Started
 Container relay-relay-1  Waiting
container relay-relay-1 is unhealthy

Compose Version

2.29.7

Docker Environment

Client:
 Version:    27.2.0
 Context:    colima

Anything else?

Let me know if there is anything else I can add to help out when reproducing the issue. The contents of the relay configs can be found here:
https://github.com/getsentry/relay/tree/fe3f09fd3accd2361887dd678dbe034f25139fce/devservices/config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant