Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: Segmentation fault with docker aarch64 on MacOS M1 using a small test model stories15M_MOE-Q8_0.gguf #11082

Open
marcindulak opened this issue Jan 5, 2025 · 1 comment

Comments

@marcindulak
Copy link

Name and Version

version: 4410 (4b0c638)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu

Operating systems

Mac

GGML backends

CPU

Hardware

MacOS M1

Models

https://huggingface.co/ggml-org/stories15M_MOE stories15M_MOE-Q8_0.gguf

Problem description & steps to reproduce

  1. download the gguf model

    mkdir -p models
    MODEL=stories15M_MOE-Q8_0.gguf && curl -sL -o models/$MODEL "https://huggingface.co/ggml-org/stories15M_MOE/resolve/main/$MODEL?download=true"
  2. run with docker on aarch64 - it fails

    docker run --platform linux/aarch64 --rm -it --name llama.cpp-full -v $PWD/models:/models ghcr.io/ggerganov/llama.cpp:full-b4410 --run -m /models/stories15M_MOE-Q8_0.gguf -p "Building a website can be done in 10 simple steps:"
    ...
    echo $?
    139

    When executing the run in the container with bash, it additionally prints "Segmentation fault (core dumped)"

    docker run --entrypoint /bin/bash --platform linux/aarch64 --rm -it --name llama.cpp-full -v $PWD/models:/models ghcr.io/ggerganov/llama.cpp:full-b4410
    ./llama-cli -m /models/stories15M_MOE-Q8_0.gguf -p "Building a website can be done in 10 simple steps:"
  3. run with docker on amd64 - it succeeds

    docker run --platform linux/amd64 --rm -it --name llama.cpp-full -v $PWD/models:/models ghcr.io/ggerganov/llama.cpp:full-b4410 --run -m /models/stories15M_MOE-Q8_0.gguf -p "Building a website can be done in 10 simple steps:"
    # use docker stop llama.cpp-full to stop that run

I see the aarch64 run does not load_backend, and amd64 does

< build: 4410 (4b0c638b) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu
---
> load_backend: loaded CPU backend from ./libggml-cpu-haswell.so
> build: 4410 (4b0c638b) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

First Bad Commit

No response

Relevant log output

Unable to find image 'ghcr.io/ggerganov/llama.cpp:full-b4410' locally
full-b4410: Pulling from ggerganov/llama.cpp
Digest: sha256:03fd6a1abb47fc7abe25c50a7a2fb0651ef0f0ef314e5fe0c16fa80442b1f83f
Status: Downloaded newer image for ghcr.io/ggerganov/llama.cpp:full-b4410
build: 4410 (4b0c638b) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 26 key-value pairs and 63 tensors from /models/stories15M_MOE-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 4x24M
llama_model_loader: - kv   3:                            general.license str              = mit
llama_model_loader: - kv   4:                          llama.block_count u32              = 6
llama_model_loader: - kv   5:                       llama.context_length u32              = 256
llama_model_loader: - kv   6:                     llama.embedding_length u32              = 288
llama_model_loader: - kv   7:                  llama.feed_forward_length u32              = 768
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 6
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 6
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                         llama.expert_count u32              = 4
llama_model_loader: - kv  13:                    llama.expert_used_count u32              = 2
llama_model_loader: - kv  14:                          general.file_type u32              = 7
llama_model_loader: - kv  15:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 48
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  20:                      tokenizer.ggml.scores arr[f32,32000]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,32000]   = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   19 tensors
llama_model_loader: - type q8_0:   44 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 256
llm_load_print_meta: n_embd           = 288
llm_load_print_meta: n_layer          = 6
llm_load_print_meta: n_head           = 6
llm_load_print_meta: n_head_kv        = 6
llm_load_print_meta: n_rot            = 48
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 48
llm_load_print_meta: n_embd_head_v    = 48
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 288
llm_load_print_meta: n_embd_v_gqa     = 288
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 768
llm_load_print_meta: n_expert         = 4
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 256
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 36.36 M
llm_load_print_meta: model size       = 36.87 MiB (8.51 BPW) 
llm_load_print_meta: general.name     = n/a
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
@noah-bytestudio
Copy link

I am facing the same issue. When running the llama.cpp:server image with the --platform linux/arm64 , the server wont start. If I use the --platform linux/amd64 flag the server starts, but is incredibly slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants