[Issue]: "Some reports are missing full content embeddings" when you use a DRIFT search #1561

MarkHmnv · 2024-12-27T08:32:43Z

Do you need to file an issue?

I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

Hi, I followed the guide from the official DRIFT search documentation , however when I try to run the search I get the following error:

Entity count: 2539
Relationship count: 1435
Text unit records: 103
Traceback (most recent call last):
  File "...\graphrag-example\drift_search.py", line 127, in <module>
    response = asyncio.run(drift_search.asearch('What happens after my NetSuite Service Tier license is activated?'))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\Python\Python311\Lib\asyncio\runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "...\Python\Python311\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\Python\Python311\Lib\asyncio\base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "...\graphrag-example\venv\Lib\site-packages\graphrag\query\structured_search\drift_search\search.py", line 200, in asearch
    primer_context, token_ct = self.context_builder.build_context(query)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\graphrag-example\venv\Lib\site-packages\graphrag\query\structured_search\drift_search\drift_context.py", line 196, in build_context
    report_df = self.convert_reports_to_df(self.reports)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\graphrag-example\venv\Lib\site-packages\graphrag\query\structured_search\drift_search\drift_context.py", line 128, in convert_reports_to_df
    raise ValueError(
ValueError: Some reports are missing full content embeddings. 187 out of 187

Local and Global search works fine. And also if you run DRIFT search from CLI, e.g.

graphrag query --query "What happens after my NetSuite Service Tier license is activated?" --method drift

Everything works fine, too.

Steps to reproduce

Index your documents
Run the code according to the following guide

GraphRAG Config Used

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

encoding_model: cl100k_base # this needs to be matched to your model!

llm:
  api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
  type: azure_openai_chat # or azure_openai_chat
  model: ${GRAPHRAG_LLM_MODEL}
  model_supports_json: true # recommended if this is available for your model.
  api_base: ${GRAPHRAG_API_BASE}
  api_version: ${GRAPHRAG_API_VERSION}
  organization: ${GRAPHRAG_API_ORGANIZATION}
  deployment_name: ${GRAPHRAG_LLM_MODEL}
  tokens_per_minute: 2000000
  requests_per_minute: 20000

parallelization:
  stagger: 0.3
  # num_threads: 50

async_mode: threaded

embeddings:
  async_mode: threaded
  vector_store: 
    type: lancedb
    db_uri: 'output\lancedb'
    container_name: default
    overwrite: true
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: azure_openai_embedding
    model: ${GRAPHRAG_EMBEDDING_MODEL}
    api_base: ${GRAPHRAG_API_BASE}
    api_version: ${GRAPHRAG_API_VERSION}
    organization: ${GRAPHRAG_API_ORGANIZATION}
    deployment_name: ${GRAPHRAG_EMBEDDING_MODEL}
    tokens_per_minute: 350000
    requests_per_minute: 2100

### Input settings ###

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: file # or blob
  base_dir: "cache"

reporting:
  type: file # or console, blob
  base_dir: "logs"

storage:
  type: file # or blob
  base_dir: "output"

## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_storage:
  # type: file # or blob
  # base_dir: "update_output"

### Workflow settings ###

skip_workflows: []

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  enabled: false
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  embeddings: false
  transient: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  prompt: "prompts/drift_search_system_prompt.txt"

Logs and screenshots

No response

Additional Information

GraphRAG Version: 1.0.1
Operating System: Windows 11
Python Version: 3.11
Related Issues:

entorick · 2024-12-27T09:21:52Z

same by the way

YepJin · 2024-12-30T20:01:38Z

Same here... not sure what happens

YepJin · 2024-12-31T00:00:18Z

I think the issue comes from the read_indexer_reports function in the example notebook. The config is not identified here.

So the full_content_embedding column is None value when returned.

Can you help check this example Drift_search notebook? @natoverse thanks!

thomasjlittle · 2025-01-02T19:18:36Z

I was having the same issue and @YepJin's suggestion of adding in the config to the read_indexer_reports call worked for me. Thank you!

xldistance · 2025-01-04T02:09:22Z

You can refer to my code,The official notebook is missing description_embedding_store

        INPUT_DIR = "E:\\graphrag_kb\\input\\artifacts"
        LANCEDB_URI = "E:\\graphrag_kb\\output\\lancedb"
        COMMUNITY_REPORT_TABLE = "create_final_community_reports"
        FINAL_COMMUNITY_TABLE = "create_final_communities"
        ENTITY_TABLE = "create_final_nodes"
        ENTITY_EMBEDDING_TABLE = "create_final_entities"
        RELATIONSHIP_TABLE = "create_final_relationships"
        COVARIATE_TABLE = "create_final_covariates"
        TEXT_UNIT_TABLE = "create_final_text_units"
        text_embedder = OpenAIEmbedding(
                # 本地嵌入模型
                api_key="ollama",
                api_base="http://localhost:11434/v1",
                model="bge-m3:Q4",
                deployment_name="bge-m3:Q4",
                api_type=OpenaiApiType.OpenAI,
                max_retries=20,
        )
        entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
        entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")
        entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)

        description_embedding_store = LanceDBVectorStore(collection_name="default-entity-description")
        description_embedding_store.connect(db_uri=LANCEDB_URI)

        relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
        relationships = read_indexer_relationships(relationship_df)
        text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
        text_units = read_indexer_text_units(text_unit_df)
        drift_params = DRIFTSearchConfig(
                temperature = 0.5,
                max_tokens = 12_000,
                primer_folds = 1,       # 搜索引导的折叠次数
                drift_k_followups = 3,      # 全局检索次数
                n_depth = 3,            # 混合搜索深度
                n = 1,              # 混合搜索次数
        )
        drift_context_builder = DRIFTSearchContextBuilder(
                chat_llm=llm,
                text_embedder=text_embedder,
                entities=entities,
                relationships=relationships,
                reports=reports,
                entity_text_embeddings=description_embedding_store,
                text_units=text_units,
                config = drift_params

                )
        drift_serch_engine = DRIFTSearch(
                llm=llm, context_builder=drift_context_builder, token_encoder=token_encoder
                )

xldistance · 2025-01-04T02:13:25Z

Draft_search's answer calls it this way,

        result = await drift_serch_engine.asearch(prompt)
        formatted_response = result.response
        formatted_response:str = formatted_response["nodes"][0]["answer"]

MarkHmnv added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: "Some reports are missing full content embeddings" when you use a DRIFT search #1561

[Issue]: "Some reports are missing full content embeddings" when you use a DRIFT search #1561

MarkHmnv commented Dec 27, 2024

entorick commented Dec 27, 2024

YepJin commented Dec 30, 2024

YepJin commented Dec 31, 2024

thomasjlittle commented Jan 2, 2025

xldistance commented Jan 4, 2025

xldistance commented Jan 4, 2025

[Issue]: "Some reports are missing full content embeddings" when you use a DRIFT search #1561

[Issue]: "Some reports are missing full content embeddings" when you use a DRIFT search #1561

Comments

MarkHmnv commented Dec 27, 2024

Do you need to file an issue?

Describe the issue

Steps to reproduce

GraphRAG Config Used

Logs and screenshots

Additional Information

entorick commented Dec 27, 2024

YepJin commented Dec 30, 2024

YepJin commented Dec 31, 2024

thomasjlittle commented Jan 2, 2025

xldistance commented Jan 4, 2025

xldistance commented Jan 4, 2025