[Issue]: Does cache still work if extracting one-file graphRAG from a multiple files graphRAG? #819

Edwin-poying · 2024-08-05T03:31:43Z

Hi, Here is the scenerio I currently confront:

I build a graphRAG based on two distinct .txt file. And later, I want to see if I can build a graphRAG based on one of them.
After I modify the settings file to ensure that only one file gets ingested, I run the following command
python -m graphrag.index --root .

I was expecting that this act will not cost too much if the indexing stage can leverage the cache; however, it still make complete calls to Openai to build the graph.

So, can someone tells me if I did wrong or this scenerio has not been supported yet.

Many thanks.

The text was updated successfully, but these errors were encountered:

Edwin-poying · 2024-08-05T03:37:19Z

Here is my settings.yml file


encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: ${GRAPHRAG_LLM_MODEL}
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  api_base: ${GRAPHRAG_API_BASE}
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: ${GRAPHRAG_EMBEDDING_MODEL}
    api_base: ${API_BASE}
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional
  


chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*0731\\.txt$"
  # file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [product,group,job,feature,case,solution]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: Ｔrue
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

natoverse · 2024-08-05T18:28:35Z

If your settings have not been changed, and the original file is still in the folder, then it should use the cache in several places. For example, the text units (chunks) should be identical, so graph extraction should use the cache for those. However, and new entities and relationships extracted from the second file will trigger re-compute of the communities, and therefore all of the community summarization, which can be much of your overall expense. We're tracking more efficient incremental indexing with #741.

Edwin-poying · 2024-08-06T01:12:06Z

Hi, @natoverse ,
I have changed the file_pattern field in the input setting to deal with the specific file. Does this matter?

Below is how I change:
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
# file_pattern: ".*\.txt$"
file_pattern: ".*0731\.txt$"

natoverse · 2024-08-06T02:00:47Z

I don't think it should matter - the key to getting an accurate cache is that we hash all of the LLM params and prompt so that identical API calls are avoided. This is done per step, so individual parameter changes should only affect the steps that rely on them.

Edwin-poying · 2024-08-06T09:57:08Z

Thank you @natoverse for your graphRAG and your answer,

I still have one question that is related to this topic:

Since I generated graphRAG using two files at first; however, I decided to build a graphRAG using one of them later. I am wondering whether the system needs to regenerate the entity summary because the description list may be changed resulting from the reduction of input documents. So as to the summaries of relationships and claims.

natoverse · 2024-08-06T19:42:21Z

The entity/relationship extraction step is separate from the summarization. When extracting, each entity and relationship is given a description by the LLM. This will get the benefit of the cache. Before creating the community reports, the descriptions for each entity are combined into a single "canonical" description. This is also done by the LLMs, and if you have new instances of the entities, it should not use the cache.

Edwin-poying · 2024-08-07T01:03:11Z

Many thanks

Edwin-poying added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Aug 5, 2024

Edwin-poying changed the title ~~[Issue]: Does cache still work if extracting a file graphRAG from a multiple files graphRAG?~~ [Issue]: Does cache still work if extracting one-file graphRAG from a multiple files graphRAG? Aug 5, 2024

natoverse added awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Aug 5, 2024

Edwin-poying closed this as completed Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Does cache still work if extracting one-file graphRAG from a multiple files graphRAG? #819

[Issue]: Does cache still work if extracting one-file graphRAG from a multiple files graphRAG? #819

Edwin-poying commented Aug 5, 2024 •

edited

Loading

Edwin-poying commented Aug 5, 2024

natoverse commented Aug 5, 2024

Edwin-poying commented Aug 6, 2024 •

edited

Loading

natoverse commented Aug 6, 2024

Edwin-poying commented Aug 6, 2024

natoverse commented Aug 6, 2024

Edwin-poying commented Aug 7, 2024

[Issue]: Does cache still work if extracting one-file graphRAG from a multiple files graphRAG? #819

[Issue]: Does cache still work if extracting one-file graphRAG from a multiple files graphRAG? #819

Comments

Edwin-poying commented Aug 5, 2024 • edited Loading

Edwin-poying commented Aug 5, 2024

natoverse commented Aug 5, 2024

Edwin-poying commented Aug 6, 2024 • edited Loading

natoverse commented Aug 6, 2024

Edwin-poying commented Aug 6, 2024

natoverse commented Aug 6, 2024

Edwin-poying commented Aug 7, 2024

Edwin-poying commented Aug 5, 2024 •

edited

Loading

Edwin-poying commented Aug 6, 2024 •

edited

Loading