Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Does cache still work if extracting one-file graphRAG from a multiple files graphRAG? #819

Closed
Edwin-poying opened this issue Aug 5, 2024 · 7 comments
Labels
awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response

Comments

@Edwin-poying
Copy link

Edwin-poying commented Aug 5, 2024

Hi, Here is the scenerio I currently confront:

I build a graphRAG based on two distinct .txt file. And later, I want to see if I can build a graphRAG based on one of them.
After I modify the settings file to ensure that only one file gets ingested, I run the following command
python -m graphrag.index --root .

I was expecting that this act will not cost too much if the indexing stage can leverage the cache; however, it still make complete calls to Openai to build the graph.

So, can someone tells me if I did wrong or this scenerio has not been supported yet.

Many thanks.

@Edwin-poying Edwin-poying added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Aug 5, 2024
@Edwin-poying
Copy link
Author

Here is my settings.yml file


encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: ${GRAPHRAG_LLM_MODEL}
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  api_base: ${GRAPHRAG_API_BASE}
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: ${GRAPHRAG_EMBEDDING_MODEL}
    api_base: ${API_BASE}
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional
  


chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*0731\\.txt$"
  # file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [product,group,job,feature,case,solution]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: True
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

@Edwin-poying Edwin-poying changed the title [Issue]: Does cache still work if extracting a file graphRAG from a multiple files graphRAG? [Issue]: Does cache still work if extracting one-file graphRAG from a multiple files graphRAG? Aug 5, 2024
@natoverse
Copy link
Collaborator

If your settings have not been changed, and the original file is still in the folder, then it should use the cache in several places. For example, the text units (chunks) should be identical, so graph extraction should use the cache for those. However, and new entities and relationships extracted from the second file will trigger re-compute of the communities, and therefore all of the community summarization, which can be much of your overall expense. We're tracking more efficient incremental indexing with #741.

@natoverse natoverse added awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Aug 5, 2024
@Edwin-poying
Copy link
Author

Edwin-poying commented Aug 6, 2024

Hi, @natoverse ,
I have changed the file_pattern field in the input setting to deal with the specific file. Does this matter?

Below is how I change:
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
# file_pattern: ".*\.txt$"
file_pattern: ".*0731\.txt$"

@natoverse
Copy link
Collaborator

I don't think it should matter - the key to getting an accurate cache is that we hash all of the LLM params and prompt so that identical API calls are avoided. This is done per step, so individual parameter changes should only affect the steps that rely on them.

@Edwin-poying
Copy link
Author

Thank you @natoverse for your graphRAG and your answer,

I still have one question that is related to this topic:

Since I generated graphRAG using two files at first; however, I decided to build a graphRAG using one of them later. I am wondering whether the system needs to regenerate the entity summary because the description list may be changed resulting from the reduction of input documents. So as to the summaries of relationships and claims.

@natoverse
Copy link
Collaborator

The entity/relationship extraction step is separate from the summarization. When extracting, each entity and relationship is given a description by the LLM. This will get the benefit of the cache. Before creating the community reports, the descriptions for each entity are combined into a single "canonical" description. This is also done by the LLMs, and if you have new instances of the entities, it should not use the cache.

@Edwin-poying
Copy link
Author

Many thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response
Projects
None yet
Development

No branches or pull requests

2 participants