Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove datashaper strip code #1574

Merged
merged 26 commits into from
Jan 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
c6b658a
Remove most old pipeline running code and update tests
natoverse Dec 31, 2024
96056e9
Restore update functionality
natoverse Dec 31, 2024
201d173
Remove most pipeline config references
natoverse Dec 31, 2024
e6ac3c9
Move workflows up
natoverse Dec 31, 2024
aa47297
Simplify base_text_units naming
natoverse Dec 31, 2024
21e8e2e
Simplify verb test setup
natoverse Dec 31, 2024
d916feb
Simplify table read/write
natoverse Dec 31, 2024
12fa4f7
Remove some pipeline_config refs
natoverse Dec 31, 2024
eb099c1
Remove misc unused utils/functions
natoverse Dec 31, 2024
b72f4d9
Semver
natoverse Dec 31, 2024
58ea8be
Update blog posts (#1571)
AlonsoGuevara Dec 30, 2024
6b079c0
Semver
natoverse Dec 31, 2024
15bea98
Fix filename typo
natoverse Jan 2, 2025
3096d3c
Remove runtime_storage and snapshot
natoverse Jan 2, 2025
fb646a5
Remove unused exporter
natoverse Jan 2, 2025
bb0ecdc
Improve workflow_name reuse
natoverse Jan 2, 2025
a212446
Move derive_from_rows from DataShaper
natoverse Jan 2, 2025
651748b
Migrate callbacks/progress from DataShaper
natoverse Jan 3, 2025
86933c3
Remove unused profiling
natoverse Jan 3, 2025
ad08db8
Remove DataShaper agg helper in create_base_text_units
natoverse Jan 3, 2025
4c507a2
Remove VerbParallelizationError
natoverse Jan 3, 2025
9ab7989
Remove datashaper dep and docs references
natoverse Jan 3, 2025
5299d3d
Chore/gleanings any encoding (#1569)
AlonsoGuevara Jan 2, 2025
2523a46
Basic search implementation (#1563)
gaudyb Jan 2, 2025
a190526
Remove config input models (#1570)
dworthen Jan 2, 2025
236e8ef
Bump ruff from 0.8.4 to 0.8.5 (#1579)
dependabot[bot] Jan 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .semversioner/next-release/minor-20241227205339264730.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "minor",
"description": "new search implemented as a new option for the api"
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/minor-20241231213627966329.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "minor",
"description": "Remove old pipeline runner."
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/minor-20241231214323349946.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "minor",
"description": "Remove DataShaper (first steps)."
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20241230224307150194.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "minor",
"description": "Make gleanings independent of encoding"
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20250102170720512799.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "Remove config input models."
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20250102232542899735.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "Ruff update"
}
4 changes: 0 additions & 4 deletions dictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -148,10 +148,6 @@ codebases
# Microsoft
MSRC

# Broken Upstream
# TODO FIX IN DATASHAPER
Arrary

# Prompt Inputs
ABILA
Abila
Expand Down
7 changes: 7 additions & 0 deletions docs/blog_posts.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,11 @@
<h6>Published November 25, 2024

By [Darren Edge](https://www.microsoft.com/en-us/research/people/daedge/), Senior Director; [Ha Trinh](https://www.microsoft.com/en-us/research/people/trinhha/), Senior Data Scientist; [Jonathan Larson](https://www.microsoft.com/en-us/research/people/jolarso/), Senior Principal Data Architect</h6>

- [:octicons-arrow-right-24: __Moving to GraphRAG 1.0 – Streamlining ergonomics for developers and users__](https://www.microsoft.com/en-us/research/blog/moving-to-graphrag-1-0-streamlining-ergonomics-for-developers-and-users)
AlonsoGuevara marked this conversation as resolved.
Show resolved Hide resolved

---
<h6>Published December 16, 2024

By [Nathan Evans](https://www.microsoft.com/en-us/research/people/naevans/), Principal Software Architect; [Alonso Guevara Fernández](https://www.microsoft.com/en-us/research/people/alonsog/), Senior Software Engineer; [Joshua Bradley](https://www.microsoft.com/en-us/research/people/joshbradley/), Senior Data Scientist</h6>
</div>
3 changes: 1 addition & 2 deletions docs/examples_notebooks/index_migration.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -206,9 +206,8 @@
"metadata": {},
"outputs": [],
"source": [
"from datashaper import NoopVerbCallbacks\n",
"\n",
"from graphrag.cache.factory import create_cache\n",
"from graphrag.callbacks.noop_verb_callbacks import NoopVerbCallbacks\n",
"from graphrag.index.flows.generate_text_embeddings import generate_text_embeddings\n",
"\n",
"# We only need to re-run the embeddings workflow, to ensure that embeddings for all required search fields are in place\n",
Expand Down
28 changes: 2 additions & 26 deletions docs/index/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,33 +8,9 @@ In order to support the GraphRAG system, the outputs of the indexing engine (in
This model is designed to be an abstraction over the underlying data storage technology, and to provide a common interface for the GraphRAG system to interact with.
In normal use-cases the outputs of the GraphRAG Indexer would be loaded into a database system, and the GraphRAG's Query Engine would interact with the database using the knowledge model data-store types.

### DataShaper Workflows

GraphRAG's Indexing Pipeline is built on top of our open-source library, [DataShaper](https://github.com/microsoft/datashaper).
DataShaper is a data processing library that allows users to declaratively express data pipelines, schemas, and related assets using well-defined schemas.
DataShaper has implementations in JavaScript and Python, and is designed to be extensible to other languages.

One of the core resource types within DataShaper is a [Workflow](https://github.com/microsoft/datashaper/blob/main/javascript/schema/src/workflow/WorkflowSchema.ts).
Workflows are expressed as sequences of steps, which we call [verbs](https://github.com/microsoft/datashaper/blob/main/javascript/schema/src/workflow/verbs.ts).
Each step has a verb name and a configuration object.
In DataShaper, these verbs model relational concepts such as SELECT, DROP, JOIN, etc.. Each verb transforms an input data table, and that table is passed down the pipeline.

```mermaid
---
title: Sample Workflow
---
flowchart LR
input[Input Table] --> select[SELECT] --> join[JOIN] --> binarize[BINARIZE] --> output[Output Table]
```

### LLM-based Workflow Steps

GraphRAG's Indexing Pipeline implements a handful of custom verbs on top of the standard, relational verbs that our DataShaper library provides. These verbs give us the ability to augment text documents with rich, structured data using the power of LLMs such as GPT-4. We utilize these verbs in our standard workflow to extract entities, relationships, claims, community structures, and community reports and summaries. This behavior is customizable and can be extended to support many kinds of AI-based data enrichment and extraction tasks.

### Workflow Graphs
### Workflows

Because of the complexity of our data indexing tasks, we needed to be able to express our data pipeline as series of multiple, interdependent workflows.
In the GraphRAG Indexing Pipeline, each workflow may define dependencies on other workflows, effectively forming a directed acyclic graph (DAG) of workflows, which is then used to schedule processing.

```mermaid
---
Expand All @@ -55,7 +31,7 @@ stateDiagram-v2
The primary unit of communication between workflows, and between workflow steps is an instance of `pandas.DataFrame`.
Although side-effects are possible, our goal is to be _data-centric_ and _table-centric_ in our approach to data processing.
This allows us to easily reason about our data, and to leverage the power of dataframe-based ecosystems.
Our underlying dataframe technology may change over time, but our primary goal is to support the DataShaper workflow schema while retaining single-machine ease of use and developer ergonomics.
Our underlying dataframe technology may change over time, but our primary goal is to support the workflow schema while retaining single-machine ease of use and developer ergonomics.

### LLM Caching

Expand Down
19 changes: 0 additions & 19 deletions examples/README.md

This file was deleted.

2 changes: 0 additions & 2 deletions examples/__init__.py

This file was deleted.

2 changes: 0 additions & 2 deletions examples/custom_input/__init__.py

This file was deleted.

24 changes: 0 additions & 24 deletions examples/custom_input/pipeline.yml

This file was deleted.

46 changes: 0 additions & 46 deletions examples/custom_input/run.py

This file was deleted.

3 changes: 0 additions & 3 deletions examples/single_verb/input/data.csv

This file was deleted.

12 changes: 0 additions & 12 deletions examples/single_verb/pipeline.yml

This file was deleted.

77 changes: 0 additions & 77 deletions examples/single_verb/run.py

This file was deleted.

2 changes: 0 additions & 2 deletions examples/use_built_in_workflows/__init__.py

This file was deleted.

23 changes: 0 additions & 23 deletions examples/use_built_in_workflows/pipeline.yml

This file was deleted.

Loading
Loading