GraphQL Search API - Pagination and Sorting Constraints Cause Data Loss for Large-Scale Queries #35831
Open
1 task done
Labels
content
This issue or pull request belongs to the Docs Content team
search-github
Related to GitHub search
waiting for review
Issue/PR is waiting for a writer's review
Code of Conduct
What article on docs.github.com is affected?
https://docs.github.com/en/graphql/reference/queries#search
What part(s) of the article would you like to see updated?
https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories
Additional information
Issue:
The GraphQL API enforces a 1000-record limit per search query, which is fine. But when you try to fetch records based on criterial like creation or update dates, and specify for example: "first 100" as the page size, it does not return first 100 records per the specified criteria. The records seem to be picked up randomly or may be sorted based on some default relevance-based sorting and thus creates non-deterministic pagination when fetching records based on criteria like creation or update dates.
Impact:
Steps to reproduce
Query:
query Search { search(query: "org:microsoft created:>2010-01-01 sort:created-asc", type: REPOSITORY, first: 100) { edges { node { ... on Repository { url createdAt } } } pageInfo { endCursor hasNextPage } }
Expected:
Should return sorted records based on createdAt starting from 2010-01-01.
Or if the query does not include
sort:created-asc
, should still return the first 100 records with respect to thecreatedAt
date even if unsorted.Actual:
Random 1000 records (100 records per page) based on relevance instead of specified criteria. This makes the results appear random with respect to the createdAt date.
Rationale and Examples
The general expectation is that when the user says "Give me first 10 repositories belonging to this organization, which were created after 2010-01-01", then the "first 10" means that the user is expecting first 10 results with respect to that date (2010-01-01) even if unsorted. But what it returns is even the repositories that were created very recently that are clearly not one of the first 10 since the provided criteria date. It looks like GitHub is selecting any 10 repositories which are most relevant while ignoring the query criteria based on createdAt date.
Even if the first: 10 (or first: 100) is meant only for pagination (and not considered part of the query), the current behavior breaks any usability because the paginated values are not deterministic, when the user is asking for records after a certain date, the first 10 (or 100) records they see on the first page should be deterministic according to the criteria and not according to some default 'relevance'. Or at least there must be some way to override the default.
For example: Here we are just selecting first 10 records and we see that the result of the query returns records create as recently as 2023 -
But there are more than 100 repositories that were created just between 2010 and 2015. The first 10 results should have been from this set. See the results below -
Suggestions for Improvement:
Use Case:
We’re building an insights tool that analyzes repository metadata at scale. Due to these limitations, we’re unable to fetch data deterministically and efficiently. It seems like practically impossible to fetch usable data.
The text was updated successfully, but these errors were encountered: