GraphQL Search API - Pagination and Sorting Constraints Cause Data Loss for Large-Scale Queries #35831

abgoswami · 2025-01-04T20:57:23Z

Code of Conduct

I have read and agree to the GitHub Docs project's Code of Conduct

What article on docs.github.com is affected?

https://docs.github.com/en/graphql/reference/queries#search

What part(s) of the article would you like to see updated?

https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories

Additional information

Issue:

The GraphQL API enforces a 1000-record limit per search query, which is fine. But when you try to fetch records based on criterial like creation or update dates, and specify for example: "first 100" as the page size, it does not return first 100 records per the specified criteria. The records seem to be picked up randomly or may be sorted based on some default relevance-based sorting and thus creates non-deterministic pagination when fetching records based on criteria like creation or update dates.

Impact:

Sorting behavior ignores explicit conditions like sort:created-asc.
Even if we don't want to support the sort:created-asc sorting, the result should still return first 100 records per page based on the query criteria instead of some unknown criteria like 'relevance'. Even if it is the default, there must be some way of overriding that default in the query.
Paginated queries risk data loss or overlaps due to inconsistent ordering.
Fetching historical data over large data set even after staying within the rate limit of 5000 requests per hour, becomes practically impossible without hacks and workarounds.

Steps to reproduce

Query:

query Search { search(query: "org:microsoft created:>2010-01-01 sort:created-asc", type: REPOSITORY, first: 100) { edges { node { ... on Repository { url createdAt } } } pageInfo { endCursor hasNextPage } }

Expected:

Should return sorted records based on createdAt starting from 2010-01-01.
Or if the query does not include sort:created-asc, should still return the first 100 records with respect to the createdAt date even if unsorted.

Actual:

Random 1000 records (100 records per page) based on relevance instead of specified criteria. This makes the results appear random with respect to the createdAt date.

Rationale and Examples

The general expectation is that when the user says "Give me first 10 repositories belonging to this organization, which were created after 2010-01-01", then the "first 10" means that the user is expecting first 10 results with respect to that date (2010-01-01) even if unsorted. But what it returns is even the repositories that were created very recently that are clearly not one of the first 10 since the provided criteria date. It looks like GitHub is selecting any 10 repositories which are most relevant while ignoring the query criteria based on createdAt date.

Even if the first: 10 (or first: 100) is meant only for pagination (and not considered part of the query), the current behavior breaks any usability because the paginated values are not deterministic, when the user is asking for records after a certain date, the first 10 (or 100) records they see on the first page should be deterministic according to the criteria and not according to some default 'relevance'. Or at least there must be some way to override the default.

For example: Here we are just selecting first 10 records and we see that the result of the query returns records create as recently as 2023 -

But there are more than 100 repositories that were created just between 2010 and 2015. The first 10 results should have been from this set. See the results below -

Suggestions for Improvement:

Allow deterministic ordering using fields like createdAt and updatedAt.
Introduce a server-side cursor for pagination that honors sorting criteria.
Provide options to disable relevance-based scoring for explicit queries.
Remove or increase the 1000-record limit for queries the remain under per hour rate limits.

Use Case:

We’re building an insights tool that analyzes repository metadata at scale. Due to these limitations, we’re unable to fetch data deterministically and efficiently. It seems like practically impossible to fetch usable data.

The text was updated successfully, but these errors were encountered:

welcome · 2025-01-04T20:57:26Z

Thanks for opening this issue. A GitHub docs team member should be by to give feedback soon. In the meantime, please check out the contributing guidelines.

COOLCY16 · 2025-01-05T03:18:25Z

Fascinating

nguyenalex836 · 2025-01-06T19:05:01Z

@abgoswami Thank you for raising this issue! I'll get this triaged for review ✨ Our team will provide feedback regarding the best next steps for this issue - thanks for your patience! 💛

abgoswami added the content This issue or pull request belongs to the Docs Content team label Jan 4, 2025

github-actions bot added the triage Do not begin working on this issue until triaged by the team label Jan 4, 2025

nguyenalex836 added search-github Related to GitHub search waiting for review Issue/PR is waiting for a writer's review and removed triage Do not begin working on this issue until triaged by the team labels Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GraphQL Search API - Pagination and Sorting Constraints Cause Data Loss for Large-Scale Queries #35831

GraphQL Search API - Pagination and Sorting Constraints Cause Data Loss for Large-Scale Queries #35831

abgoswami commented Jan 4, 2025

welcome bot commented Jan 4, 2025

COOLCY16 commented Jan 5, 2025

nguyenalex836 commented Jan 6, 2025

GraphQL Search API - Pagination and Sorting Constraints Cause Data Loss for Large-Scale Queries #35831

GraphQL Search API - Pagination and Sorting Constraints Cause Data Loss for Large-Scale Queries #35831

Comments

abgoswami commented Jan 4, 2025

Code of Conduct

What article on docs.github.com is affected?

What part(s) of the article would you like to see updated?

Additional information

Issue:

Impact:

Steps to reproduce

Query:

Expected:

Actual:

Rationale and Examples

Suggestions for Improvement:

Use Case:

welcome bot commented Jan 4, 2025

COOLCY16 commented Jan 5, 2025

nguyenalex836 commented Jan 6, 2025