Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GraphQL Search API - Pagination and Sorting Constraints Cause Data Loss for Large-Scale Queries #35831

Open
1 task done
abgoswami opened this issue Jan 4, 2025 · 3 comments
Labels
content This issue or pull request belongs to the Docs Content team search-github Related to GitHub search waiting for review Issue/PR is waiting for a writer's review

Comments

@abgoswami
Copy link

Code of Conduct

What article on docs.github.com is affected?

https://docs.github.com/en/graphql/reference/queries#search

What part(s) of the article would you like to see updated?

https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories

Additional information

Issue:

The GraphQL API enforces a 1000-record limit per search query, which is fine. But when you try to fetch records based on criterial like creation or update dates, and specify for example: "first 100" as the page size, it does not return first 100 records per the specified criteria. The records seem to be picked up randomly or may be sorted based on some default relevance-based sorting and thus creates non-deterministic pagination when fetching records based on criteria like creation or update dates.

Impact:

  • Sorting behavior ignores explicit conditions like sort:created-asc.
  • Even if we don't want to support the sort:created-asc sorting, the result should still return first 100 records per page based on the query criteria instead of some unknown criteria like 'relevance'. Even if it is the default, there must be some way of overriding that default in the query.
  • Paginated queries risk data loss or overlaps due to inconsistent ordering.
  • Fetching historical data over large data set even after staying within the rate limit of 5000 requests per hour, becomes practically impossible without hacks and workarounds.

Steps to reproduce

Query:

query Search { search(query: "org:microsoft created:>2010-01-01 sort:created-asc", type: REPOSITORY, first: 100) { edges { node { ... on Repository { url createdAt } } } pageInfo { endCursor hasNextPage } }

Expected:

Should return sorted records based on createdAt starting from 2010-01-01.
Or if the query does not include sort:created-asc, should still return the first 100 records with respect to the createdAt date even if unsorted.

Actual:

Random 1000 records (100 records per page) based on relevance instead of specified criteria. This makes the results appear random with respect to the createdAt date.

Rationale and Examples

The general expectation is that when the user says "Give me first 10 repositories belonging to this organization, which were created after 2010-01-01", then the "first 10" means that the user is expecting first 10 results with respect to that date (2010-01-01) even if unsorted. But what it returns is even the repositories that were created very recently that are clearly not one of the first 10 since the provided criteria date. It looks like GitHub is selecting any 10 repositories which are most relevant while ignoring the query criteria based on createdAt date.

Even if the first: 10 (or first: 100) is meant only for pagination (and not considered part of the query), the current behavior breaks any usability because the paginated values are not deterministic, when the user is asking for records after a certain date, the first 10 (or 100) records they see on the first page should be deterministic according to the criteria and not according to some default 'relevance'. Or at least there must be some way to override the default.

For example: Here we are just selecting first 10 records and we see that the result of the query returns records create as recently as 2023 -
image

But there are more than 100 repositories that were created just between 2010 and 2015. The first 10 results should have been from this set. See the results below -
image

Suggestions for Improvement:

  • Allow deterministic ordering using fields like createdAt and updatedAt.
  • Introduce a server-side cursor for pagination that honors sorting criteria.
  • Provide options to disable relevance-based scoring for explicit queries.
  • Remove or increase the 1000-record limit for queries the remain under per hour rate limits.

Use Case:

We’re building an insights tool that analyzes repository metadata at scale. Due to these limitations, we’re unable to fetch data deterministically and efficiently. It seems like practically impossible to fetch usable data.

@abgoswami abgoswami added the content This issue or pull request belongs to the Docs Content team label Jan 4, 2025
Copy link

welcome bot commented Jan 4, 2025

Thanks for opening this issue. A GitHub docs team member should be by to give feedback soon. In the meantime, please check out the contributing guidelines.

@github-actions github-actions bot added the triage Do not begin working on this issue until triaged by the team label Jan 4, 2025
@COOLCY16
Copy link

COOLCY16 commented Jan 5, 2025

Fascinating

@nguyenalex836 nguyenalex836 added search-github Related to GitHub search waiting for review Issue/PR is waiting for a writer's review and removed triage Do not begin working on this issue until triaged by the team labels Jan 6, 2025
@nguyenalex836
Copy link
Contributor

@abgoswami Thank you for raising this issue! I'll get this triaged for review ✨ Our team will provide feedback regarding the best next steps for this issue - thanks for your patience! 💛

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content This issue or pull request belongs to the Docs Content team search-github Related to GitHub search waiting for review Issue/PR is waiting for a writer's review
Projects
None yet
Development

No branches or pull requests

3 participants