Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Crawling the listing pages" example results in numerous timeouts / max retries exceeded #2790

Closed
1 task
webchick opened this issue Jan 6, 2025 · 0 comments · Fixed by #2791
Closed
1 task
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@webchick
Copy link
Contributor

webchick commented Jan 6, 2025

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

I'm making my way through your excellent introduction docs and came across an issue at the Crawling the Store page.

Under the heading, "Crawling the listing page" it shows the following code example:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`);
        // Wait for the category cards to render,
        // otherwise enqueueLinks wouldn't enqueue anything.
        await page.waitForSelector('.collection-block-item');

        // Add links to the queue, but only from
        // elements matching the provided selector.
        await enqueueLinks({
            selector: '.collection-block-item',
            label: 'CATEGORY',
        });
    },
});

await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);

Every other step of the tutorial, I've been able to replace the contents of my main.js file with whatever's in that code block, and then run node main.js and see the app working up until that point.

However, when I do so here, I get a bunch of stuff like this:

INFO  PlaywrightCrawler: Starting the crawler.
Processing: https://warehouse-theme-metal.myshopify.com/collections
Processing: https://warehouse-theme-metal.myshopify.com/collections/a-v-receivers
Processing: https://warehouse-theme-metal.myshopify.com/collections/accessories
Processing: https://warehouse-theme-metal.myshopify.com/collections/all-tvs
# Waits for a super long time...
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.waitForSelector: Timeout 30000ms exceeded.
Call log:
  - waiting for locator('.collection-block-item') to be visible

    at PlaywrightCrawler.requestHandler (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20) {"id":"42KUpkekCOQ7JH8","url":"https://warehouse-theme-metal.myshopify.com/collections/a-v-receivers","retryCount":1}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.waitForSelector: Timeout 30000ms exceeded.
Call log:
  - waiting for locator('.collection-block-item') to be visible

    at PlaywrightCrawler.requestHandler (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20) {"id":"mhaUzEY5wa6lbfF","url":"https://warehouse-theme-metal.myshopify.com/collections/accessories","retryCount":1}

...and so on, then it tries to re-queue the failed requests:

INFO  PlaywrightCrawler:Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1056,"requestsFinishedPerMinute":1,"requestsFailedPerMinute":0,"requestTotalDurationMillis":1056,"requestsTotal":1,"crawlerRuntimeMillis":60137,"retryHistogram":[1]}
INFO  PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":7,"desiredConcurrency":8,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.waitForSelector: Timeout 30000ms exceeded.
Call log:
  - waiting for locator('.collection-block-item') to be visible

    at PlaywrightCrawler.requestHandler (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20) {"id":"ilk5hjgD3Q1OZSS","url":"https://warehouse-theme-metal.myshopify.com/collections/amplifiers","retryCount":1}

...eventually the warnings turn into errors once the retry limit is reached:

ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.waitForSelector: Timeout 30000ms exceeded.
Call log:
  - waiting for locator('.collection-block-item') to be visible

    at PlaywrightCrawler.requestHandler (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20)
    at /Users/webchick/TechAround/fun-with-scraping/node_modules/@crawlee/browser/internals/browser-crawler.js:287:87
    at wrap (/Users/webchick/TechAround/fun-with-scraping/node_modules/@apify/timeout/cjs/index.cjs:54:27)
    at /Users/webchick/TechAround/fun-with-scraping/node_modules/@apify/timeout/cjs/index.cjs:68:7
    at /Users/webchick/TechAround/fun-with-scraping/node_modules/@apify/timeout/cjs/index.cjs:67:13
    at addTimeoutToPromise (/Users/webchick/TechAround/fun-with-scraping/node_modules/@apify/timeout/cjs/index.cjs:61:10)
    at PlaywrightCrawler._runRequestHandler (/Users/webchick/TechAround/fun-with-scraping/node_modules/@crawlee/browser/internals/browser-crawler.js:287:53)
    at async PlaywrightCrawler._runRequestHandler (/Users/webchick/TechAround/fun-with-scraping/node_modules/@crawlee/playwright/internals/playwright-crawler.js:114:9)
    at async wrap (/Users/webchick/TechAround/fun-with-scraping/node_modules/@apify/timeout/cjs/index.cjs:54:21) {"id":"zcSG2HO1c3fSO3W","url":"https://warehouse-theme-metal.myshopify.com/collections/audio-accessories","method":"GET","uniqueKey":"https://warehouse-theme-metal.myshopify.com/collections/audio-accessories"}

...and in the end, 31/32 requests fail:

INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":31,"retryHistogram":[1,null,null,31],"requestAvgFailedDurationMillis":30359,"requestAvgFinishedDurationMillis":1056,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":6,"requestTotalDurationMillis":942186,"requestsTotal":32,"crawlerRuntimeMillis":291681}
INFO  PlaywrightCrawler: Error analysis: {"totalErrors":31,"uniqueErrors":1,"mostCommonErrors":["31x: page.waitForSelector: Timeout 30000ms exceeded. (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20)"]}
INFO  PlaywrightCrawler: Finished! Total 32 requests: 1 succeeded, 31 failed. {"terminal":true}

Since I'm just learning crawlee I don't know enough yet how to fix this script, but to try and debug it I tried it in "headful" mode and all the URLs for the category pages are loading properly.

What I suspect is happening is it's sitting around on all the sub-pages waiting to find a .collection-block-item class and never does because that class is only ever found on the "main" listing page at https://warehouse-theme-metal.myshopify.com/collections.

NOTE: The final code sample on that page under "Crawling the detail pages" works fine. So this may just not be intended behaviour on the part of the user to try and get the "interim" code working, but in case anyone else out there is curious about what happens step by step like me, figured I'd report it. :)

Code sample

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`);
        // Wait for the category cards to render,
        // otherwise enqueueLinks wouldn't enqueue anything.
        await page.waitForSelector('.collection-block-item');

        // Add links to the queue, but only from
        // elements matching the provided selector.
        await enqueueLinks({
            selector: '.collection-block-item',
            label: 'CATEGORY',
        });
    },
});

await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);

Package version

3.12.1

Node.js version

v22.11.0

Operating system

macOS 15.0.1 (24A348)

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@webchick webchick added the bug Something isn't working. label Jan 6, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 6, 2025
@B4nan B4nan self-assigned this Jan 6, 2025
B4nan pushed a commit that referenced this issue Jan 7, 2025
…xample (#2791)

I'm not sure if this is the correct fix, but it does fix the problem,
and it uses similar logic to the final script on this page, which is
working properly.

(It also works if you simply comment out the `await
page.waitForSelector('.collection-block-item');` line, but I assume
that's there for a reason.)

Old output:

```
INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":31,"retryHistogram":[1,null,null,31],"requestAvgFailedDurationMillis":30359,"requestAvgFinishedDurationMillis":1056,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":6,"requestTotalDurationMillis":942186,"requestsTotal":32,"crawlerRuntimeMillis":291681}
INFO  PlaywrightCrawler: Error analysis: {"totalErrors":31,"uniqueErrors":1,"mostCommonErrors":["31x: page.waitForSelector: Timeout 30000ms exceeded. (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20)"]}
INFO  PlaywrightCrawler: Finished! Total 32 requests: 1 succeeded, 31 failed. {"terminal":true}
```

New output:
```
INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":32,"requestsFailed":0,"retryHistogram":[32],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":302,"requestsFinishedPerMinute":340,"requestsFailedPerMinute":0,"requestTotalDurationMillis":9677,"requestsTotal":32,"crawlerRuntimeMillis":5644}
INFO  PlaywrightCrawler: Finished! Total 32 requests: 32 succeeded, 0 failed. {"terminal":true}
```

Closes #2790
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants