You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which package is this bug report for? If unsure which one to select, leave blank
None
Issue description
I'm making my way through your excellent introduction docs and came across an issue at the Crawling the Store page.
Under the heading, "Crawling the listing page" it shows the following code example:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
console.log(`Processing: ${request.url}`);
// Wait for the category cards to render,
// otherwise enqueueLinks wouldn't enqueue anything.
await page.waitForSelector('.collection-block-item');
// Add links to the queue, but only from
// elements matching the provided selector.
await enqueueLinks({
selector: '.collection-block-item',
label: 'CATEGORY',
});
},
});
await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);
Every other step of the tutorial, I've been able to replace the contents of my main.js file with whatever's in that code block, and then run node main.js and see the app working up until that point.
However, when I do so here, I get a bunch of stuff like this:
INFO PlaywrightCrawler: Starting the crawler.
Processing: https://warehouse-theme-metal.myshopify.com/collections
Processing: https://warehouse-theme-metal.myshopify.com/collections/a-v-receivers
Processing: https://warehouse-theme-metal.myshopify.com/collections/accessories
Processing: https://warehouse-theme-metal.myshopify.com/collections/all-tvs
# Waits for a super long time...
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.waitForSelector: Timeout 30000ms exceeded.
Call log:
- waiting for locator('.collection-block-item') to be visible
at PlaywrightCrawler.requestHandler (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20) {"id":"42KUpkekCOQ7JH8","url":"https://warehouse-theme-metal.myshopify.com/collections/a-v-receivers","retryCount":1}
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.waitForSelector: Timeout 30000ms exceeded.
Call log:
- waiting for locator('.collection-block-item') to be visible
at PlaywrightCrawler.requestHandler (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20) {"id":"mhaUzEY5wa6lbfF","url":"https://warehouse-theme-metal.myshopify.com/collections/accessories","retryCount":1}
...and so on, then it tries to re-queue the failed requests:
INFO PlaywrightCrawler:Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1056,"requestsFinishedPerMinute":1,"requestsFailedPerMinute":0,"requestTotalDurationMillis":1056,"requestsTotal":1,"crawlerRuntimeMillis":60137,"retryHistogram":[1]}
INFO PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":7,"desiredConcurrency":8,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.waitForSelector: Timeout 30000ms exceeded.
Call log:
- waiting for locator('.collection-block-item') to be visible
at PlaywrightCrawler.requestHandler (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20) {"id":"ilk5hjgD3Q1OZSS","url":"https://warehouse-theme-metal.myshopify.com/collections/amplifiers","retryCount":1}
...eventually the warnings turn into errors once the retry limit is reached:
ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.waitForSelector: Timeout 30000ms exceeded.
Call log:
- waiting for locator('.collection-block-item') to be visible
at PlaywrightCrawler.requestHandler (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20)
at /Users/webchick/TechAround/fun-with-scraping/node_modules/@crawlee/browser/internals/browser-crawler.js:287:87
at wrap (/Users/webchick/TechAround/fun-with-scraping/node_modules/@apify/timeout/cjs/index.cjs:54:27)
at /Users/webchick/TechAround/fun-with-scraping/node_modules/@apify/timeout/cjs/index.cjs:68:7
at /Users/webchick/TechAround/fun-with-scraping/node_modules/@apify/timeout/cjs/index.cjs:67:13
at addTimeoutToPromise (/Users/webchick/TechAround/fun-with-scraping/node_modules/@apify/timeout/cjs/index.cjs:61:10)
at PlaywrightCrawler._runRequestHandler (/Users/webchick/TechAround/fun-with-scraping/node_modules/@crawlee/browser/internals/browser-crawler.js:287:53)
at async PlaywrightCrawler._runRequestHandler (/Users/webchick/TechAround/fun-with-scraping/node_modules/@crawlee/playwright/internals/playwright-crawler.js:114:9)
at async wrap (/Users/webchick/TechAround/fun-with-scraping/node_modules/@apify/timeout/cjs/index.cjs:54:21) {"id":"zcSG2HO1c3fSO3W","url":"https://warehouse-theme-metal.myshopify.com/collections/audio-accessories","method":"GET","uniqueKey":"https://warehouse-theme-metal.myshopify.com/collections/audio-accessories"}
...and in the end, 31/32 requests fail:
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":31,"retryHistogram":[1,null,null,31],"requestAvgFailedDurationMillis":30359,"requestAvgFinishedDurationMillis":1056,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":6,"requestTotalDurationMillis":942186,"requestsTotal":32,"crawlerRuntimeMillis":291681}
INFO PlaywrightCrawler: Error analysis: {"totalErrors":31,"uniqueErrors":1,"mostCommonErrors":["31x: page.waitForSelector: Timeout 30000ms exceeded. (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20)"]}
INFO PlaywrightCrawler: Finished! Total 32 requests: 1 succeeded, 31 failed. {"terminal":true}
Since I'm just learning crawlee I don't know enough yet how to fix this script, but to try and debug it I tried it in "headful" mode and all the URLs for the category pages are loading properly.
What I suspect is happening is it's sitting around on all the sub-pages waiting to find a .collection-block-item class and never does because that class is only ever found on the "main" listing page at https://warehouse-theme-metal.myshopify.com/collections.
NOTE: The final code sample on that page under "Crawling the detail pages" works fine. So this may just not be intended behaviour on the part of the user to try and get the "interim" code working, but in case anyone else out there is curious about what happens step by step like me, figured I'd report it. :)
Code sample
import{PlaywrightCrawler}from'crawlee';constcrawler=newPlaywrightCrawler({requestHandler: async({ page, request, enqueueLinks })=>{console.log(`Processing: ${request.url}`);// Wait for the category cards to render,// otherwise enqueueLinks wouldn't enqueue anything.awaitpage.waitForSelector('.collection-block-item');// Add links to the queue, but only from// elements matching the provided selector.awaitenqueueLinks({selector: '.collection-block-item',label: 'CATEGORY',});},});awaitcrawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);
Package version
3.12.1
Node.js version
v22.11.0
Operating system
macOS 15.0.1 (24A348)
Apify platform
Tick me if you encountered this issue on the Apify platform
I have tested this on the next release
No response
Other context
No response
The text was updated successfully, but these errors were encountered:
…xample (#2791)
I'm not sure if this is the correct fix, but it does fix the problem,
and it uses similar logic to the final script on this page, which is
working properly.
(It also works if you simply comment out the `await
page.waitForSelector('.collection-block-item');` line, but I assume
that's there for a reason.)
Old output:
```
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":31,"retryHistogram":[1,null,null,31],"requestAvgFailedDurationMillis":30359,"requestAvgFinishedDurationMillis":1056,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":6,"requestTotalDurationMillis":942186,"requestsTotal":32,"crawlerRuntimeMillis":291681}
INFO PlaywrightCrawler: Error analysis: {"totalErrors":31,"uniqueErrors":1,"mostCommonErrors":["31x: page.waitForSelector: Timeout 30000ms exceeded. (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20)"]}
INFO PlaywrightCrawler: Finished! Total 32 requests: 1 succeeded, 31 failed. {"terminal":true}
```
New output:
```
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":32,"requestsFailed":0,"retryHistogram":[32],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":302,"requestsFinishedPerMinute":340,"requestsFailedPerMinute":0,"requestTotalDurationMillis":9677,"requestsTotal":32,"crawlerRuntimeMillis":5644}
INFO PlaywrightCrawler: Finished! Total 32 requests: 32 succeeded, 0 failed. {"terminal":true}
```
Closes#2790
Which package is this bug report for? If unsure which one to select, leave blank
None
Issue description
I'm making my way through your excellent introduction docs and came across an issue at the Crawling the Store page.
Under the heading, "Crawling the listing page" it shows the following code example:
Every other step of the tutorial, I've been able to replace the contents of my
main.js
file with whatever's in that code block, and then runnode main.js
and see the app working up until that point.However, when I do so here, I get a bunch of stuff like this:
...and so on, then it tries to re-queue the failed requests:
...eventually the warnings turn into errors once the retry limit is reached:
...and in the end, 31/32 requests fail:
Since I'm just learning crawlee I don't know enough yet how to fix this script, but to try and debug it I tried it in "headful" mode and all the URLs for the category pages are loading properly.
What I suspect is happening is it's sitting around on all the sub-pages waiting to find a
.collection-block-item
class and never does because that class is only ever found on the "main" listing page at https://warehouse-theme-metal.myshopify.com/collections.NOTE: The final code sample on that page under "Crawling the detail pages" works fine. So this may just not be intended behaviour on the part of the user to try and get the "interim" code working, but in case anyone else out there is curious about what happens step by step like me, figured I'd report it. :)
Code sample
Package version
3.12.1
Node.js version
v22.11.0
Operating system
macOS 15.0.1 (24A348)
Apify platform
I have tested this on the
next
releaseNo response
Other context
No response
The text was updated successfully, but these errors were encountered: