Replies: 2 comments
-
Hey. I may be missing something. But I think you are talking about using Example: import asyncio
from parsel import Selector
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
async def main() -> None:
crawler = HttpCrawler(max_requests_per_crawl=10)
@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')
selector = Selector(context.http_response.read().decode())
requests = []
for link in selector.xpath('//a/@href').getall():
if not link.startswith('http'):
continue
requests.append(link)
await context.add_requests(requests)
await crawler.run(['https://crawlee.dev'])
asyncio.run(main()) You can do the same thing inside BeautifulSoupCrawler. if for some reason you want to use it, but you need Parsel import asyncio
from parsel import Selector
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main() -> None:
crawler = BeautifulSoupCrawler(
max_requests_per_crawl=10,
)
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')
title = context.soup.select_one('title').text.strip()
context.log.info(f'Title {title}')
selector = Selector(context.http_response.read().decode())
requests = []
for link in selector.xpath('//a/@href').getall():
if not link.startswith('http'):
continue
requests.append(link)
await context.add_requests(requests)
await crawler.run(['https://crawlee.dev'])
asyncio.run(main()) |
Beta Was this translation helpful? Give feedback.
-
Hello, and thanks for writing your thoughts out in such detail!
First, we are not trying to completely erase Scrapy from existence. If you're comfortable using it, we're happy for you 🙂 If not, we offer an alternative.
This is not completely true - in browser-based crawling (Playwright), the result is a loaded page in a browser instance. There are many things that you can do with that that would be difficult to do with just a "snapshot" of the page. I'm not going into much detail here, but I can, if you're interested.
If you're not interested in browser-based scraping, you are more than welcome to just use
The Python version of Crawlee is very new, but it is based on a Javascript version that captures 8 years of experience with web scraping that we gathered at Apify. We made different choices than the authors of Scrapy, and we mostly stand by them. Not prescribing any particular code structure is one of these choices.
I disagree. As I mentioned above, the line between crawling and parsing may not always be clean. Moreover, some users appreciate being able to write all their logic into a single short function. They are of course free to structure their code according to their preference - Crawlee code is just Python.
I believe that Crawlee does not force you to adopt any particular parser, period. You are also free to use any http client library that you like, as long as you implement the |
Beta Was this translation helpful? Give feedback.
-
I understand that you're building a modern replacement for
Scrapy
, which is amazing! However, I’d like to share some concerns about the current design.In web scraping, we typically deal with two main parts: crawling and parsing.
Crawling: This involves fetching the HTML page, which can be done using various HTTP clients like
HTTPX
,Requests
,Curl-Cffi
, etc., or through JavaScript rendering tools likePlaywright
. Regardless of the method, the end result is the same—a fetched HTML page.Parsing: Once we have the HTML, the next step is to parse it. There are numerous tools for this, such as
BeautifulSoup
,Parsel
,lxml
, and more.The issue with
Crawlee
current design is that it tightly couples crawling with parsing, which can lead to confusion and limitations. For instance:BeautifulSoupCrawler
: You can only parse with BeautifulSoup when using this crawler, even thoughBeautifulSoup
is just a parser and does not handle crawling.ParselCrawler
: Similarly, you’re restricted to parsing withParsel
, despite it being purely a parsing library, not a crawler.This tightly integrated approach feels restrictive and sometimes confusing.
Scrapy
, despite being 16 years old, has matured significantly over time. While I’m not suggesting replicating it entirely, there are some good practices we can draw inspiration from.If
Crawlee
aims to replaceScrapy
, it should decouple crawling and parsing, allowing them to operate independently. For example:Scrapy
provides a unified interface to parse HTML using XPATH or CSS selectors. This interface covers 99% of use cases, but users are free to use other parsing tools, like BeautifulSoup, if they prefer. Crawlee could adopt a similar philosophy.Crawlee
doesn’t need to enforce a tightly integrated "crawler+parser" design. Instead, it could focus on providing flexibility. For crawling, users could integrate different HTTP clients orPlaywright
via Downloader Handlers. This would allow developers to plug in their preferred crawling tools seamlessly.By decoupling crawling and parsing,
Crawlee
would be more versatile and user-friendly, catering to a broader range of use cases without being overly opinionated.Looking forward to hearing your thoughts on this.
Beta Was this translation helpful? Give feedback.
All reactions