Design Inspiration From Scrapy #855

Ehsan-U · 2025-01-03T03:14:24Z

Ehsan-U
Jan 3, 2025

I understand that you're building a modern replacement for Scrapy, which is amazing! However, I’d like to share some concerns about the current design.

In web scraping, we typically deal with two main parts: crawling and parsing.

Crawling: This involves fetching the HTML page, which can be done using various HTTP clients like HTTPX, Requests, Curl-Cffi, etc., or through JavaScript rendering tools like Playwright. Regardless of the method, the end result is the same—a fetched HTML page.

Parsing: Once we have the HTML, the next step is to parse it. There are numerous tools for this, such as BeautifulSoup, Parsel, lxml, and more.

The issue with Crawlee current design is that it tightly couples crawling with parsing, which can lead to confusion and limitations. For instance:

BeautifulSoupCrawler: You can only parse with BeautifulSoup when using this crawler, even though BeautifulSoup is just a parser and does not handle crawling.
ParselCrawler: Similarly, you’re restricted to parsing with Parsel, despite it being purely a parsing library, not a crawler.
This tightly integrated approach feels restrictive and sometimes confusing.

Scrapy, despite being 16 years old, has matured significantly over time. While I’m not suggesting replicating it entirely, there are some good practices we can draw inspiration from.

If Crawlee aims to replace Scrapy, it should decouple crawling and parsing, allowing them to operate independently. For example:

Scrapy provides a unified interface to parse HTML using XPATH or CSS selectors. This interface covers 99% of use cases, but users are free to use other parsing tools, like BeautifulSoup, if they prefer. Crawlee could adopt a similar philosophy.
Crawlee doesn’t need to enforce a tightly integrated "crawler+parser" design. Instead, it could focus on providing flexibility. For crawling, users could integrate different HTTP clients or Playwright via Downloader Handlers. This would allow developers to plug in their preferred crawling tools seamlessly.
By decoupling crawling and parsing, Crawlee would be more versatile and user-friendly, catering to a broader range of use cases without being overly opinionated.

Looking forward to hearing your thoughts on this.

Mantisus · 2025-01-04T20:23:13Z

Mantisus
Jan 4, 2025
Collaborator

Hey.

I may be missing something. But I think you are talking about using HttpCrawler and whatever parser you would be comfortable with. Which you can do now.

Example:

import asyncio

from parsel import Selector

from crawlee.crawlers import HttpCrawler, HttpCrawlingContext


async def main() -> None:
    crawler = HttpCrawler(max_requests_per_crawl=10)

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}')

        selector = Selector(context.http_response.read().decode())
        requests = []
        for link in selector.xpath('//a/@href').getall():
            if not link.startswith('http'):
                continue
            requests.append(link)

        await context.add_requests(requests)

    await crawler.run(['https://crawlee.dev'])


asyncio.run(main())

You can do the same thing inside BeautifulSoupCrawler.

if for some reason you want to use it, but you need Parsel

import asyncio

from parsel import Selector

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler(
        max_requests_per_crawl=10,
    )

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}')

        title = context.soup.select_one('title').text.strip()

        context.log.info(f'Title {title}')

        selector = Selector(context.http_response.read().decode())
        requests = []
        for link in selector.xpath('//a/@href').getall():
            if not link.startswith('http'):
                continue
            requests.append(link)

        await context.add_requests(requests)

    await crawler.run(['https://crawlee.dev'])


asyncio.run(main())

0 replies

janbuchar · 2025-01-06T12:51:39Z

janbuchar
Jan 6, 2025
Maintainer

Hello, and thanks for writing your thoughts out in such detail!

I understand that you're building a modern replacement for Scrapy, which is amazing! However, I’d like to share some concerns about the current design.

First, we are not trying to completely erase Scrapy from existence. If you're comfortable using it, we're happy for you 🙂 If not, we offer an alternative.

In web scraping, we typically deal with two main parts: crawling and parsing.

Crawling: This involves fetching the HTML page, which can be done using various HTTP clients like HTTPX, Requests, Curl-Cffi, etc., or through JavaScript rendering tools like Playwright. Regardless of the method, the end result is the same—a fetched HTML page.

This is not completely true - in browser-based crawling (Playwright), the result is a loaded page in a browser instance. There are many things that you can do with that that would be difficult to do with just a "snapshot" of the page. I'm not going into much detail here, but I can, if you're interested.

Parsing: Once we have the HTML, the next step is to parse it. There are numerous tools for this, such as BeautifulSoup, Parsel, lxml, and more.

The issue with Crawlee current design is that it tightly couples crawling with parsing, which can lead to confusion and limitations.

If you're not interested in browser-based scraping, you are more than welcome to just use HttpCrawler and parse the results in any way you like, like @Mantisus mentions in his comment. The individual crawlers, such as BeautifulSoupCrawler or ParselCrawler, exist merely for convenience.

Scrapy, despite being 16 years old, has matured significantly over time. While I’m not suggesting replicating it entirely, there are some good practices we can draw inspiration from.

The Python version of Crawlee is very new, but it is based on a Javascript version that captures 8 years of experience with web scraping that we gathered at Apify.

We made different choices than the authors of Scrapy, and we mostly stand by them. Not prescribing any particular code structure is one of these choices.

If Crawlee aims to replace Scrapy, it should decouple crawling and parsing, allowing them to operate independently.

I disagree. As I mentioned above, the line between crawling and parsing may not always be clean. Moreover, some users appreciate being able to write all their logic into a single short function. They are of course free to structure their code according to their preference - Crawlee code is just Python.

Scrapy provides a unified interface to parse HTML using XPATH or CSS selectors. This interface covers 99% of use cases, but users are free to use other parsing tools, like BeautifulSoup, if they prefer. Crawlee could adopt a similar philosophy. Crawlee doesn’t need to enforce a tightly integrated "crawler+parser" design. Instead, it could focus on providing flexibility. For crawling, users could integrate different HTTP clients or Playwright via Downloader Handlers. This would allow developers to plug in their preferred crawling tools seamlessly. By decoupling crawling and parsing, Crawlee would be more versatile and user-friendly, catering to a broader range of use cases without being overly opinionated.

I believe that Crawlee does not force you to adopt any particular parser, period. You are also free to use any http client library that you like, as long as you implement the BaseHttpClient interface.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Inspiration From Scrapy #855

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Design Inspiration From Scrapy #855

Ehsan-U Jan 3, 2025

Replies: 2 comments

Mantisus Jan 4, 2025 Collaborator

janbuchar Jan 6, 2025 Maintainer

Ehsan-U
Jan 3, 2025

Mantisus
Jan 4, 2025
Collaborator

janbuchar
Jan 6, 2025
Maintainer