You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's a neat little package autoscraper that allows to quickly build no-code web extractors.
You take a page with known content.
Say what text from it you need and what alias to bind it to. For example, { "name": " Apple Mac Mini (256GB SSD, M1, 8GB)", "current_bid": "US $130.50", "end_of_bid": "Saturday, 11:32 PM" }
Fit the model to your known page and known data.
It then tries to find what DOM selectors can yield the desired data with best accuracy and saves it into a model object you can pickle, which you probably should given the known page may die long before the DOM changes, so best keep model creation somewhere in a notebook.
Now you can just predict that data from new URLs/DOMs.
I actually wonder the idea can be extended to also use data from the heap to try get the text out, especially given it's a lot messier than hunting for the selector.
May be prototyped as another CLI on top of heap, html, and image exporting here.
The text was updated successfully, but these errors were encountered:
There's a neat little package autoscraper that allows to quickly build no-code web extractors.
{ "name": " Apple Mac Mini (256GB SSD, M1, 8GB)", "current_bid": "US $130.50", "end_of_bid": "Saturday, 11:32 PM" }
model
object you can pickle, which you probably should given the known page may die long before the DOM changes, so best keep model creation somewhere in a notebook.I actually wonder the idea can be extended to also use data from the heap to try get the text out, especially given it's a lot messier than hunting for the selector.
May be prototyped as another CLI on top of heap, html, and image exporting here.
The text was updated successfully, but these errors were encountered: