-
-
Notifications
You must be signed in to change notification settings - Fork 15.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEATURE: Select which links should be used in web scraper nodes (cheerio, puppeteer, and playwright) #1566
Conversation
…and playwright nodes
@HenryHengZJ Fixed the bug. The URL field in the node will be updated if we change it in the manage links dialog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome work @0xi4o !
@0xi4o one side topic - right now whenever users restart the apps, and ask a question, it runs through the webscraping stuff again because of the |
@HenryHengZJ Sure! I'll update it 👍 |
@HenryHengZJ There's already an |
it wont upsert, but it will still go through the |
@HenryHengZJ Added a condition to skip initializing web scraper nodes during prediction. |
Could you add the option to exclude links ending with ".jpg, .jpeg, .png, .pdf or anything similar? Cheerio will scrape jpegs and upsert jpeg headers and everything after it into the vector database. I haven't found an option to prevent this. Apify's Cheerio web scraper uses: [ |
Manage Links
button to Web Scraper nodes like Cheerio, Puppeteer, and Playwright. Until now, users can't choose which links to scrap. The node just picks the firstn
links (configurable inAdditional Parameters
) and scraps their content. Now users can pick which links they want to scrap.Manage Links
button will open a dialog where you can fetch all the links from the given URL, and remove the ones you don't need. During upsert, only the selected links will be scraped. Selecting links is optional. The nodes will fallback to the old behavior when no links are selected.Relative Links Method
is set inAdditional Parameters
before selecting links, the manage links dialog will use the selected method to fetch the links.Relative Links Limit
is set and is smaller than the number of selected links, the limit will apply to the selected links and the firstn
links are chosen. If it's larger, then all the selected links are scraped.