Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE: Select which links should be used in web scraper nodes (cheerio, puppeteer, and playwright) #1566

Merged
merged 14 commits into from
Jan 25, 2024

Conversation

0xi4o
Copy link
Contributor

@0xi4o 0xi4o commented Jan 19, 2024

  • Adds a Manage Links button to Web Scraper nodes like Cheerio, Puppeteer, and Playwright. Until now, users can't choose which links to scrap. The node just picks the first n links (configurable in Additional Parameters) and scraps their content. Now users can pick which links they want to scrap.
  • Clicking on the Manage Links button will open a dialog where you can fetch all the links from the given URL, and remove the ones you don't need. During upsert, only the selected links will be scraped. Selecting links is optional. The nodes will fallback to the old behavior when no links are selected.
  • If Relative Links Method is set in Additional Parameters before selecting links, the manage links dialog will use the selected method to fetch the links.
  • If Relative Links Limit is set and is smaller than the number of selected links, the limit will apply to the selected links and the first n links are chosen. If it's larger, then all the selected links are scraped.

Flowise-webscraper-manage-links

Flowise-webscraper-manage-links-dialog

@0xi4o 0xi4o added the enhancement New feature or request label Jan 19, 2024
@0xi4o 0xi4o self-assigned this Jan 19, 2024
@0xi4o 0xi4o marked this pull request as draft January 19, 2024 07:46
@0xi4o 0xi4o marked this pull request as ready for review January 22, 2024 03:20
@HenryHengZJ
Copy link
Contributor

Bug:

1.) Enter https:flowiseai.com into the Cheerio node:
image

2.) Open Manage Links and modified to https://docs.flowiseai.com:
image

3.) Fetch links

4.) Click save

5.) On the canvas, the Cheerio Node URL is still https:flowiseai.com, same as when you click manage links:
image

@0xi4o
Copy link
Contributor Author

0xi4o commented Jan 23, 2024

@HenryHengZJ Fixed the bug. The URL field in the node will be updated if we change it in the manage links dialog.

Copy link
Contributor

@HenryHengZJ HenryHengZJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome work @0xi4o !

@HenryHengZJ
Copy link
Contributor

HenryHengZJ commented Jan 23, 2024

@0xi4o one side topic - right now whenever users restart the apps, and ask a question, it runs through the webscraping stuff again because of the init() function. Can we skip this if the request is not the upserting request? maybe a flag that passes to the options. Because we only need to process the scraping when user is hitting the vector/upsert API

@0xi4o
Copy link
Contributor Author

0xi4o commented Jan 24, 2024

@HenryHengZJ Sure! I'll update it 👍

@0xi4o
Copy link
Contributor Author

0xi4o commented Jan 24, 2024

@HenryHengZJ There's already an isUpsert flag in the buildLangchain() in server/src/index.ts. It's set to true only during upsert operations. When calling the prediction endpoints, it's not set so it won't do upsert. This means the scraping will also only happen during upsert and not during prediction.

@HenryHengZJ
Copy link
Contributor

@HenryHengZJ There's already an isUpsert flag in the buildLangchain() in server/src/index.ts. It's set to true only during upsert operations. When calling the prediction endpoints, it's not set so it won't do upsert. This means the scraping will also only happen during upsert and not during prediction.

it wont upsert, but it will still go through the init of Cheerio node, scraping all the links, which might take significant portion of the time

@0xi4o
Copy link
Contributor Author

0xi4o commented Jan 25, 2024

@HenryHengZJ Added a condition to skip initializing web scraper nodes during prediction.

@0xi4o 0xi4o merged commit 09d2b96 into FlowiseAI:main Jan 25, 2024
2 checks passed
@0xi4o 0xi4o deleted the feature/scrapped-links branch January 25, 2024 22:38
@luc4t
Copy link

luc4t commented May 1, 2024

Could you add the option to exclude links ending with ".jpg, .jpeg, .png, .pdf or anything similar?

Cheerio will scrape jpegs and upsert jpeg headers and everything after it into the vector database. I haven't found an option to prevent this.

Apify's Cheerio web scraper uses:

[
{
"glob": "/**/*.{png,jpg,jpeg,pdf}"
}
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants