FEATURE: Select which links should be used in web scraper nodes (cheerio, puppeteer, and playwright) #1566

0xi4o · 2024-01-19T07:46:40Z

Adds a Manage Links button to Web Scraper nodes like Cheerio, Puppeteer, and Playwright. Until now, users can't choose which links to scrap. The node just picks the first n links (configurable in Additional Parameters) and scraps their content. Now users can pick which links they want to scrap.
Clicking on the Manage Links button will open a dialog where you can fetch all the links from the given URL, and remove the ones you don't need. During upsert, only the selected links will be scraped. Selecting links is optional. The nodes will fallback to the old behavior when no links are selected.
If Relative Links Method is set in Additional Parameters before selecting links, the manage links dialog will use the selected method to fetch the links.
If Relative Links Limit is set and is smaller than the number of selected links, the limit will apply to the selected links and the first n links are chosen. If it's larger, then all the selected links are scraped.

…, playwright

…and playwright nodes

packages/components/nodes/documentloaders/Cheerio/Cheerio.ts

packages/ui/src/ui-component/dialog/ManageScrapedLinksDialog.js

packages/components/nodes/documentloaders/Cheerio/Cheerio.ts

… playwright, and puppeteer

HenryHengZJ · 2024-01-22T15:12:35Z

Bug:

1.) Enter https:flowiseai.com into the Cheerio node:

2.) Open Manage Links and modified to https://docs.flowiseai.com:

3.) Fetch links

4.) Click save

5.) On the canvas, the Cheerio Node URL is still https:flowiseai.com, same as when you click manage links:

0xi4o · 2024-01-23T11:30:19Z

@HenryHengZJ Fixed the bug. The URL field in the node will be updated if we change it in the manage links dialog.

HenryHengZJ

awesome work @0xi4o !

HenryHengZJ · 2024-01-23T17:42:29Z

@0xi4o one side topic - right now whenever users restart the apps, and ask a question, it runs through the webscraping stuff again because of the init() function. Can we skip this if the request is not the upserting request? maybe a flag that passes to the options. Because we only need to process the scraping when user is hitting the vector/upsert API

0xi4o · 2024-01-24T06:31:44Z

@HenryHengZJ Sure! I'll update it 👍

0xi4o · 2024-01-24T07:19:13Z

@HenryHengZJ There's already an isUpsert flag in the buildLangchain() in server/src/index.ts. It's set to true only during upsert operations. When calling the prediction endpoints, it's not set so it won't do upsert. This means the scraping will also only happen during upsert and not during prediction.

HenryHengZJ · 2024-01-24T14:16:27Z

@HenryHengZJ There's already an isUpsert flag in the buildLangchain() in server/src/index.ts. It's set to true only during upsert operations. When calling the prediction endpoints, it's not set so it won't do upsert. This means the scraping will also only happen during upsert and not during prediction.

it wont upsert, but it will still go through the init of Cheerio node, scraping all the links, which might take significant portion of the time

0xi4o · 2024-01-25T05:53:54Z

@HenryHengZJ Added a condition to skip initializing web scraper nodes during prediction.

packages/server/src/utils/index.ts

…-links

luc4t · 2024-05-01T21:19:19Z

Could you add the option to exclude links ending with ".jpg, .jpeg, .png, .pdf or anything similar?

Cheerio will scrape jpegs and upsert jpeg headers and everything after it into the vector database. I haven't found an option to prevent this.

Apify's Cheerio web scraper uses:

[
{
"glob": "/**/*.{png,jpg,jpeg,pdf}"
}
]

0xi4o added 4 commits January 19, 2024 12:29

Add api endpoint for fetching links from a url

e7edbc6

Show a manage links button for web scraper nodes - cheerio, puppeteer…

1b8813a

…, playwright

Show a dialog to fetch and manage links in web scraper nodes

9637c12

Add interface for fetching links from server

43fa116

0xi4o added the enhancement New feature or request label Jan 19, 2024

0xi4o self-assigned this Jan 19, 2024

0xi4o marked this pull request as draft January 19, 2024 07:46

Use selected links if available when scraping in cheerio, puppeteer, …

bfa26a7

…and playwright nodes

vinodkiran requested changes Jan 20, 2024

View reviewed changes

0xi4o added 5 commits January 22, 2024 08:19

Update where loader is rendered in manage links dialog

76cb879

Update manage links button variant

62ec17d

Fix multiple calls to parseInt

bf60a1a

Set default value for get links limit in web scraper nodes - cheerio,…

c24708f

… playwright, and puppeteer

Update console statements to use logger

193e5c4

0xi4o marked this pull request as ready for review January 22, 2024 03:20

0xi4o requested review from HenryHengZJ and vinodkiran January 22, 2024 03:24

Update input url if user changed the url in manage links dialog

6395b12

HenryHengZJ approved these changes Jan 23, 2024

View reviewed changes

Add condition to skip initializing web scraper nodes during prediction

3abfa13

HenryHengZJ reviewed Jan 25, 2024

View reviewed changes

packages/server/src/utils/index.ts Outdated Show resolved Hide resolved

HenryHengZJ approved these changes Jan 25, 2024

View reviewed changes

Revert adding condition to skip initialization of web scraper nodes

98acb35

Merge branch 'main' of github.com:0xi4o/Flowise into feature/scrapped…

94d8e00

…-links

0xi4o merged commit 09d2b96 into FlowiseAI:main Jan 25, 2024
2 checks passed

0xi4o deleted the feature/scrapped-links branch January 25, 2024 22:38

luc4t mentioned this pull request May 1, 2024

Add url filtering to Cheerio scraper. Also fix multiple issues of link limit enforcement. #1417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEATURE: Select which links should be used in web scraper nodes (cheerio, puppeteer, and playwright) #1566

FEATURE: Select which links should be used in web scraper nodes (cheerio, puppeteer, and playwright) #1566

0xi4o commented Jan 19, 2024 •

edited

Loading

HenryHengZJ commented Jan 22, 2024

0xi4o commented Jan 23, 2024

HenryHengZJ left a comment

HenryHengZJ commented Jan 23, 2024 •

edited

Loading

0xi4o commented Jan 24, 2024

0xi4o commented Jan 24, 2024

HenryHengZJ commented Jan 24, 2024

0xi4o commented Jan 25, 2024

luc4t commented May 1, 2024

FEATURE: Select which links should be used in web scraper nodes (cheerio, puppeteer, and playwright) #1566

FEATURE: Select which links should be used in web scraper nodes (cheerio, puppeteer, and playwright) #1566

Conversation

0xi4o commented Jan 19, 2024 • edited Loading

HenryHengZJ commented Jan 22, 2024

0xi4o commented Jan 23, 2024

HenryHengZJ left a comment

Choose a reason for hiding this comment

HenryHengZJ commented Jan 23, 2024 • edited Loading

0xi4o commented Jan 24, 2024

0xi4o commented Jan 24, 2024

HenryHengZJ commented Jan 24, 2024

0xi4o commented Jan 25, 2024

luc4t commented May 1, 2024

0xi4o commented Jan 19, 2024 •

edited

Loading

HenryHengZJ commented Jan 23, 2024 •

edited

Loading