Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early browser API accesses and function calls are missed #77

Open
asumansenol opened this issue Sep 26, 2022 · 4 comments
Open

Early browser API accesses and function calls are missed #77

asumansenol opened this issue Sep 26, 2022 · 4 comments

Comments

@asumansenol
Copy link

Hi!
While running some pilot crawls for our current study, we found that the TRC doesn’t collect function calls or access to properties when the call/access occurs immediately after page load. Perhaps APICallCollector can’t find time to register the breakpoints. To test this issue, we have created two test pages that

  1. Access window.devicePixelRatio
  2. Call toDataURL method of an HTML5 canvas element

We’ve visited the test pages using the latest version of TRC without any modification.

  1. Test page 1: The script is run 1000ms after the page load.
  • Command: npm run crawl -- -u "https://homes.esat.kuleuven.be/~asenol/fp-test-with-timeout/" -o ./data/ -v -f -d 'apis'
  • In this case, the TRC correctly intercepts the API call and property access.
  1. Test website 2: The script is run immediately after the page load
  • Command npm run crawl -- -u "https://homes.esat.kuleuven.be/~asenol/fp-test-without-timeout/" -o ./data/ -v -f -d 'apis'
  • In this case, the TRC couldn’t intercept the API call and the property access.

I hope this helps. If you need any other info, just let me know.

@kdzwinel
Copy link
Member

kdzwinel commented Sep 29, 2022

Hey @asumansenol , thanks for bringing this up!

I observed the same with our API collection integration test -> https://github.com/duckduckgo/tracker-radar-collector/blob/main/tests/integration/apiCollection.test.js . Which is somehow flaky because of this issue.

I suspect a race condition between API collection script setting things up (https://github.com/duckduckgo/tracker-radar-collector/blob/main/collectors/APICalls/TrackerTracker.js#L126) and scripts on the page alrady running.

This is not a huge issue for DDG use case as everything is ready before 3p request load and execute in most cases, plus we operate on a huge sample of sites, but I can see how this is not precise enough for other use cases.

I suspect this is fixable - I'll give it a shot next week and let you know.

@kdzwinel
Copy link
Member

kdzwinel commented Mar 1, 2023

Sorry, still no solution to this. @muodov is updating APICollector for a better attribution (#90), but it doesn't seem to have an effect on this issue. I suspect the solution here is to block scripts from running before all collectors are fully set up. This can be done e.g. via Debugger.pause as soon as page starts loading.

@muodov
Copy link
Member

muodov commented Mar 1, 2023

There seems to be a problem with RequestCollector and latest chromium as well, I'm currently investigating, but don't have a concrete solution yet

@muodov
Copy link
Member

muodov commented Mar 6, 2023

I think this is basically the same problem as described in puppeteer/puppeteer#8507. This was fixed in puppeteer last year, but unfortunately it is incompatible with our current CDP usage, as I mentioned in #84 (comment). We're exploring different options to fix this at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants