Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordering by most common #90

Open
plbowers opened this issue Mar 16, 2018 · 10 comments
Open

Ordering by most common #90

plbowers opened this issue Mar 16, 2018 · 10 comments

Comments

@plbowers
Copy link

Most of the time people using this code will be hoping to identify bots as quickly as possible. Attempting to put them in order according to most commonly identified bots would speed up the process, allowing to optimize and get out quickly.

I did a very quick optimization using the frequency reported on this page:

https://deviceatlas.com/blog/list-of-web-crawlers-user-agents

And then I put all your patterns (concatenated with |) into 2 preg_match() calls:

if (preg_match(/most|common|patterns/, $_SERVER['HTTP_USER_AGENT'] || preg_match(/less|common|patterns/, $_SERVER['HTTP_USER_AGENT']) {
// is a bot
} else {
// isn't a bot
}

Providing a script to produce that might be a help...?

@monperrus
Copy link
Owner

interesting comment!

one option is to add a field "prevalence"/"priority"/ that reflects this information, and that could be used to generate the regexp in the right order.

WDYT?

@plbowers
Copy link
Author

Sure - that would be a good solution.

@JayBizzle
Copy link
Contributor

@plbowers have you got any hard benchmark figures proving that your method would indeed be significantly faster?

@jimdigriz
Copy link

jimdigriz commented Jul 18, 2018

This would be a strange optimisation to do unless more than 50% of your User-Agent tests match the crawler list. This is because for non-crawer traffic both regex groups state 'no match', so you would e optimising for something that occurs rarely; assuming your User-Agent traffic is 95%+ non-crawler.

If you are looking for lowering latency, then you should look to using a language (or maybe PHP has a C an extension) that lets you compile the concatinated version of RE (which then means ordering is irrelevent):

For example:

Some languages cache the compiled version automatically for you (I cannot see if PHP does too):

  • Perl
  • JavaScript - I suspect that is why .compile() is now deprecated

@Fale
Copy link
Contributor

Fale commented Jul 28, 2019

I tried multiple cases using https://godoc.org/go.kelfa.io/kelfa/pkg/crawlerflagger (it's written in Go).

It exposes 2 ways to query the crawler-user-agents list:

  • ExactMatch (it uses the "instances" field)
  • RegexpMatch (it uses the "pattern" field)

I tried to match the 1st entry, the 100th entry, the 200th entry, the 300th entry, the 400th entry, and a non-existent entry, those are the results:

BenchmarkName                         Iterations         Average (nanoseconds/operation)
BenchmarkExactMatch/case0-8             10000000               182 ns/op
BenchmarkExactMatch/case101-8           10000000               128 ns/op
BenchmarkExactMatch/case200-8           10000000               137 ns/op
BenchmarkExactMatch/case300-8           10000000               124 ns/op
BenchmarkExactMatch/case400-8           20000000               113 ns/op
BenchmarkExactMatch/miss-8             200000000                 8.03 ns/op
BenchmarkRegExpMatch/case0-8             5000000               292 ns/op
BenchmarkRegExpMatch/case101-8            200000              7335 ns/op
BenchmarkRegExpMatch/case200-8            100000             12866 ns/op
BenchmarkRegExpMatch/case300-8            100000             21898 ns/op
BenchmarkRegExpMatch/case400-8             50000             26515 ns/op
BenchmarkRegExpMatch/miss-8                50000             23963 ns/op

So it seems to suggest that for "instances" based match (at least in Go) the order has absolutely no relevance, while it has relevance for "pattern" based in match (at least in Go).

@monperrus
Copy link
Owner

Interesting, do you also test with a single pattern concatenating all patterns with |?

@Fale
Copy link
Contributor

Fale commented Jul 29, 2019

At the moment there are 400+ regexp (one per entry) and then a switch to analyse the case that matches.

The reason which made me implement it in this way is that I'm not really sure how to identify which case then is matching. Basically it would be possible to decide if at least one pattern is matched by the input string, but not which one.

@MurugappanVR
Copy link

MurugappanVR commented May 31, 2020

@monperrus Since most of the bots user agent has bot|crawler|spider ,we could group all the bots useragents with bot|crawl|spider to a single pattern like this regex might help. Improving this regex to a single pattern will reduce the number of patterns to be matched .

@JayBizzle
Copy link
Contributor

JayBizzle commented May 31, 2020

A generic regex like that is a good idea but you do have to be very careful not to create false positives. You can’t have bot as part of that regex as there are a few genuine user-agents that have bot as part of their name, Cubot for example.

The best way to increase the performance of as regex such as this, is to remove common strings from the source user-agent.

As you can see here...
https://github.com/JayBizzle/Crawler-Detect/blob/master/src/Fixtures/Exclusions.php
...we run a regex replace on the user agent first that removes any of the common matches before running the bot regexes.

We saw a 55% speed increase doing this.

@monperrus
Copy link
Owner

Grouping patterns is on the user side, as in @JayBizzle 's example.

Note that we'd be happy to merge example code snippets for grouping in the README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants