Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CompiledAutomaton not equal when NFARunAutomaton is non-null #13715

Open
ChrisHegarty opened this issue Sep 4, 2024 · 9 comments
Open

CompiledAutomaton not equal when NFARunAutomaton is non-null #13715

ChrisHegarty opened this issue Sep 4, 2024 · 9 comments

Comments

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Sep 4, 2024

NFARunAutomaton is does not override equals, so defaults to object identity, which means that classes like CompiledAutomaton that created internal instances of it may not appear equal when they in fact are. This issue has been filed to investigate the possibility of adding a NFARunAutomaton::equals override that does something sensible with its internal state.

I ran into this issue when comparing this like with Lucene 10:

var ca1 = new CompiledAutomaton(automaton, false, true, false);
var ca2 = new CompiledAutomaton(automaton, false, true, false);
assertEquals(ca1, ca2);   // <<<< ca1 and ca2 are not equal
var q1 = new IntervalQuery("f", Intervals.regexp(new BytesRef(".*foo")));
var q2 = new IntervalQuery("f", Intervals.regexp(new BytesRef(".*foo")));
assertEquals(q1, q2);   // <<<< q1 and q2 are not equal
...
Expected :org.apache.lucene.queries.intervals.IntervalQuery<f:MultiTerm(.*foo)>
Actual   :org.apache.lucene.queries.intervals.IntervalQuery<f:MultiTerm(.*foo)>

@ChrisHegarty
Copy link
Contributor Author

@zhaih @rmuir - Just raising for your awareness.

@zhaih
Copy link
Contributor

zhaih commented Sep 4, 2024

The NFARunAutomaton internally carries a cache such that it will remember all the states that has been determinized, so it does not make sense to implement a completely strict equal method. We might be able to implement some by just comparing automaton field such that as long as the two NFARunAutomaton give the same output they are the same.

Or, maybe more strictly, if the automaton field is the same and both have empty cached states such that they're equal when just created but will be different as long as they're being run?

@dweiss
Copy link
Contributor

dweiss commented Sep 4, 2024

I'm not sure if we can guarantee true equality check for arbitrary automata. Perhaps a "shallow" check meaning equivalence of states and transitions but not the true equality in terms of accepting the same language.

@ChrisHegarty
Copy link
Contributor Author

Before considering possible fixes, do we agree that there is a problem worth fixing? I’m particular thinking of the equality of IntervalQuery.

@zhaih
Copy link
Contributor

zhaih commented Sep 4, 2024

I think it's a nice to have one. Altho for the specific IntervalQuery I feel like maybe we can directly claim they're equal if the pattern is the same and not checking the automaton at all.

@rmuir
Copy link
Member

rmuir commented Sep 4, 2024

@ChrisHegarty I think it might be an oversight IntervalQuery is getting an NFA, was that really intended?

The following other methods on intervals are powered by automatons and get a DFA:

  • Intervals.prefix()
  • Intervals.wildcard()
  • Intervals.range()
  • Intervals.fuzzyTerm()

This happens because they reuse the parsing from the associated Query classes, but this Intervals.regexp() neglects to do that and just creates its own RegExp and converts it to an NFA. Given that the default of RegexpQuery is to determinize, I think we should determinize here to avoid surprises such as this?

@rmuir
Copy link
Member

rmuir commented Sep 4, 2024

There are other related problems to fix here separately, just start with CompiledAutomaton:

  • hashCode() is inconsistent with equals(): either they both consider nfaRunAutomaton or they do not
  • ramBytesUsed doesn't consider nfaRunAutomaton

@ChrisHegarty
Copy link
Contributor Author

Did we uncover a hornets nest!?

@rmuir
Copy link
Member

rmuir commented Sep 4, 2024

That's usually how it works, it is good stuff. To be practical, for now, I'd recommend just fixing Intervals.regex to determinize() so that it matches all other queries which are using DFAs. It should be a one-line change to fix your problem because then there won't be an NFA :)

I think it is enough for lucene 10 that the queries no longer minimize() up-front, and I'm hoping we can explore taking that next step soon so that we no longer determinize() up-front either (just as needed via NFA query), but we shouldn't be doing that now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants