Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add howto for Diversification #24

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

uppinder
Copy link
Contributor

@uppinder uppinder commented Aug 4, 2018

Contains a howto for diversification with code snippets in python.

howtos/diversification.rst Outdated Show resolved Hide resolved

// Query a database and get 10 results, where 'enq' is an instantiated
// Enquire object over a database
matches = enq.get_mset(0, 10)
Copy link
Contributor

@ojwb ojwb Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than giving code in a particular language (or here what appears to be Python but with C++ comments!) it's better to use put the example code under code/ and use .. xapianexample:: to pull in parts of it. You can show the results of running the code with ..xapianrunexample:: (see for example howtos/boolean_filters.rst for uses of these).

By using these macros, people can provide translations of the examples to other languages, and so we can have a versions of this guide in C++, Python, Perl, PHP, etc.

The other thing these macros do is ensure that the shown example code actually works and produces any output shown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, this encourages creating a complete working program; however, you don't have to show all the code in the documentation. It's generally a good idea to base your code on one of the other examples, which you can reference and then describe just what you've changed. In this case, you wouldn't need this snippet of code, which is common in existing search examples.

howtos/diversification.rst Outdated Show resolved Hide resolved

// Query a database and get 10 results, where 'enq' is an instantiated
// Enquire object over a database
matches = enq.get_mset(0, 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, this encourages creating a complete working program; however, you don't have to show all the code in the documentation. It's generally a good idea to base your code on one of the other examples, which you can reference and then describe just what you've changed. In this case, you wouldn't need this snippet of code, which is common in existing search examples.


Xapian allows for diversification of documents which are stored in the form of an MSet.
This feature is a well-known technique in information retrieval used to increase
user satisfaction, especially for ambiguous queries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't explain what diversification actually does, although the glossary entry does this very well. Using similar wording may feel like duplication, but would actually be a good thing here.

howtos/diversification.rst Outdated Show resolved Hide resolved
howtos/diversification.rst Outdated Show resolved Hide resolved

Perform diversification over 'matches' and obtain an ordered list of documents::

dset = d.get_dmset(matches)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two snippets go closely together, and the general approach we've taken is to show the whole snippet and then explain it with text before and after. Further, I'd suggest that you need to explain how to use the generated dset, since otherwise people reading this document will be left with an object they'll still have to look up in the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it probably makes more sense to have diversification reorder the MSet. If it worked like that, then adding diversification to an existing application would just be a matter of adding a bit of code between the application code which calls get_mset() and the application code which uses the returned MSet.

With the currently implemented diversification API you'd also have to rework all the application code which uses the returned MSet to instead use a DocumentSet. That's especially annoying if you want to support diversification as an option, since then you need two versions of that code.

This would also make it easier to document here, since using the diversified MSet is then really no different to using any MSet.

For explicit clustering, I think you get a DocumentSet object for each cluster, so there it has more of a reason for the class to exist in that case. To return clustered results via an MSet we'd need to add some sort of cluster-id in the MSet or some other way to indicate what the clusters are. It also seems less likely you'd want to treat explicitly clustered results the same way as unclustered results.

As for how to do this, there's a simple API to support that which was added for letor by ayushp last year, which is based on setting new weights for each item with replace_weights() and then calling sort_by_relevance(). That could be used here, but we'd have to adjust the weights of the entries to get the desired order - for example, crudely setting weight to 1.0 / desired_rank would work.

Or perhaps better, reorder the MSet's internals more directly - diversification is part of xapian-core, so doesn't have to be constrained to use public API methods (unlike xapian-letor).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd favour a direct re-ordering in this case, which would retain the original weights.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the original weights are probably useful to some API users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at the code, and I'm thinking about refactoring it to instead provide a new diversify method on Xapian::MSet which takes the same parameters that a Diversify constructor currently does. The Diversify class would then not be needed at all.

So usage would look like:

matches = enq.get_mset(0, 10);
matches.diversify(4, 2);

And then you can display matches as you would without diversification.

Currently we effectively get a reordered list of docids out, so I think we might need to have some sort of mapping to allow us to efficiently find the MSet entry with a particular docid - that could be created as the MSet is scanned to seed the clustering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the code to change the API as above, and from updating the test code it does seem more natural.

However while checking if it gives the same results as before I get a segfault which I can reproduce with the unrefactored code:

==149288== Invalid read of size 8
==149288==    at 0x49A3A4B: Xapian::LCDClusterer::cluster(Xapian::MSet const&) (lcd_clusterer.cc:132)
==149288==    by 0x49A81FB: Xapian::Diversify::Internal::get_dmset(Xapian::MSet const&) (diversify.cc:181)
==149288==    by 0x49A880C: Xapian::Diversify::get_dmset(Xapian::MSet const&) (diversify.cc:75)
==149288==    by 0x10D657: main (quest.cc:458)
==149288==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

Copy link
Contributor

@ojwb ojwb Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The segfault was a bug in LCDClusterer - the relevance weight was used as a unique key, which doesn't work well in cases where multiple documents have exactly the same weight. Fixed by xapian/xapian@be80054.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been looking at the diversification code some more. I tried some test queries and what I always seem to get out is a result with the top k results in the same order, and the rest with their order reversed. I would expect the first k to change in many cases, and the relative order of remaining entries should probably be preserved (since in the absence of any other reason to order them, the original ordering seems the way to go).

I dug in a bit and found that the diversification uses the cluster centroids, but the LCDClusterer doesn't set these, so every centroid has no terms and zero magnitude - effectively every centroid is the zero point of the term space. That would explain the lack of useful reordering.

IIRC the diversification code was written using the existing K-means clustering (which does set centroids) and then LCDClusterer was added, so I suspect that's how we ended up here.

Also I had a look at the paper, and noticed that says "The next cluster center is chosen as the one that maximizes the sum of distances to the previous center(s)" but our implementation has:

        // Select a new cluster center which is the point that is farthest away
        // from the current cluster center
        cluster_center = dist_vector.back().first;

That seems to do the right thing for the second cluster (and for the first we seem to correctly take the top ranked doc) but for the third and onwards we only consider the distance to the previous cluster centre, and not also the distances to the ones before that.

@uppinder Do you have any useful insights?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants