Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parent-child relationship #25

Closed
wants to merge 6 commits into from

Conversation

ShaneHarvey
Copy link
Contributor

@ShaneHarvey ShaneHarvey commented Nov 2, 2016

This builds on @xiaogaozi's work on #3.

Add support for parent-child mapping.

Example Usage

Let's use the data from the Elasticsearch Parent-Child Mapping docs and adapt it to MongoDB. Suppose you have a company database in MongoDB with the collections branch and employees. Use this script to load some sample data:
Note: this example assumes you have a MongoDB replica set on localhost:27017 and an Elasticsearch instance on localhost:9200.

from pymongo import MongoClient

HOST = 'mongodb://localhost:27017'
client = MongoClient(HOST)

company = client.company
branch = company.branch
employee = company.employee


def insert_parent_child(parent_doc, child_doc, parent_field):
    res = branch.insert_one(parent_doc)
    child_doc[parent_field] = res.inserted_id
    employee.insert_one(child_doc)

branch_docs = [
    {'name': 'London Westminster', 'city': 'London', 'country': 'UK'},
    {'name': 'Liverpool Central', 'city': 'Liverpool', 'country': 'UK'},
    {'name': 'Champs Elysees', 'city': 'Paris', 'country': 'France'}]
employee_docs = [
    {'name': 'Mark Thomas', 'dob': '1982-05-16', 'hobby': 'diving'},
    {'name': 'Barry Smith', 'dob': '1979-04-01', 'hobby': 'hiking'},
    {'name': 'Adrien Grand', 'dob': '1987-05-11', 'hobby': 'horses'}]

for branch_doc, employee_doc in zip(branch_docs, employee_docs):
    insert_parent_child(branch_doc, employee_doc, 'branch_id')

In the current pull request, the elastic-doc-manager does not automatically create the parent-child mapping. So create the mapping manually before running mongo-connector:

curl -XPUT 'http://localhost:9200/company' -d '{
  "mappings": {
    "branch": {},
    "employee": {
      "_parent": {
        "type": "branch" 
      }
    }
  }
}'

Next, create the mongo-connector config file with parent-child mapping options:

{
  "mainAddress": "localhost:27017",
  "verbosity": 2,
  "docManagers":[
    {
      "docManager": "elastic2_doc_manager",
      "targetURL": "localhost:9200",
      "args": {
        "routing": {
          "company": {
            "employee": {
              "parentField": "branch_id"
            }
          }
        }
      }
    }
  ]
}

Next, run mongo-connector with the above config file:
mongo-connector -c parent-child-config.json

Finally, you can query Elasticsearch with your MongoDB data.

Query Elasticsearch for children by their parents:

curl -XGET 'http://localhost:9200/company/employee/_search/?pretty=1' -d '{
  "query": {
    "has_parent": {
      "type": "branch",
      "query": {
        "match": {
          "country": "UK"
        }
      }
    }
  }
}'

Or, query Elasticsearch for parents by their children:

curl -XGET 'http://localhost:9200/company/branch/_search/?pretty=1' -d '{
  "query": {
    "has_child": {
      "type": "employee",
      "query": {
        "range": {
          "dob": {
            "gte": "1980-01-01"
          }
        }
      }
    }
  }
}'

Up for discussion:

  • Should the config file format be changed?
  • Should the DocManager automatically create the parent and child index/mapping?
    • In this case, we need to add a parentType field to the config file as well as the parentField.
  • This does not support nested parent fields, eg. "parentField": "foo.bar.parent_id" will not work.
  • This does not support grandparents and grandchildren. The difficulty here is that to insert a grandchild document, we need the parent id (for parent) and the grandparent id (for routing). I propose we only support grandparent relationships when the grandchild documents contain a parentField and an additional routingField. This would avoid having to query the parent collection to find the parent's _routing to use for routing (what if the parent doesn't exist yet which may happen during the collection dump?). This would support great-grandchildren, great-great-grandchildren, etc. automatically. Extending the example above with the Elasticsearch Grandparents example:
{
  "mainAddress": "localhost:27017",
  "verbosity": 2,
  "docManagers":[
    {
      "docManager": "elastic2_doc_manager",
      "targetURL": "localhost:9200",
      "args": {
        "routing": {
          "company": {
            "branch": {
              "parentField": "country_id"
            },
            "employee": {
              "parentField": "branch_id",
              "routingField": "routing_id"
            }
          }
        }
      }
    }
  ]
}

@ShaneHarvey
Copy link
Contributor Author

@benan789 is this how you imagine grandparent relationships could work?

@benan789
Copy link

benan789 commented Nov 2, 2016

Yes, that's perfect.

'Elasticsearch to apply update', u(document_id))
return None
had_parent = '_parent' in document
old_parent = document.get('_parent')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_parent_id_from_elastic ? You could also do if old_parent: below instead of creating had_parent.

# index the new child.
updated = self.apply_update(document['_source'], update_spec)
if (had_parent and updated.get(parent_field) != old_parent) or (
not had_parent and parent_field in updated):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it sufficient to check if updated.get(parent_field) != old_parent?

refresh=(self.auto_commit_interval == 0))
parent_args = {}
if parent_id is not None:
parent_args['parent'] = parent_id
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a common pattern:

parent_id = self._get_parent_id_from_mongodb(...)
parent_args = {}
if parent_id is not None:
    parent_args['parent'] = parent_id

perhaps this could be its own helper function? like:

parent_args = self._get_parent_args(index, doc_type, doc)

# This is due to the fact that Elasticsearch needs the parent ID
# to know where to route the get request. We do not have the
# parent ID available in our update request though.
document = self._search_doc_by_id(index, doc_type, document_id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is something the BulkBuffer can help us with, once that gets merged in.

@benan789
Copy link

Is there an ETA on this? Thanks!

@ShaneHarvey ShaneHarvey added this to the 0.3.0 milestone Nov 14, 2016
@ShaneHarvey ShaneHarvey self-assigned this Nov 14, 2016
@StefanBalea
Copy link

Will this feature be merged? Thanks!

@lnader
Copy link

lnader commented Nov 28, 2016

@ShaneHarvey - Sorry for the delay! We were able to test this and it worked.

@marcuscarr
Copy link

@ShaneHarvey - I've been testing this out for the use case my coworker @lnader asked about. It's working great! 👍

@aarestu
Copy link

aarestu commented Dec 17, 2016

so, can I use this? and is this safe to use?
thank's masters 👍

@lnader
Copy link

lnader commented Dec 17, 2016

@ShaneHarvey - When will this get merged? Thanks

@ShaneHarvey ShaneHarvey modified the milestones: 0.3.0, 0.4.0 Dec 23, 2016
@yomansk8
Copy link

yomansk8 commented Feb 9, 2017

Hello, any news on when we got this merged and we can use it ? Thanks!

@StefanBalea
Copy link

Hello, any news on when we got this merged ? Thanks! :)

@lnader
Copy link

lnader commented Mar 31, 2017

When is this going to get merged?

@dblado
Copy link

dblado commented May 23, 2017

I'd like to use this too. Any update on when it will be merged?

if parent_field is None:
return None

return self._formatter.transform_value(doc.pop(parent_field, None))
Copy link

@dblado dblado May 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't think the parent field should be popped off the doc. I use the parent field in queries as a filter. I can't find anything that speaks to the performance of has_parent, but my gut says that a bool.filter.term.parent_field="value" is more performant than has_parent.

@nVitius
Copy link

nVitius commented Jul 19, 2017

Having this functionality would be super-helpful. @ShaneHarvey: any chance you could update this PR so we could look at merging it again?

@sorhent
Copy link

sorhent commented Aug 9, 2017

Hello,
I performed some ugly development in order for mongo-connector to support parent/child relationship,
routing and to fit also with oplog issue: yougov/mongo-connector#446
I could make a PR with the code as it is actually and if you want, you could improve it.
It works for me.

@eric-chao
Copy link

I tested this case under es5.5.2, not passed. :-(

2017-09-19 17:06:42,881 [CRITICAL] mongo_connector.oplog_manager:670 - Exception during collection dump
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/mongo_connector/oplog_manager.py", line 625, in do_dump
upsert_all(dm)
File "/usr/lib/python2.7/site-packages/mongo_connector/oplog_manager.py", line 611, in upsert_all
mapped_ns, long_ts)
File "/usr/lib/python2.7/site-packages/mongo_connector/util.py", line 46, in wrapped
reraise(new_type, exc_value, exc_tb)
File "/usr/lib/python2.7/site-packages/mongo_connector/util.py", line 35, in wrapped
return f(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/mongo_connector/doc_managers/elastic2_doc_manager.py", line 367, in bulk_upsert
for ok, resp in responses:
File "/usr/lib/python2.7/site-packages/elasticsearch/helpers/init.py", line 163, in streaming_bulk
for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
File "/usr/lib/python2.7/site-packages/elasticsearch/helpers/init.py", line 135, in _process_bulk_chunk
raise BulkIndexError('%i document(s) failed to index.' % len(errors), errors)
OperationFailed: (u'3 document(s) failed to index.', [{u'index': {u'status': 400, u'_type': u'employee', u'_id': u'59c0d72823ad98799d437c25', u'error': {u'index_uuid': u'na', u'index': u'company', u'reason': u'routing is required for [company]/[employee]/[59c0d72823ad98799d437c25]', u'type': u'routing_missing_exception'}, u'_index': u'company'}}, {u'index': {u'status': 400, u'_type': u'employee', u'_id': u'59c0d72823ad98799d437c27', u'error': {u'index_uuid': u'na', u'index': u'company', u'reason': u'routing is required for [company]/[employee]/[59c0d72823ad98799d437c27]', u'type': u'routing_missing_exception'}, u'_index': u'company'}}, {u'index': {u'status': 400, u'_type': u'employee', u'_id': u'59c0d72823ad98799d437c29', u'error': {u'index_uuid': u'na', u'index': u'company', u'reason': u'routing is required for [company]/[employee]/[59c0d72823ad98799d437c29]', u'type': u'routing_missing_exception'}, u'_index': u'company'}}])
2017-09-19 17:06:42,883 [ERROR] mongo_connector.oplog_manager:678 - OplogThread: Failed during dump collection cannot recover! Collection(Database(MongoClient(host=[u'localhost:27018'], document_class=dict, tz_aware=False, connect=True, replicaset=u'rs'), u'local'), u'oplog.rs')
2017-09-19 17:06:43,444 [ERROR] mongo_connector.connector:398 - MongoConnector: OplogThread <OplogThread(Thread-3, started 140243465701120)> unexpectedly stopped! Shutting down
2017-09-19 17:06:43,445 [INFO] mongo_connector.connector:459 - MongoConnector: Stopping all OplogThreads

@bartlettpsj
Copy link

Please add me to the list of folk asking when this will be merged. I need to link two huge collections and application side joins and nesting are out of the question.

@songguohfut
Copy link

hello~when will it be merged?

@ft0907
Copy link

ft0907 commented Jun 29, 2018

@eric-chao I have a similar problem. How did you solve it? Or is this branch not integrated into the main molecule, which is unable to use this function?
@ShaneHarvey The existing version does not support paternity. Would you like to join this function later? Or are there other alternative solutions?

@jaraco
Copy link
Collaborator

jaraco commented Nov 4, 2018

Sadly, this PR has some non-trivial merge conflicts with the implementation. @ShaneHarvey if you're willing to resolve the merge conflicts and if you don't think the implementation is stable enough for a release, I'd like to get it out, as it sounds like there's some good demand for it.

@jaraco jaraco removed this from the 0.4.0 milestone Nov 4, 2018
@jaraco jaraco self-assigned this Nov 4, 2018
@ShaneHarvey
Copy link
Contributor Author

I'm cleaning up some old issues highlighted by the new github dashboard UI and realized this is still open. Sorry for the huge delay here @jaraco (and everyone else waiting for this feature).

The problem with this PR is that it conflicts with the bulk/buffering work added in #15. Resolving the conflicts are definitely non-trivial. In order for this to be merged I think the buffering logic might need to be simplified to make parent-child lookup logic simpler. That said, I haven't worked on this project for quite a while and my knowledge of Elastic is also not up to date so there might be a solution that does not involve changing the bulk buffer logic.

For now I'm going to close this PR because I'm not planning to work on it. Anyone else is free to take over this work and open a new PR using the same ideas (as I did with #3).

@ShaneHarvey ShaneHarvey closed this Sep 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.