Support parent-child relationship #25

ShaneHarvey · 2016-11-02T18:11:05Z

This builds on @xiaogaozi's work on #3.

Add support for parent-child mapping.

Example Usage

Let's use the data from the Elasticsearch Parent-Child Mapping docs and adapt it to MongoDB. Suppose you have a company database in MongoDB with the collections branch and employees. Use this script to load some sample data:
Note: this example assumes you have a MongoDB replica set on localhost:27017 and an Elasticsearch instance on localhost:9200.

from pymongo import MongoClient

HOST = 'mongodb://localhost:27017'
client = MongoClient(HOST)

company = client.company
branch = company.branch
employee = company.employee


def insert_parent_child(parent_doc, child_doc, parent_field):
    res = branch.insert_one(parent_doc)
    child_doc[parent_field] = res.inserted_id
    employee.insert_one(child_doc)

branch_docs = [
    {'name': 'London Westminster', 'city': 'London', 'country': 'UK'},
    {'name': 'Liverpool Central', 'city': 'Liverpool', 'country': 'UK'},
    {'name': 'Champs Elysees', 'city': 'Paris', 'country': 'France'}]
employee_docs = [
    {'name': 'Mark Thomas', 'dob': '1982-05-16', 'hobby': 'diving'},
    {'name': 'Barry Smith', 'dob': '1979-04-01', 'hobby': 'hiking'},
    {'name': 'Adrien Grand', 'dob': '1987-05-11', 'hobby': 'horses'}]

for branch_doc, employee_doc in zip(branch_docs, employee_docs):
    insert_parent_child(branch_doc, employee_doc, 'branch_id')

In the current pull request, the elastic-doc-manager does not automatically create the parent-child mapping. So create the mapping manually before running mongo-connector:

curl -XPUT 'http://localhost:9200/company' -d '{
  "mappings": {
    "branch": {},
    "employee": {
      "_parent": {
        "type": "branch" 
      }
    }
  }
}'

Next, create the mongo-connector config file with parent-child mapping options:

{
  "mainAddress": "localhost:27017",
  "verbosity": 2,
  "docManagers":[
    {
      "docManager": "elastic2_doc_manager",
      "targetURL": "localhost:9200",
      "args": {
        "routing": {
          "company": {
            "employee": {
              "parentField": "branch_id"
            }
          }
        }
      }
    }
  ]
}

Next, run mongo-connector with the above config file:
mongo-connector -c parent-child-config.json

Finally, you can query Elasticsearch with your MongoDB data.

Query Elasticsearch for children by their parents:

curl -XGET 'http://localhost:9200/company/employee/_search/?pretty=1' -d '{
  "query": {
    "has_parent": {
      "type": "branch",
      "query": {
        "match": {
          "country": "UK"
        }
      }
    }
  }
}'

Or, query Elasticsearch for parents by their children:

curl -XGET 'http://localhost:9200/company/branch/_search/?pretty=1' -d '{
  "query": {
    "has_child": {
      "type": "employee",
      "query": {
        "range": {
          "dob": {
            "gte": "1980-01-01"
          }
        }
      }
    }
  }
}'

Up for discussion:

Should the config file format be changed?
Should the DocManager automatically create the parent and child index/mapping?
- In this case, we need to add a parentType field to the config file as well as the parentField.
This does not support nested parent fields, eg. "parentField": "foo.bar.parent_id" will not work.
This does not support grandparents and grandchildren. The difficulty here is that to insert a grandchild document, we need the parent id (for parent) and the grandparent id (for routing). I propose we only support grandparent relationships when the grandchild documents contain a parentField and an additional routingField. This would avoid having to query the parent collection to find the parent's _routing to use for routing (what if the parent doesn't exist yet which may happen during the collection dump?). This would support great-grandchildren, great-great-grandchildren, etc. automatically. Extending the example above with the Elasticsearch Grandparents example:

{
  "mainAddress": "localhost:27017",
  "verbosity": 2,
  "docManagers":[
    {
      "docManager": "elastic2_doc_manager",
      "targetURL": "localhost:9200",
      "args": {
        "routing": {
          "company": {
            "branch": {
              "parentField": "country_id"
            },
            "employee": {
              "parentField": "branch_id",
              "routingField": "routing_id"
            }
          }
        }
      }
    }
  ]
}

…lection

ShaneHarvey · 2016-11-02T18:14:02Z

@benan789 is this how you imagine grandparent relationships could work?

benan789 · 2016-11-02T19:06:32Z

Yes, that's perfect.

llvtt · 2016-11-03T17:42:13Z

mongo_connector/doc_managers/elastic2_doc_manager.py

+                          'Elasticsearch to apply update', u(document_id))
+                return None
+            had_parent = '_parent' in document
+            old_parent = document.get('_parent')


_get_parent_id_from_elastic ? You could also do if old_parent: below instead of creating had_parent.

llvtt · 2016-11-03T17:46:29Z

mongo_connector/doc_managers/elastic2_doc_manager.py

+            # index the new child.
+            updated = self.apply_update(document['_source'], update_spec)
+            if (had_parent and updated.get(parent_field) != old_parent) or (
+                    not had_parent and parent_field in updated):


Isn't it sufficient to check if updated.get(parent_field) != old_parent?

llvtt · 2016-11-03T18:07:23Z

mongo_connector/doc_managers/elastic2_doc_manager.py

-                           refresh=(self.auto_commit_interval == 0))
+        parent_args = {}
+        if parent_id is not None:
+            parent_args['parent'] = parent_id


This seems to be a common pattern:

parent_id = self._get_parent_id_from_mongodb(...) parent_args = {} if parent_id is not None: parent_args['parent'] = parent_id

perhaps this could be its own helper function? like:

parent_args = self._get_parent_args(index, doc_type, doc)

llvtt · 2016-11-03T18:10:49Z

mongo_connector/doc_managers/elastic2_doc_manager.py

+            # This is due to the fact that Elasticsearch needs the parent ID
+            # to know where to route the get request. We do not have the
+            # parent ID available in our update request though.
+            document = self._search_doc_by_id(index, doc_type, document_id)


Maybe this is something the BulkBuffer can help us with, once that gets merged in.

benan789 · 2016-11-14T19:25:03Z

Is there an ETA on this? Thanks!

StefanBalea · 2016-11-21T08:15:56Z

Will this feature be merged? Thanks!

lnader · 2016-11-28T23:58:43Z

@ShaneHarvey - Sorry for the delay! We were able to test this and it worked.

marcuscarr · 2016-11-29T03:38:18Z

@ShaneHarvey - I've been testing this out for the use case my coworker @lnader asked about. It's working great! 👍

aarestu · 2016-12-17T02:20:02Z

so, can I use this? and is this safe to use?
thank's masters 👍

lnader · 2016-12-17T03:26:36Z

@ShaneHarvey - When will this get merged? Thanks

yomansk8 · 2017-02-09T09:51:16Z

Hello, any news on when we got this merged and we can use it ? Thanks!

StefanBalea · 2017-03-10T15:04:03Z

Hello, any news on when we got this merged ? Thanks! :)

lnader · 2017-03-31T21:01:25Z

When is this going to get merged?

dblado · 2017-05-23T22:18:05Z

I'd like to use this too. Any update on when it will be merged?

dblado · 2017-05-26T21:06:10Z

mongo_connector/doc_managers/elastic2_doc_manager.py

+        if parent_field is None:
+            return None
+
+        return self._formatter.transform_value(doc.pop(parent_field, None))


I actually don't think the parent field should be popped off the doc. I use the parent field in queries as a filter. I can't find anything that speaks to the performance of has_parent, but my gut says that a bool.filter.term.parent_field="value" is more performant than has_parent.

nVitius · 2017-07-19T08:00:18Z

Having this functionality would be super-helpful. @ShaneHarvey: any chance you could update this PR so we could look at merging it again?

sorhent · 2017-08-09T08:57:57Z

Hello,
I performed some ugly development in order for mongo-connector to support parent/child relationship,
routing and to fit also with oplog issue: yougov/mongo-connector#446
I could make a PR with the code as it is actually and if you want, you could improve it.
It works for me.

eric-chao · 2017-09-19T09:21:49Z

I tested this case under es5.5.2, not passed. :-(

2017-09-19 17:06:42,881 [CRITICAL] mongo_connector.oplog_manager:670 - Exception during collection dump
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/mongo_connector/oplog_manager.py", line 625, in do_dump
upsert_all(dm)
File "/usr/lib/python2.7/site-packages/mongo_connector/oplog_manager.py", line 611, in upsert_all
mapped_ns, long_ts)
File "/usr/lib/python2.7/site-packages/mongo_connector/util.py", line 46, in wrapped
reraise(new_type, exc_value, exc_tb)
File "/usr/lib/python2.7/site-packages/mongo_connector/util.py", line 35, in wrapped
return f(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/mongo_connector/doc_managers/elastic2_doc_manager.py", line 367, in bulk_upsert
for ok, resp in responses:
File "/usr/lib/python2.7/site-packages/elasticsearch/helpers/init.py", line 163, in streaming_bulk
for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
File "/usr/lib/python2.7/site-packages/elasticsearch/helpers/init.py", line 135, in _process_bulk_chunk
raise BulkIndexError('%i document(s) failed to index.' % len(errors), errors)
OperationFailed: (u'3 document(s) failed to index.', [{u'index': {u'status': 400, u'_type': u'employee', u'_id': u'59c0d72823ad98799d437c25', u'error': {u'index_uuid': u'na', u'index': u'company', u'reason': u'routing is required for [company]/[employee]/[59c0d72823ad98799d437c25]', u'type': u'routing_missing_exception'}, u'_index': u'company'}}, {u'index': {u'status': 400, u'_type': u'employee', u'_id': u'59c0d72823ad98799d437c27', u'error': {u'index_uuid': u'na', u'index': u'company', u'reason': u'routing is required for [company]/[employee]/[59c0d72823ad98799d437c27]', u'type': u'routing_missing_exception'}, u'_index': u'company'}}, {u'index': {u'status': 400, u'_type': u'employee', u'_id': u'59c0d72823ad98799d437c29', u'error': {u'index_uuid': u'na', u'index': u'company', u'reason': u'routing is required for [company]/[employee]/[59c0d72823ad98799d437c29]', u'type': u'routing_missing_exception'}, u'_index': u'company'}}])
2017-09-19 17:06:42,883 [ERROR] mongo_connector.oplog_manager:678 - OplogThread: Failed during dump collection cannot recover! Collection(Database(MongoClient(host=[u'localhost:27018'], document_class=dict, tz_aware=False, connect=True, replicaset=u'rs'), u'local'), u'oplog.rs')
2017-09-19 17:06:43,444 [ERROR] mongo_connector.connector:398 - MongoConnector: OplogThread <OplogThread(Thread-3, started 140243465701120)> unexpectedly stopped! Shutting down
2017-09-19 17:06:43,445 [INFO] mongo_connector.connector:459 - MongoConnector: Stopping all OplogThreads

bartlettpsj · 2017-10-26T15:02:06Z

Please add me to the list of folk asking when this will be merged. I need to link two huge collections and application side joins and nesting are out of the question.

songguohfut · 2018-04-24T12:17:07Z

hello~when will it be merged?

ft0907 · 2018-06-29T03:14:09Z

@eric-chao I have a similar problem. How did you solve it? Or is this branch not integrated into the main molecule, which is unable to use this function?
@ShaneHarvey The existing version does not support paternity. Would you like to join this function later? Or are there other alternative solutions?

jaraco · 2018-11-04T09:06:05Z

Sadly, this PR has some non-trivial merge conflicts with the implementation. @ShaneHarvey if you're willing to resolve the merge conflicts and if you don't think the implementation is stable enough for a release, I'd like to get it out, as it sounds like there's some good demand for it.

ShaneHarvey · 2019-09-09T23:21:00Z

I'm cleaning up some old issues highlighted by the new github dashboard UI and realized this is still open. Sorry for the huge delay here @jaraco (and everyone else waiting for this feature).

The problem with this PR is that it conflicts with the bulk/buffering work added in #15. Resolving the conflicts are definitely non-trivial. In order for this to be merged I think the buffering logic might need to be simplified to make parent-child lookup logic simpler. That said, I haven't worked on this project for quite a while and my knowledge of Elastic is also not up to date so there might be a solution that does not involve changing the bulk buffer logic.

For now I'm going to close this PR because I'm not planning to work on it. Anyone else is free to take over this work and open a new PR using the same ideas (as I did with #3).

xiaogaozi and others added 5 commits October 31, 2016 14:19

Support parent-child relationship (yougov#3)

68c0861

Update Elasticsearch test version to 2.4.1

1c8c559

Change parent-child mapping to be per namespace specific, not per col…

bdcf9ec

…lection

Add back parent field when updating child documents

c587b3e

Fix updating the parent field of a child document

e885889

ShaneHarvey mentioned this pull request Nov 2, 2016

Support parent-child relationship #3

Closed

ShaneHarvey added the enhancement label Nov 2, 2016

Create meta index in test setup.

d78f30c

ShaneHarvey mentioned this pull request Nov 2, 2016

Question: How do I combine/transform 2 collections in mongo into one in elastic index? yougov/mongo-connector#557

Open

llvtt suggested changes Nov 3, 2016

View reviewed changes

ShaneHarvey added this to the 0.3.0 milestone Nov 14, 2016

ShaneHarvey self-assigned this Nov 14, 2016

ShaneHarvey modified the milestones: 0.3.0, 0.4.0 Dec 23, 2016

dblado reviewed May 26, 2017

View reviewed changes

kakarukeys mentioned this pull request Jul 25, 2017

support for ES index with routing #51

Open

sliwinski-milosz mentioned this pull request Aug 8, 2017

Mongo connector is lag behind mongodb yougov/mongo-connector#446

Closed

jaraco removed this from the 0.4.0 milestone Nov 4, 2018

jaraco self-assigned this Nov 4, 2018

ShaneHarvey closed this Sep 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parent-child relationship #25

Support parent-child relationship #25

ShaneHarvey commented Nov 2, 2016 •

edited

Loading

ShaneHarvey commented Nov 2, 2016

benan789 commented Nov 2, 2016

llvtt Nov 3, 2016

llvtt Nov 3, 2016

llvtt Nov 3, 2016

llvtt Nov 3, 2016

benan789 commented Nov 14, 2016

StefanBalea commented Nov 21, 2016

lnader commented Nov 28, 2016

marcuscarr commented Nov 29, 2016

aarestu commented Dec 17, 2016

lnader commented Dec 17, 2016

yomansk8 commented Feb 9, 2017

StefanBalea commented Mar 10, 2017

lnader commented Mar 31, 2017

dblado commented May 23, 2017

dblado May 26, 2017 •

edited

Loading

nVitius commented Jul 19, 2017

sorhent commented Aug 9, 2017

eric-chao commented Sep 19, 2017

bartlettpsj commented Oct 26, 2017

songguohfut commented Apr 24, 2018

ft0907 commented Jun 29, 2018 •

edited

Loading

jaraco commented Nov 4, 2018

ShaneHarvey commented Sep 9, 2019

Support parent-child relationship #25

Support parent-child relationship #25

Conversation

ShaneHarvey commented Nov 2, 2016 • edited Loading

Example Usage

Up for discussion:

ShaneHarvey commented Nov 2, 2016

benan789 commented Nov 2, 2016

llvtt Nov 3, 2016

Choose a reason for hiding this comment

llvtt Nov 3, 2016

Choose a reason for hiding this comment

llvtt Nov 3, 2016

Choose a reason for hiding this comment

llvtt Nov 3, 2016

Choose a reason for hiding this comment

benan789 commented Nov 14, 2016

StefanBalea commented Nov 21, 2016

lnader commented Nov 28, 2016

marcuscarr commented Nov 29, 2016

aarestu commented Dec 17, 2016

lnader commented Dec 17, 2016

yomansk8 commented Feb 9, 2017

StefanBalea commented Mar 10, 2017

lnader commented Mar 31, 2017

dblado commented May 23, 2017

dblado May 26, 2017 • edited Loading

Choose a reason for hiding this comment

nVitius commented Jul 19, 2017

sorhent commented Aug 9, 2017

eric-chao commented Sep 19, 2017

bartlettpsj commented Oct 26, 2017

songguohfut commented Apr 24, 2018

ft0907 commented Jun 29, 2018 • edited Loading

jaraco commented Nov 4, 2018

ShaneHarvey commented Sep 9, 2019

ShaneHarvey commented Nov 2, 2016 •

edited

Loading

dblado May 26, 2017 •

edited

Loading

ft0907 commented Jun 29, 2018 •

edited

Loading