-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mongo connector is lag behind mongodb #446
Comments
Same issue here, any updates ? |
The default elastic2_doc_manager indexes documents one at a time, which is inefficient and pretty slow if you have a large volume of operations in the oplog. I was able to increase the speed quite a bit by modifying the doc manager to batch the indexing actions into a bulk request. |
Same issue here. @jmmk any chance you can share your patch? Thanks in advance :) |
@jmmk, pull requests against elastic2-doc-manager are warmly welcomed. :-) |
I was looking into this and seems that mongo-connector itself is the one driving whether a Regardless, I think it'd be great to have either mongo_connector or the doc managers buffer updates for a given period of time and then bulk upserting. |
@luisobo @behackett I have been out of town for a few weeks, but I will try and see if I tidy up my solution and make an upstream PR. @luisobo it's not just the upserts that can/should be batched, it's every operation. ElasticSearch allows a "bulk indexing operation" with any number of inserts/updates/deletes that will be run in the order specified (see: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html). Mongo Connector is just passing each oplog operation to the doc manager to handle it however it sees fit. In order to properly replicate the database state, the operations must be run "one at a time" in the exact order they happened. So if Mongo Connector tried to do any batching logic, it may not be correct for how the downstream data store needs to handle the operations. But since we know we can batch them in a bulk indexing request and that they will be run in order, we can do that for ES to prevent unnecessary network calls. |
Same issue here. I can't wait till solution will come :) |
Made a quick gist to demonstrate the improvements to elastic2_doc_manager: You can see a diff of the changes I've made here: https://www.diffchecker.com/qybzkblm Note: I have not tested this code, but it should be a good starting point. |
Thanks@jmmk. I'm gonna test this out. |
What I've noticed is that mongo-connector, on operation == "u" calls docman.update . And elastic2_doc_manager update function starts with self.commit() which means that it clears buffer even before it has been filled. I am also not sure if it handles case when the buffer is not full yet, but actually mongo-connector went through whole oplog -> I mean, we should make a bulk and empty buffer once mongo-connector finish to go through oplog. EDIT: So after I:
It works very well! 10000 lines from oplog it is able to sync in ~15sec. I added checking as well to call commit only if self.action_buffer is not empty (it is related to point 2) |
@sliwinski-milosz that call to Also noticed the You can see the two changes here: https://gist.github.com/jmmk/b3342508b6a805f51101e53fb9d9df86/revisions |
Superb, thank you very much @jmmk Regarding auto commit: I don't want to call my commits periodically. Rather would like to call it only, when some lines from oplog have been processed. That is why I mentioned about flush inside mongo-connector after it goes out from oplog loop. |
I think auto commit is still your best bet, but you could just add an additional check inside commit or inside run_auto_commit: def commit(self):
if len(actions_buffer):
# do stuff
# OR
def run_auto_commit(self):
if len(actions_buffer):
self.commit()
if self.auto_commit_interval not in [None, 0]:
Timer(self.auto_commit_interval, self.run_auto_commit).start() |
@jmmk you are right. Auto commit is my best bet and I started to use that. There is only one issue in provided code. self.action_buffer = [] needs to be moved before first call of run_auto_commit as it is used by self.commit() Thank you very much again, now it works much better! |
I faced an issue: if do an insert then has several update right after EDIT: |
Some workaround before we will have better solution would be to set a flag on insert -> insert_in_queue and once your oplog-manager will want to do "update" then if flag is true -> do commit before update and set flag back to false. Thanks to that if you will have 7000 insert operations and then one update operation, it will still be able to use bulk. P.S. Now we know why there was commit inside doc manager update function. |
@hungvotrung @sliwinski-milosz I forgot about that - all the inserts must be committed because the update pulls the document from Elasticsearch. It could be slower if you simply add the Alternatively, you can make some more modifications. Right now when an update operation comes in, it immediately attempts to fetch the document from ES. You can make this faster by batching updates, but it will be more complex.
|
Summary from my test-lab (mixing 1000 (i) and 13000 (u), the first @jmmk commit already did the magic but the change later even better. Normal took ~ 40 mins to finish but new changeonly need 5 mins (actual time is shorter due to mongo-connector have to wait for application process the data). |
Hey Guys, I prepared solution based on hints provided by @jmmk . With my solution elastic2_doc_manager is able to do bulk insert and update operations. It also does multi get request to ES to get sources for queued operations. I have tested it for a while and seems that works pretty good. So now I need your help in testing. @hungvotrung could you please test my solution and provide that nice summary. This time there should be a lot less Elasticsearch calls. You can find code here: And here is a diff in comparison to default elastic2_doc_manager: |
@sliwinski-milosz, does the code you've written pass the existing unit tests in mongo-connector and elastic2-doc-manager? Those would be good starting points for testing. You might also consider writing some tests around BulkBuffer. When you're ready, you should turn your code into a pull request against elastic2-doc-manager. @jmmk @sliwinski-milosz @hungvotrung Thank you all for your hard work on this. |
I found one case which is not handled by my solution.
I wonder if above case is even handled in default elastic2_doc_manager. |
@sliwinski-milosz : your new code is lighting fast Edit: 2016-07-07 10:45:23,582 [INFO] elasticsearch:63 - POST http://10.10.y.z:9200/_bulk [status:200 request:0.365s] |
@hungvotrung Thank you very much for testing! There are two threads: mongo-connector and timer (for auto_commit purpose). It refreshes Elasticsearch in two cases:
Because of these two threads it has to call refresh before mget as I put Ad2 refresh outside lock, just to not block mongo-connector. |
Bug: that still related to the scenario that when application insert new document then follow with severals update on that newly doc. This case mongo-connector fail to send new doc to elastic-search since mget cannot find the doc on elastic search & does't through any exception. To fix (and avoid losing data) I still have to make the commit immediately after have any insert in mongodb. Other than that bulk insert seem work well. |
@hungvotrung it is interesting. You can print missing documents here:
I am using that logic for 9 days without any issues (no missing documents).
Logic to handle your case is there - lets investigate the issue. |
So I turned on the debug log to got more detail (1 insert follows by several update): mget
you can see mget cannot get the the newly doc hence can not apply the update later and when the bulk happen new doc will be overwrite by later update --> got an empty doc in elasticsearch. So for new insert with update right after it will fail hence it's important to do the commit for each insert otherwise we need to look for another way to make sure data is replicated correctly :) |
Hey @hungvotrung I found the issue. It was about type of "_id" variable. Oplog (by update() function) provides "_id" as integer number but inside my class (locally) _source is stored under unicode "_id". That is why it was not able to find _source locally and it tried to get it from elasticsearch. You can find diff here: Could you please test it again? I made some smoke tests and seems to work fine. |
Got this exception: ongo_connector.util:92 - Fatal Exception |
@hungvotrung many thanks for testing!! I found the root cause: it was related to storing _source locally. For one case I stored it after Formatter.format_document, for another one before doing that. Now for both cases I store not formatted source. If you still want to try ;-): This time I tested it more deeply:
|
@sliwinski-milosz sweet, just done several run, seem works fine now |
Good to hear! :) Thank you for your patience. Now it is time to prepare some tests and make a pull request. I will do that soon. |
Thanks for everyone's time! |
Hi, no updates yet at least from my side. I am still working on tests... maybe this weekend I will be able to finally make a pull request. |
@hungvotrung could you please, for the very last time, prepare yours nice summary with the latest version of mongo-connector and elastic2_doc_manager (0.2.0 vs 0.3.0) and then I think we can close this issue. Please just remember to set autoCommitInterval. |
@sliwinski-milosz
|
Hi @sorhent :) Ad. 1. I think that you should rather find a reason why any of those operation fails - if you will simply just ignore exceptions you will end-up with mongodb<->elasticsearch not fully synchronised. Ad. 2. It seems like parent-child relationship is not ready yet: yougov/elastic2-doc-manager#25 - maybe you can help them to finish implementation :) You can check this issue as well: #678 . Maybe that is the reason of your exceptions. |
Hi again @sliwinski-milosz,
|
That is actually quite interesting case :) So steps to reproduce:
@ShaneHarvey do you know if above case is handled by mongo-connector? |
@sliwinski-milosz , taking into account that the issue is on https://github.com/mongodb-labs/elastic2-doc-manager repository I created an issue related to bulk operation(yougov/elastic2-doc-manager#52) , on that repository |
@sliwinski-milosz I found a http auth problem in your doc_manager. If the |
@Ricky-Hao issue that you have mentioned is not related to lag between ElasticSearch and MongoDb. Could you please open new issue regarding http auth problem? As it is related to I also think that this issue can be closed as lagging has been fixed. |
Thanks for the update @sliwinski-milosz . |
Hi guys,
We are using mongo-connector to feed data from mongodb to elasticsearch. Everything went well until today we have a big insert / update to mongodb and mongo-connector started falling behind and taking hours to catch-up.
Here is the mongo-connector config:
Elasticsearch is setup as cluster of 3 servers.
I'm looking for anyway to make mongo-connector tailing and update data faster.
Thanks everyone.
The text was updated successfully, but these errors were encountered: