Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncate record failing when attempting to recover broken record #10286

Open
timw opened this issue Aug 26, 2024 · 5 comments
Open

Truncate record failing when attempting to recover broken record #10286

timw opened this issue Aug 26, 2024 · 5 comments
Milestone

Comments

@timw
Copy link
Contributor

timw commented Aug 26, 2024

OrientDB Version: 3.1.21-SNAPSHOT (3.1.20 + our HA/replicated stability patches)

Java Version: 1.8

OS: Linux/arm64

We have a database in production with 9 broken records: #24:5773252 through #24:5773260
We suspect they were created by an unclean shutdown of the database server.

When these records are accessed (e.g. a SELECT or a DELETE), they fail with a BufferUnderflowException.
Reading some other similar issues, the solution recommended in those is to use TRUNCATE RECORD, however this is also failing.

Results for the first broken record:

orientdb {db=index}> select from #24:5773252

!ERROR:com.orientechnologies.orient.core.exception.ODatabaseException: Error on retrieving record #24:5773252 (cluster: searchdata)
        DB name="index"
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.executeReadRecord(ODatabaseDocumentEmbedded.java:1546)
        at com.orientechnologies.orient.core.tx.OTransactionNoTx.loadRecord(OTransactionNoTx.java:123)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.load(ODatabaseDocumentAbstract.java:1036)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.load(ODatabaseDocumentAbstract.java:136)
        at com.orientechnologies.orient.core.sql.executor.FetchFromRidsStep$1.fetchNext(FetchFromRidsStep.java:51)
        at com.orientechnologies.orient.core.sql.executor.FetchFromRidsStep$1.hasNext(FetchFromRidsStep.java:68)
        at com.orientechnologies.orient.core.sql.parser.OLocalResultSet.fetchNext(OLocalResultSet.java:37)
        at com.orientechnologies.orient.core.sql.parser.OLocalResultSet.<init>(OLocalResultSet.java:27)
        at com.orientechnologies.orient.core.sql.parser.OSelectStatement.execute(OSelectStatement.java:276)
        at com.orientechnologies.orient.core.sql.parser.OStatement.execute(OStatement.java:72)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.query(ODatabaseDocumentEmbedded.java:585)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.select(OConsoleDatabaseApp.java:1170)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.orientechnologies.common.console.OConsoleApplication.execute(OConsoleApplication.java:445)
        at com.orientechnologies.common.console.OConsoleApplication.executeCommands(OConsoleApplication.java:283)
        at com.orientechnologies.common.console.OConsoleApplication.run(OConsoleApplication.java:144)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.main(OConsoleDatabaseApp.java:159)
Caused by: java.nio.BufferUnderflowException
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:270)
        at java.nio.ByteBuffer.get(ByteBuffer.java:723)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.base.ODurablePage.getBinaryValue(ODurablePage.java:204)
        at com.orientechnologies.orient.core.storage.cluster.OClusterPage.getRecordBinaryValue(OClusterPage.java:545)
        at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.readFullEntry(OPaginatedClusterV2.java:1391)
        at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.internalReadRecord(OPaginatedClusterV2.java:700)
        at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.readRecord(OPaginatedClusterV2.java:657)
        at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.readRecord(OPaginatedClusterV2.java:640)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.doReadRecord(OAbstractPaginatedStorage.java:5431)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.readRecord(OAbstractPaginatedStorage.java:5141)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.readRecord(OAbstractPaginatedStorage.java:1892)
        at com.orientechnologies.orient.core.db.document.SimpleRecordReader.readRecord(SimpleRecordReader.java:24)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.executeReadRecord(ODatabaseDocumentEmbedded.java:1493)
        ... 19 more

orientdb {db=index}> truncate record #24:5773252


!ERROR:com.orientechnologies.orient.core.exception.OCommandExecutionException: Error on executing command
        DB name="index"
        at com.orientechnologies.orient.core.sql.parser.OTruncateRecordStatement.executeSimple(OTruncateRecordStatement.java:57)
        at com.orientechnologies.orient.core.sql.executor.OSingleOpExecutionPlan.executeInternal(OSingleOpExecutionPlan.java:95)
        at com.orientechnologies.orient.core.sql.parser.OSimpleExecStatement.execute(OSimpleExecStatement.java:45)
        at com.orientechnologies.orient.core.sql.parser.OStatement.execute(OStatement.java:72)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.command(ODatabaseDocumentEmbedded.java:647)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.sqlCommand(OConsoleDatabaseApp.java:3469)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.truncateRecord(OConsoleDatabaseApp.java:642)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.orientechnologies.common.console.OConsoleApplication.execute(OConsoleApplication.java:445)
        at com.orientechnologies.common.console.OConsoleApplication.executeCommands(OConsoleApplication.java:283)
        at com.orientechnologies.common.console.OConsoleApplication.run(OConsoleApplication.java:144)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.main(OConsoleDatabaseApp.java:159)
Caused by: com.orientechnologies.orient.core.exception.OStorageException: Exception during execution of atomic operation inside of storage index
        DB name="index"
        at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.calculateInsideAtomicOperation(OAtomicOperationsManager.java:144)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.deleteRecord(OAbstractPaginatedStorage.java:2002)
        at com.orientechnologies.orient.core.sql.parser.OTruncateRecordStatement.executeSimple(OTruncateRecordStatement.java:47)
        ... 14 more
Caused by: java.lang.IndexOutOfBoundsException: 1330403569
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:252)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.base.ODurablePage.getByteValue(ODurablePage.java:245)
        at com.orientechnologies.orient.core.storage.cluster.OClusterPage.getRecordByteValue(OClusterPage.java:568)
        at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.getPhysicalPosition(OPaginatedClusterV2.java:1010)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.doDeleteRecord(OAbstractPaginatedStorage.java:5386)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.lambda$deleteRecord$8(OAbstractPaginatedStorage.java:2003)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.calculateInsideAtomicOperation(OAtomicOperationsManager.java:138)
        ... 16 more

Subsequent broken records all fail similarly, but with distinct indexes:

orientdb {db=index}> truncate record #24:5773253                             


!ERROR:com.orientechnologies.orient.core.exception.OCommandExecutionException: Error on executing command
        DB name="index"
        at com.orientechnologies.orient.core.sql.parser.OTruncateRecordStatement.executeSimple(OTruncateRecordStatement.java:57)
        at com.orientechnologies.orient.core.sql.executor.OSingleOpExecutionPlan.executeInternal(OSingleOpExecutionPlan.java:95)
        at com.orientechnologies.orient.core.sql.parser.OSimpleExecStatement.execute(OSimpleExecStatement.java:45)
        at com.orientechnologies.orient.core.sql.parser.OStatement.execute(OStatement.java:72)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.command(ODatabaseDocumentEmbedded.java:647)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.sqlCommand(OConsoleDatabaseApp.java:3469)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.truncateRecord(OConsoleDatabaseApp.java:642)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.orientechnologies.common.console.OConsoleApplication.execute(OConsoleApplication.java:445)
        at com.orientechnologies.common.console.OConsoleApplication.executeCommands(OConsoleApplication.java:283)
        at com.orientechnologies.common.console.OConsoleApplication.run(OConsoleApplication.java:144)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.main(OConsoleDatabaseApp.java:159)
Caused by: com.orientechnologies.orient.core.exception.OStorageException: Exception during execution of atomic operation inside of storage index
        DB name="index"
        at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.calculateInsideAtomicOperation(OAtomicOperationsManager.java:144)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.deleteRecord(OAbstractPaginatedStorage.java:2002)
        at com.orientechnologies.orient.core.sql.parser.OTruncateRecordStatement.executeSimple(OTruncateRecordStatement.java:47)
        ... 14 more
Caused by: java.lang.IndexOutOfBoundsException: 1752187594
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:252)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.base.ODurablePage.getByteValue(ODurablePage.java:245)
        at com.orientechnologies.orient.core.storage.cluster.OClusterPage.getRecordByteValue(OClusterPage.java:568)
        at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.getPhysicalPosition(OPaginatedClusterV2.java:1010)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.doDeleteRecord(OAbstractPaginatedStorage.java:5386)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.lambda$deleteRecord$8(OAbstractPaginatedStorage.java:2003)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.calculateInsideAtomicOperation(OAtomicOperationsManager.java:138)
        ... 16 more

Short of getting a fix for this, are there any other steps (short of exporting/importing the data) that can be taken to recover this database?
(The records in question are derived, so simply removing them is acceptable in this situation).
Unfortunately the data in the databases is confidential information, so we can't provide a copy for examination, but the engineering team supporting it is quite experienced with btree structures, so we can probably manage detailed file investigation/patching if we can get guidance on what is going on.

@timw
Copy link
Contributor Author

timw commented Aug 26, 2024

On the same database, we're observing a different persistent error (same cluster, different records):

java.lang.NullPointerException
at com.orientechnologies.orient.core.storage.cache.chm.AsyncReadCache.releaseFromRead(AsyncReadCache.java:310)
at com.orientechnologies.orient.core.storage.impl.local.paginated.base.ODurableComponent.releasePageFromRead(ODurableComponent.java:167)
at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.readFullEntry(OPaginatedClusterV2.java:1409)
at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.internalReadRecord(OPaginatedClusterV2.java:700)
at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.readRecord(OPaginatedClusterV2.java:657)
at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.readRecord(OPaginatedClusterV2.java:640)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.doReadRecord(OAbstractPaginatedStorage.java:5431)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.readRecord(OAbstractPaginatedStorage.java:5141)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.readRecord(OAbstractPaginatedStorage.java:1892)
at com.orientechnologies.orient.server.distributed.impl.ODistributedStorage.readRecord(ODistributedStorage.java:173)
at com.orientechnologies.orient.core.db.document.SimpleRecordReader.readRecord(SimpleRecordReader.java:24)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.executeReadRecord(ODatabaseDocumentEmbedded.java:1493)
at com.orientechnologies.orient.core.tx.OTransactionOptimistic.loadRecord(OTransactionOptimistic.java:208)
at com.orientechnologies.orient.core.tx.OTransactionOptimistic.loadRecord(OTransactionOptimistic.java:174)
at com.orientechnologies.orient.core.tx.OTransactionOptimistic.loadRecord(OTransactionOptimistic.java:335)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.load(ODatabaseDocumentAbstract.java:1036)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.load(ODatabaseDocumentAbstract.java:136)
at com.orientechnologies.orient.core.id.ORecordId.getRecord(ORecordId.java:328)
at com.orientechnologies.orient.core.sql.executor.OResultInternal.setElement(OResultInternal.java:416)
at com.orientechnologies.orient.core.sql.executor.FetchFromIndexedFunctionStep$1.next(FetchFromIndexedFunctionStep.java:68)
at com.orientechnologies.orient.core.sql.executor.FilterByClustersStep$1.fetchNextItem(FilterByClustersStep.java:72)
at com.orientechnologies.orient.core.sql.executor.FilterByClustersStep$1.hasNext(FilterByClustersStep.java:94)
at com.orientechnologies.orient.core.sql.executor.LetExpressionStep$1.hasNext(LetExpressionStep.java:35)
at com.orientechnologies.orient.core.sql.executor.FilterStep$1.fetchNextItem(FilterStep.java:54)
at com.orientechnologies.orient.core.sql.executor.FilterStep$1.hasNext(FilterStep.java:87)
at com.orientechnologies.orient.core.sql.executor.FilterByClassStep$1.fetchNextItem(FilterByClassStep.java:50)
at com.orientechnologies.orient.core.sql.executor.FilterByClassStep$1.hasNext(FilterByClassStep.java:82)
at com.orientechnologies.orient.core.sql.executor.LetExpressionStep$1.hasNext(LetExpressionStep.java:35)
at com.orientechnologies.orient.core.sql.executor.ProjectionCalculationStep$1.hasNext(ProjectionCalculationStep.java:31)
at com.orientechnologies.orient.core.sql.executor.DistinctExecutionStep.fetchNext(DistinctExecutionStep.java:96)
at com.orientechnologies.orient.core.sql.executor.DistinctExecutionStep.access$100(DistinctExecutionStep.java:15)
at com.orientechnologies.orient.core.sql.executor.DistinctExecutionStep$1.hasNext(DistinctExecutionStep.java:53)
at com.orientechnologies.orient.core.sql.executor.ProjectionCalculationStep$1.hasNext(ProjectionCalculationStep.java:31)
at com.orientechnologies.orient.core.sql.parser.OLocalResultSet.hasNext(OLocalResultSet.java:53)
at com.orientechnologies.orient.core.sql.parser.OLocalResultSetLifecycleDecorator.hasNext(OLocalResultSetLifecycleDecorator.java:41)
at com.orientechnologies.orient.core.sql.executor.OResultSet.tryAdvance(OResultSet.java:38)
at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at com.orientechnologies.orient.server.OConnectionBinaryExecutor.executeQuery(OConnectionBinaryExecutor.java:1513)
at com.orientechnologies.orient.client.remote.message.OQueryRequest.execute(OQueryRequest.java:143)
at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.sessionRequest(ONetworkProtocolBinary.java:355)
at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.execute(ONetworkProtocolBinary.java:239)
at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:67)

@tglman tglman added this to the 3.1.x milestone Aug 26, 2024
@timw
Copy link
Contributor Author

timw commented Aug 29, 2024

@tglman - any chance of getting some pointers on this one?
Our only path to resolving this at present is migrating some of the data to another database and rebuilding the rest, which is a lot of effort.

@tglman
Copy link
Member

tglman commented Aug 29, 2024

So, going strait to how to recover:

OrientDB store the cluster data in two files one with is the id file that keep the position where the data is stored and some minimal metadata, and the other where the data is stored, the one that is failing is the one that keep the data. So should be possible to mark the record as delete and create a "leak" in the data file.

If you are happy to hack a bit in the code, this:
https://github.com/orientechnologies/orientdb/blob/3.1.x/core/src/main/java/com/orientechnologies/orient/core/storage/cluster/v2/OPaginatedClusterV2.java#L790-L843

is the part that remove the data from the data file, so is the part that is failing on truncate record. If you comment that out and make sure to keep this call: https://github.com/orientechnologies/orientdb/blob/3.1.x/core/src/main/java/com/orientechnologies/orient/core/storage/cluster/v2/OPaginatedClusterV2.java#L847

you should be able to run a truncate record that logically delete the data without actually remove that from the data file and triggering the issue.

Obviously if you do this hack use that code just to remove the failed records, do not run in production will cause record to leak data, even if you change the code to handle the previous exception without deleting it it may cause some other issues, so better to not deploy hack in production.

Going to the origin of the bug, also in this case it feel something I already saw, checking the history of the file in 3.2.x https://github.com/orientechnologies/orientdb/commits/3.2.x/core/src/main/java/com/orientechnologies/orient/core/storage/cluster/v2/OPaginatedClusterV2.java I do have a commit with a fix that is not present in 3.1.x in the specific: 21c49f9

by the way, is there anything that is holding you in 3.1.x ? it may be a good to try to switch to 3.2.x

Bye

@timw
Copy link
Contributor Author

timw commented Aug 30, 2024

Thanks @tglman

When you say do not run in production, I'm assuming you mean we could use this offline with a hacked OrientDB to fix these records, but not let that hacked delete/truncate code run in production as it would cause other issues?

Would it make sense to make this a FORCE mode on TRUNCATE RECORD?

Similarly, would this be something the check/repair database should be able to detect and rectify, or is the DB structure too broken at that point? The problem is that you can't do a complete database export with these broken records in place (thankfully the affected records in our case are generated, so we can do a partial export/import + regenerate in this case), so having a repair tool that could patch them up enough to get back to a workable state (e.g. that we could manually fix the graph on or run a REPAIR -fix-graph to completion) and then allow the user to export/import to recover the leaked record space if required would add a lot of reassurance.

We're planning to get to 3.2.x, but we need to find time to stress test the distributed code in 3.2 (which has changed a fair amount since 3.1). Given we're running into a lot of things fixed in 3.2 now, we might look to bring that forward.

@tglman
Copy link
Member

tglman commented Aug 30, 2024

Hi,

When you say do not run in production, I'm assuming you mean we could use this offline with a hacked OrientDB to fix these records, but not let that hacked delete/truncate code run in production as it would cause other issues?

Yes

Would it make sense to make this a FORCE mode on TRUNCATE RECORD?

Yes, but this require some proper development time, as today the low level code is shared between the normal delete and the truncate, so there is need of some work to spit the logic and make sure to have the option of "leak the data"

And yes as well as soon a truncate is done it could be used in a check/ repair tool.

Anyway as today there is something that can be used already in the database export that is -excludeClass and -excludeCluster that can be used to ignore all the data of a specific class/cluster, which in the case of specific data that can be rebuild will allow you to export all the rest.

Bye

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants
@tglman @timw and others