Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10.5 MDEV 29293 teemu #308

Open
wants to merge 13 commits into
base: 10.5
Choose a base branch
from
Open

10.5 MDEV 29293 teemu #308

wants to merge 13 commits into from

Commits on Apr 2, 2023

  1. MDEV-29293 MariaDB stuck on starting commit state

    The problem seems to be a deadlock between KILL command execution
    and BF abort issued by an applier, where:
    * KILL has locked victim's LOCK_thd_kill and LOCK_thd_data
    * applier has innodb side global lock mutex and victim trx mutex
    * KILL is calling innobase_kill_query, and is blocked by innodb global lock mutex
    * applier is in wsrep_innobase_kill_one_trx and is blocked by victim's LOCK_thd_kill
    
    The fix in this commit, removes the TOI replication of KILL command,
    and makes KILL execution less intrusive operation.
    Aborting of the victim happens now by using wsrep_abort_thd(),
    which is the same method as used for aborting victims of DDL execution.
    wsrep_thd_abort(), will start the victim aborting from inside innodb,
    holding the lock_sys mutex and victim trx mutex.
    Therefore the locking protocol is same as used in regular applier BF aborting procedure.
    wsrep_abort_thd will eventually call also THD::awake (as regular KILL would),
    and now awake is passed the user chosen kill signal, in case of KILL command execution.
    Applier BF aborting, otoh, will use KILL_QUERY_HARD signal.
    
    Notable changes in this commit:
    * wsrep client connections's error state may remain sticky after client connection is closed.
    This error message will then pop up for the next client session issuing first SQL statement.
    This problem raised with test galera.galera_bf_kill
    The fix is to reset wsrep client error state, before a THD is reused for next connetion
    
    * Releasing THD locks, in wsrep_abort_transaction, when locking innodb mutexes,
    this guarantees same locking order as with applier BF aborting
    
    * Handling BF aborting of idle victim of KILL QUERY (and lower signals) with
    background rollbacker.
    Kill signals higher than KILL_CONNECTION, otoh, will now skip background rollbacker treatment.
    This is because KILL_CONNECTION will wake up the victim so early, that victim execution may
    interfere with the rollbacker execution.
    
    * wsrep-lib is now using new branch: KILL_command, which has changed server_service::background_rollback()
    to return true/false depending on if the background rollbacking was started or not.
    
    * Avoiding to overwrite victim THD's error code to deadlock error,
    if aborting was due to manual KILL, this preserves the native error code for KILL victims
    sjaakola committed Apr 2, 2023
    Configuration menu
    Copy the full SHA
    efb06ef View commit details
    Browse the repository at this point in the history

Commits on Apr 3, 2023

  1. Configuration menu
    Copy the full SHA
    09e3f40 View commit details
    Browse the repository at this point in the history

Commits on Apr 11, 2023

  1. Configuration menu
    Copy the full SHA
    0b23494 View commit details
    Browse the repository at this point in the history

Commits on Apr 12, 2023

  1. Reorganize locking/unlocking to happen in the same scope

    In order to make things more manageable, changed the code so
    that the locking and unlocking happens in the same visible scope.
    temeo committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    4beec47 View commit details
    Browse the repository at this point in the history
  2. Temp unlock trx mutex in wsrep_innobase_kill_one_trx()

    This is to allow mutex locking order LOCK_thd_data -> trx mutex,
    which is needed to avoid a race in wsrep_abort_transaction().
    
    The assumption is that lock_sys.mutex is enough to prevent
    the victim to change its state.
    temeo committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    bec7cea View commit details
    Browse the repository at this point in the history
  3. Remove LOCK_thd_kill from wsrep_thd_LOCK/UNLOCK

    Some codepaths require more fine grained locking, and unlocking
    LOCK_thd_kill from wsrep_thd_UNLOCK() might cause unpleasant
    surprise.
    
    Added explicit calls to wsrep_thd_kill_LOCK/UNLOCK where needed.
    temeo committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    0347afe View commit details
    Browse the repository at this point in the history
  4. Restore assertions in wsrep_thd_bf_abort()

    Locking order for BF codepaths is now
    
    LOCK_thd_kill (can be omitted if call to awake_no_mutex is not needed)
    lock_sys.mutex
    LOCK_thd_data
    trx mutex
    temeo committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    710521f View commit details
    Browse the repository at this point in the history

Commits on Apr 13, 2023

  1. Fixed locking order to be

    lock_sys.mutex
    LOCK_thd_kill
    LOCK_thd_data
    trx.mutex
    temeo committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    8d6ceaf View commit details
    Browse the repository at this point in the history
  2. Deal with the sad fact that wsrep_abort_thd() and

    ha_abort_transaction() return without thd mutexes held.
    
    Add SR worker THDs to server_threads list so they can
    be found via find_thread_by_id() for BF aborting.
    temeo committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    7700544 View commit details
    Browse the repository at this point in the history
  3. Fix kill_one_thread()

    temeo committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    f850240 View commit details
    Browse the repository at this point in the history
  4. Documented wsrep_abort_thd()

    temeo committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    9e729cf View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    af2e9c4 View commit details
    Browse the repository at this point in the history
  6. Fix abort replicated

    temeo committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    aae411f View commit details
    Browse the repository at this point in the history