Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a database bootstrap guide #9390

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
4 changes: 4 additions & 0 deletions docs/database/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -304,3 +304,7 @@ is expected to migrate full mainnet data in 10 days.
## Citus Backup and Restore

Please refer to this [document](/docs/database/citus.md) for the steps.

## Bootstrap a DB from exported data

Please refer to this [document](/docs/database/bootstrap.md) for instructions.
344 changes: 344 additions & 0 deletions docs/database/bootstrap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,344 @@
# Database Bootstrap Guide

This guide provides step-by-step instructions for setting up a fresh PostgreSQL database and importing Mirror Node data into it. The process involves initializing the database, configuring environment variables, and running the import script. The data import is a long-running process, so it's recommended to run it within a `screen` or `tmux` session.

---

## Table of Contents

- [Prerequisites](#prerequisites)
- [Database Initialization](#database-initialization)
- [1. Configure Environment Variables](#1-configure-environment-variables)
- [2. Important Note for Google Cloud SQL Users](#2-important-note-for-google-cloud-sql-users)
- [3. Run the Initialization Script](#3-run-the-initialization-script)
- [4. Import the Database Schema](#4-import-the-database-schema)
- [Data Import Process](#data-import-process)
- [1. Download the Database Export Data](#1-download-the-database-export-data)
- [2. Download the Import Script](#2-download-the-import-script)
- [3. Run the Import Script](#3-run-the-import-script)
- [Handling Failed Imports](#handling-failed-imports)
- [Steps to Handle Failed Imports:](#steps-to-handle-failed-imports)
- [Additional Notes](#additional-notes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ToC doesn't seem to be updated to reflect the headers. Please scan through all sections and update.

- [Troubleshooting](#troubleshooting)

---

## Prerequisites

1. **Version Compatibility**

Before initializing your Mirror Node with the imported database, it's crucial to ensure version compatibility.

**MIRRORNODE_VERSION File:**

- In the database export data, there is a file named `MIRRORNODE_VERSION`.
- This file contains the version of the Mirror Node at the time of the database export.

**Importance:**

- Your Mirror Node instance must be initialized with the **same version** as specified in the `MIRRORNODE_VERSION` file.
- Using a different version may lead to compatibility issues and/or schema mismatches.

**Action Required:**

1. **Check the Mirror Node Version:**

- Open the `MIRRORNODE_VERSION` file:

```bash
cat /path/to/db_export/MIRRORNODE_VERSION
```

- Note the version number specified.

2. **PostgreSQL 16** installed and running.
3. Access to a machine where you can run the initialization and import scripts and connect to the PostgreSQL database.
4. A Google Cloud Platform (GCP) account with a valid billing account attached (required for downloading data from a Requester Pays bucket).

---

## Database Initialization

### 1. Configure Environment Variables

Set the following environment variables on the machine from which you will run the initialization and import scripts. These variables allow for database connectivity and authentication.

**Database Connection Variables:**

```bash
export PGUSER="postgres"
export PGPASSWORD="YOUR_POSTGRES_PASSWORD"
export PGDATABASE="postgres"
export PGHOST="DB_IP_ADDRESS"
nirbosl marked this conversation as resolved.
Show resolved Hide resolved
export PGPORT="DB_PORT"
Comment on lines +81 to +82
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to show common examples for these instead of replacement text.

Suggested change
export PGHOST="DB_IP_ADDRESS"
export PGPORT="DB_PORT"
export PGHOST="127.0.0.1"
export PGPORT="5432"

```

- `PGUSER`: The PostgreSQL superuser with administrative privileges (typically `postgres`).
- `PGPASSWORD`: Password for the PostgreSQL superuser.
- `PGDATABASE`: The default database to connect to (`postgres` by default).
- `PGHOST`: The IP address or hostname of your PostgreSQL database server.
- `PGPORT`: The database server port number (`5432` by default).



**Database User Password Variables:**

Set the following environment variables to define passwords for the various database users that will be created during initialization.

```bash
export GRAPHQL_PASSWORD="SET_PASSWORD"
export GRPC_PASSWORD="SET_PASSWORD"
export IMPORTER_PASSWORD="SET_PASSWORD"
export OWNER_PASSWORD="SET_PASSWORD"
export REST_PASSWORD="SET_PASSWORD"
export REST_JAVA_PASSWORD="SET_PASSWORD"
export ROSETTA_PASSWORD="SET_PASSWORD"
export WEB3_PASSWORD="SET_PASSWORD"
Comment on lines +98 to +105
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better experience would be to create a bootstrap.env with all the exports and ask the user to just adjust all defaults. Then inside bootstrap.sh run source ./bootstrap.env.

```

- Replace `SET_PASSWORD` with strong, unique passwords for each respective user.

- **Security Note:** Ensure that the passwords set in the environment variables are kept secure and not exposed in logs or command history.

### 2. Important Note for Google Cloud SQL Users

If you are using **Google Cloud SQL** for your PostgreSQL database, you'll need to set an additional environment variable:
```bash
export IS_GCP_CLOUD_SQL="true"
```
*Note*: For non-Google Cloud SQL environments, you do not need to set this variable, as it defaults to false.

### 3. Run the Initialization Script

Download the initialization script [`init.sh`](../../hedera-mirror-importer/src/main/resources/db/scripts/init.sh) from the repository:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant relative to root, not current directory.

Suggested change
Download the initialization script [`init.sh`](../../hedera-mirror-importer/src/main/resources/db/scripts/init.sh) from the repository:
Download the initialization script [`init.sh`](/hedera-mirror-importer/src/main/resources/db/scripts/init.sh) from the repository:


```bash
curl -O https://raw.githubusercontent.com/hashgraph/hedera-mirror-node/main/hedera-mirror-importer/src/main/resources/db/scripts/init.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically they should download the init.sh version that's specified in MIRRORNODE_VERSION

chmod +x init.sh
```

Run the initialization script:

```bash
./init.sh
echo "EXIT STATUS: $?"
```

- The exit status `0` indicates the script executed successfully.
- The script will create the `mirror_node` database, along with all necessary roles, users, and permissions within your PostgreSQL database, using the passwords specified in the environment variables.

### 4. Import the Database Schema

After the initialization script completes successfully, update the environment variables to connect using the `mirror_node` user and database:

```bash
export PGUSER="mirror_node"
export PGPASSWORD="$OWNER_PASSWORD" # Use the password set for OWNER_PASSWORD
export PGDATABASE="mirror_node"
```

Import the database schema:

```bash
psql -f schema.sql
nirbosl marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's confusing that both schema.sql and MIRRORNODE_VERSION files are referenced in the docs before they are requested to be downloaded. Users following the steps sequentially will get stuck.

echo "EXIT STATUS: $?"
```

- Ensure the exit status is `0` to confirm the schema was imported successfully.

---

## Data Import Process

### 1. Download the Database Export Data

The Mirror Node database export data is available in a Google Cloud Storage (GCS) bucket:

- **Bucket URL:** [mirrornode-db-export](https://console.cloud.google.com/storage/browser/mirrornode-db-export)

**Important Notes:**

- The bucket is **read-only** to the public.
- It is configured as **Requester Pays**, meaning you need a GCP account with a valid billing account attached to download the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to link to the Hedera docs that show how to create such an account.

- You will be billed for the data transfer fees incurred during the download.

**Download Instructions:**

1. **Authenticate with GCP:**

Ensure you have the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) installed and authenticated:
nirbosl marked this conversation as resolved.
Show resolved Hide resolved

```bash
gcloud auth login
gcloud config set billing/disable_usage_reporting false
```

2. **Set the Default Project:**

```bash
gcloud config set project YOUR_GCP_PROJECT_ID
```

3. **Download the Data:**

Create an empty directory to store the data and download all files and subdirectories:

```bash
mkdir -p /path/to/db_export
gsutil -u YOUR_GCP_PROJECT_ID -m cp -r gs://mirrornode-db-export/* /path/to/db_export/
nirbosl marked this conversation as resolved.
Show resolved Hide resolved
```

- Replace `/path/to/db_export` with your desired directory path.
- Ensure all files and subdirectories are downloaded into this single parent directory.
- **Note:** The `-m` flag enables parallel downloads to speed up the process.

### 2. Download the Import Script

Download the import script `bootstrap.sh` from the repository:

```bash
curl -O https://raw.githubusercontent.com/hashgraph/hedera-mirror-node/main/hedera-mirror-importer/src/main/resources/db/scripts/bootstrap.sh
chmod +x bootstrap.sh
```

### 3. Run the Import Script

The import script is designed to efficiently import the Mirror Node data into your PostgreSQL database. It handles compressed CSV files and uses parallel processing to speed up the import.

**Script Summary:**

- **Name:** `bootstrap.sh`
nirbosl marked this conversation as resolved.
Show resolved Hide resolved
- **Functionality:** Imports data from compressed CSV files into the PostgreSQL database using parallel processing. It processes multiple tables concurrently based on the number of CPU cores specified.
- **Requirements:** Ensure that the environment variables for database connectivity are set (`PGUSER`, `PGPASSWORD`, `PGDATABASE`, `PGHOST`).

**Instructions:**

1. **Ensure Environment Variables are Set:**

The environment variables should still be set from the previous steps. Verify them:

```bash
echo $PGUSER # Should output 'mirror_node'
echo $PGPASSWORD # Should output the password you set for OWNER_PASSWORD
echo $PGDATABASE # Should output 'mirror_node'
echo $PGHOST # Should be set to your DB IP address
```

2. **Run the Import Script within a `screen` or `tmux` Session:**

It's recommended to run the import script within a `screen` or `tmux` session, as the import process may take several hours to complete.

**Using `screen`:**

```bash
screen -S db_import
```

**Run the Import Script:**

```bash
./bootstrap.sh 8 /path/to/db_export/
```

- `8` refers to the number of CPU cores to use for parallel processing. Adjust this number based on your system's resources.
- `/path/to/db_export/` is the directory where you downloaded the database export data.

**Detach from the `screen` Session:**

Press `Ctrl+A` then `D`.

- This allows the import process to continue running in the background.

**Reattach to the `screen` Session Later:**

```bash
screen -r db_import
```
Comment on lines +240 to +250
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen/tmux might be a bit more complex then just adding nohup or disown to the bootstrap.sh command.


3. **Monitor the Import Process:**

- The script will output logs indicating the progress of the import.
- Check the `import.log` file for detailed logs and any error messages.

4. **Check the Exit Status:**

After the script completes, check the exit status:

```bash
echo "EXIT STATUS: $?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works if they are running it in the foreground. Better to have them check the final few lines of the log.

```

- An exit status of `0` indicates the import completed successfully.
- If the exit status is not `0`, refer to the `import.log` file and `import_tracking.txt` for troubleshooting.

---

## Handling Failed Imports

During the import process, the script generates a file named `import_tracking.txt`, which logs the status of each file import. Each line in this file contains the path and name of a file, followed by its import status: `NOT_STARTED`, `IN_PROGRESS`, `IMPORTED`, or `FAILED_TO_IMPORT`.

**Statuses:**

- `NOT_STARTED`: The file has not yet been processed.
- `IN_PROGRESS`: The file is currently being imported.
- `IMPORTED`: The file was successfully imported.
- `FAILED_TO_IMPORT`: The file failed to import.

**Example of `import_tracking.txt`:**

```
/path/to/db_export/record_file.csv.gz IMPORTED
/path/to/db_export/transaction/transaction_part_1.csv.gz IMPORTED
/path/to/db_export/transaction/transaction_part_2.csv.gz FAILED_TO_IMPORT
/path/to/db_export/account.csv.gz NOT_STARTED
```

### Steps to Handle Failed Imports:
nirbosl marked this conversation as resolved.
Show resolved Hide resolved

1. **Re-run the Import Script:**

- Simply re-run the import script; it will automatically skip files marked as `IMPORTED` and attempt to import files with statuses `NOT_STARTED`, `IN_PROGRESS`, or `FAILED_TO_IMPORT`.

```bash
./your_import_script.sh 8 /path/to/db_export/
```

- The script manages the import process, ensuring that only the necessary files are processed without manual intervention.

2. **Verify the Imports:**

- Check the `import_tracking.txt` and `import.log` files to ensure that all files have been imported successfully.

- If files continue to fail, review the error messages in `import.log` for troubleshooting.

**Notes on Data Consistency:**

- **System Resources:** Adjust the number of CPU cores used (`8` in the example) based on your system's capabilities to prevent overloading the server.

- **Data Integrity:** When a file import fails, the database transaction ensures that **no partial data** is committed. This means that when you re-run the import script, you can safely re-import failed files without worrying about duplicates or inconsistencies; The database tables remain in the same state as before the failed import attempt.

- **Concurrent Write Safety:** The script uses file locking (`flock`) to safely handle concurrent writes to `import_tracking.txt`. This prevents race conditions and ensures the tracking file remains consistent.

---

## Troubleshooting

- **Connection Errors:**

- Confirm that `PGHOST` is correctly set to the IP address or hostname of your database server.
- Ensure that the database server allows connections from your client machine.

- **Import Failures:**

- Check the `import.log` file generated by the import script for detailed error messages.
- Review the `import_tracking.txt` file to identify which files failed to import.

- **Interruption Handling:**

- If the import process is interrupted (e.g., due to a network issue or manual cancellation), the script updates the statuses in `import_tracking.txt` accordingly.
- Files that were in progress will be marked as `IN_PROGRESS` or remain as `NOT_STARTED` if they had not begun.
- Upon restarting the script, it will:
- Skip files marked as `IMPORTED`.
- Attempt to import files with statuses `NOT_STARTED`, `IN_PROGRESS`, or `FAILED_TO_IMPORT`.

---
Loading