Skip to content

Latest commit

 

History

History
605 lines (488 loc) · 29.2 KB

README.md

File metadata and controls

605 lines (488 loc) · 29.2 KB

Modules

This directory contains a set of core modules built for the Cluster Toolkit. Modules describe the building blocks of an AI/ML and HPC deployment. The expected fields in a module are listed in more detail below. Blueprints can be extended in functionality by incorporating modules from GitHub repositories.

All Modules

Modules from various sources are all listed here for visibility. Badges are used to indicate the source and status of many of these resources.

Modules listed below with the core-badge badge are located in this folder and are tested and maintained by the Cluster Toolkit team.

Modules labeled with the community-badge badge are contributed by the community (including the Cluster Toolkit team, partners, etc.). Community modules are located in the community folder.

Modules labeled with the deprecated-badge badge are now deprecated and may be removed in the future. Customers are advised to transition to alternatives.

Modules that are still in development and less stable are labeled with the experimental-badge badge.

Compute

Database

File System

Monitoring

Network

Packer

  • custom-image core-badge : Creates a custom VM Image based on the GCP HPC VM image.

Project

Pub/Sub

Remote Desktop

Scheduler

Scripts

  • startup-script core-badge : Creates a customizable startup script that can be fed into compute VMs.
  • windows-startup-script community-badge experimental-badge: Creates Windows PowerShell (PS1) scripts that can be used to customize Windows VMs and VM images.
  • htcondor-install community-badge experimental-badge : Creates a startup script to install HTCondor and exports a list of required APIs
  • omnia-install community-badge experimental-badge deprecated-badge : Installs Slurm via Dell Omnia onto a cluster of VM instances. This module has been deprecated and will be removed on August 1, 2024.
  • pbspro-preinstall community-badge experimental-badge : Creates a Cloud Storage bucket with PBS Pro RPM packages for use by PBS clusters.
  • pbspro-install community-badge experimental-badge : Creates a Toolkit runner to install PBS Professional from RPM packages.
  • pbspro-qmgr community-badge experimental-badge : Creates a Toolkit runner to run common qmgr commands when configuring a PBS Pro cluster.
  • ramble-execute community-badge experimental-badge : Creates a startup script to execute Ramble commands on a target VM
  • ramble-setup community-badge experimental-badge : Creates a startup script to install Ramble on an instance or a slurm login or controller.
  • spack-setup community-badge experimental-badge : Creates a startup script to install Spack on an instance or a slurm login or controller.
  • spack-execute community-badge experimental-badge : Defines a software build using Spack.
  • wait-for-startup community-badge experimental-badge : Waits for successful completion of a startup script on a compute VM.

NOTE: Slurm V4 is deprecated. In case, you want to use V4 modules, please use ghpc-v1.27.0 source code and build ghpc binary from this. This source code also contains deprecated examples using V4 modules for your reference.

Module Fields

ID (Required)

The id field is used to uniquely identify and reference a defined module. ID's are used in variables and become the name of each module when writing the terraform main.tf file. They are also used in the use and outputs lists described below.

For terraform modules, the ID will be rendered into the terraform module label at the top level main.tf file.

Source (Required)

The source is a path or URL that points to the source files for Packer or Terraform modules. A source can either be a filesystem path or a URL to a git repository:

  • Filesystem paths

    • modules embedded in the gcluster executable
    • modules in the local filesystem
  • Remote modules using Terraform URL syntax

    when modules are in a subdirectory of the git repository, a special double-slash // notation can be required as described below

An important distinction is that those URLs are natively supported by Terraform so they are not copied to your deployment directory. Packer does not have native support for git-hosted modules so the Toolkit will copy these modules into the deployment folder on your behalf.

Embedded Modules

Embedded modules are added to the gcluster binary during compilation and cannot be edited. To refer to embedded modules, set the source path to modules/<<MODULE_PATH>> or community/modules/<<MODULE_PATH>>.

The paths match the modules in the repository structure for core modules and community modules. Because the modules are embedded during compilation, your local copies may differ unless you recompile gcluster.

For example, this example snippet uses the embedded pre-existing-vpc module:

  - id: network1
    source: modules/network/pre-existing-vpc

Local Modules

Local modules point to a module in the file system and can easily be edited. They are very useful during module development. To use a local module, set the source to a path starting with /, ./, or ../. For instance, the following module definition refers the local pre-existing-vpc modules.

  - id: network1
    source: ./modules/network/pre-existing-vpc

NOTE: Relative paths (beginning with . or .. must be relative to the working directory from which gcluster is executed. This example would have to be run from a local copy of the Cluster Toolkit repository. An alternative is to use absolute paths to modules.

GitHub-hosted Modules and Packages

The Intel DAOS blueprint makes extensive use of GitHub-hosted Terraform and Packer modules. You may wish to use it as an example reference for this documentation.

To use a Terraform module available on GitHub, set the source to a path starting with github.com (HTTPS) or [email protected] (SSH). For instance, the following module definition sources the Toolkit vpc module:

  - id: network1
    source: github.com/GoogleCloudPlatform/hpc-toolkit//modules/network/vpc

This example uses the double-slash notation (//) to indicate that the Toolkit is a "package" of multiple modules whose root directory is the root of the git repository. The remainder of the path indicates the sub-directory of the vpc module.

The example above uses the default main branch of the Toolkit. Specific revisions can be selected with any valid git reference. (git branch, commit hash or tag). If the git reference is a tag or branch, we recommend setting &depth=1 to reduce the data transferred over the network. This option cannot be set when the reference is a commit hash. The following examples select the vpc module on the active develop branch and also an older release of the filestore module:

  - id: network1
    source: github.com/GoogleCloudPlatform/hpc-toolkit//modules/network/vpc?ref=develop
  ...
  - id: homefs
    source: github.com/GoogleCloudPlatform/hpc-toolkit//modules/file-system/filestore?ref=v1.22.1&depth=1

Because Terraform modules natively support this syntax, gcluster will not copy GitHub-hosted modules into your deployment folder. Terraform will download them into a hidden folder when you run terraform init.

GitHub-hosted Packer modules

Packer does not natively support GitHub-hosted modules so gcluster create will copy modules into your deployment folder.

If the module uses // package notation, gcluster create will copy the entire repository to the module path: deployment_name/group_name/module_id. However, when gcluster deploy is invoked, it will run Packer from the subdirectory deployment_name/group_name/module_id/subdirectory/after/double_slash.

Referring back to the Intel DAOS blueprint, we see that it will create 2 deployment groups at pfs-daos/daos-client-image and pfs-daos/daos-server-image. However, Packer will actually be invoked from a subdirectories ending in daos-client-image/images and daos-server-image/images.

If the module does not use // package notation, gcluster create will copy only the final directory in the path to deployment_name/group_name/module_id.

In all cases, gcluster create will remove the .git directory from the packer module to ensure that you can manage the entire deployment directory with its own git versioning.

GitHub over SSH

Get module from GitHub over SSH:

  - id: network1
    source: [email protected]:GoogleCloudPlatform/hpc-toolkit.git//modules/network/vpc

Specific versions can be selected as for HTTPS:

  - id: network1
    source: [email protected]:GoogleCloudPlatform/hpc-toolkit.git//modules/network/vpc?ref=v1.22.1&depth=1
Generic Git Modules

To use a Terraform module available in a non-GitHub git repository such as gitlab, set the source to a path starting git::. Two Standard git protocols are supported, git::https:// for HTTPS or git::[email protected] for SSH.

Additional formatting and features after git:: are identical to that of the GitHub Modules described above.

Google Cloud Storage Modules

To use a Terraform module available in a Google Cloud Storage bucket, set the source to a URL with the special gcs:: prefix, followed by a GCS bucket object URL.

For example: gcs::https://www.googleapis.com/storage/v1/BUCKET_NAME/PATH_TO_MODULE

Kind (May be Required)

kind refers to the way in which a module is deployed. Currently, kind can be either terraform or packer. It must be specified for modules of type packer. If omitted, it will default to terraform.

Settings (May Be Required)

The settings field is a map that supplies any user-defined variables for each module. Settings values can be simple strings, numbers or booleans, but can also support complex data types like maps and lists of variable depth. These settings will become the values for the variables defined in either the variables.tf file for Terraform or variable.pkr.hcl file for Packer.

For some modules, there are mandatory variables that must be set, therefore settings is a required field in that case. In many situations, a combination of sensible defaults, deployment variables and used modules can populated all required settings and therefore the settings field can be omitted.

Use (Optional)

The use field is a powerful way of linking a module to one or more other modules. When a module "uses" another module, the outputs of the used module are compared to the settings of the current module. If they have matching names and the setting has no explicit value, then it will be set to the used module's output. For example, see the following blueprint snippet:

modules:
- id: network1
  source: modules/network/vpc

- id: workstation
  source: modules/compute/vm-instance
  use: [network1]
  settings:
  ...

In this snippet, the VM instance workstation uses the outputs of vpc network1.

In this case both network_self_link and subnetwork_self_link in the workstation settings will be set to $(network1.network_self_link) and $(network1.subnetwork_self_link) which refer to the network1 outputs of the same names.

The order of precedence that gcluster uses in determining when to infer a setting value is in the following priority order:

  1. Explicitly set in the blueprint using the settings field
  2. Output from a used module, taken in the order provided in the use list
  3. Deployment variable (vars) of the same name
  4. Default value for the setting

NOTE: See the network storage documentation for more information about mounting network storage file systems via the use field.

Outputs (Optional)

The outputs field adds the output of individual Terraform modules to the output of its deployment group. This enables the value to be available via terraform output. This can useful for displaying the IP of a login node or printing instructions on how to use a module, as we have in the monitoring dashboard module.

The outputs field is a lists that it can be in either of two formats: a string equal to the name of the module output, or a map specifying the name, description, and whether the value is sensitive and should be suppressed from the standard output of Terraform commands. An example is shown below that displays the internal and public IP addresses of a VM created by the vm-instance module:

  - id: vm
    source: modules/compute/vm-instance
    use:
    - network1
    settings:
      machine_type: e2-medium
    outputs:
    - internal_ip
    - name: external_ip
      description: "External IP of VM"
      sensitive: true

The outputs shown after running Terraform apply will resemble:

Apply complete! Resources: 7 added, 0 changed, 0 destroyed.

Outputs:

external_ip_simplevm = <sensitive>
internal_ip_simplevm = [
  "10.128.0.19",
]

Required Services (APIs) (optional)

Each Toolkit module depends upon Google Cloud services ("APIs") being enabled in the project used by the AI/ML and HPC environment. For example, the creation of VMs requires the Compute Engine API (compute.googleapis.com). The startup-script module requires the Cloud Storage API (storage.googleapis.com) for storage of the scripts themselves. Each module included in the Toolkit source code describes its required APIs internally. The Toolkit will merge the requirements from all modules and automatically validate that all APIs are enabled in the project specified by $(vars.project_id).

Common Settings

The following common naming conventions should be used to decrease the verbosity needed to define a blueprint. This is intentional to allow multiple modules to share inferred settings from deployment variables or from other modules listed under the use field.

For example, if all modules are to be created in a single region, that region can be defined as a deployment variable named region, which is shared between all modules without an explicit setting. Similarly, if many modules need to be connected to the same VPC network, they all can add the vpc module ID to their use list so that network_self_link would be inferred from that vpc module rather than having to set it manually.

  • project_id: The GCP project ID in which to create the GCP resources.
  • deployment_name: The name of the current deployment of a blueprint. This can help to avoid naming conflicts of modules when multiple deployments are created from the same blueprint.
  • region: The GCP region the module will be created in.
  • zone: The GCP zone the module will be created in.
  • labels: Labels added to the module. In order to include any module in advanced monitoring, labels must be exposed. We strongly recommend that all modules expose this variable.

Writing Custom Cluster Toolkit Modules

Modules are flexible by design, however we do define some best practices when creating a new module meant to be used with the Cluster Toolkit.