Batch Replace Cluster Nodes
| sagemaker_batch_replace_cluster_nodes | R Documentation |
Replaces specific nodes within a SageMaker HyperPod cluster with new hardware¶
Description¶
Replaces specific nodes within a SageMaker HyperPod cluster with new
hardware. batch_replace_cluster_nodes terminates the specified
instances and provisions new replacement instances with the same
configuration but fresh hardware. The Amazon Machine Image (AMI) and
instance configuration remain the same.
This operation is useful for recovering from hardware failures or persistent issues that cannot be resolved through a reboot.
-
Data Loss Warning: Replacing nodes destroys all instance volumes, including both root and secondary volumes. All data stored on these volumes will be permanently lost and cannot be recovered.
-
To safeguard your work, back up your data to Amazon S3 or an FSx for Lustre file system before invoking the API on a worker node group. This will help prevent any potential data loss from the instance root volume. For more information about backup, see Use the backup script provided by SageMaker HyperPod.
-
If you want to invoke this API on an existing cluster, you'll first need to patch the cluster by running the UpdateClusterSoftware API. For more information about patching a cluster, see Update the SageMaker HyperPod platform software of a cluster.
-
You can replace up to 25 nodes in a single request.
Usage¶
sagemaker_batch_replace_cluster_nodes(ClusterName, NodeIds,
NodeLogicalIds)
Arguments¶
ClusterName |
[required] The name or Amazon Resource Name (ARN) of the SageMaker HyperPod cluster containing the nodes to replace. |
NodeIds |
A list of EC2 instance IDs to replace with new hardware. You can specify between 1 and 25 instance IDs. Replace operations destroy all instance volumes (root and secondary). Ensure you have backed up any important data before proceeding.
|
NodeLogicalIds |
A list of logical node IDs to replace with new hardware. You can specify between 1 and 25 logical node IDs. The
|
Value¶
A list with the following syntax:
list(
Successful = list(
"string"
),
Failed = list(
list(
NodeId = "string",
ErrorCode = "InstanceIdNotFound"|"InvalidInstanceStatus"|"InstanceIdInUse"|"InternalServerError",
Message = "string"
)
),
FailedNodeLogicalIds = list(
list(
NodeLogicalId = "string",
ErrorCode = "InstanceIdNotFound"|"InvalidInstanceStatus"|"InstanceIdInUse"|"InternalServerError",
Message = "string"
)
),
SuccessfulNodeLogicalIds = list(
"string"
)
)
Request syntax¶
svc$batch_replace_cluster_nodes(
ClusterName = "string",
NodeIds = list(
"string"
),
NodeLogicalIds = list(
"string"
)
)