diff --git a/docs/source/compute_backends.rst b/docs/source/compute_backends.rst index 49c52f14f..0148a6615 100644 --- a/docs/source/compute_backends.rst +++ b/docs/source/compute_backends.rst @@ -44,3 +44,4 @@ Compute Backends compute_config/ibm_vpc.md compute_config/aws_ec2.md compute_config/azure_vms.md + compute_config/gcp_compute_engie.md diff --git a/docs/source/compute_config/aws_ec2.md b/docs/source/compute_config/aws_ec2.md index 0ba653a45..3d29f4368 100644 --- a/docs/source/compute_config/aws_ec2.md +++ b/docs/source/compute_config/aws_ec2.md @@ -130,6 +130,8 @@ In summary, you can use one of the following settings: |aws_ec2 | soft_dismantle_timeout | 300 |no| Time in seconds to stop the VM instance after a job **completed** its execution | |aws_ec2 | hard_dismantle_timeout | 3600 | no | Time in seconds to stop the VM instance after a job **started** its execution | |aws_ec2 | exec_mode | reuse | no | One of: **consume**, **create** or **reuse**. If set to **create**, Lithops will automatically create new VMs for each map() call based on the number of elements in iterdata. If set to **reuse** will try to reuse running workers if exist | +|aws_ec2 | extra_apt_packages | [] | no | Extra Debian/Ubuntu packages on master/worker VMs during setup (YAML list or space-separated string) | +|aws_ec2 | extra_python_packages | [] | no | Extra pip packages on master/worker VMs after Lithops (YAML list or space-separated string) | ## Additional configuration diff --git a/docs/source/compute_config/azure_vms.md b/docs/source/compute_config/azure_vms.md index 177421c77..c74b06a21 100644 --- a/docs/source/compute_config/azure_vms.md +++ b/docs/source/compute_config/azure_vms.md @@ -97,6 +97,8 @@ Edit your lithops config and add the relevant keys: |azure_vms | soft_dismantle_timeout | 300 |no| Time in seconds to stop the VM instance after a job **completed** its execution | |azure_vms | hard_dismantle_timeout | 3600 | no | Time in seconds to stop the VM instance after a job **started** its execution | |azure_vms | exec_mode | reuse | no | One of: **consume**, **create** or **reuse**. If set to **create**, Lithops will automatically create new VMs for each `map()` call based on the number of elements in `iterdata`. If set to **reuse**, Lithops will try to reuse running workers if they exist | +|azure_vms | extra_apt_packages | [] | no | Extra Debian/Ubuntu packages on master/worker VMs during setup (YAML list or space-separated string) | +|azure_vms | extra_python_packages | [] | no | Extra pip packages on master/worker VMs after Lithops (YAML list or space-separated string) | ## Consume mode diff --git a/docs/source/compute_config/gcp_compute_engie.md b/docs/source/compute_config/gcp_compute_engie.md new file mode 100644 index 000000000..676b8fb7d --- /dev/null +++ b/docs/source/compute_config/gcp_compute_engie.md @@ -0,0 +1,210 @@ +# Google Compute Engine (GCE) + +The GCP Compute Engine backend of Lithops can provide a serverless user experience on top of GCE where Lithops creates new Virtual Machines (VMs) dynamically at runtime and scales Lithops jobs against them (create and reuse modes). Alternatively Lithops can start and stop an existing VM instance (consume mode). + +The backend key is `gcp_compute_engie` (matches the Lithops module name). + +## Choose an operating system image for the VM + +Any VM needs an operating system image. By default Lithops uses Ubuntu 24.04 (`ubuntu-2404-lts-amd64`). Lithops installs required dependencies on the VM on first use (this can take a few minutes). + +To list available images: + +```bash +lithops image list -b gcp_compute_engie +``` + +Use the **Image ID** column as `source_image` in your config. + +## Installation + +1. Install GCP backend dependencies: + +```bash +python3 -m pip install lithops[gcp] +``` + +## Create and reuse modes + +In the `create` mode, Lithops automatically creates new worker VM instances at runtime, runs the job on them, and deletes the workers when the job completes (unless configured otherwise). + +In the `reuse` mode, Lithops keeps the master and worker VMs stopped between jobs and starts them again when needed. It reuses running workers when possible and only creates new workers if necessary. + +### Configuration + +1. Enable the Compute Engine API: + +```bash +gcloud services enable compute.googleapis.com --project +``` + +2. Create a service account (or use an existing one) and grant these roles on the project: + +```bash +gcloud projects add-iam-policy-binding \ + --member="serviceAccount:" \ + --role="roles/compute.admin" + +gcloud projects add-iam-policy-binding \ + --member="serviceAccount:" \ + --role="roles/storage.objectAdmin" +``` + +3. Set the Lithops config file: + +```yaml +lithops: + backend: gcp_compute_engie + +gcp: + credentials_path: + +gcp_compute_engie: + project_name: + zone: + exec_mode: reuse +``` + +Lithops attaches the service account from `credentials_path` to master and worker VMs so they can access GCS via the metadata service. You can set `service_account: ` explicitly if needed. + +### Summary of configuration keys for GCP + +|Group|Key|Default|Mandatory|Additional info| +|---|---|---|---|---| +|gcp | credentials_path | |no | Service account JSON used by Lithops on your machine and to select the VM service account | +|gcp | region | |no | GCP region. Derived from `zone` if omitted | + +### GCE - Create and Reuse Modes + +|Group|Key|Default|Mandatory|Additional info| +|---|---|---|---|---| +|gcp_compute_engie | project_name | |yes | GCP project ID | +|gcp_compute_engie | zone | |yes | Compute Engine zone, for example `us-east1-b` | +|gcp_compute_engie | region | derived from zone |no | Region used for subnet and NAT | +|gcp_compute_engie | service_account | |no | Service account email attached to VMs. Default: `client_email` from `credentials_path` | +|gcp_compute_engie | network_name | |no | Existing VPC name. If not provided, Lithops creates a new network | +|gcp_compute_engie | subnet_name | |no | Existing subnet name when using a custom VPC | +|gcp_compute_engie | source_image | ubuntu-2404-lts-amd64 |no | Boot image reference | +|gcp_compute_engie | master_instance_type | e2-small |no | Master VM machine type | +|gcp_compute_engie | worker_instance_type | e2-standard-2 |no | Worker VM machine type | +|gcp_compute_engie | ssh_username | ubuntu |no | Username to access the VM | +|gcp_compute_engie | ssh_password | |no | Password for worker VMs. If not provided, it is created randomly | +|gcp_compute_engie | ssh_key_filename | ~/.ssh/id_rsa |no | SSH private key for the master VM. If not provided, Lithops creates one | +|gcp_compute_engie | request_spot_instances | False |no | Use Spot VMs for workers | +|gcp_compute_engie | delete_on_dismantle | True |no | Delete worker VMs when stopped. Master VM is never deleted when stopped | +|gcp_compute_engie | max_workers | 100 |no | Max number of workers per `FunctionExecutor()` | +|gcp_compute_engie | worker_processes | AUTO |no | Parallel Lithops processes per worker. Default: CPUs of `worker_instance_type` | +|gcp_compute_engie | runtime | python3 |no | Runtime name. Default: python3 on the VM | +|gcp_compute_engie | auto_dismantle | True |no | If False, VMs are not stopped automatically | +|gcp_compute_engie | soft_dismantle_timeout | 300 |no | Seconds to stop the VM after a job **completed** | +|gcp_compute_engie | hard_dismantle_timeout | 3600 |no | Seconds to stop the VM after a job **started** | +|gcp_compute_engie | exec_mode | reuse |no | One of: **consume**, **create** or **reuse** | +|gcp_compute_engie | extra_apt_packages | [] |no | Extra apt packages on master/worker VMs during setup | +|gcp_compute_engie | extra_python_packages | [] |no | Extra pip packages on master/worker VMs after Lithops | + +## Consume mode + +In this mode, Lithops uses an existing VM. The VM must be reachable by SSH and have a service account with GCS access. + +### Configuration + +```yaml +lithops: + backend: gcp_compute_engie + +gcp: + credentials_path: + +gcp_compute_engie: + exec_mode: consume + project_name: + zone: + instance_name: +``` + +### Summary of configuration keys for the consume mode + +|Group|Key|Default|Mandatory|Additional info| +|---|---|---|---|---| +|gcp_compute_engie | instance_name | |yes | Existing VM instance name | +|gcp_compute_engie | project_name | |yes | GCP project ID | +|gcp_compute_engie | zone | |yes | Compute Engine zone | +|gcp_compute_engie | ssh_username | ubuntu |no | Username to access the VM | +|gcp_compute_engie | ssh_key_filename | ~/.ssh/id_rsa |no | Path to the SSH private key | +|gcp_compute_engie | worker_processes | AUTO |no | Parallel Lithops processes per worker | + +## Test Lithops + +Once you have your compute and storage backends configured, you can run a hello world function with: + +```bash +lithops hello -b gcp_compute_engie -s gcp_storage +``` + +## Viewing the execution logs + +You can view the function execution logs on your local machine using the Lithops client: + +```bash +lithops logs poll +``` + +## VM Management + +Lithops for GCE follows a master-worker architecture (1:N). + +All VMs, including the master, are automatically stopped after a configurable timeout (see hard/soft dismantle timeouts). Stopped master and worker VMs are started again on the next job in reuse mode. + +You can open an SSH session to the master VM with: + +```bash +lithops attach -b gcp_compute_engie +``` + +The master and worker VMs store Lithops service logs in `/tmp/lithops-root/*-service.log`. + +To list available workers: + +```bash +lithops worker list -b gcp_compute_engie +``` + +To list submitted jobs: + +```bash +lithops job list -b gcp_compute_engie +``` + +To delete workers only: + +```bash +lithops clean -b gcp_compute_engie -s gcp_storage +``` + +To delete workers, the master VM, and Lithops-created network resources: + +```bash +lithops clean -b gcp_compute_engie -s gcp_storage --all +``` + +## Architecture diagram + +```mermaid +flowchart TB + subgraph gcp [GCP project / region] + NET["VPC lithops-net-XXXXXX"] + SUB["Subnet lithops-net-XXXXXX-subnet"] + FW1["Firewall SSH :22 from internet"] + FW2["Firewall internal :8080/8081/6379/22"] + NAT["Cloud Router + NAT lithops-net-XXXXXX-router"] + M["Master VM lithops-master-XXXXXX\n+ ephemeral external IP"] + W["Worker VMs lithops-worker-*\nprivate IP only"] + end + LAPTOP["Your laptop"] -->|SSH :22| M + M -->|private network| W + W -->|egress| NAT + NAT --> INTERNET[Internet apt/docker/pip] + M --> SUB + W --> SUB + SUB --> NET +``` diff --git a/docs/source/compute_config/ibm_vpc.md b/docs/source/compute_config/ibm_vpc.md index 12c62ae2e..45e64701d 100644 --- a/docs/source/compute_config/ibm_vpc.md +++ b/docs/source/compute_config/ibm_vpc.md @@ -109,6 +109,8 @@ ibm_vpc: |ibm_vpc | exec_mode | reuse | no | One of: **consume**, **create** or **reuse**. If set to **create**, Lithops will automatically create new VMs for each `map()` call based on the number of elements in `iterdata`. If set to **reuse**, Lithops will try to reuse running workers if they exist | |ibm_vpc | singlesocket | False | no | Try to allocate workers with a single-socket CPU. If they end up running on multiple sockets, a warning message is printed to the user. If **True**, the standalone **workers_policy** must be set to **strict** to track worker states | |ibm_vpc | gpu | False | no | If `True`, Docker is started with GPU support. Requires the host to have the necessary hardware and software pre-configured, and a Docker image runtime with GPU support specified | +|ibm_vpc | extra_apt_packages | [] | no | Extra Debian/Ubuntu packages on master/worker VMs during setup (YAML list or space-separated string) | +|ibm_vpc | extra_python_packages | [] | no | Extra pip packages on master/worker VMs after Lithops (YAML list or space-separated string) | ## Consume mode diff --git a/docs/source/compute_config/vm.md b/docs/source/compute_config/vm.md index 5797666ad..c41b61726 100644 --- a/docs/source/compute_config/vm.md +++ b/docs/source/compute_config/vm.md @@ -54,6 +54,8 @@ In this backend, you can use any docker image that contains all the required dep |vm | ssh_key_filename | | no | Path to SSH key | |vm | runtime | python3 |no | `python3` or a docker image name | |vm | worker_processes | 1 | no | Number of Lithops processes within the VM. This can be used to parallelize function activations within the VM. It is recommended to set it to the same number of CPUs as the VM | +|vm | extra_apt_packages | [] | no | Extra Debian/Ubuntu packages during Lithops setup on the VM (YAML list or space-separated string) | +|vm | extra_python_packages | [] | no | Extra pip packages after Lithops on the VM (YAML list or space-separated string) | ## Test Lithops diff --git a/docs/source/configuration.rst b/docs/source/configuration.rst index 9106875c6..4b6be8239 100644 --- a/docs/source/configuration.rst +++ b/docs/source/configuration.rst @@ -37,6 +37,7 @@ Choose your compute and storage engines from the table below: || `IBM Virtual Private Cloud `_ || | || `AWS Elastic Compute Cloud (EC2) `_ || | || `Azure Virtual Machines `_ || | +|| `Google Compute Engine `_ || | +--------------------------------------------------------------------+--------------------------------------------------------------------+ Configuration File diff --git a/docs/source/execution_modes.rst b/docs/source/execution_modes.rst index cd058cb2b..737300a41 100644 --- a/docs/source/execution_modes.rst +++ b/docs/source/execution_modes.rst @@ -83,4 +83,4 @@ underlying infrastructure. fexec = lithops.StandaloneExecutor() -- Available backends: `IBM Virtual Private Cloud `_, `AWS Elastic Compute Cloud (EC2) `_, `Azure Virtual Machines `_, `Virtual Machine `_ +- Available backends: `IBM Virtual Private Cloud `_, `AWS Elastic Compute Cloud (EC2) `_, `Azure Virtual Machines `_, `Google Compute Engine `_, `Virtual Machine `_ diff --git a/docs/source/supported_clouds.rst b/docs/source/supported_clouds.rst index b252acf1e..fc5c21df9 100644 --- a/docs/source/supported_clouds.rst +++ b/docs/source/supported_clouds.rst @@ -52,6 +52,8 @@ Currently, Lithops for Google Cloud Platform supports these backends: - `Google Cloud Storage `_ * - `Google Cloud Run `_ - + * - `Google Compute Engine (GCE) `_ + - Microsoft Azure --------------- diff --git a/lithops/config.py b/lithops/config.py index 50e768803..522bdbf0a 100644 --- a/lithops/config.py +++ b/lithops/config.py @@ -256,6 +256,7 @@ def extract_standalone_config(config): sa_config = config[c.STANDALONE].copy() backend = config['lithops']['backend'] sa_config['backend'] = backend + sa_config['storage'] = config['lithops'].get('storage') sa_config[backend] = config[backend] if backend in config and config[backend] else {} sa_config[backend]['user_agent'] = f'lithops/{__version__}' diff --git a/lithops/constants.py b/lithops/constants.py index 84303102f..e1b29cad4 100644 --- a/lithops/constants.py +++ b/lithops/constants.py @@ -93,7 +93,9 @@ 'start_timeout': 300, 'auto_dismantle': True, 'soft_dismantle_timeout': 300, - 'hard_dismantle_timeout': 3600 + 'hard_dismantle_timeout': 3600, + 'extra_apt_packages': [], + 'extra_python_packages': [], } SERVERLESS_BACKENDS = [ @@ -118,5 +120,6 @@ 'ibm_vpc', 'aws_ec2', 'azure_vms', + 'gcp_compute_engie', 'vm' ] diff --git a/lithops/scripts/cli.py b/lithops/scripts/cli.py index 7af456ebc..fac28f962 100644 --- a/lithops/scripts/cli.py +++ b/lithops/scripts/cli.py @@ -907,7 +907,7 @@ def list_images(config, backend, region, debug): compute_config = extract_standalone_config(config) compute_handler = StandaloneHandler(compute_config) - logger.info('Listing all Ubuntu Linux 22.04 VM Images') + logger.info('Listing all Ubuntu VM images') vm_images = compute_handler.list_images() headers = ['Image Name', 'Image ID', 'Creation Date'] diff --git a/lithops/standalone/backends/aws_ec2/aws_ec2.py b/lithops/standalone/backends/aws_ec2/aws_ec2.py index 4a5eb2ec0..96379c5f4 100644 --- a/lithops/standalone/backends/aws_ec2/aws_ec2.py +++ b/lithops/standalone/backends/aws_ec2/aws_ec2.py @@ -27,7 +27,7 @@ from concurrent.futures import ThreadPoolExecutor from lithops.version import __version__ -from lithops.util.ssh_client import SSHClient +from lithops.util.ssh_client import SSHClient, ssh_boot_status_message from lithops.constants import COMPUTE_CLI_MSG, CACHE_DIR from lithops.config import load_yaml_config, dump_yaml_config from lithops.standalone.utils import CLOUD_CONFIG_WORKER, CLOUD_CONFIG_WORKER_PK, StandaloneMode, get_host_setup_script @@ -647,7 +647,7 @@ def build_image(self, image_name, script_file, overwrite, include, extra_args=[] logger.debug(f"Uploading installation script to {build_vm}") remote_script = "/tmp/install_lithops.sh" - script = get_host_setup_script() + script = get_host_setup_script(lithops_pip_spec='lithops[aws,redis]') build_vm.get_ssh_client().upload_data_to_file(script, remote_script) logger.debug("Executing Lithops installation script. Be patient, this process can take up to 3 minutes") build_vm.get_ssh_client().run_remote_command(f"chmod 777 {remote_script}; sudo {remote_script}; rm {remote_script};") @@ -1099,14 +1099,13 @@ def is_ready(self): """ Checks if the VM instance is ready to receive ssh connections """ - login_type = 'password' if 'password' in self.ssh_credentials and \ - not self.public else 'publickey' try: self.get_ssh_client().run_remote_command('id') except LithopsValidationError as err: raise err except Exception as err: - logger.debug(f'SSH to {self.public_ip if self.public else self.private_ip} failed ({login_type}): {err}') + ip = self.public_ip if self.public else self.private_ip + logger.debug(f'SSH to {ip}: {ssh_boot_status_message(err)}') self.del_ssh_client() return False return True diff --git a/lithops/standalone/backends/azure_vms/azure_vms.py b/lithops/standalone/backends/azure_vms/azure_vms.py index ba4da1d1a..e67402650 100644 --- a/lithops/standalone/backends/azure_vms/azure_vms.py +++ b/lithops/standalone/backends/azure_vms/azure_vms.py @@ -27,7 +27,7 @@ from azure.core.exceptions import ResourceNotFoundError from lithops.version import __version__ -from lithops.util.ssh_client import SSHClient +from lithops.util.ssh_client import SSHClient, ssh_boot_status_message from lithops.constants import COMPUTE_CLI_MSG, CACHE_DIR, SA_CONFIG_FILE from lithops.config import load_yaml_config, dump_yaml_config from lithops.standalone.utils import StandaloneMode @@ -752,14 +752,13 @@ def is_ready(self): """ Checks if the VM instance is ready to receive ssh connections """ - login_type = 'password' if 'password' in self.ssh_credentials and \ - not self.public else 'publickey' try: self.get_ssh_client().run_remote_command('id') except LithopsValidationError as err: raise err except Exception as err: - logger.debug(f'SSH to {self.public_ip if self.public else self.private_ip} failed ({login_type}): {err}') + ip = self.public_ip if self.public else self.private_ip + logger.debug(f'SSH to {ip}: {ssh_boot_status_message(err)}') self.del_ssh_client() return False return True diff --git a/lithops/standalone/backends/azure_vms/config.py b/lithops/standalone/backends/azure_vms/config.py index 76e8f4961..8748e8395 100644 --- a/lithops/standalone/backends/azure_vms/config.py +++ b/lithops/standalone/backends/azure_vms/config.py @@ -25,7 +25,7 @@ 'ssh_username': 'ubuntu', 'ssh_password': str(uuid.uuid4()), 'request_spot_instances': True, - 'delete_on_dismantle': False, + 'delete_on_dismantle': True, 'max_workers': 100, 'worker_processes': 'AUTO' } diff --git a/lithops/standalone/backends/gcp_compute_engie/__init__.py b/lithops/standalone/backends/gcp_compute_engie/__init__.py new file mode 100644 index 000000000..d7d7fb0f1 --- /dev/null +++ b/lithops/standalone/backends/gcp_compute_engie/__init__.py @@ -0,0 +1,3 @@ +from .gcp_compute_engie import GCPComputeEngieBackend as StandaloneBackend + +__all__ = ['StandaloneBackend'] diff --git a/lithops/standalone/backends/gcp_compute_engie/config.py b/lithops/standalone/backends/gcp_compute_engie/config.py new file mode 100644 index 000000000..5e61af1aa --- /dev/null +++ b/lithops/standalone/backends/gcp_compute_engie/config.py @@ -0,0 +1,90 @@ +# +# Copyright Cloudlab URV 2021 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import copy +import os +import uuid + +from lithops.constants import SA_DEFAULT_CONFIG_KEYS + + +DEFAULT_CONFIG_KEYS = { + 'master_instance_type': 'e2-small', + 'worker_instance_type': 'e2-standard-2', + 'boot_disk_size': 50, + 'boot_disk_type': 'pd-standard', + 'network_cidr': '10.0.0.0/16', + 'subnet_cidr': '10.0.0.0/24', + 'source_image': 'projects/ubuntu-os-cloud/global/images/family/ubuntu-2404-lts-amd64', + 'ssh_username': 'ubuntu', + 'ssh_password': str(uuid.uuid4()), + 'ssh_key_filename': '~/.ssh/id_rsa', + 'delete_on_dismantle': True, + 'max_workers': 100, + 'request_spot_instances': False, + 'worker_processes': 'AUTO' +} + +MANDATORY_PARAMETERS_1 = ('project_name', 'zone', 'instance_name') +MANDATORY_PARAMETERS_2 = ('project_name', 'zone') + + +def load_config(config_data): + if 'gcp_compute_engie' not in config_data or not config_data['gcp_compute_engie']: + raise Exception("'gcp_compute_engie' section is mandatory in the configuration") + + if 'gcp' not in config_data: + config_data['gcp'] = {} + + temp = copy.deepcopy(config_data['gcp_compute_engie']) + config_data['gcp_compute_engie'].update(config_data['gcp']) + config_data['gcp_compute_engie'].update(temp) + + if 'credentials_path' in config_data['gcp_compute_engie']: + config_data['gcp_compute_engie']['credentials_path'] = os.path.expanduser( + config_data['gcp_compute_engie']['credentials_path'] + ) + + for key in DEFAULT_CONFIG_KEYS: + if key not in config_data['gcp_compute_engie']: + config_data['gcp_compute_engie'][key] = DEFAULT_CONFIG_KEYS[key] + + if 'standalone' not in config_data or config_data['standalone'] is None: + config_data['standalone'] = {} + + for key in SA_DEFAULT_CONFIG_KEYS: + if key in config_data['gcp_compute_engie']: + config_data['standalone'][key] = config_data['gcp_compute_engie'].pop(key) + elif key not in config_data['standalone']: + config_data['standalone'][key] = SA_DEFAULT_CONFIG_KEYS[key] + + if config_data['standalone']['exec_mode'] == 'consume': + params_to_check = MANDATORY_PARAMETERS_1 + config_data['gcp_compute_engie']['max_workers'] = 1 + else: + params_to_check = MANDATORY_PARAMETERS_2 + + for param in params_to_check: + if param not in config_data['gcp_compute_engie']: + msg = f"'{param}' is mandatory in the 'gcp_compute_engie' section of the configuration" + raise Exception(msg) + + if 'region' not in config_data['gcp_compute_engie']: + zone = config_data['gcp_compute_engie']['zone'] + config_data['gcp_compute_engie']['region'] = '-'.join(zone.split('-')[:-1]) + + if 'region' not in config_data['gcp'] and 'region' in config_data['gcp_compute_engie']: + config_data['gcp']['region'] = config_data['gcp_compute_engie']['region'] diff --git a/lithops/standalone/backends/gcp_compute_engie/gcp_compute_engie.py b/lithops/standalone/backends/gcp_compute_engie/gcp_compute_engie.py new file mode 100644 index 000000000..94dfa6d75 --- /dev/null +++ b/lithops/standalone/backends/gcp_compute_engie/gcp_compute_engie.py @@ -0,0 +1,1024 @@ +# +# Copyright Cloudlab URV 2021 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import re +import json +import time +import uuid +import logging +from concurrent.futures import ThreadPoolExecutor +from datetime import datetime +import httplib2 +import google.auth +from google.oauth2 import service_account +from google_auth_httplib2 import AuthorizedHttp +from googleapiclient.discovery import build +from googleapiclient.errors import HttpError + +from lithops.version import __version__ +from lithops.util.ssh_client import SSHClient, ssh_boot_status_message +from lithops.constants import COMPUTE_CLI_MSG, CACHE_DIR +from lithops.config import load_yaml_config, dump_yaml_config +from lithops.standalone.utils import ( + StandaloneMode, + CLOUD_CONFIG_WORKER, + CLOUD_CONFIG_WORKER_PK, +) +from lithops.standalone import LithopsValidationError + + +logger = logging.getLogger(__name__) + +INSTANCE_START_TIMEOUT = 180 +UBUNTU_OS_PROJECT = 'ubuntu-os-cloud' +UBUNTU_LTS_FAMILIES = ( + 'ubuntu-2404-lts-amd64', + 'ubuntu-2204-lts', +) + +# Scopes for the SA attached to master/worker VMs (GCS, etc. via metadata credentials). +GCE_INSTANCE_SCOPES = ['https://www.googleapis.com/auth/cloud-platform'] + + +class GCPComputeEngieBackend: + + def __init__(self, config, mode): + logger.debug("Creating GCP Compute Engine client") + self.name = 'gcp_compute_engie' + self.config = config + self.mode = mode + self.project_name = self.config['project_name'] + self.zone = self.config['zone'] + self.region = self.config.get('region') or '-'.join(self.zone.split('-')[:-1]) + self.credentials_path = self.config.get('credentials_path') + + suffix = 'vm' if self.mode == StandaloneMode.CONSUME.value else 'vpc' + self.cache_dir = os.path.join(CACHE_DIR, self.name) + self.cache_file = os.path.join(self.cache_dir, f'{self.project_name}_{self.region}_{suffix}_data') + self.gce_data = {} + self.vpc_data_type = 'provided' if 'network_name' in self.config else 'created' + self.ssh_data_type = 'provided' if 'ssh_key_filename' in self.config else 'created' + self.network_name = self.config.get('network_name') + self.network_key = None + + self.compute_client = self._create_compute_client() + self._resolve_service_account_email() + + self.master = None + self.workers = [] + self.instance_types = {} + + msg = COMPUTE_CLI_MSG.format('GCP Compute Engine') + logger.info(f"{msg} - Zone: {self.zone} - Project: {self.project_name}") + + def _resolve_service_account_email(self): + """ + VMs use the GCE metadata service for GCS credentials (not the laptop key file). + Attach the same service account as gcp.credentials_path when creating instances. + """ + if self.config.get('service_account'): + self.config['service_account_email'] = self.config['service_account'] + logger.debug( + f'VM service account (from config): {self.config["service_account_email"]}' + ) + return + + if self.credentials_path and os.path.isfile(self.credentials_path): + with open(self.credentials_path) as f: + sa_data = json.load(f) + email = sa_data.get('client_email') + if email: + self.config['service_account_email'] = email + logger.debug( + f'VM service account (from credentials_path): {email}' + ) + return + + try: + credentials, _ = google.auth.default( + scopes=['https://www.googleapis.com/auth/cloud-platform'] + ) + email = getattr(credentials, 'service_account_email', None) + if email: + self.config['service_account_email'] = email + logger.debug(f'VM service account (from ADC): {email}') + return + except Exception: + pass + + logger.warning( + 'No service account resolved for GCE VMs. Workers/master cannot access GCS ' + 'via metadata unless you set gcp_compute_engie.service_account or ' + 'gcp.credentials_path.' + ) + + def _wait_operation(self, operation_name, scope='zone'): + while True: + if scope == 'zone': + op = self.compute_client.zoneOperations().get( + project=self.project_name, zone=self.zone, operation=operation_name + ).execute() + elif scope == 'region': + op = self.compute_client.regionOperations().get( + project=self.project_name, region=self.region, operation=operation_name + ).execute() + else: + op = self.compute_client.globalOperations().get( + project=self.project_name, operation=operation_name + ).execute() + + if op['status'] == 'DONE': + if 'error' in op: + raise Exception(op['error']) + return + time.sleep(2) + + def _load_gce_data(self): + self.gce_data = load_yaml_config(self.cache_file) + if self.gce_data and 'network_name' in self.gce_data: + self.network_name = self.gce_data['network_name'] + self.network_key = self.gce_data.get('network_key') + + def _dump_gce_data(self): + dump_yaml_config(self.cache_file, self.gce_data) + + def _delete_vpc_data(self): + if os.path.exists(self.cache_file): + os.remove(self.cache_file) + + def _resource_exists(self, getter): + try: + getter() + return True + except HttpError as e: + if getattr(e.resp, 'status', None) == 404: + return False + raise + + def _create_network(self): + if 'network_name' in self.config: + logger.debug( + f'Using user-provided network {self.config["network_name"]} ' + f'(subnet {self.config.get("subnet_name", "default")})' + ) + return + + if 'network_name' in self.gce_data: + self.config['network_name'] = self.gce_data['network_name'] + self.config['subnet_name'] = self.gce_data['subnet_name'] + self.config['firewall_name'] = self.gce_data['firewall_name'] + self.config['internal_firewall_name'] = self.gce_data.get('internal_firewall_name') + self.network_name = self.config['network_name'] + logger.debug(f'Using existing network {self.config["network_name"]}') + logger.debug( + f'Using existing subnet {self.config["subnet_name"]} in region {self.region}' + ) + logger.debug(f'Using existing firewall {self.config["firewall_name"]} (SSH)') + if self.config.get('internal_firewall_name'): + logger.debug( + f'Using existing firewall {self.config["internal_firewall_name"]} ' + f'(internal ports 8080/8081/6379/22)' + ) + if self.gce_data.get('router_name'): + self.config['router_name'] = self.gce_data['router_name'] + self.config['nat_name'] = self.gce_data.get('nat_name') + logger.debug( + f'Using existing Cloud NAT router {self.config["router_name"]} ' + f'(NAT {self.config.get("nat_name")})' + ) + else: + self._create_cloud_nat() + self.gce_data['router_name'] = self.config.get('router_name') + self.gce_data['nat_name'] = self.config.get('nat_name') + self._dump_gce_data() + return + + self.network_name = f'lithops-net-{str(uuid.uuid4())[-6:]}' + assert re.match("^[a-z0-9-]*$", self.network_name), f'Network name "{self.network_name}" not valid' + self.network_key = self.network_name[-6:] + subnet_name = f'{self.network_name}-subnet' + firewall_name = f'{self.network_name}-fw' + + logger.debug( + f'Creating VPC network {self.network_name} ' + f'(CIDR {self.config["network_cidr"]})' + ) + body = { + 'name': self.network_name, + 'autoCreateSubnetworks': False + } + op = self.compute_client.networks().insert(project=self.project_name, body=body).execute() + self._wait_operation(op['name'], scope='global') + + logger.debug( + f'Creating subnet {subnet_name} in {self.region} ' + f'(CIDR {self.config["subnet_cidr"]})' + ) + body = { + 'name': subnet_name, + 'ipCidrRange': self.config['subnet_cidr'], + 'network': f'projects/{self.project_name}/global/networks/{self.network_name}', + 'region': self.region, + 'privateIpGoogleAccess': True, + } + op = self.compute_client.subnetworks().insert( + project=self.project_name, region=self.region, body=body + ).execute() + self._wait_operation(op['name'], scope='region') + + logger.debug(f'Creating firewall {firewall_name} (SSH tcp/22 from internet)') + body = { + 'name': firewall_name, + 'network': f'projects/{self.project_name}/global/networks/{self.network_name}', + 'sourceRanges': ['0.0.0.0/0'], + 'allowed': [{'IPProtocol': 'tcp', 'ports': ['22']}] + } + op = self.compute_client.firewalls().insert(project=self.project_name, body=body).execute() + self._wait_operation(op['name'], scope='global') + + internal_fw_name = f'{self.network_name}-internal-fw' + logger.debug( + f'Creating firewall {internal_fw_name} ' + f'(internal tcp 8080/8081/6379/22 from {self.config["network_cidr"]})' + ) + body = { + 'name': internal_fw_name, + 'network': f'projects/{self.project_name}/global/networks/{self.network_name}', + 'sourceRanges': [self.config['network_cidr']], + 'allowed': [{'IPProtocol': 'tcp', 'ports': ['8080', '8081', '6379', '22']}] + } + op = self.compute_client.firewalls().insert(project=self.project_name, body=body).execute() + self._wait_operation(op['name'], scope='global') + + self.config['network_name'] = self.network_name + self.config['subnet_name'] = subnet_name + self.config['firewall_name'] = firewall_name + self.config['internal_firewall_name'] = internal_fw_name + + self._create_cloud_nat() + logger.debug( + f'VPC setup complete: network={self.network_name}, ' + f'subnet={subnet_name}, router={self.config.get("router_name")}' + ) + + def _create_cloud_nat(self): + """ + Provision Cloud NAT so private worker VMs can reach the internet + (same role as IBM VPC public gateway / AWS NAT for private subnets). + The master keeps an ephemeral external IP for SSH from the client. + """ + if 'router_name' in self.config: + return + + network_name = self.network_name or self.config.get('network_name') + subnet_name = self.config.get('subnet_name') + if not network_name or not subnet_name: + return + + router_name = f'{network_name}-router' + nat_name = f'{network_name}-nat' + + if self._resource_exists( + lambda: self.compute_client.routers().get( + project=self.project_name, + region=self.region, + router=router_name, + ).execute() + ): + logger.debug(f'Using existing Cloud NAT router {router_name}') + self.config['router_name'] = router_name + self.config['nat_name'] = nat_name + return + + logger.debug( + f'Creating Cloud NAT router {router_name} with NAT {nat_name} ' + f'on subnet {subnet_name} (worker outbound internet)' + ) + network_url = ( + f'https://www.googleapis.com/compute/v1/projects/{self.project_name}' + f'/global/networks/{network_name}' + ) + region_url = ( + f'https://www.googleapis.com/compute/v1/projects/{self.project_name}' + f'/regions/{self.region}' + ) + subnet_url = ( + f'https://www.googleapis.com/compute/v1/projects/{self.project_name}' + f'/regions/{self.region}/subnetworks/{subnet_name}' + ) + body = { + 'name': router_name, + 'network': network_url, + 'region': region_url, + 'nats': [{ + 'name': nat_name, + 'natIpAllocateOption': 'AUTO_ONLY', + 'sourceSubnetworkIpRangesToNat': 'LIST_OF_SUBNETWORKS', + 'subnetworks': [{ + 'name': subnet_url, + 'sourceIpRangesToNat': ['ALL_IP_RANGES'], + }], + }], + } + op = self.compute_client.routers().insert( + project=self.project_name, + region=self.region, + body=body, + ).execute() + self._wait_operation(op['name'], scope='region') + + self.config['router_name'] = router_name + self.config['nat_name'] = nat_name + + def _create_ssh_key(self): + """ + Creates a new SSH key pair on the client (same pattern as AWS EC2 / IBM VPC). + Used for Lithops client -> master SSH; workers use the master lithops_id_rsa key. + """ + if 'ssh_key_filename' in self.gce_data and os.path.isfile(self.gce_data['ssh_key_filename']): + self.config['ssh_key_filename'] = self.gce_data['ssh_key_filename'] + return + + user_key = os.path.expanduser(self.config.get('ssh_key_filename', '~/.ssh/id_rsa')) + if os.path.isfile(user_key) and 'lithops-key-' not in os.path.basename(user_key): + logger.debug(f'Using user-provided SSH key {user_key}') + self.config['ssh_key_filename'] = user_key + return + + keyname = f'lithops-key-{str(uuid.uuid4())[-8:]}' + filename = os.path.join("~", ".ssh", f"{keyname}.{self.name}.id_rsa") + key_filename = os.path.expanduser(filename) + if not os.path.isfile(key_filename): + logger.debug("Generating new ssh key pair") + os.system(f'ssh-keygen -b 2048 -t rsa -f {key_filename} -q -N ""') + logger.debug(f"SSH key pair generated: {key_filename}") + self.config['ssh_key_filename'] = key_filename + + def _load_instance_types(self): + if 'instance_types' in self.gce_data: + self.instance_types = self.gce_data['instance_types'] + return + + self.instance_types = {} + request = self.compute_client.machineTypes().list( + project=self.project_name, zone=self.zone + ) + while request is not None: + response = request.execute() + for machine_type in response.get('items', []): + self.instance_types[machine_type['name']] = machine_type.get('guestCpus', 1) + request = self.compute_client.machineTypes().list_next( + previous_request=request, previous_response=response + ) + + def _instance_exists(self, instance_name): + return self._resource_exists( + lambda: self.compute_client.instances().get( + project=self.project_name, zone=self.zone, instance=instance_name + ).execute() + ) + + def _create_master_instance(self): + name = self.config.get('instance_name') or f'lithops-master-{self.network_key}' + self.master = GCPComputeEngieInstance(self.config, self.compute_client, public=True) + self.master.name = name + self.master.instance_type = self.config['master_instance_type'] + self.master.delete_on_dismantle = False + + if self._instance_exists(name): + logger.debug(f'Using existing master VM {name}') + self.master.get_instance_data() + if self.master.get_status() == 'TERMINATED': + logger.debug(f'Master VM {name} is stopped, starting') + self.master.start() + elif self.mode != StandaloneMode.CONSUME.value: + logger.debug(f'Creating new VM instance {name}') + self.master.create(public=True) + self.master.get_instance_data() + + def _create_compute_client(self): + if self.credentials_path and os.path.isfile(self.credentials_path): + credentials = service_account.Credentials.from_service_account_file( + self.credentials_path, + scopes=['https://www.googleapis.com/auth/cloud-platform'] + ) + else: + credentials, _ = google.auth.default( + scopes=['https://www.googleapis.com/auth/cloud-platform'] + ) + + http = AuthorizedHttp(credentials, http=httplib2.Http()) + return build('compute', 'v1', http=http, cache_discovery=False) + + def is_initialized(self): + if self.mode == StandaloneMode.CONSUME.value: + return True + return os.path.isfile(self.cache_file) + + def init(self): + logger.debug(f'Initializing GCP Compute Engine backend ({self.mode} mode)') + self._load_gce_data() + + if self.mode == StandaloneMode.CONSUME.value: + self._create_master_instance() + self.gce_data = { + 'mode': self.mode, + 'master_name': self.master.name, + 'master_id': self.master.get_instance_id() + } + self._dump_gce_data() + return + + self._create_network() + self._create_ssh_key() + if 'instance_name' not in self.config: + self.config['instance_name'] = f'lithops-master-{self.network_key}' + self._create_master_instance() + self._load_instance_types() + self.gce_data = { + 'mode': self.mode, + 'vpc_data_type': self.vpc_data_type, + 'ssh_data_type': self.ssh_data_type, + 'master_name': self.master.name, + 'master_id': self.master.get_instance_id(), + 'network_name': self.config['network_name'], + 'network_key': self.network_key, + 'subnet_name': self.config['subnet_name'], + 'firewall_name': self.config['firewall_name'], + 'internal_firewall_name': self.config['internal_firewall_name'], + 'router_name': self.config.get('router_name'), + 'nat_name': self.config.get('nat_name'), + 'ssh_key_filename': self.config['ssh_key_filename'], + 'instance_types': self.instance_types, + } + self._dump_gce_data() + + def build_image(self, **kwargs): + raise NotImplementedError(f'{self.name}.build_image() is not implemented yet') + + def delete_image(self, **kwargs): + raise NotImplementedError(f'{self.name}.delete_image() is not implemented yet') + + @staticmethod + def _format_image_timestamp(timestamp): + if not timestamp: + return 'Unknown' + try: + created_at = datetime.fromisoformat(timestamp.replace('Z', '+00:00')) + except ValueError: + return timestamp + return created_at.strftime('%Y-%m-%d %H:%M:%S') + + def _iter_project_images(self, project): + request = self.compute_client.images().list(project=project) + while request is not None: + response = request.execute() + yield from response.get('items', []) + request = self.compute_client.images().list_next( + previous_request=request, previous_response=response + ) + + def list_images(self, **kwargs): + """ + List Ubuntu LTS image families (latest) and custom Lithops images in the project. + Returns tuples of (name, image_id, creation_date) like other standalone backends. + """ + result = set() + + for family in UBUNTU_LTS_FAMILIES: + try: + image = self.compute_client.images().getFromFamily( + project=UBUNTU_OS_PROJECT, family=family + ).execute() + except HttpError as err: + if getattr(err.resp, 'status', None) == 404: + continue + raise + + created_at = self._format_image_timestamp(image.get('creationTimestamp')) + family_ref = f'projects/{UBUNTU_OS_PROJECT}/global/images/family/{family}' + result.add((image['name'], family_ref, created_at)) + + for image in self._iter_project_images(self.project_name): + name = image.get('name', '') + if 'lithops' not in name.lower(): + continue + created_at = self._format_image_timestamp(image.get('creationTimestamp')) + image_ref = f'projects/{self.project_name}/global/images/{name}' + result.add((name, image_ref, created_at)) + + return sorted(result, key=lambda x: x[2], reverse=True) + + def clean(self, **kwargs): + all_clean = kwargs.get('all', False) + if not self.gce_data: + self._load_gce_data() + + if self.mode == StandaloneMode.CONSUME.value: + self._delete_vpc_data() + return + + self._delete_vm_instances(all=all_clean) + + master_name = self.gce_data.get('master_name') or (self.master.name if self.master else None) + if master_name: + master_pk = os.path.join(self.cache_dir, f'{master_name}-id_rsa.pub') + if all_clean and os.path.isfile(master_pk): + os.remove(master_pk) + + if all_clean: + self._delete_network_resources() + self._delete_ssh_key() + self._delete_vpc_data() + + def _delete_vm_instances(self, all=False): + prefixes = ('lithops-worker-', 'lithops-master-') if all else ('lithops-worker-',) + instances = self.compute_client.instances().list( + project=self.project_name, zone=self.zone + ).execute().get('items', []) + + for ins in instances: + name = ins.get('name', '') + if not name.startswith(prefixes): + continue + logger.debug(f"Deleting VM instance {name}") + op = self.compute_client.instances().delete( + project=self.project_name, zone=self.zone, instance=name + ).execute() + self._wait_operation(op['name'], scope='zone') + + def _delete_network_resources(self): + """ + Remove Lithops-created VPC resources (reverse order of creation). + VMs must already be deleted (they hold NICs on the subnet). + """ + if self.gce_data.get('vpc_data_type') == 'provided': + return + + if not self.gce_data.get('network_name'): + logger.debug('No Lithops network in cache; skipping VPC cleanup') + return + + fw_names = [ + self.gce_data.get('firewall_name'), + self.gce_data.get('internal_firewall_name') + ] + for fw_name in fw_names: + if not fw_name: + continue + try: + logger.debug(f'Deleting firewall {fw_name}') + op = self.compute_client.firewalls().delete( + project=self.project_name, firewall=fw_name + ).execute() + self._wait_operation(op['name'], scope='global') + except HttpError as e: + if getattr(e.resp, 'status', None) != 404: + raise + + network_name = self.gce_data.get('network_name') + router_name = self.gce_data.get('router_name') or ( + f'{network_name}-router' if network_name else None + ) + if router_name: + try: + logger.debug(f'Deleting Cloud Router (NAT) {router_name}') + op = self.compute_client.routers().delete( + project=self.project_name, + region=self.region, + router=router_name, + ).execute() + self._wait_operation(op['name'], scope='region') + except HttpError as e: + if getattr(e.resp, 'status', None) != 404: + raise + + subnet_name = self.gce_data.get('subnet_name') + if subnet_name: + try: + logger.debug(f'Deleting subnet {subnet_name}') + op = self.compute_client.subnetworks().delete( + project=self.project_name, region=self.region, subnetwork=subnet_name + ).execute() + self._wait_operation(op['name'], scope='region') + except HttpError as e: + if getattr(e.resp, 'status', None) != 404: + raise + + if network_name: + try: + logger.debug(f'Deleting network {network_name}') + op = self.compute_client.networks().delete( + project=self.project_name, network=network_name + ).execute() + self._wait_operation(op['name'], scope='global') + except HttpError as e: + if getattr(e.resp, 'status', None) != 404: + raise + + def _delete_ssh_key(self): + if self.gce_data.get('ssh_data_type') == 'provided': + return + key_filename = self.gce_data.get('ssh_key_filename') + if key_filename and "lithops-key-" in key_filename: + if os.path.isfile(key_filename): + os.remove(key_filename) + if os.path.isfile(f"{key_filename}.pub"): + os.remove(f"{key_filename}.pub") + + def clear(self, job_keys=None): + """ + Delete all the workers + """ + # clear() is automatically called after get_result(), + self.dismantle(include_master=False) + + def dismantle(self, include_master=True): + """ + Stop all worker VM instances + """ + if len(self.workers) > 0: + with ThreadPoolExecutor(len(self.workers)) as ex: + ex.map(lambda worker: worker.stop(), self.workers) + self.workers = [] + + if include_master and self.master: + self.master.stop() + + def get_instance(self, name, **kwargs): + instance = GCPComputeEngieInstance(self.config, self.compute_client) + instance.name = name + for key in kwargs: + if hasattr(instance, key) and kwargs[key] is not None: + setattr(instance, key, kwargs[key]) + return instance + + def get_worker_instance_type(self): + return self.config['worker_instance_type'] + + def get_worker_cpu_count(self): + return self.instance_types.get(self.config['worker_instance_type'], 1) + + def create_worker(self, name): + """ + Creates a new worker VM instance (same flow as AWS EC2 / IBM VPC). + """ + if self.mode == StandaloneMode.CONSUME.value: + raise NotImplementedError(f'{self.name}.create_worker() not available in consume mode') + + worker = GCPComputeEngieInstance(self.config, self.compute_client, public=False) + worker.name = name + worker.instance_type = self.config['worker_instance_type'] + + user = worker.ssh_credentials['username'] + pub_key = os.path.join(self.cache_dir, f'{self.master.name}-id_rsa.pub') + ssh_public_key = None + user_data = None + + if os.path.isfile(pub_key): + with open(pub_key, 'r') as pk: + pk_data = pk.read().strip() + user_data = CLOUD_CONFIG_WORKER_PK.format(user, pk_data) + ssh_public_key = pk_data + worker.ssh_credentials['key_filename'] = '~/.ssh/lithops_id_rsa' + worker.ssh_credentials.pop('password', None) + else: + logger.error(f'Unable to locate {pub_key}') + worker.ssh_credentials.pop('key_filename', None) + token = worker.ssh_credentials['password'] + user_data = CLOUD_CONFIG_WORKER.format(user, token) + + worker.create(public=False, user_data=user_data, ssh_public_key=ssh_public_key) + self.workers.append(worker) + + def get_runtime_key(self, runtime_name, version=__version__): + runtime = runtime_name.replace('/', '-').replace(':', '-') + master_id = self.gce_data.get('master_id', self.config.get('instance_name', self.master.name)) + return os.path.join(self.name, version, master_id, runtime) + + +class GCPComputeEngieInstance: + + def __init__(self, config, compute_client, public=False): + self.config = config + self.compute_client = compute_client + self.public = public + self.name = self.config['instance_name'] + self.project_name = self.config['project_name'] + self.zone = self.config['zone'] + + self.ssh_client = None + self.instance_data = None + self.instance_id = None + self.private_ip = None + self.public_ip = None + self.delete_on_dismantle = self.config['delete_on_dismantle'] + self.instance_type = self.config['worker_instance_type'] + self.home_dir = '/home/' + self.config['ssh_username'] + + self.ssh_credentials = { + 'username': self.config['ssh_username'], + 'password': self.config['ssh_password'], + 'key_filename': self.config.get('ssh_key_filename', '~/.ssh/id_rsa') + } + + def __str__(self): + ip = self.public_ip or self.private_ip + if ip: + return f'VM instance {self.name} ({ip})' + return f'VM instance {self.name}' + + def get_ssh_client(self): + self.get_instance_data() + if self.public: + if not self.public_ip: + if self.get_status() == 'TERMINATED': + self.start() + else: + self._wait_public_ip(timeout=60) + ip_address = self.public_ip + else: + ip_address = self.private_ip + + if not ip_address: + raise Exception( + f'No IP address available for {self.name} ' + f'(status={self.get_status()}, public={self.public})' + ) + + if not self.ssh_client or self.ssh_client.ip_address != ip_address: + self.ssh_client = SSHClient(ip_address, self.ssh_credentials) + return self.ssh_client + + def del_ssh_client(self): + if self.ssh_client: + try: + self.ssh_client.close() + except Exception: + pass + self.ssh_client = None + + def is_ready(self): + try: + self.get_ssh_client().run_remote_command('id') + except LithopsValidationError as err: + raise err + except Exception as err: + ip = self.public_ip or self.private_ip + logger.debug( + f'SSH to {ip}: {ssh_boot_status_message(err)}' + ) + self.del_ssh_client() + return False + return True + + def wait_ready(self, timeout=INSTANCE_START_TIMEOUT): + logger.debug(f'Waiting {self} to become ready') + start = time.time() + if self.public: + self.get_public_ip() + else: + self.get_private_ip() + while (time.time() - start < timeout): + if self.is_ready(): + return True + time.sleep(5) + raise TimeoutError(f'Readiness probe expired on {self}') + + def get_instance_data(self): + res = self.compute_client.instances().get( + project=self.project_name, + zone=self.zone, + instance=self.name + ).execute() + self.instance_data = res + self.instance_id = str(res.get('id')) + + interfaces = res.get('networkInterfaces', []) + if interfaces: + self.private_ip = interfaces[0].get('networkIP') + access_cfg = interfaces[0].get('accessConfigs', []) + if access_cfg: + self.public_ip = access_cfg[0].get('natIP') + return self.instance_data + + def get_instance_id(self): + if not self.instance_id: + self.get_instance_data() + return self.instance_id + + def get_private_ip(self): + if not self.private_ip: + self.get_instance_data() + return self.private_ip + + def get_public_ip(self): + if not self.public_ip: + self.get_instance_data() + return self.public_ip + + def get_status(self): + try: + self.get_instance_data() + except HttpError: + return None + return self.instance_data.get('status') if self.instance_data else None + + def _wait_public_ip(self, timeout=INSTANCE_START_TIMEOUT): + start = time.time() + while time.time() - start < timeout: + self.get_instance_data() + if self.public_ip: + return self.public_ip + time.sleep(2) + raise TimeoutError(f'Public IP not available for {self.name} after {timeout}s') + + def create(self, public=False, ssh_public_key=None, user_data=None, + check_if_exists=False, **kwargs): + if self._exists(): + self.get_instance_data() + if check_if_exists: + status = self.get_status() + if status == 'TERMINATED': + logger.debug(f'VM instance {self.name} is stopped, starting') + self.start() + elif status == 'RUNNING': + logger.debug(f'VM instance {self.name} already running') + elif status in ('STAGING', 'PROVISIONING', 'REPAIRING', 'STOPPING'): + logger.debug( + f'VM instance {self.name} is {status}, waiting until running' + ) + self._wait_until_status('RUNNING') + return + logger.debug(f'VM instance {self.name} already exists') + return + + logger.debug(f'Creating new VM instance {self.name}') + + if ssh_public_key is None and 'key_filename' in self.ssh_credentials: + pub_path = os.path.expanduser(self.ssh_credentials['key_filename'] + '.pub') + if os.path.isfile(pub_path): + with open(pub_path, 'r') as pkf: + ssh_public_key = pkf.read().strip() + + network_iface = { + 'subnetwork': ( + f'projects/{self.project_name}/regions/{self.config["region"]}' + f'/subnetworks/{self.config["subnet_name"]}' + ) + } + # Master: external IP for SSH from the Lithops client. + # Workers: no external IP; outbound internet uses Cloud NAT on the subnet. + if public or self.config.get('worker_public_ip', False): + network_iface['accessConfigs'] = [{'name': 'External NAT', 'type': 'ONE_TO_ONE_NAT'}] + + body = { + 'name': self.name, + 'machineType': f'zones/{self.zone}/machineTypes/{self.instance_type}', + 'disks': [{ + 'boot': True, + 'autoDelete': True, + 'initializeParams': { + 'sourceImage': self.config['source_image'], + 'diskSizeGb': str(self.config['boot_disk_size']), + 'diskType': f'zones/{self.zone}/diskTypes/{self.config["boot_disk_type"]}' + } + }], + 'networkInterfaces': [network_iface], + 'labels': { + 'type': 'lithops-runtime' + } + } + + metadata_items = [] + if ssh_public_key: + metadata_items.append({ + 'key': 'ssh-keys', + 'value': f'{self.ssh_credentials["username"]}:{ssh_public_key}' + }) + if user_data: + metadata_items.append({'key': 'user-data', 'value': user_data}) + if metadata_items: + body['metadata'] = {'items': metadata_items} + + sa_email = self.config.get('service_account_email') + if sa_email: + body['serviceAccounts'] = [{ + 'email': sa_email, + 'scopes': GCE_INSTANCE_SCOPES, + }] + else: + logger.warning( + f'Creating VM {self.name} without a service account; ' + f'GCS access from the VM will fail' + ) + + if self.config.get('request_spot_instances', False) and not public: + body['scheduling'] = { + 'provisioningModel': 'SPOT', + 'instanceTerminationAction': 'STOP' + } + op = self.compute_client.instances().insert( + project=self.project_name, + zone=self.zone, + body=body + ).execute() + self._wait_zone_operation(op['name']) + self.get_instance_data() + + def _wait_until_status(self, target_status, timeout=INSTANCE_START_TIMEOUT): + start = time.time() + last_status = None + while time.time() - start < timeout: + status = self.get_status() + if status != last_status: + logger.debug( + f'VM instance {self.name} status: {status} ' + f'(waiting for {target_status})' + ) + last_status = status + if status == target_status: + return status + time.sleep(2) + raise TimeoutError( + f'{self.name} did not reach status {target_status} (last: {self.get_status()})' + ) + + def start(self): + status = self.get_status() + if status == 'RUNNING': + logger.debug(f'VM instance {self.name} already running') + if self.public and not self.public_ip: + self._wait_public_ip(timeout=60) + return + + logger.debug(f'Starting VM instance {self.name}') + op = self.compute_client.instances().start( + project=self.project_name, zone=self.zone, instance=self.name + ).execute() + self._wait_zone_operation(op['name']) + logger.debug(f'VM instance {self.name} start operation completed') + self.del_ssh_client() + self.public_ip = None + self._wait_until_status('RUNNING') + if self.public: + self._wait_public_ip() + ip = self.public_ip or self.private_ip + logger.debug(f'VM instance {self.name} started ({ip})') + + def stop(self): + if self.delete_on_dismantle: + return self.delete() + op = self.compute_client.instances().stop( + project=self.project_name, zone=self.zone, instance=self.name + ).execute() + self._wait_zone_operation(op['name']) + + def delete(self): + op = self.compute_client.instances().delete( + project=self.project_name, zone=self.zone, instance=self.name + ).execute() + self._wait_zone_operation(op['name']) + + def validate_capabilities(self): + pass + + def _exists(self): + try: + self.compute_client.instances().get( + project=self.project_name, zone=self.zone, instance=self.name + ).execute() + return True + except HttpError as e: + if getattr(e.resp, 'status', None) == 404: + return False + raise + + def _wait_zone_operation(self, operation_name, timeout=INSTANCE_START_TIMEOUT): + start = time.time() + while time.time() - start < timeout: + op = self.compute_client.zoneOperations().get( + project=self.project_name, zone=self.zone, operation=operation_name + ).execute() + if op['status'] == 'DONE': + if 'error' in op: + raise Exception(op['error']) + return + time.sleep(2) + raise TimeoutError( + f'Zone operation {operation_name} timed out after {timeout}s' + ) diff --git a/lithops/standalone/backends/ibm_vpc/ibm_vpc.py b/lithops/standalone/backends/ibm_vpc/ibm_vpc.py index fc6466893..bea3b5095 100644 --- a/lithops/standalone/backends/ibm_vpc/ibm_vpc.py +++ b/lithops/standalone/backends/ibm_vpc/ibm_vpc.py @@ -29,7 +29,7 @@ from concurrent.futures import ThreadPoolExecutor from lithops.version import __version__ -from lithops.util.ssh_client import SSHClient +from lithops.util.ssh_client import SSHClient, ssh_boot_status_message from lithops.constants import COMPUTE_CLI_MSG, CACHE_DIR from lithops.config import load_yaml_config, dump_yaml_config from lithops.standalone.utils import ( @@ -555,7 +555,7 @@ def build_image(self, image_name, script_file, overwrite, include, extra_args=[] logger.debug(f"Uploading installation script to {build_vm}") remote_script = "/tmp/install_lithops.sh" - script = get_host_setup_script() + script = get_host_setup_script(lithops_pip_spec='lithops[ibm,redis]') build_vm.get_ssh_client().upload_data_to_file(script, remote_script) logger.debug("Executing Lithops installation script. Be patient, this process can take up to 3 minutes") build_vm.get_ssh_client().run_remote_command(f"chmod 777 {remote_script}; sudo {remote_script}; rm {remote_script};") @@ -963,14 +963,13 @@ def is_ready(self): """ Checks if the VM instance is ready to receive ssh connections """ - login_type = 'password' if 'password' in self.ssh_credentials and \ - not self.public else 'publickey' try: self.get_ssh_client().run_remote_command('id') except LithopsValidationError as err: raise err except Exception as err: - logger.debug(f'SSH to {self.public_ip if self.public else self.private_ip} failed ({login_type}): {err}') + ip = self.public_ip if self.public else self.private_ip + logger.debug(f'SSH to {ip}: {ssh_boot_status_message(err)}') self.del_ssh_client() return False return True diff --git a/lithops/standalone/backends/vm/vm.py b/lithops/standalone/backends/vm/vm.py index 524223c5b..1ddb2e9ad 100644 --- a/lithops/standalone/backends/vm/vm.py +++ b/lithops/standalone/backends/vm/vm.py @@ -21,7 +21,7 @@ from lithops.standalone.utils import StandaloneMode from lithops.version import __version__ from lithops.constants import COMPUTE_CLI_MSG -from lithops.util.ssh_client import SSHClient +from lithops.util.ssh_client import SSHClient, ssh_boot_status_message from lithops.standalone import LithopsValidationError logger = logging.getLogger(__name__) @@ -126,7 +126,7 @@ def is_ready(self): except LithopsValidationError as e: raise e except Exception as e: - logger.debug(f'ssh to {self.public_ip} failed: {e}') + logger.debug(f'SSH to {self.public_ip}: {ssh_boot_status_message(e)}') self.del_ssh_client() return False return True diff --git a/lithops/standalone/master.py b/lithops/standalone/master.py index 57ae7d9bb..c25d7736a 100644 --- a/lithops/standalone/master.py +++ b/lithops/standalone/master.py @@ -57,7 +57,8 @@ StandaloneMode, WorkerStatus, get_host_setup_script, - get_worker_setup_script + get_worker_setup_script, + install_script_kwargs_from_config, ) os.makedirs(LITHOPS_TEMP_DIR, exist_ok=True) @@ -204,13 +205,24 @@ def check_worker(worker_data): return response +def _redis_field(value): + """ + Redis hash values must be bytes, str, int, or float. + """ + if isinstance(value, (dict, list)): + return json.dumps(value) + if isinstance(value, bool): + return str(value) + return value + + def save_worker(worker, standalone_config, work_queue_name): """ Saves the worker instance with the provided data in redis """ config = copy.deepcopy(standalone_config) del config[config['backend']] - config = {key: str(value) if isinstance(value, bool) else value for key, value in config.items()} + config = {key: _redis_field(value) for key, value in config.items()} worker_processes = CPU_COUNT if worker.config['worker_processes'] == 'AUTO' \ else worker.config['worker_processes'] @@ -303,7 +315,9 @@ def wait_worker_ready(worker): 'lithops_version': __version__ } remote_script = "/tmp/install_lithops.sh" - script = get_host_setup_script() + script = get_host_setup_script( + run_install=False, **install_script_kwargs_from_config(standalone_handler.config) + ) script += get_worker_setup_script(standalone_handler.config, vm_data) logger.debug(f'Submitting installation script to {worker}') @@ -346,7 +360,10 @@ def setup_worker_consume(standalone_handler, worker_info, work_queue_name): 'lithops_version': __version__ } worker_setup_script = "/tmp/install_lithops.sh" - script = get_worker_setup_script(standalone_handler.config, vm_data) + script = get_host_setup_script( + run_install=False, **install_script_kwargs_from_config(standalone_handler.config) + ) + script += get_worker_setup_script(standalone_handler.config, vm_data) with open(worker_setup_script, 'w') as wis: wis.write(script) diff --git a/lithops/standalone/standalone.py b/lithops/standalone/standalone.py index 3b146887c..44e3e14fb 100644 --- a/lithops/standalone/standalone.py +++ b/lithops/standalone/standalone.py @@ -40,7 +40,8 @@ StandaloneMode, LithopsValidationError, get_host_setup_script, - get_master_setup_script + get_master_setup_script, + install_script_kwargs_from_config, ) from lithops.version import __version__ @@ -435,7 +436,7 @@ def _setup_master_service(self): logger.debug('Be patient, initial installation process may take up to 3 minutes') remote_script = "/tmp/install_lithops.sh" - script = get_host_setup_script() + script = get_host_setup_script(run_install=False, **install_script_kwargs_from_config(self.config)) script += get_master_setup_script(self.config, master_data) ssh_client.upload_data_to_file(script, remote_script) diff --git a/lithops/standalone/utils.py b/lithops/standalone/utils.py index 3009ee456..d52208ec7 100644 --- a/lithops/standalone/utils.py +++ b/lithops/standalone/utils.py @@ -1,5 +1,7 @@ import os +import re import json +import shlex from enum import Enum from lithops.constants import ( @@ -101,9 +103,90 @@ class LithopsValidationError(Exception): """ -def get_host_setup_script(docker=True): +def _normalize_package_list(packages): + if not packages: + return [] + if isinstance(packages, str): + return [p.strip() for p in packages.split() if p.strip()] + return [str(p).strip() for p in packages if str(p).strip()] + + +def _format_apt_packages_for_shell(packages): + safe = [] + for package in _normalize_package_list(packages): + if not re.match(r'^[a-z0-9][a-z0-9.+~-]*$', package, re.IGNORECASE): + raise LithopsValidationError( + f'Invalid apt package name "{package}" in extra_apt_packages' + ) + safe.append(package) + return ' '.join(safe) + + +def _format_pip_packages_for_shell(packages): + quoted = [] + for package in _normalize_package_list(packages): + if re.search(r'[;&|`$(){}]', package): + raise LithopsValidationError( + f'Invalid pip package spec "{package}" in extra_python_packages' + ) + quoted.append(shlex.quote(package)) + return ' '.join(quoted) + + +def install_script_kwargs_from_config(config=None): + """ + Build keyword arguments for get_host_setup_script() from standalone config. + """ + config = config or {} + return { + 'lithops_pip_spec': lithops_pip_spec_from_config(config), + 'extra_apt_packages': _format_apt_packages_for_shell(config.get('extra_apt_packages')), + 'extra_python_packages': _format_pip_packages_for_shell(config.get('extra_python_packages')), + } + + +def lithops_pip_spec_from_config(config=None, default='lithops'): + """ + Build a minimal pip spec from lithops config (avoid lithops[all] on VMs). + Standalone master/workers always need the redis extra for the job queue. """ - Returns the script necessary for installing a lithops VM host + if not config: + return default + + extras = {'redis'} + lithops_cfg = config.get('lithops') or {} + for key in ('backend', 'storage'): + name = (config.get(key) or lithops_cfg.get(key) or '').lower() + if name.startswith('gcp'): + extras.add('gcp') + elif name.startswith('aws') or name in ('aws_s3', 'aws_sqs'): + extras.add('aws') + elif name.startswith('azure'): + extras.add('azure') + elif name.startswith('ibm'): + extras.add('ibm') + elif name.startswith('aliyun'): + extras.add('aliyun') + elif name in ('oracle', 'oci', 'oracle_storage'): + extras.add('oracle') + + cloud_extras = extras - {'redis'} + if not cloud_extras: + return 'lithops[redis]' + return f"lithops[{','.join(sorted(extras))}]" + + +def get_host_setup_script( + docker=True, + run_install=True, + lithops_pip_spec='lithops', + extra_apt_packages='', + extra_python_packages='', +): + """ + Returns the script necessary for installing a lithops VM host. + Set run_install=False when appending master/worker setup (they run install first). + extra_apt_packages/extra_python_packages are pre-validated space-separated strings. """ script = f"""#!/bin/bash mkdir -p {SA_INSTALL_DIR}; @@ -116,7 +199,34 @@ def get_host_setup_script(docker=True): done; }} + apt_install(){{ + # Serialize apt and recover from interrupted/corrupt package lists. + flock -w 600 /var/lib/dpkg/lock-frontend apt-get "$@" || {{ + echo "--> apt failed, repairing package lists and retrying" + rm -rf /var/lib/apt/lists/partial/* + apt-get clean + apt-get update + flock -w 600 /var/lib/dpkg/lock-frontend apt-get "$@" + }} + }} + + configure_redis_for_standalone(){{ + # Workers connect to the master private IP; Redis must not listen on loopback only. + if [ ! -f /etc/redis/redis.conf ]; then + return 0 + fi + echo "--> Configuring Redis for standalone workers (bind 0.0.0.0)" + sed -i -E 's/^bind .*/bind 0.0.0.0 -::1/' /etc/redis/redis.conf + if grep -q '^protected-mode yes' /etc/redis/redis.conf; then + sed -i 's/^protected-mode yes/protected-mode no/' /etc/redis/redis.conf + fi + systemctl enable redis-server.service + systemctl restart redis-server.service + }} + install_packages(){{ + set -e + export DEBIAN_FRONTEND=noninteractive export DOCKER_REQUIRED={str(docker).lower()}; command -v docker >/dev/null 2>&1 || {{ export INSTALL_DOCKER=true; export INSTALL_LITHOPS_DEPS=true;}}; command -v unzip >/dev/null 2>&1 || {{ export INSTALL_LITHOPS_DEPS=true; }}; @@ -124,38 +234,56 @@ def get_host_setup_script(docker=True): if [ "$INSTALL_DOCKER" = true ] && [ "$DOCKER_REQUIRED" = true ]; then wait_internet_connection; - echo "--> Installing Docker" - apt-get update; - apt-get install apt-transport-https ca-certificates curl software-properties-common gnupg-agent -y; - curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -; - add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"; + echo "--> Installing Docker repository" + apt_install update + apt_install install -y apt-transport-https ca-certificates curl gnupg software-properties-common + curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg + DOCKER_APT="deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg]" + DOCKER_APT="$DOCKER_APT https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" + echo "$DOCKER_APT" > /etc/apt/sources.list.d/docker.list fi; if [ "$INSTALL_LITHOPS_DEPS" = true ]; then wait_internet_connection; echo "--> Installing Lithops system dependencies" - apt-get update; + apt_install update if [ "$INSTALL_DOCKER" = true ] && [ "$DOCKER_REQUIRED" = true ]; then - apt-get install unzip redis-server python3-pip docker-ce docker-ce-cli containerd.io -y --fix-missing; + apt_install install -y unzip redis-server python3-pip docker-ce docker-ce-cli containerd.io else - apt-get install unzip redis-server python3-pip -y --fix-missing; + apt_install install -y unzip redis-server python3-pip fi; - sudo systemctl enable redis-server.service; - sed -i 's/^bind 127.0.0.1 ::1/bind 0.0.0.0/' /etc/redis/redis.conf; - sudo systemctl restart redis-server.service; + configure_redis_for_standalone fi; - if [[ ! $(pip3 list|grep "lithops") ]]; then + EXTRA_APT="{extra_apt_packages}" + if [ -n "$EXTRA_APT" ]; then wait_internet_connection; - echo "--> Installing Lithops python dependencies" - pip3 install -U pip flask gevent lithops[all]; + apt_install update + echo "--> Installing extra apt packages: $EXTRA_APT" + apt_install install -y $EXTRA_APT + fi; + + if ! pip3 list 2>/dev/null | grep -q lithops; then + wait_internet_connection; + echo "--> Installing Lithops python dependencies ({lithops_pip_spec})" + export PIP_BREAK_SYSTEM_PACKAGES=1 + # --ignore-installed: do not uninstall Debian python packages (avoids RECORD errors) + pip3 install --ignore-installed -U pip + pip3 install --ignore-installed flask gevent {lithops_pip_spec} + fi; + + EXTRA_PY="{extra_python_packages}" + if [ -n "$EXTRA_PY" ]; then + echo "--> Installing extra python packages: $EXTRA_PY" + export PIP_BREAK_SYSTEM_PACKAGES=1 + pip3 install --ignore-installed $EXTRA_PY fi; }} - install_packages >> {SA_SETUP_LOG_FILE} 2>&1 - touch {SA_SETUP_DONE_FILE}; """ + if run_install: + script += f"install_packages >> {SA_SETUP_LOG_FILE} 2>&1 && touch {SA_SETUP_DONE_FILE};\n" return script @@ -182,8 +310,8 @@ def get_master_setup_script(config, vm_data): echo '{json.dumps(vm_data)}' > {SA_MASTER_DATA_FILE}; echo '{json.dumps(config)}' > {SA_CONFIG_FILE}; }} - setup_host >> {SA_SETUP_LOG_FILE} 2>&1; setup_service(){{ + configure_redis_for_standalone >> {SA_SETUP_LOG_FILE} 2>&1 echo '{MASTER_SERVICE_FILE}' > /etc/systemd/system/{MASTER_SERVICE_NAME}; chmod 644 /etc/systemd/system/{MASTER_SERVICE_NAME}; systemctl daemon-reload; @@ -191,7 +319,6 @@ def get_master_setup_script(config, vm_data): systemctl enable {MASTER_SERVICE_NAME}; systemctl start {MASTER_SERVICE_NAME}; }} - setup_service >> {SA_SETUP_LOG_FILE} 2>&1; USER_HOME=$(eval echo ~${{SUDO_USER}}); generate_ssh_key(){{ echo ' StrictHostKeyChecking no @@ -204,7 +331,10 @@ def get_master_setup_script(config, vm_data): echo '127.0.0.1 lithops-master' >> /etc/hosts; cat $USER_HOME/.ssh/id_rsa.pub >> $USER_HOME/.ssh/authorized_keys; }} - test -f $USER_HOME/.ssh/lithops_id_rsa || generate_ssh_key >> {SA_SETUP_LOG_FILE} 2>&1; + install_packages >> {SA_SETUP_LOG_FILE} 2>&1 && touch {SA_SETUP_DONE_FILE} && \\ + setup_host >> {SA_SETUP_LOG_FILE} 2>&1 && \\ + setup_service >> {SA_SETUP_LOG_FILE} 2>&1 && \\ + (test -f $USER_HOME/.ssh/lithops_id_rsa || generate_ssh_key >> {SA_SETUP_LOG_FILE} 2>&1) echo 'tail -f -n 100 /tmp/lithops-*/master-service.log'>> $USER_HOME/.bash_history """ return script @@ -237,7 +367,6 @@ def get_worker_setup_script(config, vm_data): echo '{json.dumps(vm_data)}' > {SA_WORKER_DATA_FILE}; echo '{json.dumps(config)}' > {SA_CONFIG_FILE}; }} - setup_host >> {SA_SETUP_LOG_FILE} 2>&1; USER_HOME=$(eval echo ~${{SUDO_USER}}); setup_service(){{ echo '{WORKER_SERVICE_FILE.format(cmd_pre, cmd_start, cmd_stop)}' > /etc/systemd/system/{WORKER_SERVICE_NAME}; @@ -247,6 +376,8 @@ def get_worker_setup_script(config, vm_data): systemctl enable {WORKER_SERVICE_NAME}; systemctl start {WORKER_SERVICE_NAME}; }} + install_packages >> {SA_SETUP_LOG_FILE} 2>&1 && touch {SA_SETUP_DONE_FILE} && \\ + setup_host >> {SA_SETUP_LOG_FILE} 2>&1 && \\ setup_service >> {SA_SETUP_LOG_FILE} 2>&1 echo '{vm_data['master_ip']} lithops-master' >> /etc/hosts echo 'tail -f -n 100 {SA_WORKER_LOG_FILE}'>> $USER_HOME/.bash_history diff --git a/lithops/util/ssh_client.py b/lithops/util/ssh_client.py index 4e33a8742..92c35fd07 100644 --- a/lithops/util/ssh_client.py +++ b/lithops/util/ssh_client.py @@ -4,6 +4,25 @@ logger = logging.getLogger(__name__) +# Paramiko logs full tracebacks on transient boot-time failures (banner EOF, +# port closed, etc.). Lithops already retries; keep paramiko quiet. +for _log_name in ('paramiko', 'paramiko.transport', 'paramiko.client'): + logging.getLogger(_log_name).setLevel(logging.CRITICAL) + + +def ssh_boot_status_message(err): + """ + Map transient SSH errors during VM boot to a short user-facing status. + """ + msg = str(err).lower() + if 'timed out' in msg or 'timeout' in msg: + return 'VM is starting, waiting for network/SSH' + if 'unable to connect' in msg or 'connection refused' in msg: + return 'VM is up, starting SSH service' + if 'banner' in msg or 'no existing session' in msg or 'connection reset' in msg: + return 'Configuring SSH on VM' + return str(err) + class SSHClient(): @@ -23,7 +42,11 @@ def close(self): """ Closes the SSH client connection """ - self.ssh_client.close() + if self.ssh_client: + try: + self.ssh_client.close() + except Exception: + pass self.ssh_client = None def create_client(self, timeout=2):