Skip to content

A tool that simplifies training RL agents for cloud resource management by bridging CloudSim Plus with Gymnasium

License

Notifications You must be signed in to change notification settings

tgasla/rl-cloudsimplus

Repository files navigation

RL-CloudSimPlus

Codacy Badge GPLv3 License

Requirements

1. Install Docker

https://docs.docker.com/get-docker/

Warning

If you install Docker Desktop for MacOS, make sure you are giving enough memory in your containers by going to Settings... > Resources and increasing the Memory Limit

2. Install Docker Compose

https://docs.docker.com/compose/install/

3. Install Java OpenJDK 21

  • For Debian, install the openjdk-21-jdk and openjdk-21-jre packages
sudo apt install openjdk-21-jdk openjdk-21-jre
  • For MacOS using brew
brew install openjdk@21

4. Set the JAVA_HOME environment variable to the right path

Important

The exact path may vary (different distro, different arch, etc.).

  • For Linux
export JAVA_HOME=/usr/lib/jvm/java-21-openjdk-<arch>
  • For MacOS
export JAVA_HOME=/usr/libexec/java_home

Building the docker images

1. Build TensorBoard, Gateway and RL manager images

make build

Important

It is often useful to rebuild images one at a time, especially when a change is made only in a specific application part. For example, when we change the gateway code, we must rebuild the image before running the application. Optionally, when building, we can specify the log level of the messages printed in stdout and in a log file. If the log level environment variable is not set, the log level is set to INFO by default.

[LOG_LEVEL=[TRACE|DEBUG|INFO|WARN|ERROR]] make build-gateway

Starting the TensorBoard dashboard

We have created three docker images. The gateway and manager images consist of the main application and are the docker compose services we need for every experiment we want to run. The TensorBoard image is the UI endpoint and helps us keep track of the experiment's progress. Because we do not want to shut down the visualization dashboard every time we want to stop an experiment, the TensorBoard image is not a docker compose service and can be started as a standalone docker container by using the following command (TODO: consider changing it):

make run-tensorboard

Note

You can check that the TensorBoard dashboard is running by visiting http://localhost.

Editing the experiment configuration file

To run an experiment, first edit the configuration file located at rl-manager/mnt/config.yml.

The configuration file contains two sections: the 'common' section and the 'experiment' section. This is because we may run multiple different experiments in parallel.

  • The parameters that all experiments have in common are specified under the common section, and the parameters that are unique among the experiments are defined under the experiment_{id} section
    • If a parameter is specified in both the common and experiment sections, the common one is ignored, and the experiment one takes effect.
  • To run multiple experiments in parallel, add as many experiment areas as you want, specifying the corresponding parameters for each experiment.
  • Each experiment should have a unique experiment id, and each section should be written as experiment_{id}. For simplicity, use ids starting by 1 and increment by 1.

There are three experiment modes: train, transfer, and test. When transfer or test mode are specified, an additional 'train_model_dir' key for an experiment should be defined, with the name of the directory in which the trained agent model should be used.

Running an experiment

After you are ready editing the configuration file run the following command to start the experiment(s).

make run-cpu

Note

This command runs all the docker containers in detached mode If you want to use attached mode try the following command:

make run-cpu-attached

CUDA GPU support

There is also support to run the experiments in CUDA GPUs.

  • You need to have CUDA and nvidia-container-toolkit installed on your system.
  • Make sure to restart docker daemon if you just downloaded the cuda-container-toolkit.

Run using all available CUDA devices by running the following command:

make run-gpu

for detached

or

make run-gpu-attached

attached mode.

Stopping the application

If you want to stop the application and clear all the dangling containers and volumes run the following command:

make stop

Acknowledgements

This project uses the CloudSim Plus framework: a full-featured, highly extensible, easy-to-use Java 17+ framework for modeling and simulating cloud computing infrastructure and services. The source code is available here.

The code was based on the work done by pkoperek in the following projects: