diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/404.html b/404.html new file mode 100644 index 00000000..83421cfb --- /dev/null +++ b/404.html @@ -0,0 +1,1891 @@ + + + +
+ + + + + + + + + + + + + + +In the complex landscape of multi-task learning, AdaMerging has emerged as a potent method for adaptively merging model parameters to optimize performance across tasks. Unlike traditional fixed-coefficient methods, AdaMerging autonomously learns merging coefficients, offering a more refined and responsive approach1.
+The cornerstone of AdaMerging lies in its adaptive nature, where it learns the coefficients for merging either on a task-wise or layer-wise basis. This adaptability is driven by an entropy minimization strategy applied to unlabeled test samples as a surrogate objective function, which serves to refine the merging coefficients for optimal performance.
+Task-wise AdaMerging is formulated as:
+where \(\lambda_i\) represents the merging coefficient for the \(i\)-th task, and \(\tau_i\) denotes the task vector for the \(i\)-th task.
+On the other hand, Layer-wise AdaMerging is articulated as:
+where the merging coefficient \(\lambda^{l}_{i}\) and task vector \(\tau^{l}_{i}\) are specific to each layer \(l\) of the model.
+By leveraging this adaptive learning approach, AdaMerging significantly enhances the model's ability to generalize across tasks and layers, resulting in a more robust and finely-tuned performance profile. The method’s reliance on entropy minimization ensures that the merging process continually seeks the most informative and stable configuration, adapting to the specific needs of the dataset and tasks at hand.
+Task-wise Coefficients. +The below Figure shows the changes during the iteration process of merging coefficient optimization of each task vector in Task-wise AdaMerging and AdaMerging++, which is shown every ten steps. We consistently observe that the merging coefficients of each task vector are inconsistent. When the number of tasks is relatively large, it is obviously undesirable to grid search the coefficients of each task, but our AdaMerging avoids this manual search process.
+ +Layer-wise Coefficients. +The following Figure shows the merging coefficients learned by Layer-wise AdaMerging and AdaMerging++ on ViT-B/32 respectively. We observed that:
+Merge CLIP-ViT-B/32 models from eight downstream image classification tasks:
+fusion_bench \
+ method=adamerging \
+ method.name=clip_layer_wise_adamerging \
+ method.save_merging_weights=merging_weights.pt \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ fabric_logger.root_dir=outputs/logs/ViT-B-32 \
+ fabric_logger.name=clip_layer_wise_adamerging_adam
+
Part of the output:
+Profiler Report
+
+----------------------------------------------------------------------------------------------------------------------------------
+| Action | Mean duration (s) | Num calls | Total time (s) | Percentage % |
+----------------------------------------------------------------------------------------------------------------------------------
+| Total | - | 26001 | 724.65 | 100 % |
+----------------------------------------------------------------------------------------------------------------------------------
+| backward pass | 0.060172 | 8000 | 481.38 | 66.429 |
+| forward pass | 0.016124 | 8000 | 128.99 | 17.801 |
+| data loading | 0.0063443 | 8000 | 50.754 | 7.004 |
+| merging weights | 0.050735 | 1000 | 50.735 | 7.0013 |
+| construct the wrapped model | 7.2558 | 1 | 7.2558 | 1.0013 |
+| optimizer step | 0.00098186 | 1000 | 0.98186 | 0.13549 |
+----------------------------------------------------------------------------------------------------------------------------------
+
task_wise_adamerging
+
+
+¶
entropy_loss(logits)
+
+¶Compute the entropy loss of a set of logits.
+ + +Parameters:
+logits
+ (Tensor
)
+ –
+ The logits to compute the entropy loss of.
+Returns:
+Tensor
( Tensor
+) –
+ The entropy loss of the logits.
+fusion_bench/method/adamerging/task_wise_adamerging.py
clip_task_wise_adamerging
+
+
+¶
CLIPTaskWiseAdaMergingAlgorithm
+
+
+¶
+ Bases: TaskWiseAdaMergingAlgorithm
fusion_bench/method/adamerging/clip_task_wise_adamerging.py
40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 |
|
get_test_dataset(task)
+
+
+ cached
+
+
+¶Load the test dataset for the task. +This method is cached, so the dataset is loaded only once.
+ +fusion_bench/method/adamerging/clip_task_wise_adamerging.py
on_test_time_adaptation_start()
+
+¶Here we load the CLIP processor and construct the zero-shot classification head for each task.
+ +fusion_bench/method/adamerging/clip_task_wise_adamerging.py
layer_wise_adamerging
+
+
+¶
LayerWiseAdaMergingAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
, LightningFabricMixin
, SimpleProfilerMixin
fusion_bench/method/adamerging/layer_wise_adamerging.py
33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 |
|
construct_layer_wise_merged_model(modelpool)
+
+¶Constructs a wrapped layer-wise merged model from model pool.
+This method creates a new wrapped model by merging the layers of a pretrained model with those of several fine-tuned models.
+The merging is controlled by layer-wise weights, which is a torch.Tensor
of the shape (num_models, num_layers)
.
+The merging weights can be initialized based on a provided configuration or loaded from a file.
Parameters:
+modelpool
+ (ModelPool
)
+ –
+ An object containing the pretrained model and fine-tuned models to be merged.
+Returns:
+LayerWiseMergedModel
–
+ An instance of the merged model with layer-wise weights applied.
+fusion_bench/method/adamerging/layer_wise_adamerging.py
get_shuffled_test_loader_iter(task)
+
+
+ abstractmethod
+
+
+¶Loader of test dataset for test-time adaptation. labels are not needed.
+ + +
on_test_time_adaptation_start()
+
+¶Something to do before the test-time adaptation starts. Such as setting up the task-specific heads.
+ + +
clip_layer_wise_adamerging
+
+
+¶Example Usage:
+fusion_bench method=adamerging method.name=clip_layer_wise_adamerging method.save_merging_weights=merging_weights.pt modelpool=clip-vit-base-patch32_TA8 taskpool=clip-vit-classification_TA8 fabric_logger.root_dir=outputs/logs/ViT-B-32 fabric_logger.name=clip_layer_wise_adamerging_adam
+
CLIPLayerWiseAdaMergingAlgorithm
+
+
+¶
+ Bases: CLIPClassificationMixin
, LayerWiseAdaMergingAlgorithm
fusion_bench/method/adamerging/clip_layer_wise_adamerging.py
on_test_time_adaptation_start()
+
+¶Here we load the CLIP processor and construct the zero-shot classification head for each task.
+ + +(ICLR 2024) AdaMerging: Adaptive Model Merging for Multi-Task Learning. https://openreview.net/pdf?id=nZP6NgD3QY ↩
+Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014. ↩
+A. Tang, L. Shen, Y. Luo, N. Yin, L. Zhang, and D. Tao, “Merging Multi-Task Models via Weight-Ensembling Mixture of Experts,” ICML 2024. doi: 10.48550/arXiv.2402.00433. ↩
+Consider a discrete categorical distribution parameterized by logits \(\mathbf{x} = (x_1, \dots, x_n) \in \mathbb{R}^{n}\), where \(x_i\) is the logit of the \(i\)-th category. The Gumbel-Max trick 123 states a reparameterization trick to sample from the categorical distribution by sampling from the standard Gumbel distribution \(\text{Gumbel}(\mu=0,\beta=1)\) and taking the argmax of the sum of the Gumbel random variables and the logits.
+This trick proceeds as follows: +sample \(n\) Gumbel random variables \(g_1, \dots, g_n\) independently from the standard Gumbel distribution \(\text{Gumbel}(\mu=0,\beta=1)\) (We can draw a random sample \(u\) from a unifrom distribution on the interval \((0,1)\) and then transform it into a Gumbel-distributed variable \(g\) using the formula \(g=-\log(-\log u)\).), find the index \(i\) of that maximizes \(x_i + g_i\), then we have
+If we represent the categorical distribution as a one-hot vector \(\mathbf{y} = (y_1, \dots, y_n) \in \{0,1\}^n\), where \(y_i=1\) indicates that the \(i\)-th category is sampled and for all \(j\neq i\), \(y_j=0\), then we have
+Since the derivative of the \({\arg\max}\) function is not defined, we cannot backpropagate the gradients through it. +To address this issue, (Maddison et al., 2017)4 proposed to use a continuous relaxation of the discrete categorical distribution. +A CONCRETE random variable (CONtinuous relaxation of disCRETE random variable) relax the condition that the one-hot vector \(\mathbf{y}\) must be located at the vertices of the \((n-1)\)-dimensional simplex \(\Delta^{n-1}\), and instead, it allows \(\mathbf{y}\) to be located anywhere inside the simplex \(\Delta^{n-1}\), i.e. \(\{ y\in \mathbb{R}^n | y_i \in [0,1], \sum_{i=1}^n y_i =1 \}\).
+To sample a Concrete random variable \(\mathbf{y}\) from a distribution that is parameterized by a temperature hyperparameter \(\lambda > 0\) and a vector of logits \(\mathbf{x} = (x_1, \dots, x_n) \in \mathbb{R}^{n}\), we have
+where \(\mathbf{g} = (g_1, \dots, g_n)\) is a vector of Gumbel random variables that are independently sampled from the standard Gumbel distribution \(\text{Gumbel}(\mu=0,\beta=1)\).
+A subspace mask \(\mathbf{m}\) is a binary vector that identifies a subspace of the parameter space. +For a neural network parametrized by \(\theta\), we can use a subspace mask \(\mathbf{m}\) to identify a subspace of the parameter space \(\mathbf{\theta}\) by setting the parameters that are not in the subspace to zero, i.e. \(\mathbf{\theta} \circ \mathbf{m}\), where \(\circ\) denotes the element-wise product. +We can draw a random sample \(\mathbf{m}\) from a Bernoulli distribution \(\text{Bernoulli}(\mathbf{p}=\sigma(\mathbf{x}))\), where \(\mathbf{p}\) is the probability (\(\mathbf{x}\) denotes the logits) of each parameter being activated. However, the discrete Bernoulli distribution is not differentiable, so we cannot backpropagate the gradients through it to optimize the parameters \(\mathbf{p}\) or \(\mathbf{x}\).
+To address this issue, we introduce the Concrete mask which can be drawn from a continuous relaxation of Bernoulli distribution. Before we introduce the Concrete mask, we first review the Gumbel-Max trick in the two-class case.
+Let \(p_0\) and \(p_1\) denote the unnormalized probabilities of a Bernoulli random variable being 0 and 1, respectively, with \(x\) representing the logits. Then, the probability of the event \(m=1\) is given by
+where \(\sigma\) denotes the sigmoid function. +In the context of the Gumbel-Max trick, the occurrence of the event \(m=1\) is determined by the condition \(g_1 + \log p_1 > g_0 + \log p_0\), where \(g_0\) and \(g_1\) are two independent standard Gumbel random variables. +Thus we have
+Because the difference of two standard Gumbel random variables is a Logistic random variable, we can replace \(g_1 - g_0\) by \(\log u - \log(1-u)\) where \(u\) is a random variable sampled from a uniform distribution on the interval \((0,1)\). +Substitute this into Eq.(\ref{eq:appendix_P_m_1}) and express the probability in terms of the logits \(x\) to simplify the expression, we have
+The binary Concrete distribution offers a continuous relaxation of the discrete Bernoulli random variables, which is beneficial for gradient-based optimization as it allows for the backpropagation of gradients even through the sampling process. +Instead of making a hard decision as the above equation, we use a temperature parameter \(\lambda\) to control the steepness of the sigmoid function, and hence control how close our 'soft' decisions are to being 'hard' decisions. The continuous version of the Bernoulli random variable is then given by
+As the temperature \(\lambda\) approaches zero, the sigmoid function becomes a step function, and the Concrete random variable \(\hat{m}\) becomes a Bernoulli random variable, as shown in the following Figure. In the limit when \(\lambda \to 0\), this results in sampling \(m=1\) if \(\log \frac{\sigma(x)}{1 - \sigma(x)} > -\log \frac{u}{1 - u}\), consistent with the original Gumbel-Max trick. +The binary Concrete distribution thus provides a differentiable approximation to Bernoulli random variables. +We can further binarize the Concrete mask by setting the entries with values greater than 0.5 to 1 and the rest to 0.
+ +Merging CLIP models on eight image classification tasks, using the concrete task arithmetic algorithm
+# tensorboard logs and learned checkpoints of the shared mask can be found at https://huggingface.co/tanganke/clip-vit-base-patch32_concrete-task-arithmetic_tblogs
+fusion_bench \
+ fabric_logger.name=ViT-B-32/concrete_task_arithmetic \
+ method=clip_concrete_task_arithmetic \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8
+
results
+{
+ "svhn": {
+ "accuracy": 0.903003990650177,
+ "loss": 0.37700024247169495
+ },
+ "stanford_cars": {
+ "accuracy": 0.6326327323913574,
+ "loss": 1.2553859949111938
+ },
+ "resisc45": {
+ "accuracy": 0.7558730244636536,
+ "loss": 1.017554759979248
+ },
+ "eurosat": {
+ "accuracy": 0.9407407641410828,
+ "loss": 0.20871955156326294
+ },
+ "gtsrb": {
+ "accuracy": 0.8285035490989685,
+ "loss": 0.5861473679542542
+ },
+ "mnist": {
+ "accuracy": 0.9800000190734863,
+ "loss": 0.08148527890443802
+ },
+ "dtd": {
+ "accuracy": 0.5249999761581421,
+ "loss": 2.2731478214263916
+ },
+ "sun397": {
+ "accuracy": 0.6421158909797668,
+ "loss": 1.4108904600143433
+ }
+}
+
Concrete AdaMerging (Layer-wise)
+# tensorboard logs and learned checkpoints of the shared mask can be found at https://huggingface.co/tanganke/clip-vit-base-patch32_concrete-layer-wise_adamerging_tblogs
+fusion_bench \
+ fabric_logger.name=ViT-B-32/clip_concrete_layer_wise_adamerging \
+ method=clip_concrete_layer_wise_adamerging \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8
+
+ X. Yi, S. Zheng, L. Wang, X. Wang, and L. He, “A safety realignment framework via subspace-oriented model fusion for large language models.” arXiv, May 14, 2024. doi: 10.48550/arXiv.2405.09055.
+++The paper introduces a safety realignment framework for large language models via subspace-oriented model fusion (SOMF, the authors learn a shared mask on the weight space of large language model), which combines safeguard capabilities of initially aligned models with fine-tuned models to ensure safety without compromising performance on downstream tasks.
+
E. J. Gumbel. Statistical Theory of Extreme Values and Some Practical Applications. A Series of Lectures. Technical +Report PB175818, National Bureau of Standards, Washington, D. C. Applied Mathematics Div., 1954. URL +https://ntrl.ntis.gov/NTRL/dashboard/searchResults/titleDetail/PB175818.xhtml. ↩
+R. Duncan Luce. Individual Choice Behavior. Individual Choice Behavior. John Wiley, Oxford, England, 1959 ↩
+Chris J Maddison, Daniel Tarlow, and Tom Minka. A* sampling. Advances in neural information processing systems, +27, 2014. ↩
+Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete Distribution: A Continuous Relaxation of Discrete +Random Variables, March 2017. URL http://arxiv.org/abs/1611.00712. ↩
+The DepthUpscalingAlgorithm
is used to upscale the depth of PyTorch models. Here's a basic guide on how to use it:
First, import the necessary modules:
+from omegaconf import DictConfig
+from torch import nn
+from fusion_bench.method import DepthUpscalingAlgorithm
+from fusion_bench.modelpool import to_modelpool
+
Create an instance of DepthUpscalingAlgorithm
by passing a configuration dictionary.
+This dictionary should contain the name of the method ("depth_upscaling") and a list of layer indices that determine the upscaling pattern.
method_config = {"name": "depth_upscaling", "layer_indices": [0, 1, 1, 0]}
+algorithm = DepthUpscalingAlgorithm(DictConfig(method_config))
+
Assume we have a list of PyTorch models (nn.ModuleList
instances) that we want to upscale. Here, we're creating a list of linear models as an example:
Then, we can the model to the run
method of our algorithm:
The run
method will return an upscaled model. The type of the returned model will be the same as the input models (in this case, nn.ModuleList
), and its length will be determined by the layer indices specified in the method configuration.
Here we provide an example of how to use the DepthUpscalingAlgorithm
to upscale the depth of a Mistral model 1.
from omegaconf import DictConfig
+from torch import nn
+from transformers import AutoModelForCausalLM, MistralConfig, MistralForCausalLM
+from fusion_bench.method import DepthUpscalingAlgorithm
+
+# create a Mistral model
+# here we randomly initialize the model for demonstration purposes
+# in practice, you would load a pretrained model
+model_config = MistralConfig(
+ # https://huggingface.co/mistralai/Mistral-7B-v0.1/resolve/main/config.json
+ **{
+ "architectures": ["MistralForCausalLM"],
+ "bos_token_id": 1,
+ "eos_token_id": 2,
+ "hidden_act": "silu",
+ "hidden_size": 4096,
+ "initializer_range": 0.02,
+ "intermediate_size": 14336,
+ "max_position_embeddings": 32768,
+ "model_type": "mistral",
+ "num_attention_heads": 32,
+ "num_hidden_layers": 32,
+ "num_key_value_heads": 8,
+ "rms_norm_eps": 1e-05,
+ "rope_theta": 10000.0,
+ "sliding_window": 4096,
+ "tie_word_embeddings": False,
+ "torch_dtype": "bfloat16",
+ "transformers_version": "4.34.0.dev0",
+ "use_cache": True,
+ "vocab_size": 32000,
+ }
+)
+print('creating model')
+model: MistralForCausalLM = AutoModelForCausalLM.from_config(model_config)
+
+method_config = {
+ "name": "depth_upscaling",
+ "layer_indices": ["range(0,24)", "range(8,32)"],
+}
+algorithm = DepthUpscalingAlgorithm(DictConfig(method_config))
+print('upscaling model')
+upscaled_model = algorithm.run(model.model.layers)
+
+# substitute the model with the upscaled model
+model.model.layers = upscaled_model
+
The DepthUpscalingAlgorithm
is integrated into the fusion_bench
package. You can use it by specifying "depth_upscaling"
as the method name in the command line or configuration file.
name: depth_upscaling
+# this should be a list of integers or string, indicating the sequence of layers. If the entry is an integer, it will use the n-th layer of the model. If the entry is a string, it will use the layers specified by the string. The string should be a valid python expression that evaluates to a list of integers.
+# for example, ["range(0,12)", "range(6,12)"] will use the first 12 layers and the last 6 layers of the model to construct the new model
+# [0, 2, 4, "range(6,12)"] will use the 1st, 3rd, 5th, and the 7th to 12th layers of the model to construct the new model
+layer_indices: null
+
You can then run the fusion_bench
command with the specified configuration file:
DepthUpscalingAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
fusion_bench/method/depth_upscaling.py
run(modelpool)
+
+¶Executes the depth upscaling algorithm on a given model pool.
+This method checks the type of the model pool, ensures that it contains only one model, and verifies that the model is an instance of nn.ModuleList
.
Parameters:
+modelpool
+ (ModuleList | ModelPool
)
+ –
+ The pool of models to upscale. Must contain only one model.
+Returns:
+ModuleList
+ –
+ nn.ModuleList: The upscaled model.
+Raises:
+AssertionError
+ –
+ If the model pool contains more than one model or if the model is not an instance of nn.ModuleList
.
ValueError
+ –
+ If an invalid layer specification is provided in the configuration.
+fusion_bench/method/depth_upscaling.py
The Dummy Algorithm is a simple algorithm that does not perform any fusion operation. Instead, it returns a pretrained model if one is available in the model pool. If no pretrained model is available, it returns the first model in the model pool. +This algorithm is useful for testing and debugging purposes, as it allows you to quickly check if the model pool is set up correctly and the fusion process is working as expected.
+To use the Dummy Algorithm, you need to specify "dummy"
as the algorithm name.
The implementation of the Dummy Algorithm is straightforward. Here is the main method of the DummyAlgorithm
class:
DummyAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
fusion_bench/method/dummy.py
The Fisher merging algorithm 1 is a per-parameter weighed averaging method that assigns weights to the models based on the Fisher information matrix of the models on some labeled data. +The Fisher information matrix \(F_\theta\) of a model with parameters \(\theta\) can be expressed as:
+where \(p(x)\) is the data distribution, \(p(y|x, \theta)\) is the model's output distribution, for example, the softmax output of a classification model, and \(\nabla_\theta\) is the gradient with respect to the model's parameters \(\theta\). +The Fisher information matrix can be used to estimate the importance of each parameter in the model and thus assign weights to the models based on their Fisher information. +In addition, the Fisher information matrix can be used to estimate the similarity between tasks, which can be useful in auxiliary-task learning and multi-task learning scenarios 2.
+As the full Fisher information matrix is often computationally expensive to compute and memory-intensive to store, we approximate using the diagonal Fisher information matrix, which is the diagonal of the full Fisher information matrix. +The diagonal Fisher information matrix can be computed as:
+Assuming we have \(n\) models with parameters \(\theta_i\) and diagonal Fisher information matrices \(\hat{F}_{\theta_i}\), the Fisher merging algorithm computes the merged model's parameters \(\theta\) as follows:
+where \(\theta_i\) are the parameters of the individual models, \(\hat{F}_{\theta_i}\) are the diagonal Fisher information matrices of the individual models, and \(j\) indexes the parameters of the models. +The Fisher merging algorithm can be considered a per-weight weighed averaging method, where the weights are determined by the Fisher information of each parameter in the models.
+Example of merging eight CLIP-ViT-B/32 models using Fisher merging:
+fusion_bench method=clip_fisher_merging \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8
+
Merge eight CLIP-ViT-L/14 models using Fisher merging:
+fusion_bench \
+ method=clip_fisher_merging \
+ method.batch_size=8 method.num_workers=4 \
+ modelpool=clip-vit-large-patch14_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ taskpool.clip_model=openai/clip-vit-large-patch14
+
Merge GPT-2 models for text classification tasks:
+fusion_bench \
+ method=gpt2_fisher_merging \
+ method.num_fisher_examples=512 method.batch_size=8 \
+ modelpool=gpt-2_glue \
+ taskpool=gpt-2_glue
+
FisherMergingAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
fusion_bench/method/fisher_merging/fisher_merging.py
331 +332 +333 +334 +335 +336 +337 +338 +339 +340 +341 +342 +343 +344 +345 +346 +347 +348 +349 +350 +351 +352 +353 +354 +355 +356 +357 +358 +359 +360 +361 +362 +363 +364 +365 +366 +367 +368 +369 +370 +371 +372 +373 +374 +375 +376 +377 +378 +379 +380 +381 +382 +383 +384 +385 +386 +387 +388 +389 +390 +391 +392 +393 +394 +395 +396 +397 +398 +399 +400 |
|
The Fusion Algorithm
module is a core component of the FusionBench project, dedicated to the implementation and execution of various model fusion techniques.
+This module provides the mechanisms necessary to combine multiple models from the Model Pool, enabling nuanced and optimized model merging operations.
Fusion Algorithm
Module¶The module is typically invoked through a configuration-driven approach in CLI scripts, enabling users to specify fusion algorithms and parameters via YAML configuration files. This method ensures reproducibility and ease of use. +For more information, see the document of fusion_bench CLI.
+ModelFusionAlgorithm
is the base class for all fusion algorithms in the Fusion Algorithm module.
+It provides a common interface for different fusion techniques, allowing for seamless integration and execution of various algorithms.
ModelFusionAlgorithm
+
+
+¶
+ Bases: ABC
fusion_bench/method/base_algorithm.py
run(modelpool)
+
+
+ abstractmethod
+
+
+¶Fuse the models in the given model pool.
+ + +Examples:
+>>> algorithm = SimpleAverageAlgorithm()
+>>> modelpool = ModelPool()
+>>> merged_model = algorithm.fuse(modelpool)
+
Parameters:
+modelpool
+ (_type_
)
+ –
+ description
+fusion_bench/method/base_algorithm.py
from ..method import load_algorithm_from_config
+from ..modelpool import load_modelpool_from_config
+
+def run_model_fusion(cfg: DictConfig):
+ modelpool = load_modelpool_from_config(cfg.modelpool)
+ algorithm = load_algorithm_from_config(cfg.method)
+ merged_model = algorithm.run(modelpool)
+
+ if hasattr(cfg, "taskpool") and cfg.taskpool is not None:
+ taskpool = load_taskpool_from_config(cfg.taskpool)
+ taskpool.evaluate(merged_model)
+ else:
+ print("No task pool specified. Skipping evaluation.")
+
In summary, the Fusion Algorithm module is vital for the model merging operations within FusionBench, leveraging sophisticated techniques to ensure optimal fusion and performance evaluation of deep learning models. This capability makes it an indispensable tool for researchers and practitioners focusing on model fusion strategies.
+
load_algorithm_from_config(method_config)
+
+¶Loads an algorithm based on the provided configuration.
+The function checks the 'name' attribute of the configuration and returns an instance of the corresponding algorithm. +If the 'name' attribute is not found or does not match any known algorithm names, a ValueError is raised.
+ + +Parameters:
+method_config
+ (DictConfig
)
+ –
+ The configuration for the algorithm. Must contain a 'name' attribute that specifies the type of the algorithm.
+Returns:
+An instance of the specified algorithm.
+Raises:
+ValueError
+ –
+ If 'name' attribute is not found in the configuration or does not match any known algorithm names.
+fusion_bench/method/__init__.py
The max-model predictor algorithm is a type of ensemble method. +Formally, a max-model predictor is defined as follows:
+Definition (Max-Model Predictor) 1 +Given a set of predictors \(H = \{h_1, h_2, \ldots, h_n\}\), with \(h_i: \mathcal{X} \times \mathcal{Y}_i \mapsto \mathbb{R}\), the max-model predictor \(h_H\) is defined as:
+Take the flu detection problem as an example 1. +Doctors want to build a learning model to detect what type of virus one patient is affected based on her symptoms, for appropriate treatment. However, the types of influenza diverse geographically (Rejmanek et al., 2015), which means the distribution of patient records collected by a hospital in California may be different from those in Florida. In an extreme case, some types are unknown to the other hospital. Assume there are 4 types of influenza in the United States. In California, 2 of 4 are commonly detected, while in Florida 3 of 4 types are often detected. We assume in the two states, doctors separately trained two models \(h_{CA}\) and \(h_{FL}\) which work locally well in California and Florida respectively. However, a direct ensemble of the two local models may not work well on all the patients. Let \(h_{US}\) denote the ideal global model trained on the combination of local datasets. When we input a patient record \(x\), each model outputs its prediction as shown in the following table:
+Table: Example of flu detection on a patient \(x\) affected with type 2 flu. “−” means this model is not able to predict the corresponding class. Taking the maximal score as prediction, \(h_{FL}\) is consistent with \(h_{US}\), but the combination of two local models \(h_{CA,FL}\) is not since \(3/4 > 4/7\).
+Type | +1 | +2 | +3 | +4 | +
---|---|---|---|---|
\(h_{US}(x)\) | +2/10 | +4/10 | +1/10 | +3/10 | +
\(h_{CA}(x)\) | +- | +- | +1/4 | +3/4 | +
\(h_{FL}(x)\) | +2/7 | +4/7 | +1/7 | +- | +
\(h_{\{CA,FL\}}(x)\) | +2/7 | +4/7 | +1/4 | +3/4 | +
Here is an example of how to use the Max-Model Predictor Algorithm:
+from fusion_bench.method import MaxModelPredictorAlgorithm
+from fusion_bench.modelpool import ModelPool
+
+# Instantiate the MaxPredictorAlgorithm
+algorithm = MaxModelPredictorAlgorithm()
+
+# Assume we have a ModelPool instance that contains the models we want to ensemble.
+modelpool = ModelPool(...) # or a list of nn.Module
+
+# Run the algorithm on the model pool.
+max_model_predictor : nn.Module = algorithm.run(modelpool)
+
Configuration template for the Max Predictor Algorithm:
+ +To create a max predictor ensemble of models for a specific task, you can use the following command:
+ + + + + + + + + + + + + + + +ModelRecombinationAlgorithm
is a class used to recombine models in a model pool. Here's how to use it:
First, import the necessary modules:
+from fusion_bench.method import ModelRecombinationAlgorithm
+from fusion_bench.modelpool import ModelPool, to_modelpool
+from torch import nn
+
Create an instance of ModelRecombinationAlgorithm
:
Create a model pool using the to_modelpool
function. This function takes a list of models or a dict of models and converts it into a ModelPool
:
Use the run
method of the ModelRecombinationAlgorithm
instance to recombine the models in the model pool:
The run
method takes two arguments:
modelpool
: The model pool to recombine.return_modelpool
(optional): A boolean indicating whether to return the entire model pool or just the first model. Defaults to True
.If return_modelpool
is True
, the run
method returns a new ModelPool
with the recombined models. If False
, it returns the first model from the new model pool.
You can check the type of the returned value to ensure that the run
method worked correctly:
Configuration template for the model recombination algorithm:
+name: model_recombination
+# if `return_model_pool` is not null, the argument `return_modelpool` passed to the `run` method will be ignored.
+return_modelpool: null
+
Construct a model recombination using our CLI tool fusion_bench
:
fusion_bench \
+ method=model_recombination \
+ method.return_modelpool=false \
+ modelpool=... \
+ taskpool=...
+
ModelRecombinationAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
Model recombination recombinates the layers of the given models, to create a new set of models.
+ +fusion_bench/method/model_recombination.py
run(modelpool, return_modelpool=True)
+
+¶Executes the model recombination algorithm on a given model pool.
+This method loads models from the model pool, determines their type, and applies the appropriate recombination method.
+It then creates a new model pool with the recombined models. Depending on the return_modelpool
flag, it either returns
+the entire new model pool or just the first model from it.
nn.ModuleList
, the recombination method recombine_modellist
is used. Where each module in the list is shuffled across the models.nn.ModuleDict
, the recombination method recombine_modeldict
is used. Where each module in the dictionary is shuffled across the models.nn.Module
, the recombination method recombine_state_dict
is used. Where the state dictionaries of the models are shuffled across the models.Parameters:
+modelpool
+ (ModelPool
)
+ –
+ The pool of models to recombine.
+return_modelpool
+ (bool
, default:
+ True
+)
+ –
+ Flag indicating whether to return the entire model pool or just the first model. Defaults to True. If this algorithm is initialized with config, the value of return_modelpool
in the config will be used and this argument passed to the method will be ignored.
Returns:
+Union[Module, ModelPool]
+ –
+ Union[nn.Module, ModelPool]: The recombined model pool or the first model from the recombined pool, depending on the return_modelpool
flag.
Raises:
+ValueError
+ –
+ If the models in the model pool are of an unsupported type.
+fusion_bench/method/model_recombination.py
recombine_modellist(models)
+
+¶fusion_bench/method/model_recombination.py
recombine_modeldict(models)
+
+¶fusion_bench/method/model_recombination.py
recombine_state_dict(models)
+
+¶fusion_bench/method/model_recombination.py
Here we provides instructions on how to use the fusion_bench
command-line interface to merge models using a Mixture of Experts (MoE) approach.
The first code block is a YAML configuration file for the merging method. The name
field specifies the name of the merging method. The num_experts
field specifies the number of experts to use in the merging process. The experts_per_token
field specifies the number of experts to use per token. The save_checkpoint
field specifies the path where the merged model will be saved.
name: mixtral_for_causal_lm_moe_merging
+
+experts_per_token: 2
+# path to save the merged model, if provided
+save_checkpoint: null
+
The second code block is another YAML configuration file, this time for the model pool. The type
field specifies the type of model pool to use. The models
field is a list of models to include in the pool. Each model should have a name
and a path
, and the model is loaded from the path.
type: AutoModelForCausalLMPool
+# each model should have a name and a path, and the model is loaded from the path
+# this is equivalent to `AutoModelForCausalLM.from_pretrained(path)`
+models:
+ - name: _pretrained_
+ path: path_to_your_pretrained_model
+ - name: expert_1
+ path: path_to_your_expert_model_1
+ - name: expert_2
+ path: path_to_your_expert_model_2
+ - name: expert_3
+ path: path_to_your_expert_model_3
+ - name: expert_4
+ path: path_to_your_expert_model_4
+
Finally, the third code block is a bash command that runs the fusion_bench
command-line interface with the specified method, model pool, and task pool. The method
argument specifies the merging method to use. The modelpool
argument specifies the model pool to use. The modelpool.models.0.path
argument specifies the path to the pretrained model to use. The taskpool
argument specifies the task pool to use. In this case, a dummy task pool is used that does nothing but print the parameter counts of the merged model.
fusion_bench \
+ method=mixtral_moe_merging \
+ modelpool=mixtral_moe_merging \
+ taskpool=dummy # this is a dummy taskpool that does nothing but print the parameter counts of the merged model
+
This guide provides a step-by-step process for merging models using the fusion_bench
command-line interface. By following these instructions, you can merge your own models and save them for future use.
mixtral_merging
+
+
+¶
MixtralForCausalLMMergingAlgorithm
+
+
+¶
+ Bases: MixtralForCausalLMUpscalingAlgorithm
fusion_bench/method/mixture_of_experts/mixtral_merging.py
run(modelpool)
+
+¶Runs the merging process. It first upscales the models to MixtralForCausalLM, +then substitutes the experts of the MixtralForCausalLM with the models from the modelpool.
+ + +Parameters:
+modelpool
+ (ModelPool
)
+ –
+ The pool of models to be merged. Each model in the pool will be treated as an expert, and should be a MistralForCausalLM
or LlamaForCausalLM
.
Returns:
+MixtralForCausalLM
( MixtralForCausalLM
+) –
+ The merged model.
+fusion_bench/method/mixture_of_experts/mixtral_merging.py
MixtralMoEMergingAlgorithm
+
+
+¶
+ Bases: MixtralUpscalingAlgorithm
This class is responsible for merging models into a MixtralModel.
+ +fusion_bench/method/mixture_of_experts/mixtral_merging.py
run(modelpool)
+
+¶Runs the merging process.
+ + +Parameters:
+modelpool
+ (ModelPool
)
+ –
+ The pool of models to be merged. Each model in the pool will be treated as an expert, and should be a MistralModel
or LlamaModel
.
Returns:
+MixtralModel
( MixtralModel
+) –
+ The merged model.
+fusion_bench/method/mixture_of_experts/mixtral_merging.py
Sparse upcycling is a technique used to initialize a sparsely activated Mixture-of-Experts (MoE) model from a dense checkpoint. This approach leverages previously incurred training costs to improve the performance of large models while reducing the computational expense. In the process, dense Transformer blocks are partially replaced with MoE blocks, where the MLPs in a Transformer block are replaced by multiple experts. The experts are chosen based on routing probabilities determined by a router. The initialized MoE model is then further trained to recover the performance. This method results in improved performance for both language and vision models while using only a fraction of the original dense pretraining cost 1.
+Here’s an example demonstrating how to upscale a pre-trained Mistral model to a Mixtral model:
+import os
+
+from omegaconf import DictConfig
+from transformers import MistralForCausalLM
+
+from fusion_bench.method.mixture_of_experts.mixtral_upcycling import (
+ MixtralForCausalLMUpscalingAlgorithm,
+)
+from fusion_bench.utils import print_parameters
+
+# Load a pre-trained Mistral model
+pretrained_model = MistralForCausalLM.from_pretrained(
+ os.path.expanduser("path_to_mistral_model")
+)
+print("Pretrained model:")
+print_parameters(pretrained_model)
+# Output:
+# Pretrained model:
+# trainable params: 7.24B || all params: 7.24B || trainable%: 100.0000
+
+# Define the configuration for Mixtral
+config = {
+ "num_experts": 4, # Number of expert channels
+ "experts_per_token": 2, # Experts to choose per token
+}
+
+# Initialize the upscaling algorithm
+upscaling_for_causal_lm_algorithm = MixtralForCausalLMUpscalingAlgorithm(
+ DictConfig(config)
+)
+
+# Run the upscaling process to get a Mixtral model
+mixtral_for_causal_lm_model = upscaling_for_causal_lm_algorithm.run(pretrained_model)
+
+print("Mixtral model:")
+print_parameters(mixtral_for_causal_lm_model)
+# Outputs:
+# Mixtral model:
+# trainable params: 24.15B || all params: 24.15B || trainable%: 100.0000
+
+# Save the upscaled Mixtral model
+mixtral_for_causal_lm_model.save_pretrained("path_to_save_mixtral_model")
+
A Jupyter notebook example is also available at our repo.
+This is a guide on how to use the fusion_bench
command-line interface to upscale a Mistral model to a Mixtral model.
The first code block is a YAML configuration file for the upscaling method. The name field specifies the name of the upscaling method. The num_experts
field specifies the number of experts to use in the upscaling process. The experts_per_token
field specifies the number of experts to use per token. The save_checkpoint
field specifies the path where the upscaled model will be saved, if provided.
name: mixtral_for_causal_lm_moe_upscaling # or "mixtral_moe_upscaling"
+
+num_experts: 4
+experts_per_token: 2
+# path to save the upscaled model
+save_checkpoint: null
+
The second code block is another YAML configuration file, this time for the model pool. The type
field specifies the type of model pool to use. The models
field is a list of models to include in the pool. Each model should have a name
and a path
, and the model is loaded from the path
.
type: AutoModelForCausalLMPool
+# each model should have a name and a path, and the model is loaded from the path
+# this is equivalent to `AutoModelForCausalLM.from_pretrained(path)`
+models:
+ - name: _pretrained_
+ path: path_to_your_pretrained_model
+
Finally, the third code block is a bash command that runs the fusion_bench command-line interface with the specified method, model pool, and task pool. The method argument specifies the upscaling method to use. The modelpool argument specifies the model pool to use. The modelpool.models.0.path argument specifies the path to the pretrained model to use. The taskpool argument specifies the task pool to use. In this case, a dummy task pool is used that does nothing but print the parameter counts of the merged model.
+fusion_bench \
+ method=mixtral_moe_upscaling \
+ modelpool=mixtral_moe_upscaling \
+ modelpool.models.0.path=path_to_your_pretrained_model \
+ taskpool=dummy # this is a dummy taskpool that does nothing but print the parameter counts of the merged model
+
mixtral_upcycling
+
+
+¶
MixtralForCausalLMUpscalingAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
This class is responsible for upscaling a model to a MixtralForCausalLM. +It inherits from the ModelFusionAlgorithm class.
+ +fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
run(modelpool)
+
+¶Runs the upscaling process.
+ + +Parameters:
+modelpool
+ (ModelPool | LlamaForCausalLM | MistralForCausalLM
)
+ –
+ The model to be upscaled.
+Returns:
+MixtralForCausalLM
( MixtralForCausalLM
+) –
+ The upscaled model.
+fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
MixtralUpscalingAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
This class is responsible for upscaling a model to a MixtralModel. +It inherits from the ModelFusionAlgorithm class.
+ +fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
run(modelpool)
+
+¶Runs the upscaling process.
+ + +Parameters:
+modelpool
+ (ModelPool | LlamaModel | MistralModel
)
+ –
+ The model to be upscaled.
+Returns:
+MixtralModel
( MixtralModel
+) –
+ The upscaled model.
+fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
upscale_to_mixtral_for_causal_lm(input_model, output_model)
+
+¶A helper function.
+Upscales a LlamaForCausalLM or MistralForCausalLM to a MixtralForCausalLM.
+ + +Parameters:
+input_model
+ (LlamaForCausalLM | MistralForCausalLM
)
+ –
+ The input model to be upscaled.
+output_model
+ (MixtralForCausalLM
)
+ –
+ The output model where the upscaled weights will be loaded.
+Returns:
+None
+fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
upscale_to_mixtral_model(input_model, output_model)
+
+¶A helper function.
+Upscales a LlamaModel or MistralModel to a MixtralModel.
+ + +Parameters:
+input_model
+ (LlamaModel | MistralModel
)
+ –
+ The input model to be upscaled.
+output_model
+ (MixtralModel
)
+ –
+ The output model where the upscaled weights will be loaded.
+Returns:
+None
+fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. http://arxiv.org/abs/2212.05055 ↩
+Merge CLIP-ViT-B/32 models on eight image classification tasks
+fusion_bench method=clip_regmean \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8
+
Merge CLIP-ViT-L/14 models on eight image classification tasks
+fusion_bench \
+ method=clip_regmean \
+ method.batch_size=8 method.num_workers=4 \
+ modelpool=clip-vit-large-patch14_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ taskpool.clip_model=openai/clip-vit-large-patch14
+
Merge GPT-2 models for text classification tasks:
+ +Xisen Jin, et al. "Dataless Knowledge Fusion by Merging Weights of Language Models". http://arxiv.org/abs/2212.09849 ↩
+Simple averaging is known in the literature as isotropic merging, ModelSoups, aims to yield a more robust and generalizable model. +Simple Averaging is a technique frequently employed when there are multiple models that have been fine-tuned or independently trained from scratch. +Specifically, if we possess \(n\) models that share a common architecture but different weights denoted as \(\theta_i\), the weights of the merged model, represented as \(\theta\), are computed as follows:
+This equation simply states that each weight of the final model is the average of the corresponding weights in the individual models. For example, if we have three models and the weight of the first neuron in the first layer is 0.1, 0.2, and 0.3 in each model respectively, the weight of that neuron in the final model will be (0.1 + 0.2 + 0.3) / 3 = 0.2.
+Simple averaging is a straightforward and scalable method for model fusion. It does not require any additional training or fine-tuning, making it a good choice when computational resources are limited, where maintaining an ensemble of models is not feasible.
+This method often assumes that all models are equally good. +If some models are significantly better than others, it might be beneficial to assign more weight to the better models when averaging. +This can be done by using weighted averaging, where each model's contribution to the final model is weighted by its performance on a validation set or some other metric. +See Weighed Averaging for more details. +Otherwise, the poor model may have a negative impact on the merged model.
+In this example, we will demonstrate how to use the SimpleAverageAlgorithm
class from the fusion_bench.method
module.
+This algorithm is used to merge multiple models by averaging their parameters.
from fusion_bench.method import SimpleAverageAlgorithm
+
+# Instantiate the SimpleAverageAlgorithm
+# This algorithm will be used to merge multiple models by averaging their parameters.
+algorithm = SimpleAverageAlgorithm()
+
+# Assume we have a list of PyTorch models (nn.Module instances) that we want to merge.
+# The models should all have the same architecture.
+models = [...]
+
+# Run the algorithm on the models.
+# This will return a new model that is the result of averaging the parameters of the input models.
+merged_model = algorithm.run(models)
+
The run
method of the SimpleAverageAlgorithm
class takes a list of models as input and returns a new model.
+The new model's parameters are the average of the parameters of the input models.
+This is useful in scenarios where you have trained multiple models and want to combine them into a single model that hopefully performs better than any individual model.
Configuration template for the Simple Averaging algorithm:
+ +use the following command to run the Simple Averaging algorithm:
+ +
SimpleAverageAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
, SimpleProfilerMixin
fusion_bench/method/simple_average.py
run(modelpool)
+
+¶Fuse the models in the given model pool using simple averaging.
+This method iterates over the names of the models in the model pool, loads each model, and appends it to a list. +It then returns the simple average of the models in the list.
+ + +Parameters:
+modelpool
+ (ModelPool
)
+ –
+ The pool of models to fuse.
+Returns:
+The fused model obtained by simple averaging.
+fusion_bench/method/simple_average.py
Ensemble methods are simple and effective ways to improve the performance of machine learning models. +They combine the outputs of multiple models to create a stronger model.
+from fusion_bench.method import EnsembleAlgorithm
+
+# Instantiate the EnsembleAlgorithm
+algorithm = EnsembleAlgorithm()
+
+# Assume we have a list of PyTorch models (nn.Module instances) that we want to ensemble.
+models = [...]
+
+# Run the algorithm on the models.
+merged_model = algorithm.run(models)
+
Configuration template for the ensemble algorithm:
+ +create a simple ensemble of CLIP-ViT models for image classification
+ + + + + + + + + + + + + + +Here we present the taxonomy for the SMILE upscaling method following "A Survey on Model MoErging" by Yadav et al. (2024) 2.
++ | + | + | + | + | + |
---|---|---|---|---|---|
Expert Training | +Standard | +Expert Data | +Private | +Routing Dataset | +None | +
Input Granularity | +Step | +Depth Granularity | +Module | +Expert Selection | +Sparse | +
Expert Aggregation | +Output | +Generalization | +In-Distribution | +User Dataset | +Zero-Shot | +
The SMILE upscaling method offers several configuration options, which are located in the config/method/
directory.
nn.Module
Upscaling:
+ This configuration is designed for upscaling any neural network module (nn.Module
).Each configuration file contains detailed parameters and options that can be adjusted to meet the specific needs of your model and application.
+name: smile_upscaling
+
+# merge device on cuda can accelerate the SVD computation
+device: cpu
+# device to compute svd
+upscaling_accelerator: cuda
+full_matrices: true # set to false if you are sure k < rank
+
+gate_k: 1
+k: 128
+top_k: 1
+
+routing_use_diff: true
+# average the remaining part, if this is set the False, the remaining part will kept as base model (the pretrained model)
+average_experts: false
+
+# path to save/load the model
+model_path: null
+
name: smile_mistral_upscaling
+
+device: cpu
+accelerator: cuda
+
+# path to save/load the model
+model_path: null
+model_dtype: float16
+
+num_experts_per_tok: 1
+rank_of_router: 8
+rank_of_expert: 512
+
Evaluate single fine-tuned models and save the results to outputs/ViT-B-32/single-task/
and outputs/ViT-L-14/single-task/
for CLIP-ViT-B/32 and CLIP-ViT-L/14 models, respectively.
# evaluate singlue fine-tuned models
+for task in sun397 stanford-cars resisc45 eurosat svhn gtsrb mnist dtd
+do
+ fusion_bench method=dummy \
+ modelpool=clip-vit-base-patch32_individual \
+ modelpool.models.0.path=tanganke/clip-vit-base-patch32_${task} \
+ taskpool=clip-vit-classification_TA8 \
+ save_report="outputs/ViT-B-32/single-task/clip-vit-base-patch32_${task}.json"
+done
+
+# if you have multiple GPUs, you can run the following code to evaluate the CLIP-ViT-L/14 models in parallel
+# evaluate singlue fine-tuned models clip-vit-large
+tasks=(sun397 stanford-cars resisc45 eurosat svhn gtsrb mnist dtd)
+CUDA_DEVICES=(0 1 2 3 4 5 6 7) # List of CUDA devices to use
+
+for i in "${!CUDA_DEVICES[@]}"; do
+ task=${tasks[$i]}
+ CUDA_VISIBLE_DEVICES=${CUDA_DEVICES[$i]} fusion_bench method=dummy \
+ modelpool=clip-vit-large-patch14_individual \
+ modelpool.models.0.path=tanganke/clip-vit-large-patch14_${task} \
+ taskpool=clip-vit-classification_TA8 \
+ taskpool.clip_model=openai/clip-vit-large-patch14 \
+ save_report="outputs/ViT-L-14/single-task/clip-vit-large-patch14_${task}.json" &
+done
+
Upscale eight CLIP-ViT-B/32 models with SMILE, each CLIP-ViT-B/32 model is trained on a downstream task.
+gate_k=16
+k=32
+fusion_bench \
+ method=smile_upscaling \
+ method.device=cuda \
+ method.gate_k=$gate_k method.k=$k \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ save_report="outputs/ViT-B-32/eight_tasks/gate_k\=${gate_k}_k\=${k}.json"
+
Hyperparameter search for SMILE upscaling. Pre-run results can be found in examples/smile_upscaling/clip-vit-base-patch32.ipynb
.
for gate_k in 1 2 4 8 16 32 64 128 256 512 768; do
+ for k in 4 8 16 32 64 128 -1; do
+ fusion_bench \
+ method=smile_upscaling \
+ method.device=cuda \
+ method.gate_k=$gate_k method.k=$k \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ save_report="outputs/ViT-B-32/eight_tasks/gate_k\=${gate_k}_k\=${k}.json"
+ done
+done
+
Ablations on number of experts per token (Top-K). Pre-run results can be found in examples/smile_upscaling/clip-vit-base-patch32-ablations-topk.ipynb
.
gate_k=16
+k=32
+for top_k in 1 2 4
+do
+fusion_bench \
+ method=smile_upscaling \
+ method.device=cuda \
+ method.gate_k=$gate_k method.k=$k \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ save_report="outputs/ViT-B-32/ablation/gate_k\=${gate_k}_k\=${k}.json"
+done
+
hyperparameter search for SMILE upscaling. Pre-run results can be found in examples/smile_upscaling/clip-vit-large-patch14.ipynb
.
for gate_k in 1 2 4 8 16 32 64 128; do
+ for k in 4 8 16 32 64 128 -1; do
+ fusion_bench \
+ method=smile_upscaling \
+ method.gate_k=$gate_k method.k=$k \
+ modelpool=clip-vit-large-patch14_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ taskpool.clip_model=openai/clip-vit-large-patch14 \
+ save_report="outputs/ViT-B-32/eight_tasks/gate_k\=${gate_k}_k\=${k}.json"
+ done
+done
+
Hyperparameter search for full fine-tuned and lora fine-tuned Flan-T5 models.
+Pre-run results can be found in examples/smile_upscaling/flan-t5-base.ipynb
and examples/smile_upscaling/flan-t5-base-lora16.ipynb
.
# hyperparameter search for full fine-tuned flan-t5-base
+for gate_k in 4 8 16 32; do
+ for k in 16 32 64 128; do
+ fusion_bench \
+ method=smile_upscaling \
+ method.device=cpu \
+ method.gate_k=$gate_k method.k=$k \
+ modelpool=flan-t5-base_glue \
+ taskpool=flan-t5_glue_text_generation \
+ save_report="outputs/flan-t5-base/glue_text_generation/gate_k\=${gate_k}_k\=${k}.json"
+ done
+done
+
+# hyperparameter search for lora fine-tuned flan-t5-base
+for gate_k in 2 4 8; do
+ for k in 4 8 16; do
+ fusion_bench \
+ method=smile_upscaling \
+ method.device=cuda \
+ method.gate_k=$gate_k method.k=$k \
+ modelpool=flan-t5-base_glue_lora16 \
+ taskpool=flan-t5_glue_text_generation \
+ save_report="outputs/flan-t5-base_lora16/glue_text_generation/gate_k\=${gate_k}_k\=${k}.json"
+ done
+done
+
Here we upscale several Mistral-7B models using SMILE. The models are trained on different tasks and are used as experts in the SMILE upscaling.
+We first provide an example of the upscaled model, where we upscale the linear layers of the original Mistral model into a SMILE linear layer.
+import torch
+from accelerate import init_empty_weights
+from transformers import AutoConfig
+
+from fusion_bench.models.modeling_smile_mistral import (
+ SmileMistralConfig,
+ SmileMistralForCausalLM,
+)
+
+
+config = AutoConfig.from_pretrained(
+ "mistralai/Mistral-7B-v0.1"
+)
+config = SmileMistralConfig(
+ num_experts_per_tok=1,
+ rank_of_router=8,
+ rank_of_expert=8,
+ num_local_experts=3,
+ **config.to_dict()
+)
+with init_empty_weights():
+ model = SmileMistralForCausalLM(config)
+model.to(dtype=torch.float16).to_empty(device="cuda")
+
The model architecture is as follows:
+SmileMistralForCausalLM(
+ (model): SmileMistralModel(
+ (embed_tokens): Embedding(32000, 4096)
+ (layers): ModuleList(
+ (0-31): 32 x SmileMistralDecoderLayer(
+ (self_attn): SmileMistralAttention(
+ (q_proj): SingularMoELinear(in_features=4096, out_features=4096, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (k_proj): SingularMoELinear(in_features=4096, out_features=1024, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (v_proj): SingularMoELinear(in_features=4096, out_features=1024, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (o_proj): SingularMoELinear(in_features=4096, out_features=4096, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (rotary_emb): MistralRotaryEmbedding()
+ )
+ (mlp): SmileMistralMLP(
+ (gate_proj): SingularMoELinear(in_features=4096, out_features=14336, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (up_proj): SingularMoELinear(in_features=4096, out_features=14336, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (down_proj): SingularMoELinear(in_features=14336, out_features=4096, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (act_fn): SiLU()
+ )
+ (input_layernorm): MistralRMSNorm()
+ (post_attention_layernorm): MistralRMSNorm()
+ )
+ )
+ (norm): MistralRMSNorm()
+ )
+ (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
+)
+
Knowing the model architecture, we can upscale the Mistral-7B models using the following steps:
+Prepare the following 4 configuration files in configs/modelpool
:
type: AutoModelForCausalLMPool
+models:
+- name: _pretrained_
+ path: mistralai/Mistral-7B-v0.1
+- name: expert_1
+ path: meta-math/MetaMath-Mistral-7B
+
+dtype: float16
+
type: AutoModelForCausalLMPool
+models:
+- name: _pretrained_
+ path: mistralai/Mistral-7B-v0.1
+- name: expert_1
+ path: cognitivecomputations/dolphin-2.1-mistral-7b
+
+dtype: float16
+
type: AutoModelForCausalLMPool
+models:
+- name: _pretrained_
+ path: mistralai/Mistral-7B-v0.1
+- name: expert_1
+ path: uukuguy/speechless-code-mistral-7b-v1.0
+
+dtype: float16
+
type: AutoModelForCausalLMPool
+models:
+- name: _pretrained_
+ path: mistralai/Mistral-7B-v0.1
+- name: expert_1
+ path: meta-math/MetaMath-Mistral-7B
+- name: expert_2
+ path: cognitivecomputations/dolphin-2.1-mistral-7b
+- name: expert_3
+ path: uukuguy/speechless-code-mistral-7b-v1.0
+
+dtype: float16
+
Upscale Mistral-7B models. The upscaled models are saved in outputs/mistral/gate_k-${gate_k}_k-${k}/version_${version}
.
function model_fusion() {
+ output_dir=outputs/mistral/gate_k-${gate_k}_k-${k}/version_${version}
+ fusion_bench \
+ method=smile_mistral_upscaling \
+ method.rank_of_router=$gate_k method.rank_of_expert=$k \
+ method.model_path=${output_dir} \
+ modelpool=smile_mistral_exp_v${version} \
+ modelpool.dtype=float32 \
+ taskpool=dummy \
+ save_report="${output_dir}/model_info.json"
+}
+
+gate_k=8
+for k in 8 16 32 64 128 256 384 512; do
+ for version in 1 2 3 4; do
+ model_fusion
+ done
+done
+
Use lm-evaluation-harness to evaluate the models. We use the default configurations for each task.
+# For some GPUs, the following environment variables need to be set
+# export NCCL_P2P_DISABLE="1"
+# export NCCL_IB_DISABLE="1"
+
+function model_eval() {
+ output_dir=outputs/mistral/gate_k-${gate_k}_k-${k}/version_${version}
+
+ # Check if ${output_dir}/${task}.json exists as a directory and return if it does
+ if [ -d "${output_dir}/${task}.json" ]; then
+ echo "Directory ${output_dir}/${task}.json already exists. Skipping evaluation."
+ return
+ fi
+
+ lm_eval --model hf \
+ --model_args pretrained=${output_dir},dtype="float16",parallelize=True \
+ --tasks ${task} \
+ --output_path ${output_dir}/${task}.json \
+ --batch_size 6
+}
+
The above function can be used to evaluate the models on specified task.
+Pre-run results can be found in examples/smile_upscaling/mistral_gsm8k.ipynb
.
# Evaluate all the models on GSM8K task
+gate_k=8
+task=gsm8k
+for k in 8 16 32 64 128 256 384 512; do
+ for version in 1 2 3 4; do
+ model_eval
+ done
+done
+
+# Evaluate all M0;123 models on truthfulqa gsm8k arc_challenge mmlu
+k=8
+version=4
+for task in truthfulqa gsm8k arc_challenge mmlu; do
+ model_eval
+done
+
The reported metrics are:
+Pre-run results can be found in examples/smile_upscaling/clip-vit-base-patch32_single-task_projection-merging.ipynb
.
# project into different subspaces
+for task in sun397 stanford-cars resisc45 eurosat svhn gtsrb mnist dtd
+do
+ # Space I
+ CUDA_VISIBLE_DEVICES=0 fusion_bench \
+ method=singular_projection_merging \
+ method.device=cuda method.rank=low method.k=-1 method.full_matrices=false \
+ modelpool=clip-vit-base-patch32_single_finetuned \
+ modelpool.models.1.name=${task} \
+ modelpool.models.1.path=tanganke/clip-vit-base-patch32_${task} \
+ taskpool=clip-vit-classification_TA8 \
+ save_report="outputs/ViT-B-32/single-task/projection_merging_zone1_${task}.json" &
+
+ # Space II
+ CUDA_VISIBLE_DEVICES=1 fusion_bench \
+ method=singular_projection_merging \
+ method.device=cuda method.rank=high method.k=-1 method.full_matrices=false \
+ modelpool=clip-vit-base-patch32_single_finetuned \
+ modelpool.models.1.name=${task} \
+ modelpool.models.1.path=tanganke/clip-vit-base-patch32_${task} \
+ taskpool=clip-vit-classification_TA8 \
+ save_report="outputs/ViT-B-32/single-task/projection_merging_zone2_${task}.json" &
+
+ # Space III
+ CUDA_VISIBLE_DEVICES=2 fusion_bench \
+ method=singular_projection_merging \
+ method.device=cuda method.rank=high method.k=-1 method.full_matrices=true \
+ modelpool=clip-vit-base-patch32_single_finetuned \
+ modelpool.models.1.name=${task} \
+ modelpool.models.1.path=tanganke/clip-vit-base-patch32_${task} \
+ taskpool=clip-vit-classification_TA8 \
+ save_report="outputs/ViT-B-32/single-task/projection_merging_zone23_${task}.json" &
+ wait
+done
+
singular_projection_merging
+
+
+¶
SingularProjectionMergingAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
, SimpleProfilerMixin
fusion_bench/method/smile_upscaling/singular_projection_merging.py
44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 |
|
merge(pretrained_model, finetuned_model, in_place=True)
+
+¶Merges the pretrained model with the fine-tuned model by projecting parameter differences +into the SVD subspace of the pretrained model.
+ + +Parameters:
+pretrained_model
+ (Module
)
+ –
+ The pretrained model.
+finetuned_model
+ (Module
)
+ –
+ The fine-tuned model.
+in_place
+ (bool
, default:
+ True
+)
+ –
+ If True, modifies the fine-tuned model in place. Otherwise, creates a copy.
+Returns:
+nn.Module: The merged model.
+fusion_bench/method/smile_upscaling/singular_projection_merging.py
projection_merge_linear(pretrained_model, finetuned_model, k)
+
+¶Projects the parameter differences of linear layers into the SVD subspace of the pretrained model.
+ + +Parameters:
+pretrained_model
+ (Linear
)
+ –
+ The linear layer of the pretrained model.
+finetuned_model
+ (Linear
)
+ –
+ The linear layer of the fine-tuned model.
+k
+ (int
)
+ –
+ The number of singular values to keep. If negative, it is determined based on the sum of singular values.
+Returns:
+nn.Linear: The merged linear layer with projected parameter differences.
+fusion_bench/method/smile_upscaling/singular_projection_merging.py
run(modelpool)
+
+¶Project the parameter differences into pre-trained SVD subspace. +This is an experimental method to investigate the location of task-specific knowledge.
+ +fusion_bench/method/smile_upscaling/singular_projection_merging.py
smile_upscaling
+
+
+¶
SmileMoELinear
+
+
+¶
+ Bases: Module
fusion_bench/method/smile_upscaling/smile_upscaling.py
125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 +248 +249 +250 +251 +252 +253 +254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 |
|
weight
+
+
+ property
+
+
+¶Mimic linear layer. Bacause in some cases, user might indicate the device (or dtype of parameters) of the linear layer using linear_layer.weight.device
SmileUpscalingAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
, SimpleProfilerMixin
fusion_bench/method/smile_upscaling/smile_upscaling.py
270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 +286 +287 +288 +289 +290 +291 +292 +293 +294 +295 +296 +297 +298 +299 +300 +301 +302 +303 +304 +305 +306 +307 +308 +309 +310 +311 +312 +313 +314 +315 +316 +317 +318 +319 +320 +321 +322 +323 +324 +325 +326 +327 +328 +329 +330 +331 +332 +333 +334 +335 +336 +337 +338 +339 +340 +341 +342 +343 +344 +345 +346 +347 +348 +349 +350 +351 +352 +353 +354 +355 +356 +357 +358 +359 +360 +361 +362 +363 +364 +365 +366 +367 +368 +369 +370 +371 +372 +373 +374 +375 +376 +377 +378 +379 +380 +381 +382 +383 +384 +385 +386 +387 +388 +389 +390 +391 +392 +393 +394 +395 +396 +397 +398 +399 +400 +401 +402 +403 +404 +405 +406 |
|
merge(pretrained_model, finetuned_models, in_place=True)
+
+¶Merges the pretrained model with the fine-tuned models to create an upscaled model.
+ + +Parameters:
+pretrained_model
+ (Module
)
+ –
+ The pretrained model.
+finetuned_models
+ (List[Module]
)
+ –
+ A list of fine-tuned models.
+in_place
+ (bool
, default:
+ True
+)
+ –
+ If True, modifies the pretrained model in place. Otherwise, creates a copy.
+Returns:
+nn.Module: The merged model.
+fusion_bench/method/smile_upscaling/smile_upscaling.py
run(modelpool)
+
+¶Executes the upscaling process.
+ + +Parameters:
+modelpool
+ (ModelPool
)
+ –
+ The pool of models to be used for upscaling.
+Returns:
+nn.Module: The upscaled model.
+fusion_bench/method/smile_upscaling/smile_upscaling.py
smile_mistral_upscaling
+
+
+¶
SmileMistralUpscalingAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
, SimpleProfilerMixin
fusion_bench/method/smile_upscaling/smile_mistral_upscaling.py
111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 |
|
merge(pretrained_model, finetuned_models)
+
+¶Merges the pretrained model with the fine-tuned models to create an upscaled model.
+ + +Parameters:
+pretrained_model
+ (MistralForCausalLM
)
+ –
+ The pretrained model.
+finetuned_models
+ (List[MistralForCausalLM]
)
+ –
+ A list of fine-tuned models.
+Returns:
+SmileMistralForCausalLM
–
+ The upscaled model.
+fusion_bench/method/smile_upscaling/smile_mistral_upscaling.py
164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 |
|
run(modelpool)
+
+¶Executes the upscaling process.
+ + +Parameters:
+modelpool
+ (ModelPool
)
+ –
+ The pool of models to be used for upscaling.
+Returns:
+SmileMistralForCausalLM
( SmileMistralForCausalLM
+) –
+ The upscaled model.
+fusion_bench/method/smile_upscaling/smile_mistral_upscaling.py
A. Tang et. al. SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models. Aug, 2024. +https://arxiv.org/abs/2408.10174 ↩
+Yadav, Prateek, et al. "A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning." arXiv preprint arXiv:2408.07057 (2024). ↩
+In the rapidly advancing field of machine learning, multi-task learning has emerged as a powerful paradigm, allowing models to leverage information from multiple tasks to improve performance and generalization. One intriguing method in this domain is Task Arithmetic, which involves the combination of task-specific vectors derived from model parameters.
+ +Task Vector. A task vector is used to encapsulate the adjustments needed by a model to specialize in a specific task. +It is derived from the differences between a pre-trained model's parameters and those fine-tuned for a particular task. +Formally, if \(\theta_i\) represents the model parameters fine-tuned for the i-th task and \(\theta_0\) denotes the parameters of the pre-trained model, the task vector for the i-th task is defined as:
+This representation is crucial for methods like Task Arithmetic, where multiple task vectors are aggregated and scaled to form a comprehensive multi-task model.
+Task Arithmetic1 begins by computing a task vector \(\tau_i\) for each individual task, using the set of model parameters \(\theta_0 \cup \{\theta_i\}_i\) where \(\theta_0\) is the pre-trained model and \(\theta_i\) are the fine-tuned parameters for i-th task. +These task vectors are then aggregated to form a multi-task vector. +Subsequently, the multi-task vector is combined with the pre-trained model parameters to obtain the final multi-task model. +This process involves scaling the combined vector element-wise by a scaling coefficient (denoted as \(\lambda\)), before adding it to the initial pre-trained model parameters. +The resulting formulation for obtaining a multi-task model is expressed as
+The choice of the scaling coefficient \(\lambda\) plays a crucial role in the final model performance. Typically, \(\lambda\) is chosen based on validation set performance.
+To use the Task Arithmetic algorithm, you can use the TaskArithmeticAlgorithm
class from the fusion_bench.method
module.
from fusion_bench.method import TaskArithmeticAlgorithm
+from omegaconf import DictConfig
+
+# Instantiate the TaskArithmeticAlgorithm
+method_config = {'name': 'task_arithmetic', 'scaling_factor': 0.5}
+algorithm = TaskArithmeticAlgorithm(DictConfig(method_config))
+
+# Assume we have a dict of PyTorch models (nn.Module instances) that we want to merge.
+# The models should all have the same architecture.
+# the dict must contain the pre-trained model with the key '_pretrained_', and arbitrary number of fine-tuned models.
+models = {'_pretrained_': nn.Linear(10,10), 'model_1': nn.Linear(10,10), 'model_2': nn.Linear(10,10)}
+
+# Run the algorithm on the models.
+# This will return a new model that is the result of task arithmetic on the input models.
+merged_model = algorithm.run(models)
+
Configuration template for the Task Arithmetic algorithm:
+name: task_arithmetic
+scaling_factor: 0.5 # Scaling factor for task vectors
+
Use the following command to run the Task Arithmetic algorithm:
+ +For example, to run the Task Arithmetic algorithm on two models with scaling factor 0.5:
+fusion_bench method=task_arithmetic \
+ method.scaling_factor=0.5 \
+ modelpool=clip-vit-base-patch32_svhn_and_mnist \
+ taskpool=clip-vit-base-patch32_svhn_and_mnist
+
where the configuration for the model pool is:
+type: huggingface_clip_vision
+# the modelpool must contain the pre-trained model with the name '_pretrained_',
+# and arbitrary number of fine-tuned models.
+models:
+ - name: _pretrained_
+ path: google/flan-t5-base
+ - name: _pretrained_
+ path: openai/clip-vit-base-patch32
+ - name: svhn
+ path: tanganke/clip-vit-base-patch32_svhn
+ - name: mnist
+ path: tanganke/clip-vit-base-patch32_mnist
+
and the configuration for the task pool:
+type: clip_vit_classification
+
+dataset_type: huggingface_image_classification
+tasks:
+ - name: svhn
+ dataset:
+ type: instantiate
+ name: svhn
+ object:
+ _target_: datasets.load_dataset
+ _args_:
+ - svhn
+ - cropped_digits
+ split: test
+ - name: mnist
+ dataset:
+ name: mnist
+ split: test
+
+...
+
TaskArithmeticAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
, SimpleProfilerMixin
fusion_bench/method/task_arithmetic.py
run(modelpool)
+
+¶fusion_bench/method/task_arithmetic.py
(ICLR 2023) Editing Models with Task Arithmetic. http://arxiv.org/abs/2212.04089 ↩
+(ICLR 2024) AdaMerging: Adaptive Model Merging for Multi-Task Learning. http://arxiv.org/abs/2310.02575 ↩
+(NIPS 2023 Oral) Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard, “Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models,” doi: 10.48550/arXiv.2305.12827. ↩
+Ties-Merging1 represents a novel and structured approach to consolidating multiple task-specific models into a single, efficient multi-task model. This method employs a sequence of deliberate steps to systematically merge task vectors, ensuring that the final model effectively integrates the strengths of each individual task-specific model and resolves potential conflicts between them.
+The Ties-Merging algorithm operates through three primary steps:
+Given the final merged task vector \(\tau\), the ultimate model is determined similarly to the method used in task arithmetic. The formulation is expressed as:
+where \(\lambda\) is a hyperparameter chosen based on the validation set to ensure the best-performing model.
+By following these structured steps, Ties-Merging effectively integrates multiple task-specific models into a unified multi-task model, balancing the contributions of each task to enhance overall performance. The process ensures that the final model retains the benefits of the pre-trained model while optimally incorporating the diverse knowledge contained within the individual task-specific models.
+In the above figure, we show the average performance of Task Arithmetic and Ties-Merging merged models as the scaling coefficient varies. Subfigure (a), (b), (c), and (d) show the results of CLIP-ViT-B/32, CLIP-ViT-L/14, Flan-T5-base (LoRA fine-tuned), and Flan-T5-large (LoRA fine-tuned), respectively. It is evident that the merged multi-task model hits a peak in average performance across various tasks when the scaling coefficient is set around 0.3. This value was empirically selected as the scaling coefficient in our experiments. As we increase the scaling coefficient beyond this point, the average performance of the model begins to decline, eventually even falling below the level of the pre-trained model’s original performance. This suggests that too high of a scaling coefficient can have a negative impact on the knowledge that the pre-trained model initially possessed, emphasizing the importance of calibrating the scaling coefficient parameter \(\lambda\) to avoid diminishing the model’s existing strengths.
+Configuration template for the Ties-Merging algorithm:
+name: ties_merging
+# Scaling factor $\lambda$
+scaling_factor: 0.5
+threshold: 0.5
+# List of keys to remove from the state dict, default is empty
+remove_keys: []
+# Function to merge the models, default is sum. Options are 'sum', 'mean', and 'max'
+merge_func: sum
+
Use the following command to run the Ties-Merging algorithm:
+ +
TiesMergingAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
fusion_bench/method/ties_merging/ties_merging.py
run(modelpool)
+
+¶fusion_bench/method/ties_merging/ties_merging.py
(NIPS 2023) Resolving Interference When Merging Models. http://arxiv.org/abs/2306.01708 ↩
+This method is designed to handle a wide range of tasks by segregating shared information and task-specific knowledge. +It dynamically combines these elements based on the input samples.
+The Weight-Ensembling MoE module consists of three main components: the router, the pre-trained MLP weights, and a collection of task vectors. +The router, which is an MLP, processes the input data and generates routing weights. These weights determine how the knowledge from different tasks is combined. +The pre-trained MLP weights are crucial as they have been trained to recognize a wide range of data patterns. +The task vectors represent the differences between the MLPs that have been fine-tuned for specific tasks and the pre-trained ones, capturing the unique adjustments made to optimize them for specific tasks. +The routing weights are averaged across the input tokens, and these weights are used to select task vectors from a dictionary matrix. +These task vectors are then added to the pre-trained MLP weights to create input-conditioned weights.
+Algorithm Requirements:
+Method | +Access to labeled tasks data | +Access to validation data (labeled) | +Test time adaptation | +
---|---|---|---|
Fisher Merging | +Yes (Estimate Fisher information matrix) | +No | +No | +
RegMean | +Yes (compute Gram Matrix) | +No | +No | +
Task Arithmetic | +No | +Yes (select sacling factor) | +No | +
Ties-Merging | +No | +Yes (select sacling factor) | +No | +
AdaMerging | +No | +No | +Yes | +
Ours | +No | +No | +Yes | +
Tip for reducing the parameter count
+Here we present the parameter count for the method outlined in the original paper1. +An effective strategy to minimize the number of parameters involves employing Singular Value Decomposition (SVD) to compress the task vectors. +This approach significantly cuts down on the number of parameters while only marginally impacting performance. +For additional information, please refer to the Twin-Merging paper2. +Which not only reduces the number of parameters but also conducts extensive experiments to demonstrate the effectiveness of data-adaptive merging on language domain.
+Here is the number of parameters compared to a single pre-trained model (OpenCLIP CLIP-ViT-B/32):
+Method | +Trainable Parameters | +Total Parameters | +Paremeters Reduced by Merging | +
---|---|---|---|
Single Pre-trained | +113.45M (100%) | +113.45M | +- | +
WEMoE (2-layer, 1 task) | +7.10M (4.00%) | +177.21M | +- | +
WEMoE (2-layer, 2 tasks) | +7.11M (3.04%) | +233.89M | +2*113.45-233.89=-6.99M | +
WEMoE (2-layer, 3 tasks) | +7.11M (2.45%) | +290.57M | +3*113.45-290.57=49.78M | +
WEMoE (2-layer, 4 tasks) | +7.12M (2.02%) | +347.25M | +4*113.45-347.25=106.55M | +
WEMoE (2-layer, 5 tasks) | +7.13M (1.77%) | +403.93M | +5*113.45-403.93=163.32M | +
WEMoE (2-layer, 6 tasks) | +7.14M (1.55%) | +460.61M | +6*113.45-460.61=220.09M | +
WEMoE (2-layer, 7 tasks) | +7.15M (1.38%) | +517.28M | +7*113.45-517.28=276.87M | +
WEMoE (2-layer, 8 tasks) | +7.16M (1.25%) | +573.96M | +8*113.45-573.96=333.64M | +
The number of parameter count of HuggingFace CLIP vision models (of type transformers.models.clip.modeling_clip.CLIPVisionModel
) are different from the OpenCLIP models downloaded from the task arithmetic repo, because the OpenCLIP models (of type src.modeling.ImageEncoder
) include the embedding layer for text tokens, while the HuggingFace CLIP vision models do not.
+Therefore, the relative parameter count of the upscaled model using Transformer CLIP vision models will be larger than the OpenCLIP models.
ImageEncoder( # (1)
+ (model): CLIP(
+ (visual): VisualTransformer( # (2)
+ (conv1): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
+ (ln_pre): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ (transformer): Transformer(
+ (resblocks): ModuleList(
+ (0-11): 12 x ResidualAttentionBlock(
+ (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ (attn): MultiheadAttention(
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
+ )
+ (ln_attn): Identity()
+ (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ (mlp): Sequential(
+ (c_fc): Linear(in_features=768, out_features=3072, bias=True)
+ (ln): Identity()
+ (gelu): QuickGELU()
+ (c_proj): Linear(in_features=3072, out_features=768, bias=True)
+ )
+ )
+ )
+ )
+ (ln_post): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ )
+ (token_embedding): Embedding(49408, 512) # (3)
+ (ln_final): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
+ )
+)
+
CLIPVisionModel( # (1)
+ (vision_model): CLIPVisionTransformer(
+ (embeddings): CLIPVisionEmbeddings(
+ (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
+ (position_embedding): Embedding(50, 768)
+ )
+ (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ (encoder): CLIPEncoder(
+ (layers): ModuleList(
+ (0-11): 12 x CLIPEncoderLayer(
+ (self_attn): CLIPAttention(
+ (k_proj): Linear(in_features=768, out_features=768, bias=True)
+ (v_proj): Linear(in_features=768, out_features=768, bias=True)
+ (q_proj): Linear(in_features=768, out_features=768, bias=True)
+ (out_proj): Linear(in_features=768, out_features=768, bias=True)
+ )
+ (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ (mlp): CLIPMLP(
+ (activation_fn): QuickGELUActivation()
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
+ )
+ (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ )
+ )
+ )
+ (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ )
+)
+
In the below figure, we show the performance of the merged models with varying numbers of steps. +Figure (b) shows the performance of the merged WEMoE models with varying number of steps. +In Figure (a), we merge CLIP-ViT-B/32 models with different learning rate configurations. +We observe that the performance of the merged model shows an upward trend with an increase in the number of training steps, and it converges rapidly, reaching a high accuracy level in just 200 steps. +Furthermore, the influence of different learning rates is not significant, suggesting that our method is insensitive to the learning rate parameter. This is a desirable property as it reduces the need for hyperparameter tuning.
+ +Table: Parameter comparison of WEMoE (1-layer) and WEMoE (2-layer) on CLIP-ViT-B/32 models (OpenCLIP).
+Method | +Number of Trainable Parameters | +
---|---|
AdaMerging (layer-wise) | +1.3K | +
WEMoE (1-layer) | +73.8K (0.01%) | +
WEMoE (2-layer) | +7.16M (1.25%) | +
Table: Ablation study of the router depth on the performance of the up-scaled CLIP-ViT-B/32 models (OpenCLIP).
+Method | +SUN397 | +CARS | +RESISC45 | +EuroSAT | +SVHN | +GRSRB | +MNIST | +DTD | +Avg. | +
---|---|---|---|---|---|---|---|---|---|
AdaMerging (layer-wise) | +66.6 | +68.3 | +82.4 | +92.5 | +86.5 | +93.7 | +97.7 | +61.1 | +80.9 | +
WEMoE (1-layer) | +73.2 | +76.7 | +93.8 | +98.6 | +95.7 | +98.6 | +99.5 | +74.5 | +88.3 | +
WEMoE (2-layer) | +74.1 | +77.4 | +93.7 | +99.1 | +96.2 | +98.9 | +99.6 | +76.4 | +89.4 | +
To explore the influence of router depth on the performance of the scaled-up model, we perform an ablation study where the router depth is varied. In WEMoE modules, the router is implemented as a multi-layer perceptron (MLP).
+In the above two Tables, we present additional findings to support our argument. We compare the number of trainable parameters and performance between WEMoE (1-layer) and WEMoE (2-layer). The data reveal that WEMoE (1-layer) possesses 73.8K trainable parameters, which constitute only 0.01% of the total parameters in the merged model. Notably, the performance of WEMoE (1-layer) is significantly better than AdaMerging and nearly matches that of WEMoE (2-layer) across all tasks. This evidence underscores our claim that the MoE design is crucial for performance enhancement.
+multi-task model fusion experiment on eight image classification tasks.
+# merge eight CLIP-ViT-B/32 models using WE MoE
+fusion_bench \
+ method=weight_ensembling_moe \
+ method.name=clip_weight_ensembling_moe \
+ method.use_grad_accumulate=false \
+ method.save_checkpoint=outputs/clip-vit-base-patch32_TA8_weight_ensembling_moe_checkpoint.ckpt \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8
+
merge eight CLIP-ViT-L/14 models:
+# merge eight CLIP-ViT-L/14 models using WE MoE, fine-tune the routers
+fusion_bench print_config=false \
+ method=weight_ensembling_moe \
+ method.name=clip_weight_ensembling_moe \
+ method.use_grad_accumulate=true \
+ method.save_checkpoint=outputs/clip-vit-large-patch14_TA8_weight_ensembling_moe_checkpoint.ckpt \
+ method.batch_size=4 method.devices=4 \
+ modelpool=clip-vit-large-patch14_TA8 \
+ taskpool=dummy &&
+
+# load the checkpoint and evaluate the model
+fusion_bench \
+ method=weight_ensembling_moe \
+ method.name=clip_weight_ensembling_moe \
+ method.checkpoint=outputs/clip-vit-large-patch14_TA8_weight_ensembling_moe_checkpoint.ckpt \
+ modelpool=clip-vit-large-patch14_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ taskpool.clip_model=openai/clip-vit-large-patch14
+
we_moe
+
+
+¶
WeightEnsemblingMoEAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
fusion_bench/method/we_moe/we_moe.py
37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 |
|
construct_moe_model()
+
+
+ abstractmethod
+
+
+¶Construct the Mixture of Experts model using the models in the model pool.
+ + +
load_checkpoint(model, checkpoint)
+
+
+ abstractmethod
+
+
+¶
save_checkpoint(model, checkpoint)
+
+
+ abstractmethod
+
+
+¶
entropy_loss(logits)
+
+¶Compute the entropy loss of a set of logits.
+ + +Parameters:
+logits
+ (Tensor
)
+ –
+ The logits to compute the entropy loss of.
+Returns:
+Tensor
( Tensor
+) –
+ The entropy loss of the logits.
+fusion_bench/method/we_moe/we_moe.py
clip_we_moe
+
+
+¶
CLIPWeightEnsemblingMoEAlgorithm
+
+
+¶
+ Bases: WeightEnsemblingMoEAlgorithm
fusion_bench/method/we_moe/clip_we_moe.py
31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 |
|
on_test_time_adaptation_start()
+
+¶Here we load the CLIP processor and construct the zero-shot classification head for each task.
+ +fusion_bench/method/we_moe/clip_we_moe.py
Anke Tang et.al. ICML 2024. Merging Multi-Task Models via Weight-Ensembling Mixture of Experts. http://arxiv.org/abs/2402.00433 ↩
+Z. Lu, C. Fan, W. Wei, X. Qu, D. Chen, and Y. Cheng, “Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging,” doi: 10.48550/arXiv.2406.15479. ↩
+Weighted averaging, also known as weight-ensembling. +In the context of full fine-tuned models, the weights are averaged according to their respective performance weights. Concretely, this means that if we have \(n\) models with their respective weights \(\theta_i\) and model-wise weights \(w_i\), the weights of the final model \(\theta\) are computed as:
+Configuration template for the Weighted Averaging algorithm:
+name: weighted_average
+normalize: true # if true, the weights will be normalized before merging
+weights: # List of weights for each model
+ - 0.5
+ - 0.5
+
Use the following command to run the Weighted Averaging algorithm:
+ +Here is an example of how to use the Weighted Averaging algorithm to merge two LLama models. In particular, LLaMa models of the type transformers.LlamaForCausalLM
are merged using the Weighted Averaging algorithm.
fusion_bench \
+ method=weighted_average_for_llama \
+ method.merged_model_save_path=outputs/test_merged_llama_model \
+ modelpool=llama_for_causallm \
+ taskpool=dummy
+
or using the following configuration file config/llama_weighted_average.yaml
defaults:
+ - example_config
+ - override method: weighted_average_for_llama
+ - override modelpool: llama_for_causallm
+ - _self_
+
+modelpool:
+ models:
+ # the pre-trained model (base model) is optional
+ # if not provided, the first model will be used as the base model
+ - name: _pretrained_
+ path: meta-llama/Meta-Llama-3-8B
+ - name: expert_1
+ path: meta-llama/Meta-Llama-3-8B
+ - name: expert_2
+ path: meta-llama/Meta-Llama-3-8B-Instruct
+
+method:
+ normalize: true # if true, the weights will be normalized before merging
+ weights: # List of weights for each model
+ - 0.5
+ - 0.5
+ # if true, only the backbone of the model will be merged and the head will be keeped as the pre-trained model (if the pre-trained model is provided, otherwise the head of the first model will be used)
+ # if false, the whole model will be merged
+ backbone_only: true
+
+ merged_model_save_path: null
+ save_tokenizer: true
+ push_to_hub: false
+
WeightedAverageAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
fusion_bench/method/weighted_average/weighted_average.py
run(modelpool)
+
+¶Fuses the models in the model pool using a weighted average approach.
+modelpool : ModelPool + The pool of models to be fused.
+ValueError + If the number of weights does not match the number of models in the model pool.
+forward_model : torch.nn.Module + The resulting model after fusion.
+ +fusion_bench/method/weighted_average/weighted_average.py
WeightedAverageForLLama
+
+
+¶
+ Bases: ModelFusionAlgorithm
A class to perform weighted averaging of models in a LLamaForCausalLMPool.
+ + +Attributes:
+config
+ (DictConfig
)
+ –
+ Configuration parameters for the weighted averaging process.
+Methods:
+run
+ –
+ LLamaForCausalLMPool): +Executes the weighted averaging of models in the provided model pool.
+fusion_bench/method/weighted_average/llama.py
run(modelpool)
+
+¶Executes the weighted averaging of models in the provided model pool.
+ + +Parameters:
+modelpool
+ (LLamaForCausalLMPoolThe
)
+ –
+ pool of models to be averaged.
+Returns:
+base_model
–
+ The base model after merging the state dictionaries of the models in the pool.
+Raises:
+ValueError
+ –
+ If the number of weights does not match the number of models in the pool.
+fusion_bench/method/weighted_average/llama.py
A weighted ensemble is a machine learning technique that combines the predictions of multiple models to produce a final prediction. The idea is to leverage the strengths of each individual model to improve overall performance and robustness.
+Formally, a weighted ensemble can be defined as follows:
+Given a set of \(n\) models, each model \(f_i\) produces a prediction \(f_i(x)\) for an input \(x\). Each model \(i\) also has an associated weight \(w_i\). The final prediction \(F(x)\) of the weighted ensemble is a weighted sum of the individual model predictions:
+The weights \(w_i\) are typically non-negative and sum to 1 (i.e., \(\sum_{i=1}^n w_i = 1\)), which ensures that the final prediction is a convex combination of the individual model predictions. +The weights can be determined in various ways. They could be set based on the performance of the models on a validation set, or they could be learned as part of the training process. In some cases, all models might be given equal weight. +The goal of a weighted ensemble is to produce a final prediction that is more accurate or robust than any individual model. This is particularly useful when the individual models have complementary strengths and weaknesses.
+The following Python code snippet demonstrates how to use the WeightedEnsembleAlgorithm
class from the fusion_bench.method
module to create a weighted ensemble of PyTorch models.
from omegaconf import DictConfig
+from fusion_bench.method import WeightedEnsembleAlgorithm
+
+#Instantiate the algorithm
+method_config = {'name': 'weighted_ensemble', 'weights': [0.3, 0.7]}
+algorithm = WeightedEnsembleAlgorithm(DictConfig(method_config))
+
+# Assume we have a list of PyTorch models (nn.Module instances) that we want to ensemble.
+models = [...]
+
+# Run the algorithm on the models.
+merged_model = algorithm.run(models)
+
Here's a step-by-step explanation:
+Instantiate the WeightedEnsembleAlgorithm
:
method_config
is created with two keys: 'name'
and 'weights'
. The 'name'
key is set to 'weighted_ensemble'
indicating the type of ensemble method to use. The 'weights'
key is set to a list of weights [0.3, 0.7]
indicating the weights assigned to each model in the ensemble.method_config
dictionary is converted to a DictConfig
object, which is a configuration object used by the omegaconf
library.WeightedEnsembleAlgorithm
is then instantiated with the DictConfig
object as an argument.Assume a list of PyTorch models that you want to ensemble. This list is assigned to the variable models
. The actual models are not shown in this code snippet.
Run the algorithm on the models: The run
method of the WeightedEnsembleAlgorithm
instance is called with the models
list as an argument. The result is a merged model that represents the weighted ensemble of the input models. This merged model is assigned to the variable merged_model
.
Here we list the options for the weighted ensemble algorithm:
+Option | +Default | +Description | +
---|---|---|
weights |
++ | A list of floats representing the weights for each model in the ensemble. | +
normalize |
+True |
+Whether to normalize the weights so that they sum to 1. Default is True . |
+
if normalize
is set to True
, the weights will be normalized so that they sum to 1. Mathematically, this means that the weights \(w_i\) will be divided by the sum of all weights, so that
Configuration template for the weighted ensemble algorithm:
+name: weighted_ensemble
+
+# this should be a list of floats, one for each model in the ensemble
+# If weights is null, the ensemble will use the default weights, which are equal weights for all models.
+weights: null
+nomalize: true
+
Construct a weighted ensemble using our CLI tool fusion_bench
: