Skip to content

Commit

Permalink
Incorporate Freemdx improvements (#4)
Browse files Browse the repository at this point in the history
* Sync latest code from free-music-demixer
* Vendor zlib, add gzipped quantized weights file
  • Loading branch information
sevagh committed Dec 30, 2023
1 parent 49940d8 commit 220363b
Show file tree
Hide file tree
Showing 253 changed files with 70,977 additions and 1,314 deletions.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.bin.gz filter=lfs diff=lfs merge=lfs -text
8 changes: 7 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,12 @@ add_subdirectory(vendor/libnyquist)
# add library Eigen3
include_directories(vendor/eigen)

# Add subdirectory for zlib in vendor (required for umx)
include_directories(${CMAKE_BINARY_DIR}/vendor/zlib)
include_directories(vendor/zlib)
add_definitions(-D_NO_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE)
add_subdirectory(vendor/zlib)

# add OpenBLAS for blas + lapack
find_package(BLAS REQUIRED)
find_package(LAPACK REQUIRED)
Expand All @@ -58,7 +64,7 @@ file(GLOB SOURCES "src/*.cpp")

# compile library, link against libnyquist
add_library(umx.cpp.lib SHARED ${SOURCES})
target_link_libraries(umx.cpp.lib libnyquist ${BLAS_LIBRARIES} ${LAPACK_LIBRARIES} lapacke)
target_link_libraries(umx.cpp.lib libnyquist ${BLAS_LIBRARIES} ${LAPACK_LIBRARIES} lapacke zlibstatic)
if(OPENMP_FOUND)
target_link_libraries(umx.cpp.lib ${OpenMP_CXX_LIBRARIES})
endif()
Expand Down
160 changes: 104 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,65 @@
# umx.cpp

**:boom: :dizzy: 2023-09-10 update: Wiener-EM is now implemented for maximum performance!**
C++17 implementation of [Open-Unmix](https://github.com/sigsep/open-unmix-pytorch) (UMX), a PyTorch neural network for music demixing. It uses [libnyquist](https://github.com/ddiakopoulos/libnyquist) to load audio files, the [ggml](https://github.com/ggerganov/ggml) file format to serialize the PyTorch weights of `umxhq` to a binary file format, and [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) to implement the inference of Open-Unmix.

C++17 implementation of [Open-Unmix](https://github.com/sigsep/open-unmix-pytorch) (UMX), a PyTorch neural network for music demixing.
There are 3 main differences in umx.cpp that deviate from the PyTorch model:
* **Quantized and compressed weights:** the best-performing [UMX-L](https://zenodo.org/record/5069601) weights are quantized (mostly uint8, uint16 for the final four layers) and saved with the [ggml](https://github.com/ggerganov/ggml) binary file format and then gzipped. This reduces the 425 MB of UMX-L weights down to 45 MB, while achieving similar performance (verified empirically using BSS metrics)
* **Segmented inference:** we borrow the overlapping segmented inference from [Demucs](https://github.com/facebookresearch/demucs/blob/main/demucs/apply.py#L264-L297) (and in turn [demucs.cpp](https://github.com/sevagh/demucs.cpp/blob/21e76ca781c4411bef073ace06d8e84c3c5c9835/src/model_apply.cpp#L180-L263)), which is very effective at processing a waveform in small chunks while avoiding discontinuities at the left and right boundaries when recombined with its neighboring chunks
* **Streaming LSTM:** following the above, since we chunk the input waveform, we can adapt the LSTM such that it's temporal sequence length is the chunk length, and each chunk is _streamed_ through the LSTM; again, we verified empirically with BSS metrics that this resulted in a similar overall SDR score while reducing memory and computation footprints

It uses [libnyquist](https://github.com/ddiakopoulos/libnyquist) to load audio files, the [ggml](https://github.com/ggerganov/ggml) file format to serialize the PyTorch weights of `umxhq` and `umxl` to a binary file format, and [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) (+ OpenMP) to implement the inference of Open-Unmix.
## Open-Unmix (UMX-L)

The float32 weights of UMX are quantized to uint16 during the conversion to the binary ggml format. The size on disk for umx.cpp's weights files are therefore ~50% of the original weights (216MB vs. 432MB for umxl, 68MB vs. 136MB for umxhq), with identical BSS results.
MUSDB18-HQ test track 'Zeno - Signs':

'Zeno - Signs', fully segmented (60s) inference + wiener + streaming lstm + uint8/16-quantized gzipped model file:
```
vocals ==> SDR: 6.836 SIR: 16.416 ISR: 14.015 SAR: 7.065
drums ==> SDR: 7.434 SIR: 14.580 ISR: 12.057 SAR: 8.906
bass ==> SDR: 2.445 SIR: 4.817 ISR: 5.349 SAR: 3.623
other ==> SDR: 6.234 SIR: 9.421 ISR: 12.515 SAR: 7.611
```

'Zeno - Signs', fully segmented (60s) inference + wiener + streaming lstm, no uint8 quantization:
```
vocals ==> SDR: 6.830 SIR: 16.421 ISR: 14.044 SAR: 7.104
drums ==> SDR: 7.425 SIR: 14.570 ISR: 12.062 SAR: 8.905
bass ==> SDR: 2.462 SIR: 4.859 ISR: 5.346 SAR: 3.566
other ==> SDR: 6.197 SIR: 9.437 ISR: 12.519 SAR: 7.627
```

'Zeno - Signs', unsegmented inference (crashes with large tracks) w/ streaming lstm + wiener:
```
vocals ==> SDR: 6.846 SIR: 16.382 ISR: 13.897 SAR: 7.024
drums ==> SDR: 7.679 SIR: 14.462 ISR: 12.606 SAR: 9.001
bass ==> SDR: 2.386 SIR: 4.504 ISR: 5.802 SAR: 3.731
other ==> SDR: 6.020 SIR: 9.854 ISR: 11.963 SAR: 7.472
```

Original release results on 'Zeno - Signs' (no streaming LSTM, no Wiener filtering):
```
vocals ==> SDR: 6.550 SIR: 14.583 ISR: 13.820 SAR: 6.974
drums ==> SDR: 6.538 SIR: 11.209 ISR: 11.163 SAR: 8.317
bass ==> SDR: 1.646 SIR: 0.931 ISR: 5.261 SAR: 2.944
other ==> SDR: 5.190 SIR: 6.623 ISR: 10.221 SAR: 8.599
```

* Streaming UMX LSTM module for longer tracks with Demucs overlapping segment inference

Testing 'Georgia Wonder - Siren' (largest MUSDB track) for memory usage with 60s segments:
```
vocals ==> SDR: 5.858 SIR: 10.880 ISR: 14.336 SAR: 6.187
drums ==> SDR: 7.654 SIR: 14.933 ISR: 11.459 SAR: 8.466
bass ==> SDR: 7.256 SIR: 12.007 ISR: 10.743 SAR: 6.757
other ==> SDR: 4.699 SIR: 7.452 ISR: 9.142 SAR: 4.298
```

vs. pytorch inference (w/ wiener):
```
vocals ==> SDR: 5.899 SIR: 10.766 ISR: 14.348 SAR: 6.187
drums ==> SDR: 7.939 SIR: 14.676 ISR: 12.485 SAR: 8.383
bass ==> SDR: 7.576 SIR: 12.712 ISR: 11.188 SAR: 6.951
other ==> SDR: 4.624 SIR: 7.937 ISR: 8.845 SAR: 4.270
```

## Performance

Expand Down Expand Up @@ -74,7 +127,7 @@ $ mamba activate umxcpp
$ python -m pip install -r ./scripts/requirements.txt
```

2. Dump Open-Unmix weights to ggml files (use argument `--model=umxl`, `--model=umxhq` to switch between the two best pretrained models):
2. Dump Open-Unmix weights to ggml files (use argument `--model=umxl`, `--model=umxhq` to switch between the two best pretrained models)\*:
```
$ python ./scripts/convert-pth-to-ggml.py --model=umxl ./ggml-umxl
...
Expand All @@ -85,20 +138,20 @@ Processing variable: bn3.bias with shape: (4098,)
Processing variable: bn3.running_mean with shape: (4098,)
Processing variable: bn3.running_var with shape: (4098,)
Skipping layer bn3.num_batches_tracked
Done. Output file: ggml-umxl/ggml-model-umxl-other-u16.bin
Done. Output file: ggml-models/ggml-model-umxl-u8.bin
```
\*: :warning: my script can no longer find `umxhq` files on Zenodo, so `umxl` is the new default

This will load the model using PyTorch Torchhub (which implicitly downloads the weights files to the hidden torchhub folder), locate the weights files, and dump them using the [ggml](http://ggml.ai/) file format:
This will load the model using PyTorch Torchhub (which implicitly downloads the weights files to the hidden torchhub folder), locate the weights files, and dump them using the [ggml](http://ggml.ai/) file format with mixed uint8 and uint16 quantization, which you can then gzip:
```
$ ls -latrh ggml-umxl/
total 216M
drwxrwxr-x 2 sevagh sevagh 4.0K Jun 28 10:14 .
drwxrwxr-x 13 sevagh sevagh 4.0K Jun 30 10:57 ..
-rw-rw-r-- 1 sevagh sevagh 54M Jun 30 11:06 ggml-model-umxl-vocals-u16.bin
-rw-rw-r-- 1 sevagh sevagh 54M Jun 30 11:06 ggml-model-umxl-drums-u16.bin
-rw-rw-r-- 1 sevagh sevagh 54M Jun 30 11:06 ggml-model-umxl-bass-u16.bin
-rw-rw-r-- 1 sevagh sevagh 54M Jun 30 11:06 ggml-model-umxl-other-u16.bin
# gzip in-place
$ gzip -k ./ggml-models/ggml-model-umxl-u8.bin
$ ls -latrh ggml-models/
total 177M
-rw-rw-r-- 1 sevagh sevagh 45M Dec 30 08:25 ggml-model-umxl-u8.bin.gz
drwxrwxr-x 13 sevagh sevagh 4.0K Dec 30 09:13 ..
drwxrwxr-x 2 sevagh sevagh 4.0K Dec 30 09:33 .
-rw-rw-r-- 1 sevagh sevagh 132M Dec 30 09:33 ggml-model-umxl-u8.bin
```

3. Install C++ dependencies, e.g. CMake, gcc, C++/g++, Eigen, OpenMP for your OS - my instructions are for Pop!\_OS 22.04:
Expand All @@ -122,55 +175,50 @@ Usage: ./umx.cpp.main <model dir> <wav file> <out dir>
$ ./umx.cpp.main ./ggml-umxl ./test.wav ./demix-out-umxl
umx.cpp Main driver program
Number of physical cores: 32
Input Samples: 23222488
Length in seconds: 263.294
Input Samples: 20672662
Length in seconds: 234.384
Number of channels: 2
load_umx_model: loading model
Discovered model file "../ggml-umxl/ggml-model-umxl-other-u16.bin" in model dir../ggml-umxl/
Discovered model file "../ggml-umxl/ggml-model-umxl-drums-u16.bin" in model dir../ggml-umxl/
Discovered model file "../ggml-umxl/ggml-model-umxl-vocals-u16.bin" in model dir../ggml-umxl/
Discovered model file "../ggml-umxl/ggml-model-umxl-bass-u16.bin" in model dir../ggml-umxl/
Checking the magic of model_file ../ggml-umxl/ggml-model-umxl-bass-u16.bin
Checking the magic of model_file ../ggml-umxl/ggml-model-umxl-drums-u16.bin
Checking the magic of model_file ../ggml-umxl/ggml-model-umxl-other-u16.bin
Checking the magic of model_file ../ggml-umxl/ggml-model-umxl-vocals-u16.bin
Decompressing model_file... ../ggml-models/ggml-model-umxl-u8.bin.gz
Checking the magic of model_file ../ggml-models/ggml-model-umxl-u8.bin.gz
Loaded umx model with hidden size 1024
Loading weights from model_file ../ggml-umxl/ggml-model-umxl-bass-u16.bin into target 0
Loading weights from model_file ../ggml-models/ggml-model-umxl-u8.bin.gz
Loading target 0
Loading tensor input_mean with shape [1487, 1]
input_mean: [ 1487, 1], type = float, 0.01 MB
Loading tensor input_scale with shape [1487, 1]
input_scale: [ 1487, 1], type = float, 0.01 MB
Loading tensor output_scale with shape [2049, 1]
output_scale: [ 2049, 1], type = float, 0.01 MB
Loading tensor output_mean with shape [2049, 1]
output_mean: [ 2049, 1], type = float, 0.01 MB
Loading tensor fc1.weight with shape [2974, 1024]
fc1.weight: [ 2974, 1024], type = float, 11.62 MB
Loading tensor bn1.weight with shape [1024, 1]
bn1.weight: [ 1024, 1], type = float, 0.00 MB
Loading tensor bn1.bias with shape [1024, 1]
bn1.bias: [ 1024, 1], type = float, 0.00 MB
Loading tensor bn1.running_mean with shape [1024, 1]
bn1.running_mean: [ 1024, 1], type = float, 0.00 MB
Loading tensor bn1.running_var with shape [1024, 1]
input_mean: [ 1487, 1], type = float, 0.00 MB
Loading target 0
... <truncated>
Loaded model (172 tensors, 215.68 MB) in 0.609294 s
Loaded model (172 tensors, 131.93 MB) in 1.271085 s
umx_model_load returned true
Computing STFT
spec shape: (incl 2 chan) 11340 x 2049
Computing STFT magnitude
Computing STFT phase
Running inference with Eigen matrices
Writing wav file "./demix-out-umxl/target_0.wav" to ./demix-out-umxl
Per-segment progress: 0.166667
2., apply model w/ split, offset: 0, chunk shape: (2, 2646000)
Generating spectrograms
populate eigen matrixxf
Input scaling
Target 0 fc1
Target 0 bn1
Target 0 lstm
Target 0 fc2
Target 0 bn2
Target 0 fc3
Target 0 bn3
Target 0 output scaling
Multiply mix mag with computed mask
Multiply mix mag with computed mask
... <truncated>
Getting complex spec from wiener filtering
Wiener-EM: Getting first estimates from naive mix-phase
Wiener-EM: Scaling down by max_abs
Wiener-EM: Initialize tensors
... <truncated>
Getting waveforms from istft
Writing wav file Writing wav file Writing wav file "./umx-cpp-out/target_2.wav""./umx-cpp-out/target_3.wav" to ./umx-cpp-out
to ./umx-cpp-out
"./umx-cpp-out/target_1.wav" to ./umx-cpp-out
Writing wav file "./umx-cpp-out/target_0.wav" to ./umx-cpp-out
Encoder Status: 0
Writing wav file "./demix-out-umxl/target_2.wav" to ./demix-out-umxl
Encoder Status: 0
Writing wav file "./demix-out-umxl/target_1.wav" to ./demix-out-umxl
Encoder Status: 0
Writing wav file "./demix-out-umxl/target_3.wav" to ./demix-out-umxl
Encoder Status: 0
```

Expand Down
3 changes: 3 additions & 0 deletions ggml-models/ggml-model-umxl-u8.bin.gz
Git LFS file not shown
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,17 @@
from pathlib import Path


def quantize(array):
def quantize(array, qtype=np.uint8):
# Calculate min and max of the array
min_val = np.min(array)
max_val = np.max(array)

# Calculate scale and offset for quantization
scale = (max_val - min_val) / 65535.0
scale = (max_val - min_val) / (float)(np.iinfo(qtype).max-1)
offset = min_val

# Quantize array
quantized_array = np.round((array - offset) / scale).astype(np.uint16)
quantized_array = np.round((array - offset) / scale).astype(qtype)

# Return quantized array, scale and offset
return quantized_array, scale, offset
Expand Down Expand Up @@ -72,7 +72,7 @@ def dequantize(quantized_array, scale, offset):
if __name__ == '__main__':
# add argparse to pick between umxhq and umxl models
parser = argparse.ArgumentParser(description='Convert Open Unmix PyTorch models to GGML')
parser.add_argument('--model', type=str, choices=('umxhq', 'umxl'), help='(umxhq, umxl)', default='umxhq')
parser.add_argument('--model', type=str, choices=('umxhq', 'umxl'), help='umxhq, umxl (default)', default='umxl')
parser.add_argument("dest_dir", type=str, help="destination path for the converted model")

args = parser.parse_args()
Expand All @@ -91,12 +91,21 @@ def dequantize(quantized_array, scale, offset):
# get torchub path
torchhub_path = Path(torch.hub.get_dir()) / "checkpoints"

for target_name, target_model in model.items():
# let's write it all to one file
# copied from ggerganov/whisper.cpp convert-pt-to-ggml.py
dest_name = dir_out / f"ggml-model-{args.model}-u8.bin"

fout = dest_name.open("wb")
fout.write(struct.pack("i", 0x756d7867)) # magic: umxg in hex

# we want the order of bass, drums, other, vocals

#for i, (target_name, target_model) in enumerate(model.items()):
for i, target_name in enumerate(["bass", "drums", "other", "vocals"]):
target_model = model[target_name]
print(f"Converting target {target_name}")
print(target_model)

dest_name = dir_out / f"ggml-model-{args.model}-{target_name}-u16.bin"

fname_inp = torchhub_path / HUB_PATHS[args.model][target_name]

# try to load PyTorch binary data
Expand All @@ -112,12 +121,11 @@ def dequantize(quantized_array, scale, offset):

#print(checkpoint.keys())
hidden_size = checkpoint['fc1.weight'].shape[0]
print(f"HIDDEN SIZE: {hidden_size}")

# copied from ggerganov/whisper.cpp convert-pt-to-ggml.py
fout = dest_name.open("wb")
fout.write(struct.pack("i", 0x756d7867)) # magic: umxg in hex
fout.write(struct.pack("i", hidden_size)) # hidden size
if i == 0:
# we only want to write this once
fout.write(struct.pack("i", hidden_size)) # hidden size
print(f"HIDDEN SIZE: {hidden_size}")

# write layers
for name in checkpoint.keys():
Expand All @@ -131,10 +139,15 @@ def dequantize(quantized_array, scale, offset):

data = data.astype(np.float32)

# cast type to a uint16 to perform quantization
# cast type to a uint8 to perform quantization
# take into account the min/max values of the data for each tensor
# and use that for appropriate quantization
quantized_data, scale, offset = quantize(data)

if any([x in name for x in ["bn2", "bn3", "fc2", "fc3"]]):
# if bn3 or fc3 in name
quantized_data, scale, offset = quantize(data, qtype=np.uint16)
else:
quantized_data, scale, offset = quantize(data)

# header
str_ = name.encode('utf-8')
Expand All @@ -146,7 +159,7 @@ def dequantize(quantized_array, scale, offset):
# data
quantized_data.tofile(fout)

fout.close()
fout.close()

print("Done. Output file: " , dest_name)
print("")
print("Done. Output file: " , dest_name)
print("")
Loading

0 comments on commit 220363b

Please sign in to comment.