Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some performance optimizations in decoders #369

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Fe-r-oz
Copy link
Contributor

@Fe-r-oz Fe-r-oz commented Sep 26, 2024

This PR is about removing the performance detriments because of allocations introduced by vcat and hcat in the decoders. In addition, evaluate_guesses is also made allocation free. Attempt is made to improve the decode and batchdecode as well. Instead of unnecessary copying, @view has been used. I have also checked this there is not any JET errors introduced as well. Resource: Performance Tips guide for better understanding of the context.

@Fe-r-oz
Copy link
Contributor Author

Fe-r-oz commented Sep 26, 2024

There is some improvement for most of the evaluate_decoders setup after improving evaluate_guesses and removing vcat and hcat from decoders.

Update #2: The table below shows memory and time ration after improving these: removed vcat, removed hcat, evaluate_decoders, improve decode, improve batchdecode Table taken from performance tracking: https://github.com/QuantumSavory/QuantumClifford.jl/actions/runs/11054157322/job/30710270440?pr=369#step:6:1141

                                                      ID                 time ratio               memory ratio
  –––––––––––––––––––––––––––––––––––––––––––––––––––––– –––––––––––––––––––––––––– ––––––––––––––––––––––––––
            ["circuitsim", "pftrajectories", "q1001_r1"] 0.91 (5%) :whitecheckmark:                  1.00 (1%)
             ["circuitsim", "pftrajectories", "q101_r1"] 0.92 (5%) :whitecheckmark:                  1.00 (1%)
        ["clifford", "dense", "cnot250_on_diag500_stab"] 0.84 (5%) :whitecheckmark:                  1.00 (1%)
       ["clifford", "dense", "dense500_on_diag500_stab"] 0.84 (5%) :whitecheckmark:                  1.00 (1%)
             ["ecc", "evaluate_decoder", "shor_bp_comm"] 0.84 (5%) :whitecheckmark: 0.77 (1%) :whitecheckmark:
         ["ecc", "evaluate_decoder", "shor_bp_naivesyn"] 0.84 (5%) :whitecheckmark: 0.74 (1%) :whitecheckmark:
          ["ecc", "evaluate_decoder", "shor_bp_shorsyn"] 0.76 (5%) :whitecheckmark: 0.74 (1%) :whitecheckmark:
           ["ecc", "evaluate_decoder", "shor_pybp_comm"]              1.18 (5%) :x:              1.18 (1%) :x:
       ["ecc", "evaluate_decoder", "shor_pybp_naivesyn"]              1.14 (5%) :x:              1.17 (1%) :x:
        ["ecc", "evaluate_decoder", "shor_pybp_shorsyn"]              1.16 (5%) :x:              1.17 (1%) :x:
          ["ecc", "evaluate_decoder", "shor_table_comm"]                  0.96 (5%) 0.83 (1%) :whitecheckmark:
      ["ecc", "evaluate_decoder", "shor_table_naivesyn"] 0.94 (5%) :whitecheckmark: 0.84 (1%) :whitecheckmark:
       ["ecc", "evaluate_decoder", "shor_table_shorsyn"]                  0.95 (5%) 0.87 (1%) :whitecheckmark:
       ["ecc", "evaluate_decoder", "toric8_bp_naivesyn"]              1.09 (5%) :x: 0.98 (1%) :whitecheckmark:
        ["ecc", "evaluate_decoder", "toric8_bp_shorsyn"]              1.06 (5%) :x: 0.99 (1%) :whitecheckmark:
         ["ecc", "evaluate_decoder", "toric8_pybp_comm"]                  1.04 (5%)              1.08 (1%) :x:
     ["ecc", "evaluate_decoder", "toric8_pybp_naivesyn"]                  1.00 (5%)              1.07 (1%) :x:
      ["ecc", "evaluate_decoder", "toric8_pybp_shorsyn"]                  0.99 (5%)              1.06 (1%) :x:
      ["ecc", "evaluate_decoder", "toric8_pymatch_comm"]              1.07 (5%) :x:              1.75 (1%) :x:
  ["ecc", "evaluate_decoder", "toric8_pymatch_naivesyn"]                  1.00 (5%)              1.52 (1%) :x:
   ["ecc", "evaluate_decoder", "toric8_pymatch_shorsyn"]                  0.99 (5%)              1.38 (1%) :x:
        ["ecc", "evaluate_decoder", "toric8_table_comm"]              1.10 (5%) :x: 0.96 (1%) :whitecheckmark:
    ["ecc", "evaluate_decoder", "toric8_table_naivesyn"]                  0.99 (5%) 0.98 (1%) :whitecheckmark:
     ["ecc", "evaluate_decoder", "toric8_table_shorsyn"]                  0.99 (5%) 0.99 (1%) :whitecheckmark:
                                 ["pauli", "mul", "100"] 0.71 (5%) :whitecheckmark:                  1.00 (1%)
                   ["stabilizer", "canon", "md_rref500"] 0.87 (5%) :whitecheckmark:                  1.00 (1%)
                ["stabilizer", "tensor", "diag_pow5_20"]              1.09 (5%) :x:                  1.00 (1%)
                     ["stabilizer", "tensor", "pow5_20"]              1.10 (5%) :x:                  1.00 (1%)

@Fe-r-oz Fe-r-oz marked this pull request as ready for review September 26, 2024 14:43
@Fe-r-oz
Copy link
Contributor Author

Fe-r-oz commented Sep 26, 2024

I think this is ready for review. I hope that these improvements results for evaluate decoders (#369 (comment)) are plausible and sensible. Thank you!

Copy link
Member

@Krastanov Krastanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a good idea!

Currently I do not believe that most of these changes do much though.

I like the

    row_x = @view syndrome_samples[:,1:d.nx]
    row_z = @view syndrome_samples[:,d.nx+1:end]

and the improvement to evaluate_gueses, but I do not believe the other changes help and would prefer to have them reverted.

Could you do microbenchmarks to prove what works here?

src/ecc/decoder_pipeline.jl Outdated Show resolved Hide resolved
src/ecc/decoder_pipeline.jl Outdated Show resolved Hide resolved
src/ecc/decoder_pipeline.jl Outdated Show resolved Hide resolved
Comment on lines +176 to +189
for i in 1:nsamples
is_decoded = true
for j in 1:size(faults_matrix, 1)
sum_mod = 0
@inbounds @simd for k in 1:size(faults_matrix, 2)
sum_mod += faults_matrix[j, k] * guesses[i, k]
end
sum_mod %= 2
if sum_mod != measured_faults[i, j]
is_decoded = false
break
end
end
decoded += is_decoded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems promising, but is it actually faster on its own? It probably needs to be benchmarked separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the performance improvements occurred due to evaluate guesses. I noticed that the in-place approach to vcat and hcat slowed things down, increased memory/time ratio, that was why the improvements was not that visible. I have reverted back as per your comments. Now, it looks better: #369 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed the evaluate_guess as a separate commit so before and after performance are present in the log:
Before evaluate guesses: Results, After evaluate guesses: Results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#Micro benchmarks: evaluate guesses 

julia> using BenchmarkTools
julia> n_faults = 500;
julia> n_guesses = 100;
julia> measured_faults = rand(0:1, n_guesses, n_faults);
julia> guesses = rand(0:1, n_guesses, n_faults);
julia> faults_matrix = rand(0:1, n_faults, n_faults);
julia> function evaluate_guesses_view(measured_faults, guesses, faults_matrix)
           nsamples = size(guesses, 1)
           guess_faults = (faults_matrix * guesses') .% 2
           decoded = 0
           for i in 1:nsamples
               (@view guess_faults[:,i]) == (@view measured_faults[i,:]) && (decoded += 1)
           end
           return (nsamples - decoded) / nsamples
       end
julia> function evaluate_guesses_loop(measured_faults, guesses, faults_matrix)
           nsamples = size(guesses, 1)
           decoded = 0
           for i in 1:nsamples
               is_decoded = true
               for j in 1:size(faults_matrix, 1)
                   sum_mod = 0
                   @inbounds @simd for k in 1:size(faults_matrix, 2)
                       sum_mod += faults_matrix[j, k] * guesses[i, k]
                   end
                   sum_mod %= 2
                   if sum_mod != measured_faults[i, j]
                       is_decoded = false
                       break
                   end
               end
               decoded += is_decoded
           end
           return (nsamples - decoded) / nsamples
       end
julia> @benchmark evaluate_guesses_view($measured_faults, $guesses, $faults_matrix)
BenchmarkTools.Trial: 423 samples with 1 evaluation.
 Range (min … max):   9.462 ms … 100.313 ms  ┊ GC (min … max): 0.00% … 87.81%
 Time  (median):     11.613 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.808 ms ±   4.717 ms  ┊ GC (mean ± σ):  2.29% ±  5.55%

  ▂▂ ▅█▇▄                      ▂                                
  ██▅████▅▄▂▃▄▃▄▄▃▄▄▅▇▆▇▃▇▇▄█▇▇█▆▅█▆▇▆▇▆▅▇▄▃▂▅▂▃▃▃▂▂▃▂▃▁▂▁▁▁▂▂ ▄
  9.46 ms         Histogram: frequency by time         15.5 ms <

 Memory estimate: 812.11 KiB, allocs estimate: 8.

julia> @benchmark evaluate_guesses_loop($measured_faults, $guesses, $faults_matrix)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  119.402 μs …  1.440 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     127.033 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   135.949 μs ± 49.806 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  ▇▃▄▃▂▁▁  ▁▁▁▂▁     ▁                                      ▁
  █▆▄███████████████████████████▇▇██▇▇▇▆▆▇▇▆▆▅▅▅▄▆▇▆▄▆▄▄▄▄▅▃▃▅ █
  119 μs        Histogram: log(frequency) by time       246 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.


# Micro benchmarks attached, with and without view:

julia> syndrome_samples = rand(1000, 1000) ;
julia> d = Dict(:nx => 500);
julia> function with_view(syndrome_samples, d)
           row_x = @view syndrome_samples[:, 1:d[:nx]]
           row_z = @view syndrome_samples[:, d[:nx]+1:end]
           return row_x, row_z
       end
julia> function without_view(syndrome_samples, d)
           row_x = syndrome_samples[:, 1:d[:nx]]
           row_z = syndrome_samples[:, d[:nx]+1:end]
           return row_x, row_z
       end

julia> @benchmark with_view($syndrome_samples, $d)

       # Benchmarking without @view
BenchmarkTools.Trial: 10000 samples with 996 evaluations.
 Range (min … max):  24.653 ns … 225.071 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.912 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   27.309 ns ±   7.016 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▇▄▄▄                           ▁                           ▂
  █▁█████▄▄▃▃▄▅▄▃▃▁▃▁▃▄▇▇▇▆▄▇▆▆█▇▇▇██▇█▆▆▅▅▅▄▄▆▆▇▆▅▅▆▇▇▄▄▆▅▆▆▇ █
  24.7 ns       Histogram: log(frequency) by time        60 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark without_view($syndrome_samples, $d)
BenchmarkTools.Trial: 3337 samples with 1 evaluation.
 Range (min … max):  995.098 μs …   7.976 ms  ┊ GC (min … max): 0.00% … 85.61%
 Time  (median):       1.064 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.491 ms ± 795.193 μs  ┊ GC (mean ± σ):  6.96% ± 13.21%

  ██▆▄▂                     ▁▃▃▂                ▂▃▂▁    ▁▂▃▂    ▁
  ██████▆▆▅▆▃▅▃▃▁▁▅▁▁▃▄▄▃▃▁▆█████▇▄▄▄▃▅▃▁▁▁▁▁▁▁▇█████▆▅▆█████▇▆ █
  995 μs        Histogram: log(frequency) by time       3.33 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 4.

@Fe-r-oz
Copy link
Contributor Author

Fe-r-oz commented Sep 26, 2024

Thanks for your comments. As requested, MicroBenchmarks are attached here: #369 (comment).

I have removed all the unnecessary changes except the @views and evaluate guesses. The performance tracking improvements seems to have improved. Now, all of theevaluate decoderssetups show improvements.

Table taken from: https://github.com/QuantumSavory/QuantumClifford.jl/actions/runs/11060472698/job/30731074834?pr=369#step:6:1140

                                                      ID                 time ratio               memory ratio
  –––––––––––––––––––––––––––––––––––––––––––––––––––––– –––––––––––––––––––––––––– ––––––––––––––––––––––––––
          ["circuitsim", "pftrajectories", "q1001_r100"] 0.94 (5%) :whitecheckmark:                  1.00 (1%)
        ["clifford", "dense", "cnot250_on_diag500_stab"] 0.84 (5%) :whitecheckmark:                  1.00 (1%)
       ["clifford", "dense", "dense500_on_diag500_stab"] 0.84 (5%) :whitecheckmark:                  1.00 (1%)
             ["ecc", "evaluate_decoder", "shor_bp_comm"] 0.85 (5%) :whitecheckmark: 0.77 (1%) :whitecheckmark:
         ["ecc", "evaluate_decoder", "shor_bp_naivesyn"] 0.84 (5%) :whitecheckmark: 0.75 (1%) :whitecheckmark:
          ["ecc", "evaluate_decoder", "shor_bp_shorsyn"] 0.77 (5%) :whitecheckmark: 0.74 (1%) :whitecheckmark:
           ["ecc", "evaluate_decoder", "shor_pybp_comm"] 0.94 (5%) :whitecheckmark: 0.93 (1%) :whitecheckmark:
       ["ecc", "evaluate_decoder", "shor_pybp_naivesyn"] 0.93 (5%) :whitecheckmark: 0.92 (1%) :whitecheckmark:
        ["ecc", "evaluate_decoder", "shor_pybp_shorsyn"] 0.94 (5%) :whitecheckmark: 0.92 (1%) :whitecheckmark:
          ["ecc", "evaluate_decoder", "shor_table_comm"] 0.95 (5%) :whitecheckmark: 0.83 (1%) :whitecheckmark:
      ["ecc", "evaluate_decoder", "shor_table_naivesyn"] 0.94 (5%) :whitecheckmark: 0.84 (1%) :whitecheckmark:
       ["ecc", "evaluate_decoder", "shor_table_shorsyn"]                  0.95 (5%) 0.87 (1%) :whitecheckmark:
       ["ecc", "evaluate_decoder", "toric8_bp_naivesyn"]                  1.02 (5%) 0.98 (1%) :whitecheckmark:
        ["ecc", "evaluate_decoder", "toric8_bp_shorsyn"]              1.07 (5%) :x: 0.98 (1%) :whitecheckmark:
         ["ecc", "evaluate_decoder", "toric8_pybp_comm"]                  1.01 (5%) 0.95 (1%) :whitecheckmark:
     ["ecc", "evaluate_decoder", "toric8_pybp_naivesyn"]                  0.99 (5%) 0.95 (1%) :whitecheckmark:
      ["ecc", "evaluate_decoder", "toric8_pybp_shorsyn"]                  1.01 (5%) 0.95 (1%) :whitecheckmark:
      ["ecc", "evaluate_decoder", "toric8_pymatch_comm"]              1.06 (5%) :x: 0.91 (1%) :whitecheckmark:
  ["ecc", "evaluate_decoder", "toric8_pymatch_naivesyn"]                  1.00 (5%) 0.94 (1%) :whitecheckmark:
   ["ecc", "evaluate_decoder", "toric8_pymatch_shorsyn"]                  0.99 (5%) 0.96 (1%) :whitecheckmark:
        ["ecc", "evaluate_decoder", "toric8_table_comm"]              1.09 (5%) :x: 0.96 (1%) :whitecheckmark:
    ["ecc", "evaluate_decoder", "toric8_table_naivesyn"]                  1.00 (5%) 0.98 (1%) :whitecheckmark:
     ["ecc", "evaluate_decoder", "toric8_table_shorsyn"]                  1.00 (5%) 0.99 (1%) :whitecheckmark:
                            ["pauli", "mul", "20000000"] 0.93 (5%) :whitecheckmark:                  1.00 (1%)
                   ["stabilizer", "canon", "md_cano500"] 0.92 (5%) :whitecheckmark:                  1.00 (1%)
                     ["stabilizer", "tensor", "pow5_20"]              1.06 (5%) :x:                  1.00 (1%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants