Skip to content

rebalance command

Jamie Alquiza edited this page May 13, 2021 · 22 revisions

Rebalance

rebalance is used for:

  • targeted broker storage rebalancing*
  • incremental scaling

*In contrast to storage rebalancing in rebuild (which requires that 100% of partitions for a targeted topic are relocated), rebalance is used for partial partition rebalancing from most to least storage utilized brokers.

Rebalance works by examining the free storage utilization on all referenced brokers and selecting those that are more than 20% below the harmonic mean (configurable via the --storage-threshold parameter). For each broker targeted for partition offloading, partitions are planned for relocation to the least-utilized, most-suitable destination.

Destination broker suitability is determined as either:

  • (locality scoped) the least utilized broker with the same rack.id as the offload target
  • (non locality scoped) the least utilized broker that wouldn't result in duplicate rack.id values in the resulting replica list

A partition relocation plan is then computed. The relocation planner runs in a fair-share, first-fit descending fashion: it iterates over each broker targeted for partition offloading and plans a relocation for the largest partition that that won't exceed the upper/lower storage bounds. The storage bounds are determined by the --tolerance parameter (the default being automatic, optimal selection). Each broker is allowed to schedule at most one partition relocation before the scheduler moves on to the next broker. The offload broker list is iterated over until no more relocations can be scheduled. The relocation plan is then translated to a partition map and stored for the user to apply (as a kafka-reassign-partitions compatible file).

Usage Notes

  • Rebalance takes an input topic list (similar to rebuild: comma delimited with regex support) and a broker list. Typically the broker list would include all brokers that the target topics(s) currently occupy. Removing brokers is not allowed in rebalance; only adding additional, new brokers is permitted. All 'mapped' brokers (that is, brokers that hold at least one partition for any topic referenced in the --topics input) can be automatically referenced with -1 as an input to --brokers. -1 automatically expands to the mapped broker IDs.

  • Rebalance uses the same broker/topic metrics mechanism as rebuild (both of which can be supplemented with metricsfetcher).

  • Alternatively, brokers below a free storage in gigabytes can be targeted for offload using the --storage-threshold-gb flag.

  • Relocations can be scoped by rack.id via the --locality-scoped flag. For instance, if rack.id values reflected physical data centers, performing a rebalance with a locality scope would rebalance partitions among brokers per each data center in isolation.

  • The --tolerance flag specifies specifies the upper and lower storage bounds; these are boundaries that limit how much data can be moved from offload targets and to destination targets as a distance (in percent) from the storage free arithmetic mean. If using the default tolerance of 10% with a broker mean storage free of 800GB, a partition cannot be moved when:

    • the source free storage would exceed 880GB (mean+10%)
    • the destination free storage would drop 720GB (mean-10%)

Specifying a value of 0 (default) results in topicmappr automatically choosing an optimal tolerance value. It does this by computing a map for every tolerance value between 1 and 100 in parallel, then choses the result with the lowest resulting storage utilization {range-spread, std. deviation}.

Rebalancing Example

Fetching up-to-date metrics data with metricsfetcher:

$ metricsfetcher --broker-storage-query "avg:system.disk.free{cluster:kafka-test,device:/data}" --partition-size-query "max:kafka.log.partition.size{cluster:kafka-test} by {topic,partition}"
Submitting max:kafka.log.partition.size{cluster:kafka-test} by {topic,partition}.rollup(avg, 3600)
success
Submitting avg:system.disk.free{cluster:kafka-test,device:/data} by {broker_id}.rollup(avg, 3600)
success

Data written to ZooKeeper

Running rebuild for "test-topic" and providing all of the brokers "test-topic" partitions reside on (the --brokers has an explicit broker list, but as described in the usage section, --brokers -1 would yield the same results):

$ topicmappr rebalance --topics "test-topic" --brokers 1200,1201,1202,1203,1205,1208,1209,1211,1212,1213,1214,1215,1216,1217,12
20,1223,1224,1225,1234,1235,1236,1247,1254,1255,1256,1267,1376 --storage-threshold 0.05 --tolerance 0.2 | grep -v no-op

Topics:
  test-topic

Validating broker list:
  OK

Rebalance parameters:
  Free storage mean, harmonic mean: 2299.03GB, 2199.97GB
  Broker free storage limits (with a 20.00% tolerance from mean):
    Sources limited to <= 2758.83GB
    Destinations limited to >= 1839.22GB

Brokers targeted for partition offloading (>= 5.00% threshold below hmean):
  1203
  1209
  1211
  1212
  1214
  1217
  1224
  1225
  1247
  1255
  1256
  1376

Broker 1203 relocations planned:
  [800.20GB] test-topic p117 -> 1200

Broker 1209 relocations planned:
  [827.74GB] test-topic p119 -> 1235

Broker 1211 relocations planned:
  [602.12GB] test-topic p125 -> 1236

Broker 1212 relocations planned:
  [825.81GB] test-topic p22 -> 1208

Broker 1214 relocations planned:
  [678.96GB] test-topic p59 -> 1213
  [510.32GB] test-topic p37 -> 1213

Broker 1217 relocations planned:
  [none]

Broker 1224 relocations planned:
  [692.60GB] test-topic p118 -> 1220

Broker 1225 relocations planned:
  [255.21GB] test-topic p75 -> 1216

Broker 1247 relocations planned:
  [none]

Broker 1255 relocations planned:
  [660.11GB] test-topic p20 -> 1235

Broker 1256 relocations planned:
  [none]

Broker 1376 relocations planned:
  [none]

Partition map changes:
  test-topic p20: [1255 1203] -> [1235 1203] replaced broker
  test-topic p22: [1211 1212] -> [1211 1208] replaced broker
  test-topic p37: [1217 1214] -> [1217 1213] replaced broker
  test-topic p59: [1236 1214] -> [1236 1213] replaced broker
  test-topic p75: [1225 1209] -> [1216 1209] replaced broker
  test-topic p117: [1203 1247] -> [1200 1247] replaced broker
  test-topic p118: [1247 1224] -> [1247 1220] replaced broker
  test-topic p119: [1225 1209] -> [1225 1235] replaced broker
  test-topic p125: [1212 1211] -> [1212 1236] replaced broker

Broker distribution:
  degree [min/max/avg]: 2/7/4.30 -> 2/7/4.81
  -
  Broker 1200 - leader: 5, follower: 3, total: 8
  Broker 1201 - leader: 4, follower: 4, total: 8
  Broker 1202 - leader: 5, follower: 5, total: 10
  Broker 1203 - leader: 4, follower: 5, total: 9
  Broker 1205 - leader: 5, follower: 5, total: 10
  Broker 1208 - leader: 4, follower: 5, total: 9
  Broker 1209 - leader: 5, follower: 4, total: 9
  Broker 1211 - leader: 5, follower: 4, total: 9
  Broker 1212 - leader: 5, follower: 4, total: 9
  Broker 1213 - leader: 4, follower: 6, total: 10
  Broker 1214 - leader: 5, follower: 3, total: 8
  Broker 1215 - leader: 5, follower: 5, total: 10
  Broker 1216 - leader: 6, follower: 5, total: 11
  Broker 1217 - leader: 5, follower: 5, total: 10
  Broker 1220 - leader: 5, follower: 5, total: 10
  Broker 1223 - leader: 5, follower: 5, total: 10
  Broker 1224 - leader: 5, follower: 4, total: 9
  Broker 1225 - leader: 4, follower: 5, total: 9
  Broker 1234 - leader: 5, follower: 5, total: 10
  Broker 1235 - leader: 4, follower: 6, total: 10
  Broker 1236 - leader: 4, follower: 6, total: 10
  Broker 1247 - leader: 5, follower: 5, total: 10
  Broker 1254 - leader: 5, follower: 5, total: 10
  Broker 1255 - leader: 4, follower: 5, total: 9
  Broker 1256 - leader: 5, follower: 5, total: 10
  Broker 1267 - leader: 5, follower: 4, total: 9
  Broker 1376 - leader: 5, follower: 5, total: 10

Storage free change estimations:
  range: 2031.15GB -> 971.02GB
  range spread: 130.47% -> 53.45%
  std. deviation: 521.41GB -> 305.21GB
  -
  Broker 1200: 3587.97 -> 2787.77 (-800.20GB, -22.30%)
  Broker 1201: 2708.39 -> 2708.39 (+0.00GB, 0.00%)
  Broker 1202: 2209.01 -> 2209.01 (+0.00GB, 0.00%)
  Broker 1203: 1865.20 -> 2665.40 (+800.20GB, 42.90%)
  Broker 1205: 2120.30 -> 2120.30 (+0.00GB, 0.00%)
  Broker 1208: 3224.55 -> 2398.75 (-825.81GB, -25.61%)
  Broker 1209: 1912.19 -> 2739.93 (+827.74GB, 43.29%)
  Broker 1211: 1873.23 -> 2475.35 (+602.12GB, 32.14%)
  Broker 1212: 1916.88 -> 2742.69 (+825.81GB, 43.08%)
  Broker 1213: 3165.90 -> 1976.62 (-1189.28GB, -37.57%)
  Broker 1214: 1556.82 -> 2746.10 (+1189.28GB, 76.39%)
  Broker 1215: 2091.04 -> 2091.04 (+0.00GB, 0.00%)
  Broker 1216: 2150.41 -> 1895.21 (-255.21GB, -11.87%)
  Broker 1217: 1816.75 -> 1816.75 (+0.00GB, 0.00%)
  Broker 1220: 2877.80 -> 2185.20 (-692.60GB, -24.07%)
  Broker 1223: 2347.95 -> 2347.95 (+0.00GB, 0.00%)
  Broker 1224: 1977.97 -> 2670.58 (+692.60GB, 35.02%)
  Broker 1225: 1960.09 -> 2215.30 (+255.21GB, 13.02%)
  Broker 1234: 2109.06 -> 2109.06 (+0.00GB, 0.00%)
  Broker 1235: 3369.32 -> 1881.47 (-1487.85GB, -44.16%)
  Broker 1236: 2656.35 -> 2054.22 (-602.12GB, -22.67%)
  Broker 1247: 1956.20 -> 1956.20 (+0.00GB, 0.00%)
  Broker 1254: 2416.52 -> 2416.52 (+0.00GB, 0.00%)
  Broker 1255: 1850.83 -> 2510.94 (+660.11GB, 35.67%)
  Broker 1256: 1986.07 -> 1986.07 (+0.00GB, 0.00%)
  Broker 1267: 2301.33 -> 2301.33 (+0.00GB, 0.00%)
  Broker 1376: 2065.64 -> 2065.64 (+0.00GB, 0.00%)

New partition maps:
  test-topic.json

Results after applying test-topic.json (red bars indicate start, finish events from autothrottle):

Scaling Example

NOTE: this has been deprecated in favor of the scale subcommand.

The rebalance command can effectively be used for scaling a topic incrementally (introducing new brokers in addition to existing brokers). This is done by providing the existing brokers list hosting a topic along with additional brokers.

The default --storage-threshold of 0.2 is best suited for targeting moderate to extreme outlier brokers in a normal rebalance scenario. In a scaling scenario, it is likely desired to draw partitions from most or all of the original brokers to relocate to the newly provided brokers.

There's several ways to do this:

  • setting --storage-threshold to 0 to automatically target all original brokers (preferred)
  • setting an explicit --storage-threshold-gb value
  • lowering the --storage-threshold value

If a scale up is intended that will target all original brokers, it's highly recommended to add an equal number of brokers per rack.id used. Otherwise, brokers will not be able to schedule relocations unless --locality-scoped is set to false.

Lastly, it's likely that a non-default --tolerance value will be optimal. In testing, scaling an existing broker pool that was mostly in balance showed optimal partition placement with a tolerance value of 0.02.

Example running a scale up where the broker list includes the original 18 brokers a topic was mapped to with an additional 6 new brokers:

$ topicmappr rebalance --topics test-topic --brokers 1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672,1673,1674,1675 --storage-threshold 0 --tolerance 0.02 | grep -v no-op

Topics:
  test-topic

Validating broker list:
  New broker 1670
  New broker 1675
  New broker 1671
  New broker 1673
  New broker 1674
  New broker 1672
  -
  6 additional brokers added
  -
  OK

Rebalance parameters:
  Free storage mean, harmonic mean: 2319.90GB, 2170.92GB
  Broker free storage limits (with a 2.00% tolerance from mean):
    Sources limited to <= 2366.29GB
    Destinations limited to >= 2273.50GB

Brokers targeted for partition offloading (>= 0.00% threshold below hmean):
  1652
  1653
  1654
  1655
  1656
  1657
  1658
  1659
  1660
  1661
  1662
  1663
  1664
  1665
  1666
  1667
  1668
  1669

Broker 1660 relocations planned:
  [191.10GB] test-topic p17 -> 1671
  [181.75GB] test-topic p67 -> 1674
  [176.98GB] test-topic p13 -> 1671

Broker 1659 relocations planned:
  [181.50GB] test-topic p69 -> 1674
  [168.80GB] test-topic p10 -> 1671
  [155.98GB] test-topic p15 -> 1674

Broker 1661 relocations planned:
  [202.70GB] test-topic p8 -> 1674
  [184.24GB] test-topic p70 -> 1674

Broker 1653 relocations planned:
  [162.43GB] test-topic p7 -> 1673
  [158.50GB] test-topic p65 -> 1675
  [116.04GB] test-topic p39 -> 1675

Broker 1667 relocations planned:
  [181.61GB] test-topic p16 -> 1672
  [172.74GB] test-topic p11 -> 1670
  [98.39GB] test-topic p118 -> 1670

Broker 1664 relocations planned:
  [216.87GB] test-topic p18 -> 1671
  [151.79GB] test-topic p19 -> 1671

Broker 1658 relocations planned:
  [184.24GB] test-topic p70 -> 1675
  [181.75GB] test-topic p67 -> 1673

Broker 1657 relocations planned:
  [216.87GB] test-topic p18 -> 1673
  [202.79GB] test-topic p68 -> 1675

Broker 1654 relocations planned:
  [181.50GB] test-topic p69 -> 1675
  [178.05GB] test-topic p6 -> 1673
  [57.76GB] test-topic p96 -> 1673

Broker 1668 relocations planned:
  [191.10GB] test-topic p17 -> 1670
  [178.05GB] test-topic p6 -> 1672

Broker 1662 relocations planned:
  [149.87GB] test-topic p14 -> 1674
  [142.93GB] test-topic p56 -> 1674

Broker 1666 relocations planned:
  [202.79GB] test-topic p68 -> 1672
  [154.61GB] test-topic p73 -> 1670
  [45.10GB] test-topic p45 -> 1672

Broker 1669 relocations planned:
  [190.34GB] test-topic p12 -> 1670
  [168.56GB] test-topic p66 -> 1672

Broker 1655 relocations planned:
  [202.70GB] test-topic p8 -> 1675
  [168.56GB] test-topic p66 -> 1673

Broker 1663 relocations planned:
  [155.98GB] test-topic p15 -> 1670
  [142.54GB] test-topic p5 -> 1670
  [57.76GB] test-topic p96 -> 1672

Broker 1656 relocations planned:
  [190.34GB] test-topic p12 -> 1673
  [157.66GB] test-topic p9 -> 1675

Broker 1665 relocations planned:
  [157.66GB] test-topic p9 -> 1672
  [149.70GB] test-topic p57 -> 1672

Broker 1652 relocations planned:
  [172.74GB] test-topic p11 -> 1671
  [111.72GB] test-topic p59 -> 1671

Partition map changes:
  test-topic p5: [1663 1655] -> [1670 1655] replaced broker
  test-topic p6: [1654 1668] -> [1673 1672] replaced broker
  test-topic p7: [1653 1669] -> [1673 1669] replaced broker
  test-topic p8: [1655 1661] -> [1675 1674] replaced broker
  test-topic p9: [1656 1665] -> [1675 1672] replaced broker
  test-topic p10: [1667 1659] -> [1667 1671] replaced broker
  test-topic p11: [1652 1667] -> [1671 1670] replaced broker
  test-topic p12: [1669 1656] -> [1670 1673] replaced broker
  test-topic p13: [1660 1654] -> [1671 1654] replaced broker
  test-topic p14: [1657 1662] -> [1657 1674] replaced broker
  test-topic p15: [1659 1663] -> [1674 1670] replaced broker
  test-topic p16: [1661 1667] -> [1661 1672] replaced broker
  test-topic p17: [1668 1660] -> [1670 1671] replaced broker
  test-topic p18: [1664 1657] -> [1671 1673] replaced broker
  test-topic p19: [1666 1664] -> [1666 1671] replaced broker
  test-topic p39: [1665 1653] -> [1665 1675] replaced broker
  test-topic p45: [1656 1666] -> [1656 1672] replaced broker
  test-topic p56: [1658 1662] -> [1658 1674] replaced broker
  test-topic p57: [1665 1660] -> [1672 1660] replaced broker
  test-topic p59: [1663 1652] -> [1663 1671] replaced broker
  test-topic p65: [1652 1653] -> [1652 1675] replaced broker
  test-topic p66: [1669 1655] -> [1672 1673] replaced broker
  test-topic p67: [1660 1658] -> [1674 1673] replaced broker
  test-topic p68: [1657 1666] -> [1675 1672] replaced broker
  test-topic p69: [1659 1654] -> [1674 1675] replaced broker
  test-topic p70: [1661 1658] -> [1674 1675] replaced broker
  test-topic p73: [1666 1660] -> [1670 1660] replaced broker
  test-topic p96: [1654 1663] -> [1673 1672] replaced broker
  test-topic p118: [1667 1664] -> [1670 1664] replaced broker

Broker distribution:
  degree [min/max/avg]: 7/11/8.89 -> 4/10/7.92
  -
  Broker 1652 - leader: 6, follower: 6, total: 12
  Broker 1653 - leader: 6, follower: 5, total: 11
  Broker 1654 - leader: 5, follower: 6, total: 11
  Broker 1655 - leader: 6, follower: 6, total: 12
  Broker 1656 - leader: 6, follower: 6, total: 12
  Broker 1657 - leader: 6, follower: 6, total: 12
  Broker 1658 - leader: 7, follower: 5, total: 12
  Broker 1659 - leader: 5, follower: 7, total: 12
  Broker 1660 - leader: 5, follower: 6, total: 11
  Broker 1661 - leader: 6, follower: 6, total: 12
  Broker 1662 - leader: 7, follower: 6, total: 13
  Broker 1663 - leader: 6, follower: 5, total: 11
  Broker 1664 - leader: 7, follower: 5, total: 12
  Broker 1665 - leader: 6, follower: 6, total: 12
  Broker 1666 - leader: 7, follower: 5, total: 12
  Broker 1667 - leader: 6, follower: 5, total: 11
  Broker 1668 - leader: 6, follower: 6, total: 12
  Broker 1669 - leader: 5, follower: 8, total: 13
  Broker 1670 - leader: 5, follower: 2, total: 7
  Broker 1671 - leader: 3, follower: 4, total: 7
  Broker 1672 - leader: 2, follower: 6, total: 8
  Broker 1673 - leader: 3, follower: 4, total: 7
  Broker 1674 - leader: 4, follower: 3, total: 7
  Broker 1675 - leader: 3, follower: 4, total: 7

Storage free change estimations:
  range: 330.33GB -> 149.22GB
  range spread: 19.12% -> 6.70%
  std. deviation: 79.92GB -> 38.49GB
  -
  Broker 1652: 2057.61 -> 2342.07 (+284.46GB, 13.82%)
  Broker 1653: 1894.79 -> 2331.75 (+436.96GB, 23.06%)
  Broker 1654: 1943.69 -> 2361.00 (+417.31GB, 21.47%)
  Broker 1655: 1969.27 -> 2340.53 (+371.26GB, 18.85%)
  Broker 1656: 2007.44 -> 2355.43 (+347.99GB, 17.34%)
  Broker 1657: 1943.51 -> 2363.17 (+419.65GB, 21.59%)
  Broker 1658: 1941.90 -> 2307.89 (+365.99GB, 18.85%)
  Broker 1659: 1778.32 -> 2284.60 (+506.28GB, 28.47%)
  Broker 1660: 1727.29 -> 2277.11 (+549.82GB, 31.83%)
  Broker 1661: 1841.17 -> 2228.11 (+386.94GB, 21.02%)
  Broker 1662: 1957.37 -> 2250.16 (+292.79GB, 14.96%)
  Broker 1663: 2005.48 -> 2361.76 (+356.28GB, 17.77%)
  Broker 1664: 1921.78 -> 2290.43 (+368.65GB, 19.18%)
  Broker 1665: 2021.37 -> 2328.72 (+307.35GB, 15.21%)
  Broker 1666: 1958.24 -> 2360.74 (+402.50GB, 20.55%)
  Broker 1667: 1903.49 -> 2356.23 (+452.73GB, 23.78%)
  Broker 1668: 1948.37 -> 2317.52 (+369.15GB, 18.95%)
  Broker 1669: 1958.26 -> 2317.16 (+358.89GB, 18.33%)
  Broker 1670: 3483.02 -> 2377.33 (-1105.69GB, -31.75%)
  Broker 1671: 3483.02 -> 2293.05 (-1189.97GB, -34.16%)
  Broker 1672: 3483.02 -> 2341.81 (-1141.22GB, -32.77%)
  Broker 1673: 3483.02 -> 2327.28 (-1155.75GB, -33.18%)
  Broker 1674: 3483.02 -> 2284.06 (-1198.97GB, -34.42%)
  Broker 1675: 3483.02 -> 2279.61 (-1203.42GB, -34.55%)

New partition maps:
  test-topic.json

After applying the map:

screen shot 2018-11-21 at 12 26 37 pm

Leadership Optimization

While running any of the above operations, it's possible to finally optimize each broker's leader to follower ratio using the --optimize-leadership flag.

See the leadership optimization section in the Rebuild command documentation.

Troubleshooting

Enabling --verbose will give per offload target, per partition placement decision information.

An offload target will not list any partitions scheduled for relocation:

  • It has few, large partitions and even the smallest one available would free up too much storage on the source or consume too much on any destination.
  • All partitions examined were too large to find an optimal relocation. Increasing the --partition-limit flag beyond the default of 30 increases the likelihood of finding a possible relocation (if the broker holds more than 30 partitions).
  • No suitable destination brokers have enough free storage. Possible actions:
    • adding additional brokers to the congested rack.id locality
    • disabling locality scoping (--locality-scoped=false)

Storage utilization range isn't improving

The storage range is a key metric in improving storage balance. Sometimes a poor range can be a result of offload targets being unable to schedule relocations (see above). Factors such as partition counts, distribution, sizes, broker counts, replica locality and other constraints make this a difficult problem to optimize for.

Likewise, which brokers to target for offloading is an influencing factor. Larger --storage-threshold values (such as the default 20%) are intended to target outlier brokers. If balance is somewhat good to begin with, lower values (such as 5% in the example) can be used to target more brokers, which opens more opportunity for improved balance. At some point, it may be best to use the rebuild command with the storage placement functionality and just build a storage optimal map from scratch on a new set of target brokers.