Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[controller] Harden update-store workflow #1091

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nisargthakkar
Copy link
Contributor

@nisargthakkar nisargthakkar commented Jul 31, 2024

Harden update-store workflow

This PR unifies the UpdateStore logic between child controller and parent controllers. We've faced many issues due to the structure of the code (especially VeniceParentHelixAdmin) where a set of store-update operations puts it into a non-healthy state. These are all the changes in this PR:

  1. Add a new class UpdateStoreUtils that is called by both parent and child controllers to apply the store update and perform the necessary validations.
  2. Perform store config validation after applying all configs. This helps check for config compatibility. The following validations are performed currently.
    1. Batch-only:
      1. Inc push is not allowed
      2. WC is not allowed
      3. A/A is not allowed
    2. Hybrid:
      1. Rewind time in seconds cannot be negative
      2. Both offset lag threshold and producer time lag threshold cannot be negative
      3. In multi-region mode, inc push is not allowed for NON_AGGREGATE data replication policy
      4. ACTIVE_ACTIVE data replication policy is only supported when Active-Active replication is enabled
      5. Partition count must be greater than 0
    3. In single-region mode, incremental push allowed when the store is hybrid
    4. In multi-region mode, incremental push is allowed when the store is hybrid and either of the following are true:
      1. Store has active-active replication enabled
      2. Store doesn't have active-active replication, but DataReplicationPolicy is AGGREGATE
    5. Storage quota cannot be less than 0
    6. Read quota cannot be less than 0
    7. Read quota cannot be larger than "router count * max_per_router_quota"
    8. Store cannot be active-active replication enabled if it is not also native-replication enabled
    9. Active-Active is not supported when amplification factor is more than 1
    10. Write-compute is not supported when amplification factor is more than 1
    11. Store must have a partitioner-config set
    12. Validate if the partitioner can be built with the specified partitioner configs
    13. Latest superset value schema id must map to a valid value schema id
    14. Max compaction lag < min compaction lag
    15. ETL Proxy user must be set. (We might want to deprecate this and remove the corresponding tests, but we have plans to get KIF-based ETL that could benefit from the ETLConfig in store configs)
  3. Added a new concept of "primary controller". This is the controller that is allowed to perform inferred updates. In a multi-region mode, this is the parent controller. In a single-region mode, this is a child controller.
  4. Previously each "feature" update incurred an update to the store metadata on Zk - and the corresponding watchers would get invoked. This change replaces all those updates with a single Zk update, which is done after the resulting configs are validated.
  5. The ordinal from BackupStrategy enum was being used to write to the admin channel. This is problematic when the enums evolve. Added a getValue function and used that instead.
  6. A new config: controller.external.superset.schema.generation.enabled has been added to replace controller.parent.external.superset.schema.generation.enabled because external superset schema generation must be allowed in single-region mode also. controller.parent.external.superset.schema.generation.enabled has been marked deprecated, but it has not been completely removed yet for backward compatibility reasons.
  7. Due to the changes to inferred configs, a large chunk of tests started failing as they had incorrect setups. Fixed most of such test setups.

Some side-effects of this change are:

  1. Controller now checks if the store schema can support WC at the time of enabling the WC config. There were a few tests that also used unsupported value schemas for WC.
  2. Persona validation was only present in parent controller in multi-region mode. This now works in single-region mode as well
  3. SupersetSchemaGeneratorWithCustomProp had a bug where if the first schema has a custom prop, the future superset schema generation would fail as Avro doesn't allow overriding custom props. This got caught as the update store logic now also tries creating superset schema if a store enabled RC or WC, or if it previously had a superset schema.

There are a few other changes that we should do, but are not done in this PR:
All major operations should only be allowed on the parent controller - Create a store, delete a store, add schemas. We should exclude some system stores from this check like we have for the check allowing batch push to admin in child

Recommendation for Reviewers

I recommend going through the changes in at least two passes. In the first pass, look through all the stuff that has been purely deleted. (Skip ParentControllerConfigUpdateUtils as that has been renamed to PrimaryControllerConfigUpdateUtils and it's contents partially moved to UpdateStoreUtils. So, GitHub doesn't detect is as a renaming). While doing this pass, you can also glance through various small changes that do not need much pondering over.

In the second pass, follow this review order:

  1. VeniceControllerClusterConfig, VeniceControllerConfig, VeniceControllerMultiClusterConfig
  2. VeniceControllerService
  3. Admin
  4. UpdateStoreUtils
  5. PrimaryControllerConfigUpdateUtils
  6. AdminUtils
  7. UpdateStoreWrapper
  8. VeniceHelixAdmin
  9. VeniceParentHelixAdmin
  10. SupersetSchemaGeneratorWithCustomProp
  11. All the test modifications
  12. All newly added tests

How was this PR tested?

Added new tests. Modified existing tests. GH CI

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

@nisargthakkar nisargthakkar force-pushed the adminSingleUpdateZk2 branch 4 times, most recently from 9edf55a to a019bc8 Compare August 1, 2024 17:15
@nisargthakkar nisargthakkar changed the title [WIP] Harden update-store workflow [controller] Harden update-store workflow Aug 1, 2024
@nisargthakkar nisargthakkar marked this pull request as ready for review August 1, 2024 19:27
Copy link
Contributor

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks like a good change! I didn't have time to look at it fully yet, but I left one minor comment so far.

Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a pass. Would recommend having another pair of eyes especially on the following files due to my limited context in the code base

  • VeniceParentHelixAdmin
  • VeniceHelixAdmin
  • UpdateStoreUtils

@nisargthakkar nisargthakkar force-pushed the adminSingleUpdateZk2 branch 3 times, most recently from a296585 to 33d1620 Compare September 19, 2024 21:11
@nisargthakkar nisargthakkar force-pushed the adminSingleUpdateZk2 branch 5 times, most recently from 3fa383a to e4ebdbe Compare October 1, 2024 02:04
Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I mostly made sure that the checks and description in the summary matches w/ the code changes.

Final comment son strategy of releasing the changes

  1. Do we have branch based releases for complex changes/features?
  2. It feels merging this w/ master and finding issues during certification can be potentially difficult to deal/mitigate as surface area of the changes is large and reverting such large commit accurately (assuming other changes pile in) might be tricky.

How do plan to certify this w/ our current certification flow?

Comment on lines +135 to +136
if (params.getRegionsFilter().isPresent()) {
Set<String> regionsFilter = parseRegionsFilterList(params.getRegionsFilter().get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of this conditional to make this more readable? Perhaps something like

Set<String>regionsFilter = params.getRegionsFilter().map(this::parseRegionsFilterList).orElse(Collections.emptySet());

This way you have a top level early exit in case of regions filter not containing region name.

clusterName,
regionsFilter,
multiClusterConfigs.getRegionName());
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expect caller to handle null is it?

Comment on lines +303 to +349
addToUpdatedConfigs(
updatedConfigs,
BACKUP_VERSION_RETENTION_MS,
backupVersionRetentionMs,
updatedStore::setBackupVersionRetentionMs);
addToUpdatedConfigs(
updatedConfigs,
NATIVE_REPLICATION_SOURCE_FABRIC,
nativeReplicationSourceFabric,
updatedStore::setNativeReplicationSourceFabric);
addToUpdatedConfigs(
updatedConfigs,
LATEST_SUPERSET_SCHEMA_ID,
latestSupersetSchemaId,
updatedStore::setLatestSuperSetValueSchemaId);
addToUpdatedConfigs(
updatedConfigs,
MIN_COMPACTION_LAG_SECONDS,
minCompactionLagSeconds,
updatedStore::setMinCompactionLagSeconds);
addToUpdatedConfigs(
updatedConfigs,
MAX_COMPACTION_LAG_SECONDS,
maxCompactionLagSeconds,
updatedStore::setMaxCompactionLagSeconds);
addToUpdatedConfigs(updatedConfigs, MAX_RECORD_SIZE_BYTES, maxRecordSizeBytes, updatedStore::setMaxRecordSizeBytes);
addToUpdatedConfigs(
updatedConfigs,
MAX_NEARLINE_RECORD_SIZE_BYTES,
maxNearlineRecordSizeBytes,
updatedStore::setMaxNearlineRecordSizeBytes);
addToUpdatedConfigs(
updatedConfigs,
UNUSED_SCHEMA_DELETION_ENABLED,
unusedSchemaDeletionEnabled,
updatedStore::setUnusedSchemaDeletionEnabled);
addToUpdatedConfigs(
updatedConfigs,
BLOB_TRANSFER_ENABLED,
blobTransferEnabled,
updatedStore::setBlobTransferEnabled);
addToUpdatedConfigs(
updatedConfigs,
STORAGE_NODE_READ_QUOTA_ENABLED,
storageNodeReadQuotaEnabled,
updatedStore::setStorageNodeReadQuotaEnabled);
addToUpdatedConfigs(updatedConfigs, REGULAR_VERSION_ETL_ENABLED, regularVersionETLEnabled, regularVersionETL -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to group this configurations into some sort of logical category? Reason being, we can break this method into smaller methods where each of those smaller methods take care of updating their logical group of configuration.

It definitely makes it more readable and enables developers to make changes within a group of configuration to reason about potential implication and validations.

throw new VeniceHttpException(HttpStatus.SC_BAD_REQUEST, errorMessage, ErrorType.INVALID_CONFIG);
}

if (updatedStore.isHybrid() && newPartitionCount == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay for non-hybrid stores to have newPartitionCount = 0 is it?

Comment on lines +1072 to +1078
if (newPartitionCount < minPartitionNum && newPartitionCount != 0) {
throw new VeniceHttpException(
HttpStatus.SC_BAD_REQUEST,
"Partition count must be at least " + minPartitionNum + " for store: " + storeName
+ ". If a specific partition count is not required, set it to 0.",
ErrorType.INVALID_CONFIG);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be consistent w/ rest and log the error before throwing the exception?

Comment on lines +1169 to +1173
if (oldStore.getPartitionerConfig() == null) {
originalPartitionerConfig = new PartitionerConfigImpl();
} else {
originalPartitionerConfig = oldStore.getPartitionerConfig();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:
Some of the places below starts with initialization and then checks for null and reinitializes with the default in the event of null. e.g., ln: 1186, ln: 1197

Although I personally prefer
Optional.ofNullable(...).orElse(defaultInstance)

Feel free to make them consistent w/ the rest at the very least.

Comment on lines +90 to +112
public static boolean isIncrementalPushSupported(
boolean multiRegion,
boolean activeActiveReplicationEnabled,
HybridStoreConfig hybridStoreConfig) {
// Only hybrid stores can support incremental push
if (!AdminUtils.isHybrid(hybridStoreConfig)) {
return false;
}

// If the system is running in multi-region mode, we need to validate the data replication policies
if (!multiRegion) {
return true;
}

// A/A can always support incremental push
if (activeActiveReplicationEnabled) {
return true;
}

DataReplicationPolicy dataReplicationPolicy = hybridStoreConfig.getDataReplicationPolicy();
return dataReplicationPolicy == DataReplicationPolicy.AGGREGATE
|| dataReplicationPolicy == DataReplicationPolicy.NONE;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leverage this in the UpdateStoreUtils for validation of scenarios for incremental push allowed? Seems redudant checks


sendAdminMessageAndWaitForConsumed(clusterName, storeName, message);
} finally {
releaseAdminMessageLock(clusterName, storeName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we guarantee that we don't run into scenario where the admin is stuck and this never executes? Is there some sort of timed lock releases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants