Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for intra.broker.goals in anomaly detection / self healing #1721

Open
wants to merge 8 commits into
base: migrate_to_kafka_2_4
Choose a base branch
from

Conversation

ecojan
Copy link

@ecojan ecojan commented Oct 26, 2021

This PR resolves #1264 by enabling the self-healing for intra-broker goals. This however does not account for race conditions between Cluster level Goals and will only trigger after said goals are finished.
The scope of this PR is to allow self-healing on IntraBroker goals for now (as future work might make this obsolete as intra-broker goals and regular goals might be made to work together).

New configuration:

anomaly.detection.intra.broker.goals= List of intra-broker goals for the Anomaly detector to look for
self.healing.intra.broker.goals= List of Intra broker Goals for self-healing
self.healing.intra.broker.goal.violation.enabled= If the self-healing should be enabled for intra-broker-goals

New Anomaly Type: INTRA_BROKER_GOAL_VIOLATION

A new type of Anomaly that needs to be treated differently from the regular GOAL_VIOLATIONS.
When configuring for the rebalance parameters for this violation, we are skipping hard goals check and ignoring proposalsCache.
We are skipping hard goals check because intra-broker goals may not be hard goals (at the current given time).
We are ignoring proposalsCache because the proposals model takes into account all given default goals. If we add the intraBrokersGoals this will not work properly as currently, disk-granularity goals do not work with broker-granularity goals.

New Anomaly detectors: IntraBrokerGoalViolationDetector

This anomaly detector is only looking for the anomaly.detection.intra.broker.goal and unlike the GoalViolationDetector, it's using a cluster model that is generating replica placement on disks as well.
This anomaly detector mainly functions for INTRA_BROKER_GOAL_VIOLATIONS.

What is missing from this feature

  • missing representation on INTRA_BROKER_GOAL_VIOLATIONS in CruiseControl UI when an anomaly is detected (as this is a structure in cruise-control-ui project)

*/
public GoalBasedOperationRunnable(KafkaCruiseControl kafkaCruiseControl,
OperationFuture future,
List<String> goals,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see dryRun boolean in other version of constructor. Do we need to support dryRun here too ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not, that one is used for a user request (ex. request coming from UI for example)

/**
* Constructor to be used for creating a runnable for self-healing.
*/
public GoalBasedOperationRunnable(KafkaCruiseControl kafkaCruiseControl,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to have same order of parameters as in other constructor.

@@ -252,6 +257,20 @@ private void alertGoalViolation(AnomalyType anomalyType, final String localHostn
}
}

private void alertIntraBrokerGoalViolation(AnomalyType anomalyType, final String localHostname,
List<AlertaMessage> alertaMessages, IntraBrokerGoalViolations goalViolations) {
Map<Boolean, List<String>> violations = goalViolations.violatedGoalsByFixability();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some duplication in logic for this and method above. WDYT about extracting logic into private method and calling with required params ?

* This class will be scheduled to run periodically to check if the given goals are violated or not. An alert will be
* triggered if one of the goals is not met.
*/
public class IntraBrokerGoalViolationDetector extends AbstractAnomalyDetector implements Runnable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of logic in common with GoalViolationDetector class. wdyt about extracting into a common(abstract) class and then override only necessary pieces ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted this into a BaseGoalViolationDetector class, please have a look at let me know what you think.

@ecojan ecojan force-pushed the 1264-intra-broker-self-healing branch from e47231b to 6516586 Compare November 22, 2021 11:44
@ecojan
Copy link
Author

ecojan commented Nov 22, 2021

Hi @efeg can you do a review of this PR please?
I would also like to bring up the #1724 as the integrationTests task runs for 3-5 minutes per test, if the build goes beyond 10 minutes without outputting, it will fail (see last build). Could please let me know if there was a different error here?
I tried building both this and the default branch locally, both took ~ 38minutes (+/- a few seconds) and finished successfully.

@efeg efeg requested a review from zornhsu December 16, 2021 21:42
Copy link
Contributor

@zornhsu zornhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, @ecojan! I haven't finished reviewing all the PR, but do let me know what you think about refactoring AbstractGoalViolations.

* @throws TimeoutException If broker capacity resolver is unable to resolve broker capacity in time.
* @throws BrokerCapacityResolutionException If broker capacity resolver fails to resolve broker capacity.
*/
public ClusterModel clusterModel(long now,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public ClusterModel clusterModel(long now,
public ClusterModel clusterModel(long nowMs,

@@ -443,6 +443,37 @@ public ClusterModel clusterModel(long nowMs,
return clusterModel;
Copy link
Contributor

@zornhsu zornhsu Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return clusterModel;
return clusterModel(nowMs, requirements, allowCapacityEstimation, false, operationProgress);

Would something go wrong if we dedup this method?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure on this one,
We have already generated an initial cluster model in this Constructor. (see line 435)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wasn't clear. Github doesn't allow me to drag L435-442.

How about we delete L435-442, and replace L443 with my suggested change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Updated this bit

Comment on lines 461 to 453
boolean allowCapacityEstimation,
boolean populateReplicaPlacementInfo,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the order align with line 540.

Suggested change
boolean allowCapacityEstimation,
boolean populateReplicaPlacementInfo,
boolean populateReplicaPlacementInfo,
boolean allowCapacityEstimation,

Comment on lines 451 to 443
* @param allowCapacityEstimation whether allow capacity estimation in cluster model if the underlying live broker capacity is unavailable.
* @param populateReplicaPlacementInfo whether populate replica placement information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the order align with line 540.

Comment on lines 516 to 518
boolean allowCapacityEstimation,
OperationProgress operationProgress,
boolean populateReplicaPlacementInfo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the order align with line 540.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up removing this Constructor since we have one already with the same signature.

_ignoreProposalCache = ignoreProposalCache;
_destinationBrokerIds = SELF_HEALING_DESTINATION_BROKER_IDS;
_isRebalanceDiskMode = isRebalanceDiskMode;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra whitespace

Comment on lines 75 to 80
boolean isRebalanceDiskMode,
boolean ignoreProposalCache,
boolean skipHardGoalCheck,
boolean stopOngoingExecution) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
boolean isRebalanceDiskMode,
boolean ignoreProposalCache,
boolean skipHardGoalCheck,
boolean stopOngoingExecution) {
boolean stopOngoingExecution,
boolean skipHardGoalCheck,
boolean ignoreProposalCache,
boolean isRebalanceDiskMode) {

Good to keep same parameter order as GoalBasedOperationRunnable.java line 103.
Also adding new parameters to the end, according to existing assigning order.



/**
* A class that holds all the goal violations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A class that holds all the goal violations.
* A class that holds all the intra broker goal violations.

/**
* A class that holds all the goal violations.
*/
public class IntraBrokerGoalViolations extends KafkaAnomaly {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same comment as there a lot of dup between this class and GoalViolations. How about making another abstract class?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will create an abstract class.

Comment on lines +80 to +81
case INTRA_BROKER_GOAL_VIOLATION:
IntraBrokerGoalViolations intraBrokerGoalViolations = (IntraBrokerGoalViolations) _anomalyState.anomaly();
Copy link
Contributor

@zornhsu zornhsu Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can combine both cases once we have the abstract class?

Suggested change
case INTRA_BROKER_GOAL_VIOLATION:
IntraBrokerGoalViolations intraBrokerGoalViolations = (IntraBrokerGoalViolations) _anomalyState.anomaly();
case GOAL_VIOLATION:
case INTRA_BROKER_GOAL_VIOLATION:
IntraBrokerGoalViolations intraBrokerGoalViolations = (AbstractGoalViolations) _anomalyState.anomaly();

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zornhsu , returned a bit to this.
My small concern on this is that it's used in other parts of the application and I would avoid obfuscating the difference between the two, maybe reuse the logic but keep them separate. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ecojan, I think what zorn suggested makes more sense. Do you have an example that merging them would cause confusions?

@ecojan
Copy link
Author

ecojan commented Jan 20, 2022

Hey @zornhsu thanks for the review, will try and get around to it the following days

@ecojan ecojan force-pushed the 1264-intra-broker-self-healing branch from 6516586 to 37d010b Compare February 7, 2022 17:20
Copy link
Contributor

@zornhsu zornhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new commits, @ecojan! What do you think about my previous comment to abstract BrokerGoalViolations for IntraBrokerGoalViolations to extend?

#1721

@@ -443,6 +443,37 @@ public ClusterModel clusterModel(long nowMs,
return clusterModel;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wasn't clear. Github doesn't allow me to drag L435-442.

How about we delete L435-442, and replace L443 with my suggested change?

@ecojan
Copy link
Author

ecojan commented Feb 12, 2022

Thanks for the new commits, @ecojan! What do you think about my previous comment to abstract BrokerGoalViolations for IntraBrokerGoalViolations to extend?

#1721

Hey @zornhsu sorry for not replying to the initial question, yes I makes perfect sense to refactor AbstractGoalViolations.

Have a few changes locally already.

@chen-anders
Copy link

Just stumbled across this and saw that this would be really useful with some of the JBOD Kafka clusters our team is currently managing. Is there anything else (besides cruise-control UI stuff) preventing this from being merged in (vs ppl who want to use it having to maintain an ongoing fork)?

@ecojan
Copy link
Author

ecojan commented May 4, 2022

@chen-anders all that's left is perhaps some small modifications and cleanup + another round of reviews

@ecojan ecojan force-pushed the 1264-intra-broker-self-healing branch from 27e433b to 05ff397 Compare May 4, 2022 11:36
@ecojan
Copy link
Author

ecojan commented May 4, 2022

Did a rebase with the latest changes from the default branch.

@amuraru
Copy link
Contributor

amuraru commented Aug 14, 2022

@CCisGG I see you are an active maintainer lately - is this something you wanna help us review please? thanks!

@CCisGG CCisGG requested review from CCisGG and wyuka August 14, 2022 17:33
@CCisGG
Copy link
Contributor

CCisGG commented Aug 14, 2022

@amuraru Sure, this seems to be a request from multiple cc users. I'll take some time to get the context and review the latest update. Meanwhile, @wyuka please feel free to take a pass.

@@ -553,7 +574,11 @@ public void run() {
// We compute the proposal even if there is not enough modeled partitions.
ModelCompletenessRequirements requirements = _loadMonitor.meetCompletenessRequirements(_defaultModelCompletenessRequirements)
? _defaultModelCompletenessRequirements : _requirementsWithAvailableValidWindows;
ClusterModel clusterModel = _loadMonitor.clusterModel(_time.milliseconds(), requirements, _allowCapacityEstimation, operationProgress);

// We check for Intra broker goals among Default goals - if we have intra broker goals, set replicaPlacementInfo to true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "set replicaPlacementInfo to true" -> "set populateReplicaPlacementInfo to true".

@@ -15,6 +15,8 @@

import static com.linkedin.cruisecontrol.CruiseControlUtils.utcDateFor;
import static com.linkedin.kafka.cruisecontrol.detector.notifier.KafkaAnomalyType.GOAL_VIOLATION;
import static com.linkedin.kafka.cruisecontrol.detector.notifier.KafkaAnomalyType.INTRA_BROKER_GOAL_VIOLATION;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: redundant new lines. Just curious did you enable code style check? The code style should force to avoid 2 consecutive blank lines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did, checkstyle passed (running the lastest version that exists in the default branch as well). Also, build process has a checkStyle task in it as well and I noticed this is a recurring thing in other java files as well (just following the status quo).
Let me know if it should be different.

Comment on lines +80 to +81
case INTRA_BROKER_GOAL_VIOLATION:
IntraBrokerGoalViolations intraBrokerGoalViolations = (IntraBrokerGoalViolations) _anomalyState.anomaly();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ecojan, I think what zorn suggested makes more sense. Do you have an example that merging them would cause confusions?

* @param goalViolations Goal violations to check whether there are unfixable goals.
* @return True if the given goal violations contain unfixable goals, false otherwise.
*/
public static boolean hasUnfixableGoals(IntraBrokerGoalViolations goalViolations) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that better to change the input param to type AbstractGoalViolations? That way this method can help to check both interBrokerGoals and intraBrokerGoals

@@ -177,10 +64,10 @@ public void run() {
}

Set<Integer> excludedBrokersForLeadership = _excludeRecentlyDemotedBrokers ? executorState.recentlyDemotedBrokers()
: Collections.emptySet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary change

@@ -209,7 +96,7 @@ public void run() {
_lastCheckedModelGeneration = clusterModel.generation();
}
newModelNeeded = optimizeForGoal(clusterModel, goal, goalViolations, excludedBrokersForLeadership, excludedBrokersForReplicaMove,
checkPartitionsWithRFGreaterThanNumRacks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary change.

}

@Override
public void run() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method looks like highly duplicated with the GoalViolationDetector. I think we should avoid it. E.g. maybe extract the common logic into the parent class.

boolean checkPartitionsWithRFGreaterThanNumRacks)
throws KafkaCruiseControlException {
if (clusterModel.topics().isEmpty()) {
LOG.info("Skipping goal violation detection because the cluster model does not have any topic.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: skip intra broker goal violation detector.

}
}

protected boolean optimizeForGoal(ClusterModel clusterModel,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not able to use the method in BaseGoalViolationDetector?


protected boolean optimizeForGoal(ClusterModel clusterModel,
Goal goal,
GoalViolations goalViolations,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the type to AbstractGoalViolations, maybe we can reuse this method for both GoalViolations and IntraBrokerGoalViolations?

@CCisGG
Copy link
Contributor

CCisGG commented Aug 14, 2022

Took a first pass and left some comments. Besides, I have some general comments for the commits so far:

  1. From my understand, the only difference between the intraBrokerGoalViolation and (InterBroker)GoalViolation, is "when configuring for the rebalance parameters for this intraBrokerGoalViolation, we are skipping hard goals check and ignoring proposalsCache". In that case, I think we can re-use a lot of code in GoalViolation and GoalViolationDetector. However, I can still see lots of duplicate lines in current commits. To me, it seems like the AbstractGoalViolation and BaseGoalViolationDetector class are not fully utilized yet.

  2. We need to make it more clear that "GoalViolations"/"GoalViolationDetector" are only for inter broker goals. For people who have enough background for cruise-control, they may be familiar with "goals only means interBrokerGoals", but for people don't have much background, this can be pretty confusing. I'd recommend to explicitly clarify that as much as we can. E.g. in JavaDocs, Configuration.md, and cruisecontrol.properties file.

@ecojan
Copy link
Author

ecojan commented Aug 16, 2022

Thank you for the review @CCisGG , will go through the proposed changes and come back this week.

@tkornai
Copy link

tkornai commented Feb 16, 2023

Hi @ecojan, thanks for your work on this very useful feature. Do you think that you'll have the opportunity to come back to this PR in the future?

@ecojan
Copy link
Author

ecojan commented Feb 16, 2023

Hi @tkornai
Unfortunately I had some priority changes in my schedule and did not get around to finish this one.
I can try in the near future, but I can't promise anything. If anybody else is open to continue the effort, I can help a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to use intra.broker.goals in anomaly detection / self healing
7 participants