Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Conditional Aggregation Analyzer #571

Closed
wants to merge 3 commits into from

Conversation

joshuazexter
Copy link
Contributor

@joshuazexter joshuazexter commented Jul 29, 2024

This pull request introduces the ConditionalAggregationAnalyzer, a tool designed for dynamic data aggregation based on user-specified conditions within Apache Spark DataFrames. This addition aims to enhance the Deequ library's capabilities in performing customized metric calculations and aggregations, making it applicable across a variety of use cases where conditional data aggregation is required.

Core Features:

  • Custom Aggregation Logic: Users can pass a lambda function that specifies how data should be aggregated. This function is applied to a DataFrame to compute a state representing the aggregation result.
  • Generic Metric Computation: Post aggregation, the analyzer computes metrics from the aggregated data state, facilitating easy integration with existing monitoring or reporting systems.
  • Versatility in Use Cases: Whether it's analyzing sales data, customer feedback, or operational metrics, this analyzer provides the tools necessary to extract meaningful insights from complex datasets.

Usage Examples:
Included in the pull request are unit tests that demonstrate potential use cases:

  1. Content Engagement Metrics:
  • Use Case: Media companies and content providers often need to measure the engagement levels of various content types across different platforms to optimize their offerings.
  • Example: Use the analyzer to aggregate views, likes, and shares of articles or videos across different content categories (e.g., sports, news, entertainment) to calculate engagement percentages that help in identifying the most popular content types.
  1. Operational Efficiency Monitoring:
  • Use Case: In manufacturing or IT operations, monitoring the efficiency of processes or systems is crucial. This analyzer can aggregate operational data to track efficiency metrics like downtime, throughput, or error rates.
  • Example: Aggregate and compute the frequency of downtime incidents across different machines or systems to identify patterns or potential areas for maintenance improvements.

How It Can Be Used:
To use the ConditionalAggregationAnalyzer, developers will need to:

  1. Define a lambda function that describes the aggregation logic specific to their data and requirements on a specific column.
  2. Instantiate the analyzer with this function, specifying the relevant metric names and instances.
  3. Apply the analyzer to a DataFrame within a Spark session to compute and retrieve metrics.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant