Spark and Data Sketches: Taming count-distinct
Distinct counting is a commonly used metric in trend analysis to measure popularity or performance. Although it may seem like a simple problem, the challenge quickly grows as the amount of data grows. Counting the exact number of distinct values can consume a significant amount of resources while taking a long time even when using a parallelized processing engine. To address this challenge, you can use probabilistic algorithms
Probabilistic algorithms Probabilistic algorithms, such as Data Sketches, can be an excellent solution if the results can tolerate slight inaccuracies (with mathematically proven error bounds).
[Read More]