Spark and Data Sketches: Taming count-distinct

Distinct counting is a commonly used metric in trend analysis to measure popularity or performance. Although it may seem like a simple problem, the challenge quickly grows as the amount of data grows. Counting the exact number of distinct values can consume a significant amount of resources while taking a long time even when using a parallelized processing engine. To address this challenge, you can use probabilistic algorithms Probabilistic algorithms Probabilistic algorithms, such as Data Sketches, can be an excellent solution if the results can tolerate slight inaccuracies (with mathematically proven error bounds). [Read More]

Common Pitfalls to Avoid When Publishing Artifacts on Maven Central

Overview Are you planning to publish your first artifact to Maven Central and make it available to a wider audience? Congratulations, you’re taking an important step in contributing to the open-source community! However, the process may not be as straightforward as you expect. In this post, we’ll go over some common pitfalls to avoid to make your publishing experience smoother. Before we dive into the details, it’s essential to note that the official documentation on Maven Central is a great resource. [Read More]