The HyperLogLog algorithm is a probabilistic cardinality estimator used to approximate the number of distinct elements in a multiset. It is particularly useful for large data sets where counting the exact cardinality is impractical due to memory constraints.
Key Features and Operations
Memory Efficiency: HyperLogLog uses significantly less memory than exact cardinality calculation methods, making it suitable for large data sets.
Accuracy: The algorithm provides a typical accuracy (standard error) of 2% for cardinalities over 10^9 using only 1.5 kB of memory.
Functions:
hll_accumulate (add): Add new elements to the set. Initializes the counter.
hll_estimate (count): Obtain the cardinality of the set.
hll_combine (merge): Obtain the union of two sets.
hll_export: Export the HLL Sketch to a persistent, dialect-specific format.
hll_import: Import an HLL Sketch from a persistent, dialect-specific format.
Data Structure
The HyperLogLog algorithm uses an array M of m counters (or "registers") initialized to 0. This array is called the HyperLogLog sketch of the multiset S. See the Wikipedia article for further details.
Using HyperLogLog in Malloy Schemas and Queries
In Malloy, you can use the HyperLogLog algorithm to efficiently estimate the cardinality of large data sets. By incorporating HyperLogLog into your schemas and queries, you can gain insights into the distinct elements of your data without sacrificing performance or memory.
Supported Databases
Malloy HyperLogLog functions can only be used with the following databases:
Presto-Trino See the Malloy Presto-Trino Page.
BigQuery. See the Malloy BigQuery Page.
Snowflake. See the Malloy Snowflake Page.
HLL Functions
The Malloy language supports the following HyperLogLog functions.
hll_accumulate
run: flights -> { group_by: dep_date is dep_time::date carrier, origin, destination aggregate: flight_count is count() aircraft_count_hll is hll_accumulate(tail_num) } -> { group_by: dep_date aggregate: flight_count is flight_count.sum() aircraft_count is hll_estimate(hll_combine(aircraft_count_hll)) }
Add new elements to the set. Initializes the counter.
hll_estimate
Returns an estimate of the number of distinct elements in the data set. See the previous example.
hll_combine
Merges HLL++ sketches of the same underlying type into a new sketch. See the previous example.
hll_export
run: flights -> { group_by: dep_date is dep_time::date carrier, origin, destination aggregate: flight_count is count() aircraft_count_hll_persist is hll_export(hll_accumulate(tail_num)) } -> { group_by: dep_date aggregate: flight_count is flight_count.sum() aircraft_count is hll_estimate(hll_combine(hll_import(aircraft_count_hll_persist))) }
Exports the HLL Sketch to a persistent format. The output type is database-specific.
hll_import
Imports an HLL Sketch from a persistent, database-specific format. See the previous example.