Malloy Documentation
search

The HyperLogLog algorithm is a probabilistic cardinality estimator used to approximate the number of distinct elements in a multiset. It is particularly useful for large data sets where counting the exact cardinality is impractical due to memory constraints.

Key Features and Operations

  • Memory Efficiency: HyperLogLog uses significantly less memory than exact cardinality calculation methods, making it suitable for large data sets.

  • Accuracy: The algorithm provides a typical accuracy (standard error) of 2% for cardinalities over 10^9 using only 1.5 kB of memory.

  • Functions:

    • hll_accumulate (add): Add new elements to the set. Initializes the counter.

    • hll_estimate (count): Obtain the cardinality of the set.

    • hll_combine (merge): Obtain the union of two sets.

    • hll_export: Export the HLL Sketch to a persistent, dialect-specific format.

    • hll_import: Import an HLL Sketch from a persistent, dialect-specific format.

Data Structure

The HyperLogLog algorithm uses an array M of m counters (or "registers") initialized to 0. This array is called the HyperLogLog sketch of the multiset S. See the Wikipedia article for further details.

Using HyperLogLog in Malloy Schemas and Queries

In Malloy, you can use the HyperLogLog algorithm to efficiently estimate the cardinality of large data sets. By incorporating HyperLogLog into your schemas and queries, you can gain insights into the distinct elements of your data without sacrificing performance or memory.

Supported Databases

Malloy HyperLogLog functions can only be used with the following databases:

HLL Functions

The Malloy language supports the following HyperLogLog functions.

hll_accumulate

run: flights -> {
    group_by:
        dep_date is dep_time::date
        carrier, origin, destination
    aggregate:
        flight_count is count()
        aircraft_count_hll is hll_accumulate(tail_num)
} -> {
    group_by:
        dep_date
    aggregate:
        flight_count is flight_count.sum()
        aircraft_count is hll_estimate(hll_combine(aircraft_count_hll))
}

Add new elements to the set. Initializes the counter.

hll_estimate

Returns an estimate of the number of distinct elements in the data set. See the previous example.

hll_combine

Merges HLL++ sketches of the same underlying type into a new sketch. See the previous example.

hll_export

run: flights -> {
    group_by:
        dep_date is dep_time::date
        carrier, origin, destination
    aggregate:
        flight_count is count()
        aircraft_count_hll_persist is hll_export(hll_accumulate(tail_num))
} -> {
    group_by:
        dep_date
    aggregate:
        flight_count is flight_count.sum()
        aircraft_count is hll_estimate(hll_combine(hll_import(aircraft_count_hll_persist)))
}

Exports the HLL Sketch to a persistent format. The output type is database-specific.

hll_import

Imports an HLL Sketch from a persistent, database-specific format. See the previous example.