Aggregates

VSCode

Malloy supports the standard aggregate functions count, sum, avg, min, and max. When these are used in a field's definition, they make that field a measure.

Basic Syntax

count

count()
source.count()
join_name.count()

The count aggregate function may be used to count the number of records appearing in a source.

run: flights -> {
  aggregate: flight_count is count()
}

QUERY RESULTS

[
  {
    "flight_count": 344827
  }
]

SELECT 
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base

count(expr)

count(expression)
source.count(expression)
join_name.field.count()

Distinct counts may be used to count the number of distinct values of a particular expression within a source.

run: order_items -> {
  aggregate: order_count is count(order_id)
}

QUERY RESULTS

[
  {
    "order_count": 264071
  }
]

SELECT 
   count(distinct base."order_id") as "order_count"
FROM '../data/order_items.parquet' as base

sum

sum(expression)
source.sum(expression)
join_name.sum(expression)
join_name.field.sum()

The sum function may be used to compute the sum of all records of a particular field.

run: flights -> {
  aggregate: total_distance is sum(distance)
}

QUERY RESULTS

[
  {
    "total_distance": 255337195
  }
]

SELECT 
   COALESCE(SUM(base."distance"),0) as "total_distance"
FROM '../data/flights.parquet' as base

avg

avg(expression)
source.avg(expression)
join_name.avg(expression)
join_name.field.avg()

The avg function may be used to compute the average of all records of a particular field.

run: aircraft -> {
  aggregate: average_seats is avg(aircraft_models.seats)
}

QUERY RESULTS

[
  {
    "average_seats": 8.671575437621561
  }
]

SELECT 
   AVG(aircraft_models_0."seats") as "average_seats"
FROM '../data/aircraft.parquet' as base
 LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0
  ON aircraft_models_0."aircraft_model_code"=base."aircraft_model_code"

min

min(expression)
source.min(expression)
join_name.field.min()

The min function may be used to compute the minimum of all records of a particular field.

run: order_items -> {
  aggregate: cheapest_price is min(sale_price)
}

QUERY RESULTS

[
  {
    "cheapest_price": 0.49000000953674316
  }
]

SELECT 
   min(base."sale_price") as "cheapest_price"
FROM '../data/order_items.parquet' as base

max

max(expression)
source.max(expression)
join_name.field.max()

The max function may be used to compute the maximum of all records of a particular field.

run: flights -> {
  aggregate: longest_distance is max(distance)
}

QUERY RESULTS

[
  {
    "longest_distance": 4243
  }
]

SELECT 
   max(base."distance") as "longest_distance"
FROM '../data/flights.parquet' as base

In SQL, some kinds of aggregations are difficult to express because locality of aggregation is restricted to the top level of a query. Malloy offers more control over this behavior, allowing these types of analysis to be expressed much more easily.

The Problem

Suppose you were interested in learning more about the number of seats on commercial aircraft. First you might look at the average number of seats on all registered aircraft.

To do this, you would start with the aircraft table and join in aircraft_models to get access to the number of seats, then take the average of aircraft_models.seats.

SELECT
  AVG(aircraft_models.seats)
FROM aircraft
LEFT JOIN aircraft_models
  ON aircraft.aircraft_model_code = aircraft_models.aircraft_model_code

run: aircraft -> {
  aggregate: average_seats is avg(aircraft_models.seats)
}

QUERY RESULTS

[
  {
    "average_seats": 8.671575437621561
  }
]

SELECT 
   AVG(aircraft_models_0."seats") as "average_seats"
FROM '../data/aircraft.parquet' as base
 LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0
  ON aircraft_models_0."aircraft_model_code"=base."aircraft_model_code"

You're also interested in knowing the average number of seats on the kinds of aircraft that are in use, or in other words, the average number of seats of the aircraft models of registered aircraft.

To do this, you might decide to start with the aircraft_models table instead.

SELECT AVG(seats)
FROM aircraft_models

run: aircraft_models -> {
  aggregate: average_seats is avg(seats)
}

QUERY RESULTS

[
  {
    "average_seats": 7.482757480028448
  }
]

SELECT 
   AVG(base."seats") as "average_seats"
FROM '../data/aircraft_models.parquet' as base

However, this isn't actually the number you were interested in, because this measures the average number of seats across all aircraft models, not just the ones with actively-registered aircraft.

Unfortunately, SQL doesn't have any native constructs to compute this value, and in practice analysts often resort to complicated fact tables to perform this kind of query.

The Solution

Malloy introduces the concept of aggregate locality, meaning that aggregates can be computed with respect to different points in the data graph. In the following query, average_seats is computed with respect to aircraft_models, yielding the the average number of seats on aircraft models of aircraft listed in the aircraft table.

run: aircraft -> {
  aggregate: average_seats is aircraft_models.avg(aircraft_models.seats)
}

QUERY RESULTS

[
  {
    "average_seats": 12.919491525423728
  }
]

SELECT 
   (
        SELECT AVG(a.val) as value
        FROM (
          SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: aircraft_models_0."seats"})) a
        )
      ) as "average_seats"
FROM '../data/aircraft.parquet' as base
 LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0
  ON aircraft_models_0."aircraft_model_code"=base."aircraft_model_code"

Using a bare aggregate function, e.g. avg(expression), will aggregate with respect to the current source (in the above example, that would be aircraft). The syntax source.avg(expr) makes that aggregate locality explicit. When an aggregate computation crosses a join relationship, explicit aggregate locality is required. See below for more details.

For convenience, aircraft_models.avg(aircraft_models.seats) can be written as aircraft_models.seats.avg().

Aggregates that Support Locality

The aggregate functions that support locality are count (non-distinct), sum, and avg.

The min and max aggregates do not support aggregate locality because the minimum and maximum values are the same regardless of where they are computed. Local aggregation removes duplicate values (those corresponding to the same row in the aggregate source location), and minimum and maximum values do not change if values are repeated more than once.

Required Explicit Aggregate Locality

Specifying aggregate locality is required when the aggregate computation crosses a join relationship accesses nested data. For symmetric aggregate functions min, max, and count, this never occurs, and so it is always allowed to use the regular min(expression) syntax. For asymmetric aggregates avg and sum, any time the aggregate expression (or dimension) uses a join or nested field, using the syntax sum(expression) will result in a warning suggesting the use of source.sum(expression) instead.

run: aircraft -> {
  aggregate:
    // Symmetric aggregate functions can use implicit locality
    okay_1 is min(aircraft_models.seats) 
    // Asymmetric aggregate functions must use explicit locality
    warning is avg(aircraft_models.seats)
    // Using `source` for explicity locality to produce a weighted average
    okay_2 is source.avg(aircraft_models.seats)
    // Using `aircraft_models` for explicit locality
    okay_3 is aircraft_models.avg(aircraft_models.seats)
}

QUERY RESULTS

[
  {
    "okay_1": 0,
    "warning": 8.671575437621561,
    "okay_2": 8.671575437621561,
    "okay_3": 12.919491525423728
  }
]

SELECT 
   min(aircraft_models_0."seats") as "okay_1",
   AVG(aircraft_models_0."seats") as "warning",
   AVG(aircraft_models_0."seats") as "okay_2",
   (
        SELECT AVG(a.val) as value
        FROM (
          SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: aircraft_models_0."seats"})) a
        )
      ) as "okay_3"
FROM '../data/aircraft.parquet' as base
 LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0
  ON aircraft_models_0."aircraft_model_code"=base."aircraft_model_code"

Examples

The following queries show six ways of calculating the average number of seats.

run: flights -> {
  aggregate:
    aircraft_models_avg_seats is aircraft.aircraft_models.seats.avg()
    aircraft_avg_models_seats is aircraft.avg(aircraft.aircraft_models.seats)
    avg_aircraft_models_seats is source.avg(aircraft.aircraft_models.seats)
}

QUERY RESULTS

[
  {
    "aircraft_models_avg_seats": 146.89743589743588,
    "aircraft_avg_models_seats": 146.25454545454545,
    "avg_aircraft_models_seats": 149.00187920319465
  }
]

SELECT 
   (
        SELECT AVG(a.val) as value
        FROM (
          SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: aircraft_models_0."seats"})) a
        )
      ) as "aircraft_models_avg_seats",
   (
        SELECT AVG(a.val) as value
        FROM (
          SELECT UNNEST(list(distinct {key:aircraft_0."tail_num", val: aircraft_models_0."seats"})) a
        )
      ) as "aircraft_avg_models_seats",
   AVG(aircraft_models_0."seats") as "avg_aircraft_models_seats"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/aircraft.parquet' AS aircraft_0
  ON aircraft_0."tail_num"=base."tail_num"
 LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0
  ON aircraft_models_0."aircraft_model_code"=aircraft_0."aircraft_model_code"

run: aircraft -> {
  aggregate:
    models_avg_seats is aircraft_models.seats.avg()
    avg_models_seats is source.avg(aircraft_models.seats)
}

QUERY RESULTS

[
  {
    "models_avg_seats": 12.919491525423728,
    "avg_models_seats": 8.671575437621561
  }
]

SELECT 
   (
        SELECT AVG(a.val) as value
        FROM (
          SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: aircraft_models_0."seats"})) a
        )
      ) as "models_avg_seats",
   AVG(aircraft_models_0."seats") as "avg_models_seats"
FROM '../data/aircraft.parquet' as base
 LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0
  ON aircraft_models_0."aircraft_model_code"=base."aircraft_model_code"

run: aircraft_models -> {
  aggregate: avg_seats is avg(seats)
}

QUERY RESULTS

[
  {
    "avg_seats": 7.482757480028448
  }
]

SELECT 
   AVG(base."seats") as "avg_seats"
FROM '../data/aircraft_models.parquet' as base

This table summarizes the meaning of each of these calculations.

Field Definition and Source	Is the average number of seats...
`avg(seats)` in `aircraft_models`	...of all aircraft models.
`source.avg(aircraft_models.seats)` in `aircraft`	...on aircraft.
`aircraft_models.seats.avg()` in `aircraft`	...of the aircraft models of aircraft.
`source.avg(aircraft.aircraft_models.seats)` in `flights`	...on flights.
`aircraft.avg(aircraft.aircraft_models.seats)` in `flights`	...on aircraft that fly.
`aircraft.aircraft_models.seats.avg()` in `flights`	...of the aircraft models of aircraft that fly.

Illegal Asymmetric Aggregate Computations

There are a few circumstances when an asymmetric aggregate computation is illegal, based on the join usage of that computation:

When the aggregate crosses a join_many relationship forward
When the aggregate crosses a join_cross relationship forward or backward
When the aggregate crosses a repeated nested relationship forward

"Forward" verses "backward" refers to the difference between join name.sum(field in source) (forward) and source.sum(join name.field in join) (backward).

These cases don't carry meaning and are impossible to compute. Malloy will yield an error like "Cannot compute sum across join_many relationship join_name; use join_name.field.sum()".

Aggregates on Fields

Aggregating "on a field," e.g. aircraft_models.seats.avg() is exactly equivalent to aggregating that field with respect to its direct parent source, e.g. aircraft_models.avg(aircraft_models.seats). This syntax is supported for the aggregate functions which benefit from aggregate locality and require a field, avg and sum.

One case to make note of is when the field is a dimension which uses exactly one join pattern, in which case it is equivalent to aggregating with respect to that join pattern:

run: aircraft extend {
  dimension: aircraft_model_seats is aircraft_models.seats
} -> {
  aggregate:
    avg_on_source is aircraft_models.avg(aircraft_models.seats)
    avg_on_field is aircraft_models.seats.avg()
    avg_on_dimension is aircraft_model_seats.avg()
}

QUERY RESULTS

[
  {
    "avg_on_source": 12.919491525423728,
    "avg_on_field": 12.919491525423728,
    "avg_on_dimension": 12.919491525423728
  }
]

SELECT 
   (
        SELECT AVG(a.val) as value
        FROM (
          SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: aircraft_models_0."seats"})) a
        )
      ) as "avg_on_source",
   (
        SELECT AVG(a.val) as value
        FROM (
          SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: aircraft_models_0."seats"})) a
        )
      ) as "avg_on_field",
   (
        SELECT AVG(a.val) as value
        FROM (
          SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: (aircraft_models_0."seats")})) a
        )
      ) as "avg_on_dimension"
FROM '../data/aircraft.parquet' as base
 LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0
  ON aircraft_models_0."aircraft_model_code"=base."aircraft_model_code"

Ungrouped Aggregates

In a query which is grouped by multiple dimensions, it is often useful to be able to perform an aggregate calculation on sub-groups to determine subtotals. The all() and exclude() functions in Malloy allow control over grouping and ungrouping, making this simple:

run: airports -> {
  group_by: state, faa_region
  aggregate:
    count_airports is count()
    overall_airports is all(count())
    # percent
    percent_of_total is count() / all(count())
    airports_in_region is all(count(), faa_region)
    # percent
    percent_in_region is count() / all(count(), faa_region)
}

QUERY RESULTS

[
  {
    "state": "TX",
    "faa_region": "ASW",
    "count_airports": 1845,
    "overall_airports": 19793,
    "percent_of_total": 0.09321477289950993,
    "airports_in_region": 3268,
    "percent_in_region": 0.5645654834761322
  },
  {
    "state": "CA",
    "faa_region": "AWP",
    "count_airports": 984,
    "overall_airports": 19793,
    "percent_of_total": 0.049714545546405295,
    "airports_in_region": 1503,
    "percent_in_region": 0.654690618762475
  },
  {
    "state": "IL",
    "faa_region": "AGL",
    "count_airports": 890,
    "overall_airports": 19793,
    "percent_of_total": 0.04496539180518365,
    "airports_in_region": 4437,
    "percent_in_region": 0.2005859815190444
  },
  {
    "state": "FL",
    "faa_region": "ASO",
    "count_airports": 856,
    "overall_airports": 19793,
    "percent_of_total": 0.04324761279240135,
    "airports_in_region": 2924,
    "percent_in_region": 0.292749658002736
  },
  {
    "state": "PA",
    "faa_region": "AEA",
    "count_airports": 804,
    "overall_airports": 19793,
    "percent_of_total": 0.040620421361087254,
    "airports_in_region": 2586,
    "percent_in_region": 0.3109048723897912
  }
]

WITH __stage0 AS (
  SELECT
    group_set,
    CASE WHEN group_set=2 THEN
      base."state"
      END as "state__2",
    CASE WHEN group_set IN (2,1) THEN
      base."faa_region"
      END as "faa_region__2",
    CASE WHEN group_set=2 THEN
      COUNT(1)
      END as "count_airports__2",
    MAX(CASE WHEN group_set=0 THEN
      COUNT(1)
      END) OVER () as "overall_airports__2",
    CASE WHEN group_set=2 THEN
      COUNT(1)
      END*1.0/MAX(CASE WHEN group_set=0 THEN
      COUNT(1)
      END) OVER () as "percent_of_total__2",
    MAX(CASE WHEN group_set=1 THEN
      COUNT(1)
      END) OVER (PARTITION BY CASE WHEN group_set IN (2,1) THEN
      base."faa_region"
      END) as "airports_in_region__2",
    CASE WHEN group_set=2 THEN
      COUNT(1)
      END*1.0/MAX(CASE WHEN group_set=1 THEN
      COUNT(1)
      END) OVER (PARTITION BY CASE WHEN group_set IN (2,1) THEN
      base."faa_region"
      END) as "percent_in_region__2"
  FROM '../data/airports.parquet' as base
  CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,2,1)) as group_set  ) as group_set
  GROUP BY 1,2,3
)
SELECT
  "state__2" as "state",
  "faa_region__2" as "faa_region",
  MAX(CASE WHEN group_set=2 THEN "count_airports__2" END) as "count_airports",
  MAX(CASE WHEN group_set=2 THEN "overall_airports__2" END) as "overall_airports",
  MAX(CASE WHEN group_set=2 THEN "percent_of_total__2" END) as "percent_of_total",
  MAX(CASE WHEN group_set=2 THEN "airports_in_region__2" END) as "airports_in_region",
  MAX(CASE WHEN group_set=2 THEN "percent_in_region__2" END) as "percent_in_region"
FROM __stage0
WHERE group_set NOT IN (0,1)
GROUP BY 1,2
ORDER BY 3 desc NULLS LAST

Read more about Ungrouped Aggregates here.