Malloy supports the standard aggregate functions count
, sum
, avg
, min
, and max
. When these are used in a field's definition, they make that field a measure.
Basic Syntax
count
count() source.count() join_name.count()
The count
aggregate function may be used to count the number of records appearing in a source.
run: flights -> { aggregate: flight_count is count() }
[ { "flight_count": 344827 } ]
SELECT COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base
count(expr)
count(expression) source.count(expression) join_name.field.count()
Distinct counts may be used to count the number of distinct values of a particular expression within a source.
run: order_items -> { aggregate: order_count is count(order_id) }
[ { "order_count": 264071 } ]
SELECT count(distinct base."order_id") as "order_count" FROM '../data/order_items.parquet' as base
sum
sum(expression) source.sum(expression) join_name.sum(expression) join_name.field.sum()
The sum
function may be used to compute the sum of all records of a particular field.
run: flights -> { aggregate: total_distance is sum(distance) }
[ { "total_distance": 255337195 } ]
SELECT COALESCE(SUM(base."distance"),0) as "total_distance" FROM '../data/flights.parquet' as base
avg
avg(expression) source.avg(expression) join_name.avg(expression) join_name.field.avg()
The avg
function may be used to compute the average of all records of a particular field.
run: aircraft -> { aggregate: average_seats is avg(aircraft_models.seats) }
[ { "average_seats": 8.671575437621561 } ]
SELECT AVG(aircraft_models_0."seats") as "average_seats" FROM '../data/aircraft.parquet' as base LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0 ON aircraft_models_0."aircraft_model_code"=base."aircraft_model_code"
min
min(expression) source.min(expression) join_name.field.min()
The min
function may be used to compute the minimum of all records of a particular field.
run: order_items -> { aggregate: cheapest_price is min(sale_price) }
[ { "cheapest_price": 0.49000000953674316 } ]
SELECT min(base."sale_price") as "cheapest_price" FROM '../data/order_items.parquet' as base
max
max(expression) source.max(expression) join_name.field.max()
The max
function may be used to compute the maximum of all records of a particular field.
run: flights -> { aggregate: longest_distance is max(distance) }
[ { "longest_distance": 4243 } ]
SELECT max(base."distance") as "longest_distance" FROM '../data/flights.parquet' as base
Aggregate Locality
In SQL, some kinds of aggregations are difficult to express because locality of aggregation is restricted to the top level of a query. Malloy offers more control over this behavior, allowing these types of analysis to be expressed much more easily.
The Problem
Suppose you were interested in learning more about the number of seats on commercial aircraft. First you might look at the average number of seats on all registered aircraft.
To do this, you would start with the aircraft
table and join in aircraft_models
to get access to the number of seats, then take
the average of aircraft_models.seats
.
SELECT AVG(aircraft_models.seats) FROM aircraft LEFT JOIN aircraft_models ON aircraft.aircraft_model_code = aircraft_models.aircraft_model_code
run: aircraft -> { aggregate: average_seats is avg(aircraft_models.seats) }
[ { "average_seats": 8.671575437621561 } ]
SELECT AVG(aircraft_models_0."seats") as "average_seats" FROM '../data/aircraft.parquet' as base LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0 ON aircraft_models_0."aircraft_model_code"=base."aircraft_model_code"
You're also interested in knowing the average number of seats on the kinds of aircraft that are in use, or in other words, the average number of seats of the aircraft models of registered aircraft.
To do this, you might decide to start with the aircraft_models
table instead.
SELECT AVG(seats) FROM aircraft_models
run: aircraft_models -> { aggregate: average_seats is avg(seats) }
[ { "average_seats": 7.482757480028448 } ]
SELECT AVG(base."seats") as "average_seats" FROM '../data/aircraft_models.parquet' as base
However, this isn't actually the number you were interested in, because this measures the average number of seats across all aircraft models, not just the ones with actively-registered aircraft.
Unfortunately, SQL doesn't have any native constructs to compute this value, and in practice analysts often resort to complicated fact tables to perform this kind of query.
The Solution
Malloy introduces the concept of aggregate locality, meaning that aggregates can be computed with respect to different points in the data graph. In the following query, average_seats
is computed with respect to aircraft_models
,
yielding the the average number of seats on aircraft models of aircraft listed in the aircraft
table.
run: aircraft -> { aggregate: average_seats is aircraft_models.avg(aircraft_models.seats) }
[ { "average_seats": 12.919491525423728 } ]
SELECT ( SELECT AVG(a.val) as value FROM ( SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: aircraft_models_0."seats"})) a ) ) as "average_seats" FROM '../data/aircraft.parquet' as base LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0 ON aircraft_models_0."aircraft_model_code"=base."aircraft_model_code"
Using a bare aggregate function, e.g. avg(expression)
, will aggregate with respect to the current source (in the above example, that would be aircraft
). The syntax source.avg(expr)
makes that aggregate locality explicit. When an aggregate computation crosses a join relationship, explicit aggregate locality is required. See below for more details.
For convenience, aircraft_models.avg(aircraft_models.seats)
can be written as aircraft_models.seats.avg()
.
Aggregates that Support Locality
The aggregate functions that support locality are count
(non-distinct), sum
, and avg
.
The min
and max
aggregates do not support aggregate locality because the minimum and maximum values are the same regardless of where they are computed. Local aggregation removes duplicate values (those corresponding to the same row in the aggregate source location), and minimum and maximum values do not change if values are repeated more than once.
Required Explicit Aggregate Locality
Specifying aggregate locality is required when the aggregate computation crosses a join relationship accesses nested data. For symmetric aggregate functions min
, max
, and count
, this never occurs, and so it is always allowed to use the regular min(expression)
syntax.
For asymmetric aggregates avg
and sum
, any time the aggregate expression (or dimension) uses a join or nested field, using the syntax sum(expression)
will result in a warning suggesting the use of source.sum(expression)
instead.
run: aircraft -> { aggregate: // Symmetric aggregate functions can use implicit locality okay_1 is min(aircraft_models.seats) // Asymmetric aggregate functions must use explicit locality warning is avg(aircraft_models.seats) // Using `source` for explicity locality to produce a weighted average okay_2 is source.avg(aircraft_models.seats) // Using `aircraft_models` for explicit locality okay_3 is aircraft_models.avg(aircraft_models.seats) }
[ { "okay_1": 0, "warning": 8.671575437621561, "okay_2": 8.671575437621561, "okay_3": 12.919491525423728 } ]
SELECT min(aircraft_models_0."seats") as "okay_1", AVG(aircraft_models_0."seats") as "warning", AVG(aircraft_models_0."seats") as "okay_2", ( SELECT AVG(a.val) as value FROM ( SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: aircraft_models_0."seats"})) a ) ) as "okay_3" FROM '../data/aircraft.parquet' as base LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0 ON aircraft_models_0."aircraft_model_code"=base."aircraft_model_code"
Examples
The following queries show six ways of calculating the average number of seats.
run: flights -> { aggregate: aircraft_models_avg_seats is aircraft.aircraft_models.seats.avg() aircraft_avg_models_seats is aircraft.avg(aircraft.aircraft_models.seats) avg_aircraft_models_seats is source.avg(aircraft.aircraft_models.seats) }
[ { "aircraft_models_avg_seats": 146.89743589743588, "aircraft_avg_models_seats": 146.25454545454545, "avg_aircraft_models_seats": 149.00187920319465 } ]
SELECT ( SELECT AVG(a.val) as value FROM ( SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: aircraft_models_0."seats"})) a ) ) as "aircraft_models_avg_seats", ( SELECT AVG(a.val) as value FROM ( SELECT UNNEST(list(distinct {key:aircraft_0."tail_num", val: aircraft_models_0."seats"})) a ) ) as "aircraft_avg_models_seats", AVG(aircraft_models_0."seats") as "avg_aircraft_models_seats" FROM '../data/flights.parquet' as base LEFT JOIN '../data/aircraft.parquet' AS aircraft_0 ON aircraft_0."tail_num"=base."tail_num" LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0 ON aircraft_models_0."aircraft_model_code"=aircraft_0."aircraft_model_code"
run: aircraft -> { aggregate: models_avg_seats is aircraft_models.seats.avg() avg_models_seats is source.avg(aircraft_models.seats) }
[ { "models_avg_seats": 12.919491525423728, "avg_models_seats": 8.671575437621561 } ]
SELECT ( SELECT AVG(a.val) as value FROM ( SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: aircraft_models_0."seats"})) a ) ) as "models_avg_seats", AVG(aircraft_models_0."seats") as "avg_models_seats" FROM '../data/aircraft.parquet' as base LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0 ON aircraft_models_0."aircraft_model_code"=base."aircraft_model_code"
run: aircraft_models -> { aggregate: avg_seats is avg(seats) }
[ { "avg_seats": 7.482757480028448 } ]
SELECT AVG(base."seats") as "avg_seats" FROM '../data/aircraft_models.parquet' as base
This table summarizes the meaning of each of these calculations.
Field Definition and Source | Is the average number of seats... |
---|---|
avg(seats) in aircraft_models |
...of all aircraft models. |
source.avg(aircraft_models.seats) in aircraft |
...on aircraft. |
aircraft_models.seats.avg() in aircraft |
...of the aircraft models of aircraft. |
source.avg(aircraft.aircraft_models.seats) in flights |
...on flights. |
aircraft.avg(aircraft.aircraft_models.seats) in flights |
...on aircraft that fly. |
aircraft.aircraft_models.seats.avg() in flights |
...of the aircraft models of aircraft that fly. |
Illegal Asymmetric Aggregate Computations
There are a few circumstances when an asymmetric aggregate computation is illegal, based on the join usage of that computation:
When the aggregate crosses a
join_many
relationship forwardWhen the aggregate crosses a
join_cross
relationship forward or backwardWhen the aggregate crosses a repeated nested relationship forward
"Forward" verses "backward" refers to the difference between join name.sum(field in source)
(forward) and source.sum(join name.field in join)
(backward).
These cases don't carry meaning and are impossible to compute. Malloy will yield an error like "Cannot compute sum
across join_many
relationship join_name
; use join_name.field.sum()
".
Aggregates on Fields
Aggregating "on a field," e.g. aircraft_models.seats.avg()
is exactly equivalent to aggregating that field with respect to its direct parent source, e.g. aircraft_models.avg(aircraft_models.seats)
. This syntax is supported for the aggregate functions which benefit from aggregate locality and require a field, avg
and sum
.
One case to make note of is when the field is a dimension which uses exactly one join pattern, in which case it is equivalent to aggregating with respect to that join pattern:
run: aircraft extend { dimension: aircraft_model_seats is aircraft_models.seats } -> { aggregate: avg_on_source is aircraft_models.avg(aircraft_models.seats) avg_on_field is aircraft_models.seats.avg() avg_on_dimension is aircraft_model_seats.avg() }
[ { "avg_on_source": 12.919491525423728, "avg_on_field": 12.919491525423728, "avg_on_dimension": 12.919491525423728 } ]
SELECT ( SELECT AVG(a.val) as value FROM ( SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: aircraft_models_0."seats"})) a ) ) as "avg_on_source", ( SELECT AVG(a.val) as value FROM ( SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: aircraft_models_0."seats"})) a ) ) as "avg_on_field", ( SELECT AVG(a.val) as value FROM ( SELECT UNNEST(list(distinct {key:aircraft_models_0."aircraft_model_code", val: (aircraft_models_0."seats")})) a ) ) as "avg_on_dimension" FROM '../data/aircraft.parquet' as base LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0 ON aircraft_models_0."aircraft_model_code"=base."aircraft_model_code"
Ungrouped Aggregates
In a query which is grouped by multiple dimensions, it is often useful to be able to perform an aggregate calculation on sub-groups to determine subtotals. The all()
and exclude()
functions in Malloy allow control over grouping and ungrouping, making this simple:
run: airports -> { group_by: state, faa_region aggregate: count_airports is count() overall_airports is all(count()) # percent percent_of_total is count() / all(count()) airports_in_region is all(count(), faa_region) # percent percent_in_region is count() / all(count(), faa_region) }
[ { "state": "TX", "faa_region": "ASW", "count_airports": 1845, "overall_airports": 19793, "percent_of_total": 0.09321477289950993, "airports_in_region": 3268, "percent_in_region": 0.5645654834761322 }, { "state": "CA", "faa_region": "AWP", "count_airports": 984, "overall_airports": 19793, "percent_of_total": 0.049714545546405295, "airports_in_region": 1503, "percent_in_region": 0.654690618762475 }, { "state": "IL", "faa_region": "AGL", "count_airports": 890, "overall_airports": 19793, "percent_of_total": 0.04496539180518365, "airports_in_region": 4437, "percent_in_region": 0.2005859815190444 }, { "state": "FL", "faa_region": "ASO", "count_airports": 856, "overall_airports": 19793, "percent_of_total": 0.04324761279240135, "airports_in_region": 2924, "percent_in_region": 0.292749658002736 }, { "state": "PA", "faa_region": "AEA", "count_airports": 804, "overall_airports": 19793, "percent_of_total": 0.040620421361087254, "airports_in_region": 2586, "percent_in_region": 0.3109048723897912 } ]
WITH __stage0 AS ( SELECT group_set, CASE WHEN group_set=2 THEN base."state" END as "state__2", CASE WHEN group_set IN (2,1) THEN base."faa_region" END as "faa_region__2", CASE WHEN group_set=2 THEN COUNT(1) END as "count_airports__2", MAX(CASE WHEN group_set=0 THEN COUNT(1) END) OVER () as "overall_airports__2", CASE WHEN group_set=2 THEN COUNT(1) END*1.0/MAX(CASE WHEN group_set=0 THEN COUNT(1) END) OVER () as "percent_of_total__2", MAX(CASE WHEN group_set=1 THEN COUNT(1) END) OVER (PARTITION BY CASE WHEN group_set IN (2,1) THEN base."faa_region" END) as "airports_in_region__2", CASE WHEN group_set=2 THEN COUNT(1) END*1.0/MAX(CASE WHEN group_set=1 THEN COUNT(1) END) OVER (PARTITION BY CASE WHEN group_set IN (2,1) THEN base."faa_region" END) as "percent_in_region__2" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,2,1)) as group_set ) as group_set GROUP BY 1,2,3 ) SELECT "state__2" as "state", "faa_region__2" as "faa_region", MAX(CASE WHEN group_set=2 THEN "count_airports__2" END) as "count_airports", MAX(CASE WHEN group_set=2 THEN "overall_airports__2" END) as "overall_airports", MAX(CASE WHEN group_set=2 THEN "percent_of_total__2" END) as "percent_of_total", MAX(CASE WHEN group_set=2 THEN "airports_in_region__2" END) as "airports_in_region", MAX(CASE WHEN group_set=2 THEN "percent_in_region__2" END) as "percent_in_region" FROM __stage0 WHERE group_set NOT IN (0,1) GROUP BY 1,2 ORDER BY 3 desc NULLS LAST
Read more about Ungrouped Aggregates here.