In a query which is grouped by multiple dimensions, it is often useful to be able to perform an aggregate calculation on sub-groups.
Ungrouped Aggregate Functions
all
all(expr) all(aggregate_expr, grouping_dimension, ...)
The all()
function will perform the specified aggregate computation aggregate_expr
, ignoring the grouping in the
current query to provide an overall value.
run: airports -> { group_by: faa_region aggregate: percent_of_total is count() / all(count())*100.0 }
[ { "faa_region": "AGL", "percent_of_total": 22.41701611680897 }, { "faa_region": "ASW", "percent_of_total": 16.51088768756631 }, { "faa_region": "ASO", "percent_of_total": 14.772899509927754 }, { "faa_region": "AEA", "percent_of_total": 13.065225079573587 }, { "faa_region": "ANM", "percent_of_total": 10.619916131965846 } ]
WITH __stage0 AS ( SELECT group_set, CASE WHEN group_set=1 THEN base."faa_region" END as "faa_region__1", (CASE WHEN group_set=1 THEN COUNT(1) END*1.0/MAX(CASE WHEN group_set=0 THEN COUNT(1) END) OVER ())*100.0::DOUBLE as "percent_of_total__1" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set ) as group_set GROUP BY 1,2 ) SELECT "faa_region__1" as "faa_region", MAX(CASE WHEN group_set=1 THEN "percent_of_total__1" END) as "percent_of_total" FROM __stage0 WHERE group_set NOT IN (0) GROUP BY 1 ORDER BY 2 desc NULLS LAST
When grouping_dimension
s are provided, all()
will preserve grouping by the named dimensions. For example, in the query below, grouping by faa_region
is preserved, while state
is ungrouped.
run: airports -> { group_by: faa_region, state aggregate: airports_in_state is count() airports_in_region is all(count(), faa_region) order_by: airports_in_state desc }
[ { "faa_region": "ASW", "state": "TX", "airports_in_state": 1845, "airports_in_region": 3268 }, { "faa_region": "AWP", "state": "CA", "airports_in_state": 984, "airports_in_region": 1503 }, { "faa_region": "AGL", "state": "IL", "airports_in_state": 890, "airports_in_region": 4437 }, { "faa_region": "ASO", "state": "FL", "airports_in_state": 856, "airports_in_region": 2924 }, { "faa_region": "AEA", "state": "PA", "airports_in_state": 804, "airports_in_region": 2586 } ]
WITH __stage0 AS ( SELECT group_set, CASE WHEN group_set IN (2,1) THEN base."faa_region" END as "faa_region__2", CASE WHEN group_set=2 THEN base."state" END as "state__2", CASE WHEN group_set=2 THEN COUNT(1) END as "airports_in_state__2", MAX(CASE WHEN group_set=1 THEN COUNT(1) END) OVER (PARTITION BY CASE WHEN group_set IN (2,1) THEN base."faa_region" END) as "airports_in_region__2" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,2,1)) as group_set ) as group_set GROUP BY 1,2,3 ) SELECT "faa_region__2" as "faa_region", "state__2" as "state", MAX(CASE WHEN group_set=2 THEN "airports_in_state__2" END) as "airports_in_state", MAX(CASE WHEN group_set=2 THEN "airports_in_region__2" END) as "airports_in_region" FROM __stage0 WHERE group_set NOT IN (0,1) GROUP BY 1,2 ORDER BY 3 desc NULLS LAST
Dimensions named in all()
must be included in a group_by
in the current query (in other words, they must be output fields).
exclude
exclude(aggregate_expr, ungroup_dimension, ...)
Similar to all()
, exclude()
allows you to control which grouping dimensions are
used to compute aggregate_expr
. In this case, dimensions which should not be used are listed. For example, these two aggregates will do the exact same thing:
run: airports -> { group_by: faa_region, fac_type aggregate: count_airports is count() count_in_region_exclude is exclude(count(), fac_type) count_in_region_all is all(count(), faa_region) }
[ { "faa_region": "AGL", "fac_type": "AIRPORT", "count_airports": 3443, "count_in_region_exclude": 4437, "count_in_region_all": 4437 }, { "faa_region": "ASW", "fac_type": "AIRPORT", "count_airports": 2341, "count_in_region_exclude": 3268, "count_in_region_all": 3268 }, { "faa_region": "ASO", "fac_type": "AIRPORT", "count_airports": 2038, "count_in_region_exclude": 2924, "count_in_region_all": 2924 }, { "faa_region": "AEA", "fac_type": "AIRPORT", "count_airports": 1525, "count_in_region_exclude": 2586, "count_in_region_all": 2586 }, { "faa_region": "ANM", "fac_type": "AIRPORT", "count_airports": 1524, "count_in_region_exclude": 2102, "count_in_region_all": 2102 } ]
WITH __stage0 AS ( SELECT group_set, CASE WHEN group_set IN (3,1,2) THEN base."faa_region" END as "faa_region__3", CASE WHEN group_set=3 THEN base."fac_type" END as "fac_type__3", CASE WHEN group_set=3 THEN COUNT(1) END as "count_airports__3", MAX(CASE WHEN group_set=1 THEN COUNT(1) END) OVER (PARTITION BY CASE WHEN group_set IN (3,1,2) THEN base."faa_region" END) as "count_in_region_exclude__3", MAX(CASE WHEN group_set=2 THEN COUNT(1) END) OVER (PARTITION BY CASE WHEN group_set IN (3,1,2) THEN base."faa_region" END) as "count_in_region_all__3" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,3,1)) as group_set ) as group_set GROUP BY 1,2,3 ) SELECT "faa_region__3" as "faa_region", "fac_type__3" as "fac_type", MAX(CASE WHEN group_set=3 THEN "count_airports__3" END) as "count_airports", MAX(CASE WHEN group_set=3 THEN "count_in_region_exclude__3" END) as "count_in_region_exclude", MAX(CASE WHEN group_set=3 THEN "count_in_region_all__3" END) as "count_in_region_all" FROM __stage0 WHERE group_set NOT IN (0,1,2) GROUP BY 1,2 ORDER BY 3 desc NULLS LAST
The main difference is that in a nested query, it is legal to name a grouping dimension from an outer query which contains the inner query.
As with all()
, ungroup_dimension
s must be output fields.