Ungrouped Aggregates

VSCode

In a query which is grouped by multiple dimensions, it is often useful to be able to perform an aggregate calculation on sub-groups.

Ungrouped Aggregate Functions

all

all(expr)
all(aggregate_expr, grouping_dimension, ...)

The all() function will perform the specified aggregate computation aggregate_expr, ignoring the grouping in the current query to provide an overall value.

run: airports -> {
  group_by: faa_region 
  aggregate: percent_of_total is count() / all(count())*100.0
}

QUERY RESULTS

[
  {
    "faa_region": "AGL",
    "percent_of_total": 22.41701611680897
  },
  {
    "faa_region": "ASW",
    "percent_of_total": 16.51088768756631
  },
  {
    "faa_region": "ASO",
    "percent_of_total": 14.772899509927754
  },
  {
    "faa_region": "AEA",
    "percent_of_total": 13.065225079573587
  },
  {
    "faa_region": "ANM",
    "percent_of_total": 10.619916131965846
  }
]

WITH __stage0 AS (
  SELECT
    group_set,
    CASE WHEN group_set=1 THEN
      base."faa_region"
      END as "faa_region__1",
    (CASE WHEN group_set=1 THEN
      COUNT(1)
      END*1.0/MAX(CASE WHEN group_set=0 THEN
      COUNT(1)
      END) OVER ())*100.0::DOUBLE as "percent_of_total__1"
  FROM '../data/airports.parquet' as base
  CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set  ) as group_set
  GROUP BY 1,2
)
SELECT
  "faa_region__1" as "faa_region",
  MAX(CASE WHEN group_set=1 THEN "percent_of_total__1" END) as "percent_of_total"
FROM __stage0
WHERE group_set NOT IN (0)
GROUP BY 1
ORDER BY 2 desc NULLS LAST

When grouping_dimensions are provided, all() will preserve grouping by the named dimensions. For example, in the query below, grouping by faa_region is preserved, while state is ungrouped.

run: airports -> {
  group_by: faa_region, state
  aggregate:
    airports_in_state is count()
    airports_in_region is all(count(), faa_region)
  order_by: airports_in_state desc
}

QUERY RESULTS

[
  {
    "faa_region": "ASW",
    "state": "TX",
    "airports_in_state": 1845,
    "airports_in_region": 3268
  },
  {
    "faa_region": "AWP",
    "state": "CA",
    "airports_in_state": 984,
    "airports_in_region": 1503
  },
  {
    "faa_region": "AGL",
    "state": "IL",
    "airports_in_state": 890,
    "airports_in_region": 4437
  },
  {
    "faa_region": "ASO",
    "state": "FL",
    "airports_in_state": 856,
    "airports_in_region": 2924
  },
  {
    "faa_region": "AEA",
    "state": "PA",
    "airports_in_state": 804,
    "airports_in_region": 2586
  }
]

WITH __stage0 AS (
  SELECT
    group_set,
    CASE WHEN group_set IN (2,1) THEN
      base."faa_region"
      END as "faa_region__2",
    CASE WHEN group_set=2 THEN
      base."state"
      END as "state__2",
    CASE WHEN group_set=2 THEN
      COUNT(1)
      END as "airports_in_state__2",
    MAX(CASE WHEN group_set=1 THEN
      COUNT(1)
      END) OVER (PARTITION BY CASE WHEN group_set IN (2,1) THEN
      base."faa_region"
      END) as "airports_in_region__2"
  FROM '../data/airports.parquet' as base
  CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,2,1)) as group_set  ) as group_set
  GROUP BY 1,2,3
)
SELECT
  "faa_region__2" as "faa_region",
  "state__2" as "state",
  MAX(CASE WHEN group_set=2 THEN "airports_in_state__2" END) as "airports_in_state",
  MAX(CASE WHEN group_set=2 THEN "airports_in_region__2" END) as "airports_in_region"
FROM __stage0
WHERE group_set NOT IN (0,1)
GROUP BY 1,2
ORDER BY 3 desc NULLS LAST

Dimensions named in all() must be included in a group_by in the current query (in other words, they must be output fields).

exclude

exclude(aggregate_expr, ungroup_dimension, ...)

Similar to all(), exclude() allows you to control which grouping dimensions are used to compute aggregate_expr. In this case, dimensions which should not be used are listed. For example, these two aggregates will do the exact same thing:

run: airports -> {
  group_by: faa_region, fac_type
  aggregate:
    count_airports is count()
    count_in_region_exclude is exclude(count(), fac_type)
    count_in_region_all is all(count(), faa_region)
}

QUERY RESULTS

[
  {
    "faa_region": "AGL",
    "fac_type": "AIRPORT",
    "count_airports": 3443,
    "count_in_region_exclude": 4437,
    "count_in_region_all": 4437
  },
  {
    "faa_region": "ASW",
    "fac_type": "AIRPORT",
    "count_airports": 2341,
    "count_in_region_exclude": 3268,
    "count_in_region_all": 3268
  },
  {
    "faa_region": "ASO",
    "fac_type": "AIRPORT",
    "count_airports": 2038,
    "count_in_region_exclude": 2924,
    "count_in_region_all": 2924
  },
  {
    "faa_region": "AEA",
    "fac_type": "AIRPORT",
    "count_airports": 1525,
    "count_in_region_exclude": 2586,
    "count_in_region_all": 2586
  },
  {
    "faa_region": "ANM",
    "fac_type": "AIRPORT",
    "count_airports": 1524,
    "count_in_region_exclude": 2102,
    "count_in_region_all": 2102
  }
]

WITH __stage0 AS (
  SELECT
    group_set,
    CASE WHEN group_set IN (3,1,2) THEN
      base."faa_region"
      END as "faa_region__3",
    CASE WHEN group_set=3 THEN
      base."fac_type"
      END as "fac_type__3",
    CASE WHEN group_set=3 THEN
      COUNT(1)
      END as "count_airports__3",
    MAX(CASE WHEN group_set=1 THEN
      COUNT(1)
      END) OVER (PARTITION BY CASE WHEN group_set IN (3,1,2) THEN
      base."faa_region"
      END) as "count_in_region_exclude__3",
    MAX(CASE WHEN group_set=2 THEN
      COUNT(1)
      END) OVER (PARTITION BY CASE WHEN group_set IN (3,1,2) THEN
      base."faa_region"
      END) as "count_in_region_all__3"
  FROM '../data/airports.parquet' as base
  CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,3,1)) as group_set  ) as group_set
  GROUP BY 1,2,3
)
SELECT
  "faa_region__3" as "faa_region",
  "fac_type__3" as "fac_type",
  MAX(CASE WHEN group_set=3 THEN "count_airports__3" END) as "count_airports",
  MAX(CASE WHEN group_set=3 THEN "count_in_region_exclude__3" END) as "count_in_region_exclude",
  MAX(CASE WHEN group_set=3 THEN "count_in_region_all__3" END) as "count_in_region_all"
FROM __stage0
WHERE group_set NOT IN (0,1,2)
GROUP BY 1,2
ORDER BY 3 desc NULLS LAST

The main difference is that in a nested query, it is legal to name a grouping dimension from an outer query which contains the inner query.

As with all(), ungroup_dimensions must be output fields.