Window functions in Malloy are expressed in calculation fields, using the calculate:
keyword. Calculation fields operate on the results of a aggregation or selection, while still operating within the same query stage. Logically these calculation operations occur "after" the other operations, so their exact semantics can be challenging to understand. For a full list of the window functions that Malloy supports, visit our function reference documentation.
Here is a simple example:
run: flights -> { where: carrier = 'WN' group_by: dep_year is dep_time.year aggregate: flight_count order_by: dep_year asc calculate: year_change is flight_count - lag(flight_count) }
[ { "dep_year": "2000-01-01T00:00:00.000Z", "flight_count": 13133, "year_change": null }, { "dep_year": "2001-01-01T00:00:00.000Z", "flight_count": 14421, "year_change": 1288 }, { "dep_year": "2002-01-01T00:00:00.000Z", "flight_count": 14708, "year_change": 287 }, { "dep_year": "2003-01-01T00:00:00.000Z", "flight_count": 14300, "year_change": -408 }, { "dep_year": "2004-01-01T00:00:00.000Z", "flight_count": 14640, "year_change": 340 } ]
SELECT DATE_TRUNC('year', base."dep_time") as "dep_year", COUNT(1) as "flight_count", (COUNT(1))-(LAG((COUNT(1))) OVER( ORDER BY DATE_TRUNC('year', base."dep_time") asc NULLS LAST )) as "year_change" FROM '../data/flights.parquet' as base WHERE base."carrier"='WN' GROUP BY 1 ORDER BY 1 asc NULLS LAST
In this query, we compute flight_count
as an aggregate, and then compute the year-over-year change, year_change
in the calculate:
clause, using the lag window function.
Ordering
Some window functions in SQL, such as lag, require an ORDER BY
clause to determine behavior. In Malloy, such analytic functions use the ordering of the query itself. For example:
run: flights -> { where: carrier = 'WN' group_by: dep_year is dep_time.year aggregate: flight_count order_by: dep_year asc calculate: year_change is flight_count - lag(flight_count) }
[ { "dep_year": "2000-01-01T00:00:00.000Z", "flight_count": 13133, "year_change": null }, { "dep_year": "2001-01-01T00:00:00.000Z", "flight_count": 14421, "year_change": 1288 }, { "dep_year": "2002-01-01T00:00:00.000Z", "flight_count": 14708, "year_change": 287 }, { "dep_year": "2003-01-01T00:00:00.000Z", "flight_count": 14300, "year_change": -408 }, { "dep_year": "2004-01-01T00:00:00.000Z", "flight_count": 14640, "year_change": 340 } ]
SELECT DATE_TRUNC('year', base."dep_time") as "dep_year", COUNT(1) as "flight_count", (COUNT(1))-(LAG((COUNT(1))) OVER( ORDER BY DATE_TRUNC('year', base."dep_time") asc NULLS LAST )) as "year_change" FROM '../data/flights.parquet' as base WHERE base."carrier"='WN' GROUP BY 1 ORDER BY 1 asc NULLS LAST
lag(flight_count)
for each row is calculated with respect to dep_year asc
. The value of lag(flight_count)
for the year 2001 is the value of flight_count
for the year 2000.
Note that Malloy queries use a default ordering if none is explicitly specified:
If a time dimension is specified in the query, default ordering is by "descending time", showing newest rows first
If no time is specified, but there is a numeric measure, default ordering is by "measure values descending", showing largest values first
Ordering can also be changed by adding an order_by
property on the calculation. This allows you to control the ordering of the input to the window function independently from the ordering of the final results.
##! experimental run: flights -> { where: carrier = 'WN' group_by: dep_year is dep_time.year aggregate: flight_count calculate: year_change is flight_count - lag(flight_count) { order_by: dep_year asc } order_by: dep_year desc }
[ { "dep_year": "2005-01-01T00:00:00.000Z", "flight_count": 17549, "year_change": 2909 }, { "dep_year": "2004-01-01T00:00:00.000Z", "flight_count": 14640, "year_change": 340 }, { "dep_year": "2003-01-01T00:00:00.000Z", "flight_count": 14300, "year_change": -408 }, { "dep_year": "2002-01-01T00:00:00.000Z", "flight_count": 14708, "year_change": 287 }, { "dep_year": "2001-01-01T00:00:00.000Z", "flight_count": 14421, "year_change": 1288 } ]
SELECT DATE_TRUNC('year', base."dep_time") as "dep_year", COUNT(1) as "flight_count", (COUNT(1))-(LAG((COUNT(1))) OVER( ORDER BY (DATE_TRUNC('year', base."dep_time")) ASC )) as "year_change" FROM '../data/flights.parquet' as base WHERE base."carrier"='WN' GROUP BY 1 ORDER BY 1 desc NULLS LAST
In this example, the window function is calculated in ascending order with respect to dep_year
, but the results are output in descending order (most recent year first).
Partitioning
Some window function operations require the specification of partition_by
clauses in SQL. For example, calculating year-over-year changes for a specific time span, such as "number of flights in March of this year versus March of last year". In Malloy, these can be specified by adding a partition_by
property to the calculation.
##! experimental run: flights -> { where: carrier = 'WN' group_by: dep_year is dep_time.year dep_month is month(dep_time) aggregate: flight_count calculate: year_change is flight_count - lag(flight_count) { partition_by: dep_month order_by: dep_year } order_by: dep_year, dep_month }
[ { "dep_year": "2000-01-01T00:00:00.000Z", "dep_month": 1, "flight_count": 1114, "year_change": null }, { "dep_year": "2000-01-01T00:00:00.000Z", "dep_month": 2, "flight_count": 1088, "year_change": null }, { "dep_year": "2000-01-01T00:00:00.000Z", "dep_month": 3, "flight_count": 1117, "year_change": null }, { "dep_year": "2000-01-01T00:00:00.000Z", "dep_month": 4, "flight_count": 1097, "year_change": null }, { "dep_year": "2000-01-01T00:00:00.000Z", "dep_month": 5, "flight_count": 1151, "year_change": null } ]
SELECT DATE_TRUNC('year', base."dep_time") as "dep_year", EXTRACT(month FROM base."dep_time") as "dep_month", COUNT(1) as "flight_count", (COUNT(1))-(LAG((COUNT(1))) OVER(PARTITION BY (EXTRACT(month FROM base."dep_time")) ORDER BY (DATE_TRUNC('year', base."dep_time")) )) as "year_change" FROM '../data/flights.parquet' as base WHERE base."carrier"='WN' GROUP BY 1,2 ORDER BY 1 ASC NULLS LAST,2 ASC NULLS LAST
This query modifies the window by adding a partition_by
clause on the month number, and an order_by
on the year. This means each window contains only data for a given month number, and the lag
function grabs the value for flight_count
of the prior year within that window.
Field References in Calculations
Because calculations operate logically on the output of the grouping/projecting/aggregating operations, field references behave differently inside a calculate:
block. In particular, field references refer by default to output names from those operations.
run: flights -> { group_by: lower_carrier is lower(carrier) calculate: prev_carrier is lag(lower_carrier) }
[ { "lower_carrier": "aa", "prev_carrier": null }, { "lower_carrier": "as", "prev_carrier": "aa" }, { "lower_carrier": "b6", "prev_carrier": "as" }, { "lower_carrier": "co", "prev_carrier": "b6" }, { "lower_carrier": "dl", "prev_carrier": "co" } ]
SELECT LOWER(base."carrier") as "lower_carrier", LAG((LOWER(base."carrier"))) OVER( ORDER BY LOWER(base."carrier") asc NULLS LAST ) as "prev_carrier" FROM '../data/flights.parquet' as base GROUP BY 1 ORDER BY 1 asc NULLS LAST
In a group_by
or aggregate
, you cannot reference lower_carrier
because it is not a field defined inside of flights
, but in calculate
, you can.
For a detailed explanation of the exact semantics, see the evaluation space documentation.