The basic syntax for a query in Malloy consists of a source and a view consisting of one or more stages separated by ->
. The shape of the data defined in the original source is transformed by each stage.
run: flights -> { group_by: carrier aggregate: flight_count }
[ { "carrier": "WN", "flight_count": 88751 }, { "carrier": "US", "flight_count": 37683 }, { "carrier": "AA", "flight_count": 34577 }, { "carrier": "NW", "flight_count": 33580 }, { "carrier": "UA", "flight_count": 32757 } ]
SELECT base."carrier" as "carrier", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base GROUP BY 1 ORDER BY 2 desc NULLS LAST
Sources
Every query begins with a source, which can be thought of as a table with an associated collection of calculations and join relationships. The most common kind of source is a table source, e.g. duckdb.table('flights.csv')
; see the Sources section for more information on creating and extending sources.
Generally, ad-hoc queries will use one of these kinds of sources directly:
run: duckdb.table('../data/flights.parquet') -> { aggregate: total_flight_count is count() }
[ { "total_flight_count": 344827 } ]
SELECT COUNT(1) as "total_flight_count" FROM '../data/flights.parquet' as base
However, as analysis becomes more involved, it is often useful to define reusable computations in a named source, then use that named source as the basis for queries.
source: flights2 is duckdb.table('../data/flights.parquet') extend { measure: flight_count is count() total_distance is sum(distance) } run: flights2 -> { group_by: carrier aggregate: flight_count, total_distance }
[ { "carrier": "WN", "flight_count": 88751, "total_distance": 54619152 }, { "carrier": "US", "flight_count": 37683, "total_distance": 23721642 }, { "carrier": "AA", "flight_count": 34577, "total_distance": 37684885 }, { "carrier": "NW", "flight_count": 33580, "total_distance": 33376503 }, { "carrier": "UA", "flight_count": 32757, "total_distance": 38882934 } ]
SELECT base."carrier" as "carrier", COUNT(1) as "flight_count", COALESCE(SUM(base."distance"),0) as "total_distance" FROM '../data/flights.parquet' as base GROUP BY 1 ORDER BY 2 desc NULLS LAST
It is also sometimes useful to define a query and then later use that query as the source for another query:
query: flights_by_carrier is duckdb.table('../data/flights.parquet') -> { group_by: carrier aggregate: flight_count is count() } run: flights_by_carrier -> { select: carrier limit: 2 }
[ { "carrier": "DL" }, { "carrier": "WN" } ]
WITH __stage0 AS ( SELECT base."carrier" as "carrier", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base GROUP BY 1 ) SELECT base."carrier" as "carrier" FROM __stage0 as base LIMIT 2
Views
A view defines a transformation (or series of transformations) to perform on a source. In a query, everything after the first ->
is a view.
// source -> view run: flights -> { select: * }
Views can be defined in a source extension to enable reusability and nesting. You can always copy and paste the view of a query into the definition of the source of that query (or an extension of that source).