The basic syntax for a query in Malloy consists of a source and a view consisting of one or more stages separated by ->
. The shape of the data defined in the original source is transformed by each stage.
run: flights -> { group_by: carrier aggregate: flight_count }
carrier | flight_count |
---|---|
WN | 88,751 |
US | 37,683 |
AA | 34,577 |
NW | 33,580 |
UA | 32,757 |
[ { "carrier": "WN", "flight_count": 88751 }, { "carrier": "US", "flight_count": 37683 }, { "carrier": "AA", "flight_count": 34577 }, { "carrier": "NW", "flight_count": 33580 }, { "carrier": "UA", "flight_count": 32757 } ]
SELECT base."carrier" as "carrier", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base GROUP BY 1 ORDER BY 2 desc NULLS LAST
Sources
Every query begins with a source, which can be thought of as a table with an associated collection of calculations and join relationships. The most common kind of source is a table source, e.g. duckdb.table('flights.csv')
; see the Sources section for more information on creating and extending sources.
Generally, ad-hoc queries will use one of these kinds of sources directly:
run: duckdb.table('../data/flights.parquet') -> { aggregate: total_flight_count is count() }
total_flight_count |
---|
344,827 |
[ { "total_flight_count": 344827 } ]
SELECT COUNT(1) as "total_flight_count" FROM '../data/flights.parquet' as base
However, as analysis becomes more involved, it is often useful to define reusable computations in a named source, then use that named source as the basis for queries.
source: flights2 is duckdb.table('../data/flights.parquet') extend { measure: flight_count is count() total_distance is sum(distance) } run: flights2 -> { group_by: carrier aggregate: flight_count, total_distance }
carrier | flight_count | total_distance |
---|---|---|
WN | 88,751 | 54,619,152 |
US | 37,683 | 23,721,642 |
AA | 34,577 | 37,684,885 |
NW | 33,580 | 33,376,503 |
UA | 32,757 | 38,882,934 |
[ { "carrier": "WN", "flight_count": 88751, "total_distance": 54619152 }, { "carrier": "US", "flight_count": 37683, "total_distance": 23721642 }, { "carrier": "AA", "flight_count": 34577, "total_distance": 37684885 }, { "carrier": "NW", "flight_count": 33580, "total_distance": 33376503 }, { "carrier": "UA", "flight_count": 32757, "total_distance": 38882934 } ]
SELECT base."carrier" as "carrier", COUNT(1) as "flight_count", COALESCE(SUM(base."distance"),0) as "total_distance" FROM '../data/flights.parquet' as base GROUP BY 1 ORDER BY 2 desc NULLS LAST
It is also sometimes useful to define a query and then later use that query as the source for another query:
query: flights_by_carrier is duckdb.table('../data/flights.parquet') -> { group_by: carrier aggregate: flight_count is count() } run: flights_by_carrier -> { select: carrier limit: 2 }
carrier |
---|
RU |
OH |
[ { "carrier": "RU" }, { "carrier": "OH" } ]
WITH __stage0 AS ( SELECT base."carrier" as "carrier", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base GROUP BY 1 ) SELECT base."carrier" as "carrier" FROM __stage0 as base LIMIT 2
Views
A view defines a transformation (or series of transformations) to perform on a source. In a query, everything after the first ->
is a view.
// source -> view run: flights -> { select: * }
Views can be defined in a source extension to enable reusability and nesting. You can always copy and paste the view of a query into the definition of the source of that query (or an extension of that source).
source: flights3 is flights extend { view: by_carrier2 is { group_by: carrier aggregate: flight_count } } run: flights3 -> by_carrier2
carrier | flight_count |
---|---|
WN | 88,751 |
US | 37,683 |
AA | 34,577 |
NW | 33,580 |
UA | 32,757 |
[ { "carrier": "WN", "flight_count": 88751 }, { "carrier": "US", "flight_count": 37683 }, { "carrier": "AA", "flight_count": 34577 }, { "carrier": "NW", "flight_count": 33580 }, { "carrier": "UA", "flight_count": 32757 } ]
SELECT base."carrier" as "carrier", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base GROUP BY 1 ORDER BY 2 desc NULLS LAST
Queries that reference a view by name can add additional stages after those defined in the view.
run: flights3 -> by_carrier2 -> { select: carrier limit: 1 }
carrier |
---|
DL |
[ { "carrier": "DL" } ]
WITH __stage0 AS ( SELECT base."carrier" as "carrier", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base GROUP BY 1 ) SELECT base."carrier" as "carrier" FROM __stage0 as base LIMIT 1
Refinements, Query partials and Shorthand
Query blocks write be written in parts, those parts can be either views in the source and combined with the refinement operator +
The query below
run: flights -> { group_by: carrier aggregate: flight_count limit: 10 }
can be written as below. Each of the parts of a query can be separated into a 'partial query'.
run: flights -> {group_by: carrier} + {aggregate: flight_count} + {limit: 10}
The query can also be written as:
run: flights -> carrier + flight_count + {limit: 10}
In the query above, the refienment operator +
combines the parameters of a query so the query can be built from parts. Measure, dimension and views declared in the source can also be used. Referencing dimension carrier
expands to a partial of {group_by: x}
. Referencing measure flight_count
expands to a partial query of {aggregate: flight_count}
For more information about defining, reusing, nesting, and refining views, see the Views section.