This document will assumes a working knowledge of SQL and will rapidly take you through some of Malloy's key language features.
Using this Guide
For every Malloy Query you can see the formatted result, or raw result as JSON, or the SQL used to produce the result.
Click tab to to see the HTML, JSON or SQL result: 👈👈
SQL SELECT vs Malloy's run:
The statement to run a query in Malloy is run:
. There are two types of queries in Malloy, reductions which have group_by:
or aggregate:
statements, and projections which have select:
statements and do not group or aggregate results.
Projection: SELECT with no GROUP BY
In SQL
SELECT code, full_name, state, faa_region, fac_type, elevation FROM `malloy-data.faa.airports` ORDER BY code
Equivalent in Malloy
run: duckdb.table('../data/airports.parquet') -> { select: code, full_name, state, faa_region, fac_type, elevation order_by: code }
[ { "code": "00A", "full_name": "TOTAL RF", "state": "PA", "faa_region": "AEA", "fac_type": "HELIPORT", "elevation": 11 }, { "code": "00C", "full_name": "ANIMAS AIR PARK", "state": "CO", "faa_region": "ANM", "fac_type": "AIRPORT", "elevation": 6684 }, { "code": "00CA", "full_name": "GOLDSTONE /GTS/", "state": "CA", "faa_region": "AWP", "fac_type": "AIRPORT", "elevation": 3038 }, { "code": "00E", "full_name": "AT&T - APACHE JUNCTION", "state": "AZ", "faa_region": "AWP", "fac_type": "HELIPORT", "elevation": 2527 }, { "code": "00F", "full_name": "TCJC-NORTHEAST CAMPUS", "state": "TX", "faa_region": "ASW", "fac_type": "HELIPORT", "elevation": 600 } ]
SELECT base."code" as "code", base."full_name" as "full_name", base."state" as "state", base."faa_region" as "faa_region", base."fac_type" as "fac_type", base."elevation" as "elevation" FROM '../data/airports.parquet' as base ORDER BY 1 ASC NULLS LAST
Reduction: SELECT with GROUP BY and/or aggregation
In SQL
SELECT base.fac_type as fac_type, COUNT( 1) as airport_count FROM `malloy-data.faa.airports` as base WHERE base.state='CA' GROUP BY 1 ORDER BY 2 desc
Equivalent in Malloy
run: duckdb.table('../data/airports.parquet') -> { group_by: fac_type aggregate: airport_count is count() where: state = 'CA' order_by: airport_count desc }
[ { "fac_type": "AIRPORT", "airport_count": 569 }, { "fac_type": "HELIPORT", "airport_count": 396 }, { "fac_type": "SEAPLANE BASE", "airport_count": 12 }, { "fac_type": "GLIDERPORT", "airport_count": 3 }, { "fac_type": "STOLPORT", "airport_count": 2 } ]
SELECT base."fac_type" as "fac_type", COUNT(1) as "airport_count" FROM '../data/airports.parquet' as base WHERE base."state"='CA' GROUP BY 1 ORDER BY 2 desc NULLS LAST
Source: A data source for queries
Malloy separates a query's view from the source of the data. A source can be thought of as a table and a collection of computations and relationships which are relevant to that table. (Source Documentation).
Fields can be defined as part of a source.
A
measure:
is a declared aggregate calculation (think function that operates across the table) which can be used inaggregate:
elements in a query stageA
dimension:
is a declared scalar calculation which that can be used ingroup_by:
orselect:
elements of a query stage
source: airports is duckdb.table('../data/airports.parquet') extend { dimension: elevation_in_meters is elevation * 0.3048 dimension: state_and_county is concat(state,' - ', county) measure: airport_count is count() measure: avg_elevation_in_meters is elevation_in_meters.avg() }
Querying Against a Source
Queries can be run against source:
objects and can utilize the modeled fields from that source, as well as introduce new ones. (Query Documentation)
using the above declared airports
source
run: airports -> { limit: 10 where: fac_type = 'HELIPORT' group_by: state aggregate: airport_count // <-- declared in source avg_elevation_in_meters // <-- declared in source }
[ { "state": "TX", "airport_count": 435, "avg_elevation_in_meters": 138.68680275862064 }, { "state": "CA", "airport_count": 396, "avg_elevation_in_meters": 276.40203030303024 }, { "state": "PA", "airport_count": 307, "avg_elevation_in_meters": 216.25510358306178 }, { "state": "FL", "airport_count": 280, "avg_elevation_in_meters": 18.100765714285675 }, { "state": "NJ", "airport_count": 247, "avg_elevation_in_meters": 48.149761943319795 } ]
SELECT base."state" as "state", COUNT(1) as "airport_count", AVG((base."elevation"*0.3048::DOUBLE)) as "avg_elevation_in_meters" FROM '../data/airports.parquet' as base WHERE base."fac_type"='HELIPORT' GROUP BY 1 ORDER BY 2 desc NULLS LAST LIMIT 10
Dimensional calculations are no different from columns
using the above declared airports
source
run: airports -> { group_by: state_and_county // <-- declared in source aggregate: airport_count order_by: 1 desc }
[ { "state_and_county": "WY - WESTON", "airport_count": 3 }, { "state_and_county": "WY - WASHAKIE", "airport_count": 4 }, { "state_and_county": "WY - UINTA", "airport_count": 3 }, { "state_and_county": "WY - TETON", "airport_count": 7 }, { "state_and_county": "WY - SWEETWATER", "airport_count": 4 } ]
SELECT CONCAT(base."state",' - ',base."county") as "state_and_county", COUNT(1) as "airport_count" FROM '../data/airports.parquet' as base GROUP BY 1 ORDER BY 1 desc NULLS LAST
Defining Views in a Source
A source can also contain a set of useful views relating to that source, which can be run in queries.
using the above declared airports
source
source: airports2 is duckdb.table('../data/airports.parquet') extend { measure: airport_count is count() view: by_state is { // <-- can be called by name group_by: state aggregate: airport_count } }
Executing Views
The simplest form of a query in Malloy is the name of a source, the query operator ->
, and the name of one of its contained views.
using the above declared airports
source
run: airports2 -> by_state
[ { "state": "TX", "airport_count": 1845 }, { "state": "CA", "airport_count": 984 }, { "state": "IL", "airport_count": 890 }, { "state": "FL", "airport_count": 856 }, { "state": "PA", "airport_count": 804 } ]
SELECT base."state" as "state", COUNT(1) as "airport_count" FROM '../data/airports.parquet' as base GROUP BY 1 ORDER BY 2 desc NULLS LAST
Filtering a Source
You can filter a source by adding a filter expression using the where:
keyword in an extension. In this example, we filter an existing airports
source and then use this extended version to execute the by_state
view in a query. For more information on filtering, see the Filters section.
run: airports2 extend { where: fac_type = 'SEAPLANE BASE' // <- run the query with an added filter } -> by_state
[ { "state": "AK", "airport_count": 104 }, { "state": "MN", "airport_count": 72 }, { "state": "FL", "airport_count": 43 }, { "state": "ME", "airport_count": 38 }, { "state": "NY", "airport_count": 23 } ]
SELECT base."state" as "state", COUNT(1) as "airport_count" FROM '../data/airports.parquet' as base WHERE base."fac_type"='SEAPLANE BASE' GROUP BY 1 ORDER BY 2 desc NULLS LAST
Filtering Measures
The input to an aggregate computation can be filtered.
using the above declared airports
source
run: airports -> { group_by: state aggregate: airport_count aggregate: heliport_count is airport_count { where: fac_type = 'HELIPORT' } // <-- add a filter }
[ { "state": "TX", "airport_count": 1845, "heliport_count": 435 }, { "state": "CA", "airport_count": 984, "heliport_count": 396 }, { "state": "IL", "airport_count": 890, "heliport_count": 245 }, { "state": "FL", "airport_count": 856, "heliport_count": 280 }, { "state": "PA", "airport_count": 804, "heliport_count": 307 } ]
SELECT base."state" as "state", COUNT(1) as "airport_count", (COUNT(CASE WHEN base."fac_type"='HELIPORT' THEN 1 END)) as "heliport_count" FROM '../data/airports.parquet' as base GROUP BY 1 ORDER BY 2 desc NULLS LAST
Composing with Views
For the next section assume the following source declaration.
source: airports3 is duckdb.table('../data/airports.parquet') extend { measure: airport_count is count() measure: avg_elevation is elevation.avg() view: top_5_states is { group_by: state aggregate: airport_count limit: 5 } view: by_facility_type is { group_by: fac_type aggregate: airport_count } }
The nest:
property embeds one view in another
Malloy allows you to create nested subtables easily in a query. In the case below, the top level view groups by state and the nested view groups by facility type. This mechanism is really useful for understanding data and creating complex data structures. (Nesting Documentation)
using the above declared airports
source
run: airports3 -> { group_by: state aggregate: airport_count limit: 5 nest: by_facility_type is { group_by: fac_type aggregate: airport_count } }
[ { "state": "TX", "airport_count": 1845, "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 1389 }, { "fac_type": "HELIPORT", "airport_count": 435 }, { "fac_type": "ULTRALIGHT", "airport_count": 8 }, { "fac_type": "STOLPORT", "airport_count": 8 }, { "fac_type": "GLIDERPORT", "airport_count": 5 } ] }, { "state": "CA", "airport_count": 984, "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 569 }, { "fac_type": "HELIPORT", "airport_count": 396 }, { "fac_type": "SEAPLANE BASE", "airport_count": 12 }, { "fac_type": "GLIDERPORT", "airport_count": 3 }, { "fac_type": "ULTRALIGHT", "airport_count": 2 }, { "fac_type": "STOLPORT", "airport_count": 2 } ] }, { "state": "IL", "airport_count": 890, "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 625 }, { "fac_type": "HELIPORT", "airport_count": 245 }, { "fac_type": "SEAPLANE BASE", "airport_count": 8 }, { "fac_type": "ULTRALIGHT", "airport_count": 6 }, { "fac_type": "BALLOONPORT", "airport_count": 2 }, { "fac_type": "GLIDERPORT", "airport_count": 2 }, { "fac_type": "STOLPORT", "airport_count": 2 } ] }, { "state": "FL", "airport_count": 856, "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 511 }, { "fac_type": "HELIPORT", "airport_count": 280 }, { "fac_type": "SEAPLANE BASE", "airport_count": 43 }, { "fac_type": "STOLPORT", "airport_count": 13 }, { "fac_type": "ULTRALIGHT", "airport_count": 5 }, { "fac_type": "GLIDERPORT", "airport_count": 4 } ] }, { "state": "PA", "airport_count": 804, "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 468 }, { "fac_type": "HELIPORT", "airport_count": 307 }, { "fac_type": "ULTRALIGHT", "airport_count": 13 }, { "fac_type": "SEAPLANE BASE", "airport_count": 10 }, { "fac_type": "STOLPORT", "airport_count": 3 }, { "fac_type": "GLIDERPORT", "airport_count": 3 } ] } ]
WITH __stage0 AS ( SELECT group_set, base."state" as "state__0", CASE WHEN group_set=0 THEN COUNT(1) END as "airport_count__0", CASE WHEN group_set=1 THEN base."fac_type" END as "fac_type__1", CASE WHEN group_set=1 THEN COUNT(1) END as "airport_count__1" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set ) as group_set GROUP BY 1,2,4 ) SELECT "state__0" as "state", MAX(CASE WHEN group_set=0 THEN "airport_count__0" END) as "airport_count", COALESCE(LIST({ "fac_type": "fac_type__1", "airport_count": "airport_count__1"} ORDER BY "airport_count__1" desc NULLS LAST) FILTER (WHERE group_set=1),[]) as "by_facility_type" FROM __stage0 GROUP BY 1 ORDER BY 2 desc NULLS LAST LIMIT 5
Queries can contain multiple nested views.
using the above declared airports
source
run: airports3 -> { group_by: faa_region aggregate: airport_count nest: top_5_states nest: by_facility_type }
[ { "faa_region": "AGL", "airport_count": 4437, "top_5_states": [ { "state": "IL", "airport_count": 890 }, { "state": "OH", "airport_count": 749 }, { "state": "IN", "airport_count": 643 }, { "state": "WI", "airport_count": 543 }, { "state": "MN", "airport_count": 507 } ], "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 3443 }, { "fac_type": "HELIPORT", "airport_count": 826 }, { "fac_type": "SEAPLANE BASE", "airport_count": 119 }, { "fac_type": "ULTRALIGHT", "airport_count": 30 }, { "fac_type": "STOLPORT", "airport_count": 11 }, { "fac_type": "GLIDERPORT", "airport_count": 4 }, { "fac_type": "BALLOONPORT", "airport_count": 4 } ] }, { "faa_region": "ASW", "airport_count": 3268, "top_5_states": [ { "state": "TX", "airport_count": 1845 }, { "state": "LA", "airport_count": 500 }, { "state": "OK", "airport_count": 443 }, { "state": "AR", "airport_count": 299 }, { "state": "NM", "airport_count": 181 } ], "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 2341 }, { "fac_type": "HELIPORT", "airport_count": 861 }, { "fac_type": "ULTRALIGHT", "airport_count": 32 }, { "fac_type": "SEAPLANE BASE", "airport_count": 19 }, { "fac_type": "STOLPORT", "airport_count": 9 }, { "fac_type": "GLIDERPORT", "airport_count": 6 } ] }, { "faa_region": "ASO", "airport_count": 2924, "top_5_states": [ { "state": "FL", "airport_count": 856 }, { "state": "GA", "airport_count": 440 }, { "state": "NC", "airport_count": 400 }, { "state": "TN", "airport_count": 285 }, { "state": "AL", "airport_count": 260 } ], "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 2038 }, { "fac_type": "HELIPORT", "airport_count": 770 }, { "fac_type": "SEAPLANE BASE", "airport_count": 57 }, { "fac_type": "STOLPORT", "airport_count": 33 }, { "fac_type": "ULTRALIGHT", "airport_count": 17 }, { "fac_type": "GLIDERPORT", "airport_count": 8 }, { "fac_type": "BALLOONPORT", "airport_count": 1 } ] }, { "faa_region": "AEA", "airport_count": 2586, "top_5_states": [ { "state": "PA", "airport_count": 804 }, { "state": "NY", "airport_count": 576 }, { "state": "VA", "airport_count": 421 }, { "state": "NJ", "airport_count": 378 }, { "state": "MD", "airport_count": 229 } ], "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 1525 }, { "fac_type": "HELIPORT", "airport_count": 964 }, { "fac_type": "SEAPLANE BASE", "airport_count": 61 }, { "fac_type": "ULTRALIGHT", "airport_count": 18 }, { "fac_type": "STOLPORT", "airport_count": 8 }, { "fac_type": "GLIDERPORT", "airport_count": 7 }, { "fac_type": "BALLOONPORT", "airport_count": 3 } ] }, { "faa_region": "ANM", "airport_count": 2102, "top_5_states": [ { "state": "WA", "airport_count": 484 }, { "state": "OR", "airport_count": 441 }, { "state": "CO", "airport_count": 425 }, { "state": "MT", "airport_count": 259 }, { "state": "ID", "airport_count": 238 } ], "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 1524 }, { "fac_type": "HELIPORT", "airport_count": 527 }, { "fac_type": "SEAPLANE BASE", "airport_count": 25 }, { "fac_type": "STOLPORT", "airport_count": 13 }, { "fac_type": "ULTRALIGHT", "airport_count": 8 }, { "fac_type": "GLIDERPORT", "airport_count": 4 }, { "fac_type": "BALLOONPORT", "airport_count": 1 } ] } ]
WITH __stage0 AS ( SELECT group_set, base."faa_region" as "faa_region__0", CASE WHEN group_set=0 THEN COUNT(1) END as "airport_count__0", CASE WHEN group_set=1 THEN base."state" END as "state__1", CASE WHEN group_set=1 THEN COUNT(1) END as "airport_count__1", CASE WHEN group_set=2 THEN base."fac_type" END as "fac_type__2", CASE WHEN group_set=2 THEN COUNT(1) END as "airport_count__2" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,2,1)) as group_set ) as group_set GROUP BY 1,2,4,6 ) SELECT "faa_region__0" as "faa_region", MAX(CASE WHEN group_set=0 THEN "airport_count__0" END) as "airport_count", COALESCE(LIST({ "state": "state__1", "airport_count": "airport_count__1"} ORDER BY "airport_count__1" desc NULLS LAST) FILTER (WHERE group_set=1)[1:5],[]) as "top_5_states", COALESCE(LIST({ "fac_type": "fac_type__2", "airport_count": "airport_count__2"} ORDER BY "airport_count__2" desc NULLS LAST) FILTER (WHERE group_set=2),[]) as "by_facility_type" FROM __stage0 GROUP BY 1 ORDER BY 2 desc NULLS LAST
Views can be nested to any level of depth.
using the above declared airports
source
run: airports3 -> { group_by: faa_region aggregate: airport_count nest: by_state_and_county is { group_by: state aggregate: airport_count nest: by_county is { group_by: county aggregate: airport_count limit: 4 } } nest: by_facility_type }
[ { "faa_region": "AGL", "airport_count": 4437, "by_state_and_county": [ { "state": "IL", "airport_count": 890, "by_county": [ { "county": "COOK", "airport_count": 51 }, { "county": "LA SALLE", "airport_count": 39 }, { "county": "MC HENRY", "airport_count": 29 }, { "county": "DE KALB", "airport_count": 27 } ] }, { "state": "OH", "airport_count": 749, "by_county": [ { "county": "CUYAHOGA", "airport_count": 27 }, { "county": "FRANKLIN", "airport_count": 27 }, { "county": "STARK", "airport_count": 23 }, { "county": "MONTGOMERY", "airport_count": 22 } ] }, { "state": "IN", "airport_count": 643, "by_county": [ { "county": "MARION", "airport_count": 27 }, { "county": "ALLEN", "airport_count": 24 }, { "county": "HAMILTON", "airport_count": 20 }, { "county": "MARSHALL", "airport_count": 18 } ] }, { "state": "WI", "airport_count": 543, "by_county": [ { "county": "DANE", "airport_count": 30 }, { "county": "WALWORTH", "airport_count": 22 }, { "county": "WINNEBAGO", "airport_count": 17 }, { "county": "ST CROIX", "airport_count": 16 } ] }, { "state": "MN", "airport_count": 507, "by_county": [ { "county": "ST LOUIS", "airport_count": 28 }, { "county": "HENNEPIN", "airport_count": 23 }, { "county": "CROW WING", "airport_count": 17 }, { "county": "DAKOTA", "airport_count": 17 } ] }, { "state": "MI", "airport_count": 489, "by_county": [ { "county": "OAKLAND", "airport_count": 25 }, { "county": "KENT", "airport_count": 24 }, { "county": "WAYNE", "airport_count": 17 }, { "county": "ST CLAIR", "airport_count": 15 } ] }, { "state": "ND", "airport_count": 436, "by_county": [ { "county": "CASS", "airport_count": 39 }, { "county": "RICHLAND", "airport_count": 20 }, { "county": "MC LEAN", "airport_count": 20 }, { "county": "WARD", "airport_count": 19 } ] }, { "state": "SD", "airport_count": 180, "by_county": [ { "county": "PENNINGTON", "airport_count": 7 }, { "county": "MINNEHAHA", "airport_count": 7 }, { "county": "MEADE", "airport_count": 7 }, { "county": "BRULE", "airport_count": 6 } ] } ], "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 3443 }, { "fac_type": "HELIPORT", "airport_count": 826 }, { "fac_type": "SEAPLANE BASE", "airport_count": 119 }, { "fac_type": "ULTRALIGHT", "airport_count": 30 }, { "fac_type": "STOLPORT", "airport_count": 11 }, { "fac_type": "GLIDERPORT", "airport_count": 4 }, { "fac_type": "BALLOONPORT", "airport_count": 4 } ] }, { "faa_region": "ASW", "airport_count": 3268, "by_state_and_county": [ { "state": "TX", "airport_count": 1845, "by_county": [ { "county": "HARRIS", "airport_count": 135 }, { "county": "TARRANT", "airport_count": 63 }, { "county": "DENTON", "airport_count": 53 }, { "county": "DALLAS", "airport_count": 42 } ] }, { "state": "LA", "airport_count": 500, "by_county": [ { "county": "PLAQUEMINES", "airport_count": 31 }, { "county": "VERMILION", "airport_count": 29 }, { "county": "CALCASIEU", "airport_count": 23 }, { "county": "LAFOURCHE", "airport_count": 21 } ] }, { "state": "OK", "airport_count": 443, "by_county": [ { "county": "OKLAHOMA", "airport_count": 31 }, { "county": "TULSA", "airport_count": 25 }, { "county": "ROGERS", "airport_count": 16 }, { "county": "DELAWARE", "airport_count": 13 } ] }, { "state": "AR", "airport_count": 299, "by_county": [ { "county": "PULASKI", "airport_count": 20 }, { "county": "BENTON", "airport_count": 19 }, { "county": "LONOKE", "airport_count": 13 }, { "county": "MISSISSIPPI", "airport_count": 10 } ] }, { "state": "NM", "airport_count": 181, "by_county": [ { "county": "CATRON", "airport_count": 13 }, { "county": "LINCOLN", "airport_count": 10 }, { "county": "SANTA FE", "airport_count": 9 }, { "county": "BERNALILLO", "airport_count": 9 } ] } ], "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 2341 }, { "fac_type": "HELIPORT", "airport_count": 861 }, { "fac_type": "ULTRALIGHT", "airport_count": 32 }, { "fac_type": "SEAPLANE BASE", "airport_count": 19 }, { "fac_type": "STOLPORT", "airport_count": 9 }, { "fac_type": "GLIDERPORT", "airport_count": 6 } ] }, { "faa_region": "ASO", "airport_count": 2924, "by_state_and_county": [ { "state": "FL", "airport_count": 856, "by_county": [ { "county": "PALM BEACH", "airport_count": 45 }, { "county": "DADE", "airport_count": 44 }, { "county": "POLK", "airport_count": 43 }, { "county": "MARION", "airport_count": 37 } ] }, { "state": "GA", "airport_count": 440, "by_county": [ { "county": "FULTON", "airport_count": 22 }, { "county": "PIKE", "airport_count": 17 }, { "county": "CARROLL", "airport_count": 14 }, { "county": "FAYETTE", "airport_count": 12 } ] }, { "state": "NC", "airport_count": 400, "by_county": [ { "county": "WAKE", "airport_count": 15 }, { "county": "MECKLENBURG", "airport_count": 14 }, { "county": "ROWAN", "airport_count": 14 }, { "county": "UNION", "airport_count": 14 } ] }, { "state": "TN", "airport_count": 285, "by_county": [ { "county": "SHELBY", "airport_count": 24 }, { "county": "DAVIDSON", "airport_count": 17 }, { "county": "KNOX", "airport_count": 14 }, { "county": "RUTHERFORD", "airport_count": 10 } ] }, { "state": "AL", "airport_count": 260, "by_county": [ { "county": "BALDWIN", "airport_count": 25 }, { "county": "JEFFERSON", "airport_count": 19 }, { "county": "MOBILE", "airport_count": 17 }, { "county": "MADISON", "airport_count": 14 } ] }, { "state": "MS", "airport_count": 243, "by_county": [ { "county": "WASHINGTON", "airport_count": 15 }, { "county": "HARRISON", "airport_count": 12 }, { "county": "HINDS", "airport_count": 11 }, { "county": "BOLIVAR", "airport_count": 9 } ] }, { "state": "KY", "airport_count": 202, "by_county": [ { "county": "JEFFERSON", "airport_count": 13 }, { "county": "FAYETTE", "airport_count": 7 }, { "county": "MC LEAN", "airport_count": 6 }, { "county": "BOONE", "airport_count": 6 } ] }, { "state": "SC", "airport_count": 189, "by_county": [ { "county": "GREENVILLE", "airport_count": 13 }, { "county": "BEAUFORT", "airport_count": 9 }, { "county": "LEXINGTON", "airport_count": 9 }, { "county": "CHARLESTON", "airport_count": 9 } ] }, { "state": "PR", "airport_count": 40, "by_county": [ { "county": "--PUERTO RICO", "airport_count": 40 } ] }, { "state": "VI", "airport_count": 9, "by_county": [ { "county": "-VIRGIN ISLANDS-", "airport_count": 9 } ] } ], "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 2038 }, { "fac_type": "HELIPORT", "airport_count": 770 }, { "fac_type": "SEAPLANE BASE", "airport_count": 57 }, { "fac_type": "STOLPORT", "airport_count": 33 }, { "fac_type": "ULTRALIGHT", "airport_count": 17 }, { "fac_type": "GLIDERPORT", "airport_count": 8 }, { "fac_type": "BALLOONPORT", "airport_count": 1 } ] }, { "faa_region": "AEA", "airport_count": 2586, "by_state_and_county": [ { "state": "PA", "airport_count": 804, "by_county": [ { "county": "BUCKS", "airport_count": 55 }, { "county": "MONTGOMERY", "airport_count": 44 }, { "county": "ALLEGHENY", "airport_count": 31 }, { "county": "CHESTER", "airport_count": 27 } ] }, { "state": "NY", "airport_count": 576, "by_county": [ { "county": "SUFFOLK", "airport_count": 34 }, { "county": "ERIE", "airport_count": 26 }, { "county": "DUTCHESS", "airport_count": 20 }, { "county": "NIAGARA", "airport_count": 20 } ] }, { "state": "VA", "airport_count": 421, "by_county": [ { "county": "FAUQUIER", "airport_count": 23 }, { "county": "SHENANDOAH", "airport_count": 13 }, { "county": "ACCOMACK", "airport_count": 12 }, { "county": "FAIRFAX", "airport_count": 12 } ] }, { "state": "NJ", "airport_count": 378, "by_county": [ { "county": "MONMOUTH", "airport_count": 31 }, { "county": "MIDDLESEX", "airport_count": 29 }, { "county": "BURLINGTON", "airport_count": 29 }, { "county": "HUNTERDON", "airport_count": 27 } ] }, { "state": "MD", "airport_count": 229, "by_county": [ { "county": "BALTIMORE", "airport_count": 24 }, { "county": "ANNE ARUNDEL", "airport_count": 14 }, { "county": "CARROLL", "airport_count": 14 }, { "county": "PRINCE GEORGES", "airport_count": 14 } ] }, { "state": "WV", "airport_count": 116, "by_county": [ { "county": "KANAWHA", "airport_count": 8 }, { "county": "MASON", "airport_count": 7 }, { "county": "RALEIGH", "airport_count": 6 }, { "county": "PRESTON", "airport_count": 6 } ] }, { "state": "DE", "airport_count": 42, "by_county": [ { "county": "KENT", "airport_count": 17 }, { "county": "SUSSEX", "airport_count": 14 }, { "county": "NEW CASTLE", "airport_count": 11 } ] }, { "state": "DC", "airport_count": 20, "by_county": [ { "county": "WASHINGTON", "airport_count": 18 }, { "county": "ARLINGTON", "airport_count": 1 }, { "county": "LOUDOUN", "airport_count": 1 } ] } ], "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 1525 }, { "fac_type": "HELIPORT", "airport_count": 964 }, { "fac_type": "SEAPLANE BASE", "airport_count": 61 }, { "fac_type": "ULTRALIGHT", "airport_count": 18 }, { "fac_type": "STOLPORT", "airport_count": 8 }, { "fac_type": "GLIDERPORT", "airport_count": 7 }, { "fac_type": "BALLOONPORT", "airport_count": 3 } ] }, { "faa_region": "ANM", "airport_count": 2102, "by_state_and_county": [ { "state": "WA", "airport_count": 484, "by_county": [ { "county": "KING", "airport_count": 61 }, { "county": "PIERCE", "airport_count": 27 }, { "county": "SPOKANE", "airport_count": 26 }, { "county": "SNOHOMISH", "airport_count": 25 } ] }, { "state": "OR", "airport_count": 441, "by_county": [ { "county": "CLACKAMAS", "airport_count": 34 }, { "county": "LINN", "airport_count": 26 }, { "county": "LANE", "airport_count": 24 }, { "county": "WASHINGTON", "airport_count": 24 } ] }, { "state": "CO", "airport_count": 425, "by_county": [ { "county": "WELD", "airport_count": 40 }, { "county": "EL PASO", "airport_count": 26 }, { "county": "ADAMS", "airport_count": 23 }, { "county": "JEFFERSON", "airport_count": 21 } ] }, { "state": "MT", "airport_count": 259, "by_county": [ { "county": "FLATHEAD", "airport_count": 20 }, { "county": "LEWIS AND CLARK", "airport_count": 16 }, { "county": "MISSOULA", "airport_count": 14 }, { "county": "GALLATIN", "airport_count": 11 } ] }, { "state": "ID", "airport_count": 238, "by_county": [ { "county": "VALLEY", "airport_count": 27 }, { "county": "KOOTENAI", "airport_count": 21 }, { "county": "IDAHO", "airport_count": 18 }, { "county": "BONNER", "airport_count": 18 } ] }, { "state": "UT", "airport_count": 140, "by_county": [ { "county": "SALT LAKE", "airport_count": 21 }, { "county": "SAN JUAN", "airport_count": 13 }, { "county": "UTAH", "airport_count": 10 }, { "county": "IRON", "airport_count": 9 } ] }, { "state": "WY", "airport_count": 115, "by_county": [ { "county": "LARAMIE", "airport_count": 12 }, { "county": "CAMPBELL", "airport_count": 9 }, { "county": "CARBON", "airport_count": 9 }, { "county": "PARK", "airport_count": 9 } ] } ], "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 1524 }, { "fac_type": "HELIPORT", "airport_count": 527 }, { "fac_type": "SEAPLANE BASE", "airport_count": 25 }, { "fac_type": "STOLPORT", "airport_count": 13 }, { "fac_type": "ULTRALIGHT", "airport_count": 8 }, { "fac_type": "GLIDERPORT", "airport_count": 4 }, { "fac_type": "BALLOONPORT", "airport_count": 1 } ] } ]
WITH __stage0 AS ( SELECT group_set, base."faa_region" as "faa_region__0", CASE WHEN group_set=0 THEN COUNT(1) END as "airport_count__0", CASE WHEN group_set IN (1,2) THEN base."state" END as "state__1", CASE WHEN group_set=1 THEN COUNT(1) END as "airport_count__1", CASE WHEN group_set=2 THEN base."county" END as "county__2", CASE WHEN group_set=2 THEN COUNT(1) END as "airport_count__2", CASE WHEN group_set=3 THEN base."fac_type" END as "fac_type__3", CASE WHEN group_set=3 THEN COUNT(1) END as "airport_count__3" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,3,1)) as group_set ) as group_set GROUP BY 1,2,4,6,8 ) , __stage1 AS ( SELECT CASE WHEN group_set=2 THEN 1 ELSE group_set END as group_set, "faa_region__0" as "faa_region__0", FIRST("airport_count__0") FILTER (WHERE "airport_count__0" IS NOT NULL) as "airport_count__0", CASE WHEN group_set IN (1,2) THEN "state__1" END as "state__1", FIRST("airport_count__1") FILTER (WHERE "airport_count__1" IS NOT NULL) as "airport_count__1", COALESCE(LIST({ "county": "county__2", "airport_count": "airport_count__2"} ORDER BY "airport_count__2" desc NULLS LAST) FILTER (WHERE group_set=2)[1:4],[]) as "by_county__1", CASE WHEN group_set=3 THEN "fac_type__3" END as "fac_type__3", FIRST("airport_count__3") FILTER (WHERE "airport_count__3" IS NOT NULL) as "airport_count__3" FROM __stage0 GROUP BY 1,2,4,7 ) SELECT "faa_region__0" as "faa_region", MAX(CASE WHEN group_set=0 THEN "airport_count__0" END) as "airport_count", COALESCE(LIST({ "state": "state__1", "airport_count": "airport_count__1", "by_county": "by_county__1"} ORDER BY "airport_count__1" desc NULLS LAST) FILTER (WHERE group_set=1),[]) as "by_state_and_county", COALESCE(LIST({ "fac_type": "fac_type__3", "airport_count": "airport_count__3"} ORDER BY "airport_count__3" desc NULLS LAST) FILTER (WHERE group_set=3),[]) as "by_facility_type" FROM __stage1 GROUP BY 1 ORDER BY 2 desc NULLS LAST
Refining a View
The gesture + { refinements }
allows you to base a new view on an existing view while adding new refinements to the query terms.
For example we can add a limit and an order by to by_state
run: airports2 -> by_state + { order_by: state desc // <-- add order by to query limit: 2 }
[ { "state": "WY", "airport_count": 115 }, { "state": "WV", "airport_count": 116 } ]
SELECT base."state" as "state", COUNT(1) as "airport_count" FROM '../data/airports.parquet' as base GROUP BY 1 ORDER BY 1 desc NULLS LAST LIMIT 2
is the same as
run: airports -> { group_by: state aggregate: airport_count order_by: state desc limit: 2 }
[ { "state": "WY", "airport_count": 115 }, { "state": "WV", "airport_count": 116 } ]
SELECT base."state" as "state", COUNT(1) as "airport_count" FROM '../data/airports.parquet' as base GROUP BY 1 ORDER BY 1 desc NULLS LAST LIMIT 2
You can add a measure or dimension
run: airports3 -> by_facility_type + { aggregate: avg_elevation }
[ { "fac_type": "AIRPORT", "airport_count": 13925, "avg_elevation": 1237.0441651705567 }, { "fac_type": "HELIPORT", "airport_count": 5135, "avg_elevation": 950.5125608568646 }, { "fac_type": "SEAPLANE BASE", "airport_count": 473, "avg_elevation": 488.82241014799155 }, { "fac_type": "ULTRALIGHT", "airport_count": 125, "avg_elevation": 806.144 }, { "fac_type": "STOLPORT", "airport_count": 86, "avg_elevation": 1375.046511627907 } ]
SELECT base."fac_type" as "fac_type", COUNT(1) as "airport_count", AVG(base."elevation") as "avg_elevation" FROM '../data/airports.parquet' as base GROUP BY 1 ORDER BY 2 desc NULLS LAST
You can nest another view
run: airports3 -> top_5_states + { nest: by_facility_type }
[ { "state": "TX", "airport_count": 1845, "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 1389 }, { "fac_type": "HELIPORT", "airport_count": 435 }, { "fac_type": "ULTRALIGHT", "airport_count": 8 }, { "fac_type": "STOLPORT", "airport_count": 8 }, { "fac_type": "GLIDERPORT", "airport_count": 5 } ] }, { "state": "CA", "airport_count": 984, "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 569 }, { "fac_type": "HELIPORT", "airport_count": 396 }, { "fac_type": "SEAPLANE BASE", "airport_count": 12 }, { "fac_type": "GLIDERPORT", "airport_count": 3 }, { "fac_type": "STOLPORT", "airport_count": 2 }, { "fac_type": "ULTRALIGHT", "airport_count": 2 } ] }, { "state": "IL", "airport_count": 890, "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 625 }, { "fac_type": "HELIPORT", "airport_count": 245 }, { "fac_type": "SEAPLANE BASE", "airport_count": 8 }, { "fac_type": "ULTRALIGHT", "airport_count": 6 }, { "fac_type": "BALLOONPORT", "airport_count": 2 }, { "fac_type": "GLIDERPORT", "airport_count": 2 }, { "fac_type": "STOLPORT", "airport_count": 2 } ] }, { "state": "FL", "airport_count": 856, "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 511 }, { "fac_type": "HELIPORT", "airport_count": 280 }, { "fac_type": "SEAPLANE BASE", "airport_count": 43 }, { "fac_type": "STOLPORT", "airport_count": 13 }, { "fac_type": "ULTRALIGHT", "airport_count": 5 }, { "fac_type": "GLIDERPORT", "airport_count": 4 } ] }, { "state": "PA", "airport_count": 804, "by_facility_type": [ { "fac_type": "AIRPORT", "airport_count": 468 }, { "fac_type": "HELIPORT", "airport_count": 307 }, { "fac_type": "ULTRALIGHT", "airport_count": 13 }, { "fac_type": "SEAPLANE BASE", "airport_count": 10 }, { "fac_type": "STOLPORT", "airport_count": 3 }, { "fac_type": "GLIDERPORT", "airport_count": 3 } ] } ]
WITH __stage0 AS ( SELECT group_set, base."state" as "state__0", CASE WHEN group_set=0 THEN COUNT(1) END as "airport_count__0", CASE WHEN group_set=1 THEN base."fac_type" END as "fac_type__1", CASE WHEN group_set=1 THEN COUNT(1) END as "airport_count__1" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set ) as group_set GROUP BY 1,2,4 ) SELECT "state__0" as "state", MAX(CASE WHEN group_set=0 THEN "airport_count__0" END) as "airport_count", COALESCE(LIST({ "fac_type": "fac_type__1", "airport_count": "airport_count__1"} ORDER BY "airport_count__1" desc NULLS LAST) FILTER (WHERE group_set=1),[]) as "by_facility_type" FROM __stage0 GROUP BY 1 ORDER BY 2 desc NULLS LAST LIMIT 5
Changing the inner and outer query in the example above reveals very different information.
run: airports3 -> by_facility_type + { nest: top_5_states }
[ { "fac_type": "AIRPORT", "airport_count": 13925, "top_5_states": [ { "state": "TX", "airport_count": 1389 }, { "state": "IL", "airport_count": 625 }, { "state": "CA", "airport_count": 569 }, { "state": "OH", "airport_count": 537 }, { "state": "FL", "airport_count": 511 } ] }, { "fac_type": "HELIPORT", "airport_count": 5135, "top_5_states": [ { "state": "TX", "airport_count": 435 }, { "state": "CA", "airport_count": 396 }, { "state": "PA", "airport_count": 307 }, { "state": "FL", "airport_count": 280 }, { "state": "NJ", "airport_count": 247 } ] }, { "fac_type": "SEAPLANE BASE", "airport_count": 473, "top_5_states": [ { "state": "AK", "airport_count": 104 }, { "state": "MN", "airport_count": 72 }, { "state": "FL", "airport_count": 43 }, { "state": "ME", "airport_count": 38 }, { "state": "NY", "airport_count": 23 } ] }, { "fac_type": "ULTRALIGHT", "airport_count": 125, "top_5_states": [ { "state": "LA", "airport_count": 18 }, { "state": "IN", "airport_count": 17 }, { "state": "PA", "airport_count": 13 }, { "state": "TX", "airport_count": 8 }, { "state": "AZ", "airport_count": 7 } ] }, { "fac_type": "STOLPORT", "airport_count": 86, "top_5_states": [ { "state": "FL", "airport_count": 13 }, { "state": "TN", "airport_count": 9 }, { "state": "TX", "airport_count": 8 }, { "state": "CO", "airport_count": 6 }, { "state": "NC", "airport_count": 4 } ] } ]
WITH __stage0 AS ( SELECT group_set, base."fac_type" as "fac_type__0", CASE WHEN group_set=0 THEN COUNT(1) END as "airport_count__0", CASE WHEN group_set=1 THEN base."state" END as "state__1", CASE WHEN group_set=1 THEN COUNT(1) END as "airport_count__1" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set ) as group_set GROUP BY 1,2,4 ) SELECT "fac_type__0" as "fac_type", MAX(CASE WHEN group_set=0 THEN "airport_count__0" END) as "airport_count", COALESCE(LIST({ "state": "state__1", "airport_count": "airport_count__1"} ORDER BY "airport_count__1" desc NULLS LAST) FILTER (WHERE group_set=1)[1:5],[]) as "top_5_states" FROM __stage0 GROUP BY 1 ORDER BY 2 desc NULLS LAST
Joining
First let's model some simple tables... (Join Documentation)
Carrier table
simple source declaration used in example below
source: carriers is duckdb.table('../data/carriers.parquet') extend { measure: carrier_count is count() } run: carriers -> { select: * }
[ { "code": "EV", "name": "Atlantic Southeast Airlines", "nickname": "Atlantic Southeast" }, { "code": "NW", "name": "Northwest Airlines", "nickname": "Northwest" }, { "code": "AA", "name": "American Airlines", "nickname": "American" }, { "code": "FL", "name": "Airtran Airways Corporation", "nickname": "Airtran" }, { "code": "B6", "name": "Jetblue Airways", "nickname": "Jetblue" } ]
SELECT base."code" as "code", base."name" as "name", base."nickname" as "nickname" FROM '../data/carriers.parquet' as base
Flights table
simple source declaration used in example below
source: flights is duckdb.table('../data/flights.parquet') extend { measure: flight_count is count() } run: flights -> { select: id2, tail_num, dep_time, carrier, origin, destination, distance, dep_delay limit: 10 }
[ { "id2": 30272525, "tail_num": "N806MD", "dep_time": "2004-11-18T22:32:00.000Z", "carrier": "US", "origin": "PHL", "destination": "ABE", "distance": 55, "dep_delay": -3 }, { "id2": 29742442, "tail_num": "N806MD", "dep_time": "2004-10-12T20:46:00.000Z", "carrier": "US", "origin": "PHL", "destination": "ABE", "distance": 55, "dep_delay": 6 }, { "id2": 30270885, "tail_num": "N816MA", "dep_time": "2004-11-24T10:20:00.000Z", "carrier": "US", "origin": "PHL", "destination": "ABE", "distance": 55, "dep_delay": 0 }, { "id2": 28344746, "tail_num": "N806MD", "dep_time": "2004-08-31T20:30:00.000Z", "carrier": "US", "origin": "PHL", "destination": "ABE", "distance": 55, "dep_delay": 0 }, { "id2": 27898410, "tail_num": "N806MD", "dep_time": "2004-07-27T10:21:00.000Z", "carrier": "US", "origin": "PHL", "destination": "ABE", "distance": 55, "dep_delay": -4 } ]
SELECT base."id2" as "id2", base."tail_num" as "tail_num", base."dep_time" as "dep_time", base."carrier" as "carrier", base."origin" as "origin", base."destination" as "destination", base."distance" as "distance", base."dep_delay" as "dep_delay" FROM '../data/flights.parquet' as base LIMIT 10
Declare a Join
Join carriers to flights. Each flight has one carrier so we use join_one:
.
(Join Documentation)
source: carriers2 is duckdb.table('../data/carriers.parquet') extend { measure: carrier_count is count() } source: flights2 is duckdb.table('../data/flights.parquet') extend { join_one: carriers2 on carrier = carriers2.code measure: flight_count is count() total_distance is distance.sum() avg_distance is distance.avg() }
Query the joined tables
using the above declared flights
source
run: flights2 -> { group_by: carriers2.nickname aggregate: flight_count total_distance avg_distance }
[ { "nickname": "Southwest", "flight_count": 88751, "total_distance": 54619152, "avg_distance": 615.4201304774031 }, { "nickname": "USAir", "flight_count": 37683, "total_distance": 23721642, "avg_distance": 629.5051349414855 }, { "nickname": "American", "flight_count": 34577, "total_distance": 37684885, "avg_distance": 1089.8830147207682 }, { "nickname": "Northwest", "flight_count": 33580, "total_distance": 33376503, "avg_distance": 993.9399344848124 }, { "nickname": "United", "flight_count": 32757, "total_distance": 38882934, "avg_distance": 1187.0114479347926 } ]
SELECT carriers2_0."nickname" as "nickname", COUNT(1) as "flight_count", COALESCE(SUM(base."distance"),0) as "total_distance", AVG(base."distance") as "avg_distance" FROM '../data/flights.parquet' as base LEFT JOIN '../data/carriers.parquet' AS carriers2_0 ON base."carrier"=carriers2_0."code" GROUP BY 1 ORDER BY 2 desc NULLS LAST
Aggregates can be computed from anywhere in the Join Tree
using the above declared flights
source
run: flights2 -> { limit: 10 group_by: origin aggregate: carriers2.carrier_count // <-- calculation in joined table nest: top_3_carriers is { limit: 3 group_by: carriers2.nickname aggregate: flight_count total_distance avg_distance } }
[ { "origin": "DFW", "carrier_count": 13, "top_3_carriers": [ { "nickname": "American", "flight_count": 8742, "total_distance": 8419987, "avg_distance": 963.1648364218714 }, { "nickname": "American Eagle", "flight_count": 6146, "total_distance": 1464011, "avg_distance": 238.20549951187763 }, { "nickname": "Delta", "flight_count": 668, "total_distance": 563628, "avg_distance": 843.754491017964 } ] }, { "origin": "ATL", "carrier_count": 12, "top_3_carriers": [ { "nickname": "Delta", "flight_count": 8419, "total_distance": 5793004, "avg_distance": 688.0869461931346 }, { "nickname": "Atlantic Southeast", "flight_count": 7392, "total_distance": 1473933, "avg_distance": 199.39569805194805 }, { "nickname": "USAir", "flight_count": 521, "total_distance": 257058, "avg_distance": 493.3934740882917 } ] }, { "origin": "LAX", "carrier_count": 12, "top_3_carriers": [ { "nickname": "Southwest", "flight_count": 4282, "total_distance": 2637054, "avg_distance": 615.8463334890238 }, { "nickname": "United", "flight_count": 2319, "total_distance": 3441449, "avg_distance": 1484.022854678741 }, { "nickname": "American", "flight_count": 1951, "total_distance": 3451194, "avg_distance": 1768.9359302921578 } ] }, { "origin": "IND", "carrier_count": 12, "top_3_carriers": [ { "nickname": "Southwest", "flight_count": 632, "total_distance": 486197, "avg_distance": 769.2990506329114 }, { "nickname": "Northwest", "flight_count": 344, "total_distance": 183698, "avg_distance": 534.0058139534884 }, { "nickname": "USAir", "flight_count": 244, "total_distance": 110811, "avg_distance": 454.14344262295083 } ] }, { "origin": "SEA", "carrier_count": 12, "top_3_carriers": [ { "nickname": "Alaska", "flight_count": 3030, "total_distance": 2633111, "avg_distance": 869.0135313531354 }, { "nickname": "Southwest", "flight_count": 1409, "total_distance": 1064397, "avg_distance": 755.4272533711852 }, { "nickname": "Northwest", "flight_count": 751, "total_distance": 1228918, "avg_distance": 1636.375499334221 } ] } ]
WITH __stage0 AS ( SELECT group_set, base."origin" as "origin__0", CASE WHEN group_set=0 THEN COUNT(DISTINCT carriers2_0."__distinct_key") END as "carrier_count__0", CASE WHEN group_set=1 THEN carriers2_0."nickname" END as "nickname__1", CASE WHEN group_set=1 THEN COUNT(1) END as "flight_count__1", CASE WHEN group_set=1 THEN COALESCE(SUM(base."distance"),0) END as "total_distance__1", CASE WHEN group_set=1 THEN AVG(base."distance") END as "avg_distance__1" FROM '../data/flights.parquet' as base LEFT JOIN (SELECT GEN_RANDOM_UUID() as "__distinct_key", x.* FROM '../data/carriers.parquet' as x) AS carriers2_0 ON base."carrier"=carriers2_0."code" CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set ) as group_set GROUP BY 1,2,4 ) SELECT "origin__0" as "origin", MAX(CASE WHEN group_set=0 THEN "carrier_count__0" END) as "carrier_count", COALESCE(LIST({ "nickname": "nickname__1", "flight_count": "flight_count__1", "total_distance": "total_distance__1", "avg_distance": "avg_distance__1"} ORDER BY "flight_count__1" desc NULLS LAST) FILTER (WHERE group_set=1)[1:3],[]) as "top_3_carriers" FROM __stage0 GROUP BY 1 ORDER BY 2 desc NULLS LAST LIMIT 10
More Complex Joins
The most common join pattern is a foreign key join. Malloy uses the with:
to declare these and generates more efficient SQL when these joins are used.
In the example below, we use a with:
join for carriers
and then model the more complex relationship with the flights
originating from each airport
using on:
.
Many flights
have the same
airport
as their origin so we use join_many:
.
source: carriers4 is duckdb.table('../data/carriers.parquet') extend { primary_key: code measure: carrier_count is count() } source: flights4 is duckdb.table('../data/flights.parquet') extend { join_one: carriers4 with carrier // <-- each flight has 1 carrier measure: flight_count is count() total_distance is distance.sum() avg_distance is distance.avg() } source: airports4 is duckdb.table('../data/airports.parquet') extend { join_many: flights4 on code = flights4.origin // <-- each airport has many flights measure: airport_count is count() dimension: elevation_in_meters is elevation * 0.3048 measure: avg_elevation_in_meters is elevation_in_meters.avg() view: by_state is { group_by: state aggregate: airport_count } }
Calculations work properly regardless of where you are in the graph
This query is very difficult to express in SQL. Malloy's understanding of source relationships allows it to compute aggregate computations at any node of the join path, unlike SQL which can only do aggregate computation at the. outermost level. (Aggregate Documentation)
using the above declared airports
source
run: airports4 -> { group_by: state aggregate: flights4.carriers4.carrier_count // <-- 3 levels flights4.flight_count flights4.total_distance airport_count avg_elevation_in_meters // <-- symmetric calculation }
[ { "state": "TX", "carrier_count": 14, "flight_count": 40085, "total_distance": 24171182, "airport_count": 1845, "avg_elevation_in_meters": 273.7994445528456 }, { "state": "CA", "carrier_count": 14, "flight_count": 40670, "total_distance": 37690414, "airport_count": 984, "avg_elevation_in_meters": 331.53319024390237 }, { "state": "IN", "carrier_count": 13, "flight_count": 2324, "total_distance": 1495747, "airport_count": 643, "avg_elevation_in_meters": 231.50816174183532 }, { "state": "LA", "carrier_count": 13, "flight_count": 4246, "total_distance": 2644460, "airport_count": 500, "avg_elevation_in_meters": 19.65045599999995 }, { "state": "OH", "carrier_count": 12, "flight_count": 7327, "total_distance": 3562086, "airport_count": 749, "avg_elevation_in_meters": 279.49753057409856 } ]
SELECT base."state" as "state", COUNT(DISTINCT carriers4_0."code") as "carrier_count", COUNT(DISTINCT flights4_0."__distinct_key") as "flight_count", COALESCE(SUM(flights4_0."distance"),0) as "total_distance", COUNT(DISTINCT base."__distinct_key") as "airport_count", ( SELECT AVG(a.val) as value FROM ( SELECT UNNEST(list(distinct {key:base."__distinct_key", val: (base."elevation"*0.3048::DOUBLE)})) a ) ) as "avg_elevation_in_meters" FROM (SELECT GEN_RANDOM_UUID() as "__distinct_key", x.* FROM '../data/airports.parquet' as x) as base LEFT JOIN (SELECT GEN_RANDOM_UUID() as "__distinct_key", x.* FROM '../data/flights.parquet' as x) AS flights4_0 ON base."code"=flights4_0."origin" LEFT JOIN '../data/carriers.parquet' AS carriers4_0 ON carriers4_0."code"=flights4_0."carrier" GROUP BY 1 ORDER BY 2 desc NULLS LAST
Pipelines
The output of a query can be used as the source for the next query.
Assume the following query as a starting point.
source: airports5 is duckdb.table('../data/airports.parquet') extend { measure: airport_count is count() } run: airports5 -> { where: fac_type = 'HELIPORT' group_by: state aggregate: airport_count nest: top_3_county is { limit: 3 group_by: county aggregate: airport_count } }
[ { "state": "TX", "airport_count": 435, "top_3_county": [ { "county": "HARRIS", "airport_count": 110 }, { "county": "TARRANT", "airport_count": 35 }, { "county": "DALLAS", "airport_count": 32 } ] }, { "state": "CA", "airport_count": 396, "top_3_county": [ { "county": "LOS ANGELES", "airport_count": 151 }, { "county": "ORANGE", "airport_count": 47 }, { "county": "SAN BERNARDINO", "airport_count": 24 } ] }, { "state": "PA", "airport_count": 307, "top_3_county": [ { "county": "MONTGOMERY", "airport_count": 29 }, { "county": "ALLEGHENY", "airport_count": 22 }, { "county": "PHILADELPHIA", "airport_count": 22 } ] }, { "state": "FL", "airport_count": 280, "top_3_county": [ { "county": "PALM BEACH", "airport_count": 30 }, { "county": "DADE", "airport_count": 27 }, { "county": "ORANGE", "airport_count": 24 } ] }, { "state": "NJ", "airport_count": 247, "top_3_county": [ { "county": "MIDDLESEX", "airport_count": 26 }, { "county": "MONMOUTH", "airport_count": 23 }, { "county": "SOMERSET", "airport_count": 23 } ] } ]
WITH __stage0 AS ( SELECT group_set, base."state" as "state__0", CASE WHEN group_set=0 THEN COUNT(1) END as "airport_count__0", CASE WHEN group_set=1 THEN base."county" END as "county__1", CASE WHEN group_set=1 THEN COUNT(1) END as "airport_count__1" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set ) as group_set WHERE base."fac_type"='HELIPORT' GROUP BY 1,2,4 ) SELECT "state__0" as "state", MAX(CASE WHEN group_set=0 THEN "airport_count__0" END) as "airport_count", COALESCE(LIST({ "county": "county__1", "airport_count": "airport_count__1"} ORDER BY "airport_count__1" desc NULLS LAST) FILTER (WHERE group_set=1)[1:3],[]) as "top_3_county" FROM __stage0 GROUP BY 1 ORDER BY 2 desc NULLS LAST
Un-nesting in a pipeline flattens the table
Queries can be chained together (pipelined), the output of one becoming the input of the next one, by simply adding another ->
operator and a new query definition.
source: airports6 is duckdb.table('../data/airports.parquet') extend { measure: airport_count is count() } run: airports6 -> { where: fac_type = 'HELIPORT' group_by: state aggregate: airport_count nest: top_3_county is { limit: 3 group_by: county aggregate: airport_count } } -> { select: state top_3_county.county airports_in_state is airport_count airports_in_county is top_3_county.airport_count percent_of_state is top_3_county.airport_count / airport_count }
[ { "state": "MI", "county": "OAKLAND", "airports_in_state": 87, "airports_in_county": 18, "percent_of_state": 0.20689655172413793 }, { "state": "MI", "county": "WAYNE", "airports_in_state": 87, "airports_in_county": 12, "percent_of_state": 0.13793103448275862 }, { "state": "MI", "county": "KENT", "airports_in_state": 87, "airports_in_county": 8, "percent_of_state": 0.09195402298850575 }, { "state": "TX", "county": "HARRIS", "airports_in_state": 435, "airports_in_county": 110, "percent_of_state": 0.25287356321839083 }, { "state": "TX", "county": "TARRANT", "airports_in_state": 435, "airports_in_county": 35, "percent_of_state": 0.08045977011494253 } ]
WITH __stage0 AS ( SELECT group_set, base."state" as "state__0", CASE WHEN group_set=0 THEN COUNT(1) END as "airport_count__0", CASE WHEN group_set=1 THEN base."county" END as "county__1", CASE WHEN group_set=1 THEN COUNT(1) END as "airport_count__1" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set ) as group_set WHERE base."fac_type"='HELIPORT' GROUP BY 1,2,4 ) , __stage1 AS ( SELECT "state__0" as "state", MAX(CASE WHEN group_set=0 THEN "airport_count__0" END) as "airport_count", COALESCE(LIST({ "county": "county__1", "airport_count": "airport_count__1"} ORDER BY "airport_count__1" desc NULLS LAST) FILTER (WHERE group_set=1)[1:3],[]) as "top_3_county" FROM __stage0 GROUP BY 1 ) SELECT base."state" as "state", top_3_county_0."county" as "county", base."airport_count" as "airports_in_state", top_3_county_0."airport_count" as "airports_in_county", top_3_county_0."airport_count"*1.0/base."airport_count" as "percent_of_state" FROM __stage1 as base LEFT JOIN LATERAL (SELECT UNNEST(base."top_3_county"), 1 as ignoreme) as top_3_county_0_outer(top_3_county_0,ignoreme) ON top_3_county_0_outer.ignoreme=1
Pipelines can be named as views in sources
Pipelines can do pretty complex things. They can be built into source objects.
source: airports7 is duckdb.table('../data/airports.parquet') extend { measure: airport_count is count() view: county_rollup is { where: fac_type = 'HELIPORT' group_by: state aggregate: airport_count nest: top_3_county is { limit: 3 group_by: county aggregate: airport_count } } -> { select: state top_3_county.county airports_in_state is airport_count airports_in_county is top_3_county.airport_count percent_of_state is top_3_county.airport_count/airport_count } } run: airports7 -> county_rollup
[ { "state": "CT", "county": "HARTFORD", "airports_in_state": 89, "airports_in_county": 30, "percent_of_state": 0.33707865168539325 }, { "state": "CT", "county": "NEW HAVEN", "airports_in_state": 89, "airports_in_county": 20, "percent_of_state": 0.2247191011235955 }, { "state": "CT", "county": "FAIRFIELD", "airports_in_state": 89, "airports_in_county": 18, "percent_of_state": 0.20224719101123595 }, { "state": "OH", "county": "CUYAHOGA", "airports_in_state": 201, "airports_in_county": 22, "percent_of_state": 0.10945273631840796 }, { "state": "OH", "county": "FRANKLIN", "airports_in_state": 201, "airports_in_county": 20, "percent_of_state": 0.09950248756218906 } ]
WITH __stage0 AS ( SELECT group_set, base."state" as "state__0", CASE WHEN group_set=0 THEN COUNT(1) END as "airport_count__0", CASE WHEN group_set=1 THEN base."county" END as "county__1", CASE WHEN group_set=1 THEN COUNT(1) END as "airport_count__1" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set ) as group_set WHERE base."fac_type"='HELIPORT' GROUP BY 1,2,4 ) , __stage1 AS ( SELECT "state__0" as "state", MAX(CASE WHEN group_set=0 THEN "airport_count__0" END) as "airport_count", COALESCE(LIST({ "county": "county__1", "airport_count": "airport_count__1"} ORDER BY "airport_count__1" desc NULLS LAST) FILTER (WHERE group_set=1)[1:3],[]) as "top_3_county" FROM __stage0 GROUP BY 1 ) SELECT base."state" as "state", top_3_county_0."county" as "county", base."airport_count" as "airports_in_state", top_3_county_0."airport_count" as "airports_in_county", top_3_county_0."airport_count"*1.0/base."airport_count" as "percent_of_state" FROM __stage1 as base LEFT JOIN LATERAL (SELECT UNNEST(base."top_3_county"), 1 as ignoreme) as top_3_county_0_outer(top_3_county_0,ignoreme) ON top_3_county_0_outer.ignoreme=1
Extending Sources
Similar to query refinement, a source can be extended with the extend { ... }
gesture to create a new version of the source with additional properties.
source: new name is old name extend { where: some data limit measure: new field declarations }
Sources based on Queries
Sometimes it is helpful to pre-process data before modeling it out with a source. To do this, you can first define a model-level query using a query statement, then extend the query to add reusable definitions.
Model-Level Queries
Here, we define a model-level query q_airport_facts
based on the flights.parquet
table.
query: q_airport_facts is duckdb.table('../data/flights.parquet') -> { group_by: flight_year is dep_time.year origin carrier aggregate: num_flights is count() distance is distance.sum() } run: q_airport_facts query: q_airport_facts2 is duckdb.table('../data/flights.parquet') -> { group_by: flight_year is dep_time.year origin carrier aggregate: num_flights is count() distance is distance.sum() }
[ { "flight_year": "2005-01-01T00:00:00.000Z", "origin": "CLE", "carrier": "RU", "num_flights": 1307, "distance": 501816 }, { "flight_year": "2005-01-01T00:00:00.000Z", "origin": "DFW", "carrier": "MQ", "num_flights": 1503, "distance": 370998 }, { "flight_year": "2005-01-01T00:00:00.000Z", "origin": "LBB", "carrier": "WN", "num_flights": 39, "distance": 14747 }, { "flight_year": "2005-01-01T00:00:00.000Z", "origin": "PHX", "carrier": "HP", "num_flights": 528, "distance": 375489 }, { "flight_year": "2005-01-01T00:00:00.000Z", "origin": "PDX", "carrier": "WN", "num_flights": 229, "distance": 137081 } ]
SELECT DATE_TRUNC('year', base."dep_time") as "flight_year", base."origin" as "origin", base."carrier" as "carrier", COUNT(1) as "num_flights", COALESCE(SUM(base."distance"),0) as "distance" FROM '../data/flights.parquet' as base GROUP BY 1,2,3 ORDER BY 1 desc NULLS LAST
Source based on a query
Next, we can define a source based on q_airport_facts
to add reusable fields and source-level queries.
source: airport_facts is q_airport_facts extend { measure: flight_count is num_flights.sum() measure: total_distance is distance.sum() view: flights_by_year is { group_by: flight_year aggregate: flight_count carrier_count is count(carrier) origin_count is count(origin) } view: flights_by_origin is { group_by: origin aggregate: flight_count carrier_count is count(carrier) } }
Querying the Summary source
run: airport_facts -> flights_by_origin
[ { "origin": "ATL", "flight_count": 17875, "carrier_count": 12 }, { "origin": "DFW", "flight_count": 17782, "carrier_count": 13 }, { "origin": "ORD", "flight_count": 14214, "carrier_count": 10 }, { "origin": "PHX", "flight_count": 12476, "carrier_count": 12 }, { "origin": "LAS", "flight_count": 11096, "carrier_count": 10 } ]
WITH __stage0 AS ( SELECT DATE_TRUNC('year', base."dep_time") as "flight_year", base."origin" as "origin", base."carrier" as "carrier", COUNT(1) as "num_flights", COALESCE(SUM(base."distance"),0) as "distance" FROM '../data/flights.parquet' as base GROUP BY 1,2,3 ) SELECT base."origin" as "origin", COALESCE(SUM(base."num_flights"),0) as "flight_count", count(distinct base."carrier") as "carrier_count" FROM __stage0 as base GROUP BY 1 ORDER BY 2 desc NULLS LAST
Other Interesting Language Features:
SQL Sources
See the SQL Sources section.