This guide expands on information shared in Malloy for SQL Folks.
CTE and Refactoring for Reusability.
Malloy has the notion of query
s and source
s. It can be a little confusing when to use one or the others.
A query
is a transformation that returns a table. If you are doing a transfomation of data to be joined in another query or source, make a query
.
A source
is datasource, like a table with added declarations that can be used against the data it contains.
A source
can also be joined with sources and queries to be a network of data that you can query.
Example SQL
This SQL block should be used as a reference when reading the section below it.
-- conneciton: duckdb WITH orig_facts as ( -- (1) CTE computes a rollup of flights SELECT origin as code, COUNT(*) as total_flights FROM '../data/flights.parquet' as flights GROUP BY 1 ) SELECT orig.code, orig.city, orig_facts.total_flights as total_flights_to_origin, carriers.nickname, count(*) as flight_count, count(*)/(orig_facts.total_flights*1.0) as percent_of_flights_to_origin -- (2) division is integer by default FROM '../data/flights.parquet' as flights LEFT JOIN '../data/airports.parquet' as orig ON flights.origin = orig.code LEFT JOIN orig_facts ON flights.origin = orig_facts.code LEFT JOIN '../data/carriers.parquet' as carriers ON flights.carrier = carriers.code GROUP BY 1,2,3,4 ORDER BY 5 DESC LIMIT 10
Translate the structure of the query.
For each table in the query, make a
source:
and add relevent calculations. Aggregate calculations are declared asmeasure:
s. Scalar calculations are declared asdimension:
s. These calculations can be used in queries, but simply naming them. We make sources for the flights, airports and carriers tables.For each CTE, declare a
query:
. In SQL, each CTE stage produces a table that is used in subsequent stages in the query. It works the same in malloy. CTEs are declared and named in aquery:
block. They usually use one of the previously definedsource:
s as a base. These queries are usually result infact
tables as they show up at some level of granularity. We declareorigin_facts
.Finally, declare the
source:
that you are going base your complex queries oflights_explore
.flights_explore
inherits fromflights
and adds joins to the above tables.
source: flights is duckdb.table('../data/flights.parquet') extend { -- (1) measure: flight_count is count() total_distance is distance.sum() -- add some measures avg_distance is distance.avg() rename: origin_code is origin destination_code is destination } source: airports is duckdb.table('../data/airports.parquet') extend { dimension: name is concat(code, '-', city) measure: airport_count is count() avg_elevation is elevation.avg() } source: carriers is duckdb.table('../data/carriers.parquet') extend { measure: carrier_count is count() } query: origin_facts is flights -> { -- (2) CTEs become fact queries group_by: code is origin_code aggregate: total_flights is flight_count } source: flights_explore is flights extend { -- (3), (4) join_one: carriers on carrier = carriers.code join_one: origin is airports on origin_code = origin.code join_one: origin_facts on origin_code = origin_facts.code }
For each table in the query, make a
source:
. If there are aggregate calculations that are relevent to the source declare them measures or dimensions.For each CTE, declare a
query:
. Usually these are 'fact' tables of some kind.Declare a new source that extends the root table in the query, that joins eveything together into a network.
extend:
lets you inhert from an exising source.
Writing the query becomes easy.
run: flights_explore -> { group_by: origin.code origin.city total_flights_to_origin is origin_facts.total_flights carriers.nickname aggregate: flight_count percent_of_flights_to_origin is flight_count/origin_facts.total_flights limit: 10 }
[ { "code": "DFW", "city": "DALLAS-FORT WORTH", "total_flights_to_origin": 17782, "nickname": "American", "flight_count": 8742, "percent_of_flights_to_origin": 0.49162074007423234 }, { "code": "MSP", "city": "MINNEAPOLIS", "total_flights_to_origin": 9762, "nickname": "Northwest", "flight_count": 8662, "percent_of_flights_to_origin": 0.887318172505634 }, { "code": "ATL", "city": "ATLANTA", "total_flights_to_origin": 17875, "nickname": "Delta", "flight_count": 8419, "percent_of_flights_to_origin": 0.470993006993007 }, { "code": "ATL", "city": "ATLANTA", "total_flights_to_origin": 17875, "nickname": "Atlantic Southeast", "flight_count": 7392, "percent_of_flights_to_origin": 0.4135384615384615 }, { "code": "ORD", "city": "CHICAGO", "total_flights_to_origin": 14214, "nickname": "United", "flight_count": 6802, "percent_of_flights_to_origin": 0.4785422822569298 } ]
WITH __stage0 AS ( SELECT base."origin" as "code", (COUNT(1)) as "total_flights" FROM '../data/flights.parquet' as base GROUP BY 1 ) SELECT origin_0."code" as "code", origin_0."city" as "city", origin_facts_0."total_flights" as "total_flights_to_origin", carriers_0."nickname" as "nickname", COUNT(1) as "flight_count", (COUNT(1))*1.0/origin_facts_0."total_flights" as "percent_of_flights_to_origin" FROM '../data/flights.parquet' as base LEFT JOIN __stage0 AS origin_facts_0 ON base."origin"=origin_facts_0."code" LEFT JOIN '../data/airports.parquet' AS origin_0 ON base."origin"=origin_0."code" LEFT JOIN '../data/carriers.parquet' AS carriers_0 ON base."carrier"=carriers_0."code" GROUP BY 1,2,3,4 ORDER BY 5 desc NULLS LAST LIMIT 10
This semantic model can be used to generate lots of interesting queries.
run: flights_explore -> { group_by: origin.name aggregate: flight_count, carriers.carrier_count limit: 10 # list_detail -- (1) nest: carriers is { -- (2) group_by: carriers.nickname aggregate: flight_count } }
[ { "name": "ATL-ATLANTA", "flight_count": 17875, "carrier_count": 12, "carriers": [ { "nickname": "Delta", "flight_count": 8419 }, { "nickname": "Atlantic Southeast", "flight_count": 7392 }, { "nickname": "USAir", "flight_count": 521 }, { "nickname": "American", "flight_count": 338 }, { "nickname": "Northwest", "flight_count": 310 }, { "nickname": "United", "flight_count": 258 }, { "nickname": "Continental", "flight_count": 248 }, { "nickname": "Comair", "flight_count": 239 }, { "nickname": "Continental Express", "flight_count": 81 }, { "nickname": "American Eagle", "flight_count": 46 }, { "nickname": "Jetblue", "flight_count": 16 }, { "nickname": "America West", "flight_count": 7 } ] }, { "name": "DFW-DALLAS-FORT WORTH", "flight_count": 17782, "carrier_count": 13, "carriers": [ { "nickname": "American", "flight_count": 8742 }, { "nickname": "American Eagle", "flight_count": 6146 }, { "nickname": "Delta", "flight_count": 668 }, { "nickname": "USAir", "flight_count": 657 }, { "nickname": "Atlantic Southeast", "flight_count": 395 }, { "nickname": "United", "flight_count": 303 }, { "nickname": "Northwest", "flight_count": 282 }, { "nickname": "America West", "flight_count": 275 }, { "nickname": "ATA", "flight_count": 132 }, { "nickname": "Continental", "flight_count": 111 }, { "nickname": "Continental Express", "flight_count": 67 }, { "nickname": "Comair", "flight_count": 3 }, { "nickname": "Alaska", "flight_count": 1 } ] }, { "name": "ORD-CHICAGO", "flight_count": 14214, "carrier_count": 10, "carriers": [ { "nickname": "United", "flight_count": 6802 }, { "nickname": "American", "flight_count": 5143 }, { "nickname": "American Eagle", "flight_count": 985 }, { "nickname": "USAir", "flight_count": 502 }, { "nickname": "Northwest", "flight_count": 297 }, { "nickname": "Delta", "flight_count": 218 }, { "nickname": "Continental", "flight_count": 141 }, { "nickname": "Continental Express", "flight_count": 90 }, { "nickname": "America West", "flight_count": 19 }, { "nickname": "Comair", "flight_count": 17 } ] }, { "name": "PHX-PHOENIX", "flight_count": 12476, "carrier_count": 12, "carriers": [ { "nickname": "Southwest", "flight_count": 6456 }, { "nickname": "America West", "flight_count": 3696 }, { "nickname": "United", "flight_count": 662 }, { "nickname": "Northwest", "flight_count": 480 }, { "nickname": "American", "flight_count": 355 }, { "nickname": "Alaska", "flight_count": 346 }, { "nickname": "USAir", "flight_count": 316 }, { "nickname": "Delta", "flight_count": 90 }, { "nickname": "ATA", "flight_count": 42 }, { "nickname": "Continental", "flight_count": 17 }, { "nickname": "Continental Express", "flight_count": 9 }, { "nickname": "Jetblue", "flight_count": 7 } ] }, { "name": "LAS-LAS VEGAS", "flight_count": 11096, "carrier_count": 10, "carriers": [ { "nickname": "Southwest", "flight_count": 6292 }, { "nickname": "United", "flight_count": 1340 }, { "nickname": "America West", "flight_count": 1221 }, { "nickname": "Northwest", "flight_count": 633 }, { "nickname": "Alaska", "flight_count": 418 }, { "nickname": "American", "flight_count": 391 }, { "nickname": "Delta", "flight_count": 376 }, { "nickname": "USAir", "flight_count": 223 }, { "nickname": "Jetblue", "flight_count": 146 }, { "nickname": "ATA", "flight_count": 56 } ] } ]
WITH __stage0 AS ( SELECT group_set, CONCAT(origin_0."code",'-',origin_0."city") as "name__0", CASE WHEN group_set=0 THEN COUNT(1) END as "flight_count__0", CASE WHEN group_set=0 THEN COUNT(DISTINCT carriers_0."__distinct_key") END as "carrier_count__0", CASE WHEN group_set=1 THEN carriers_0."nickname" END as "nickname__1", CASE WHEN group_set=1 THEN COUNT(1) END as "flight_count__1" FROM '../data/flights.parquet' as base LEFT JOIN '../data/airports.parquet' AS origin_0 ON base."origin"=origin_0."code" LEFT JOIN (SELECT GEN_RANDOM_UUID() as "__distinct_key", x.* FROM '../data/carriers.parquet' as x) AS carriers_0 ON base."carrier"=carriers_0."code" CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set ) as group_set GROUP BY 1,2,5 ) SELECT "name__0" as "name", MAX(CASE WHEN group_set=0 THEN "flight_count__0" END) as "flight_count", MAX(CASE WHEN group_set=0 THEN "carrier_count__0" END) as "carrier_count", COALESCE(LIST({ "nickname": "nickname__1", "flight_count": "flight_count__1"} ORDER BY "flight_count__1" desc NULLS LAST) FILTER (WHERE group_set=1),[]) as "carriers" FROM __stage0 GROUP BY 1 ORDER BY 2 desc NULLS LAST LIMIT 10
Another example query
run: flights_explore -> { group_by: carrier is carriers.nickname aggregate: flight_count # list_detail nest: by_origin is { group_by: origin.name aggregate: flight_count limit: 10 } # line_chart nest: by_month is { group_by: dep_year is dep_time.year aggregate: flight_count } }
[ { "carrier": "Southwest", "flight_count": 88751, "by_origin": [ { "name": "PHX-PHOENIX", "flight_count": 6456 }, { "name": "LAS-LAS VEGAS", "flight_count": 6292 }, { "name": "BWI-BALTIMORE", "flight_count": 5471 }, { "name": "MDW-CHICAGO", "flight_count": 4702 }, { "name": "LAX-LOS ANGELES", "flight_count": 4282 }, { "name": "OAK-OAKLAND", "flight_count": 3964 }, { "name": "HOU-HOUSTON", "flight_count": 3219 }, { "name": "BNA-NASHVILLE", "flight_count": 3170 }, { "name": "SAN-SAN DIEGO", "flight_count": 2865 }, { "name": "MCI-KANSAS CITY", "flight_count": 2583 } ], "by_month": [ { "dep_year": "2005-01-01T00:00:00.000Z", "flight_count": 17549 }, { "dep_year": "2004-01-01T00:00:00.000Z", "flight_count": 14640 }, { "dep_year": "2003-01-01T00:00:00.000Z", "flight_count": 14300 }, { "dep_year": "2002-01-01T00:00:00.000Z", "flight_count": 14708 }, { "dep_year": "2001-01-01T00:00:00.000Z", "flight_count": 14421 }, { "dep_year": "2000-01-01T00:00:00.000Z", "flight_count": 13133 } ] }, { "carrier": "USAir", "flight_count": 37683, "by_origin": [ { "name": "CLT-CHARLOTTE", "flight_count": 6427 }, { "name": "PHL-PHILADELPHIA", "flight_count": 5779 }, { "name": "PIT-PITTSBURGH", "flight_count": 4137 }, { "name": "DCA-WASHINGTON", "flight_count": 2900 }, { "name": "LGA-NEW YORK", "flight_count": 1412 }, { "name": "BOS-BOSTON", "flight_count": 1227 }, { "name": "DFW-DALLAS-FORT WORTH", "flight_count": 657 }, { "name": "RDU-RALEIGH/DURHAM", "flight_count": 632 }, { "name": "IAH-HOUSTON", "flight_count": 600 }, { "name": "ATL-ATLANTA", "flight_count": 521 } ], "by_month": [ { "dep_year": "2005-01-01T00:00:00.000Z", "flight_count": 7446 }, { "dep_year": "2004-01-01T00:00:00.000Z", "flight_count": 5388 }, { "dep_year": "2003-01-01T00:00:00.000Z", "flight_count": 4321 }, { "dep_year": "2002-01-01T00:00:00.000Z", "flight_count": 6501 }, { "dep_year": "2001-01-01T00:00:00.000Z", "flight_count": 6866 }, { "dep_year": "2000-01-01T00:00:00.000Z", "flight_count": 7161 } ] }, { "carrier": "American", "flight_count": 34577, "by_origin": [ { "name": "DFW-DALLAS-FORT WORTH", "flight_count": 8742 }, { "name": "ORD-CHICAGO", "flight_count": 5143 }, { "name": "LAX-LOS ANGELES", "flight_count": 1951 }, { "name": "LGA-NEW YORK", "flight_count": 1070 }, { "name": "MIA-MIAMI", "flight_count": 1063 }, { "name": "JFK-NEW YORK", "flight_count": 943 }, { "name": "STL-ST LOUIS", "flight_count": 712 }, { "name": "SFO-SAN FRANCISCO", "flight_count": 688 }, { "name": "AUS-AUSTIN", "flight_count": 671 }, { "name": "BOS-BOSTON", "flight_count": 605 } ], "by_month": [ { "dep_year": "2005-01-01T00:00:00.000Z", "flight_count": 5543 }, { "dep_year": "2004-01-01T00:00:00.000Z", "flight_count": 5604 }, { "dep_year": "2003-01-01T00:00:00.000Z", "flight_count": 5566 }, { "dep_year": "2002-01-01T00:00:00.000Z", "flight_count": 5963 }, { "dep_year": "2001-01-01T00:00:00.000Z", "flight_count": 5851 }, { "dep_year": "2000-01-01T00:00:00.000Z", "flight_count": 6050 } ] }, { "carrier": "Northwest", "flight_count": 33580, "by_origin": [ { "name": "MSP-MINNEAPOLIS", "flight_count": 8662 }, { "name": "DTW-DETROIT", "flight_count": 6604 }, { "name": "MEM-MEMPHIS", "flight_count": 1984 }, { "name": "LAX-LOS ANGELES", "flight_count": 993 }, { "name": "MCO-ORLANDO", "flight_count": 847 }, { "name": "SEA-SEATTLE", "flight_count": 751 }, { "name": "LGA-NEW YORK", "flight_count": 682 }, { "name": "BOS-BOSTON", "flight_count": 667 }, { "name": "LAS-LAS VEGAS", "flight_count": 633 }, { "name": "DCA-WASHINGTON", "flight_count": 601 } ], "by_month": [ { "dep_year": "2005-01-01T00:00:00.000Z", "flight_count": 5869 }, { "dep_year": "2004-01-01T00:00:00.000Z", "flight_count": 6306 }, { "dep_year": "2003-01-01T00:00:00.000Z", "flight_count": 5187 }, { "dep_year": "2002-01-01T00:00:00.000Z", "flight_count": 5081 }, { "dep_year": "2001-01-01T00:00:00.000Z", "flight_count": 5381 }, { "dep_year": "2000-01-01T00:00:00.000Z", "flight_count": 5756 } ] }, { "carrier": "United", "flight_count": 32757, "by_origin": [ { "name": "ORD-CHICAGO", "flight_count": 6802 }, { "name": "DEN-DENVER", "flight_count": 4997 }, { "name": "IAD-WASHINGTON", "flight_count": 2541 }, { "name": "SFO-SAN FRANCISCO", "flight_count": 2464 }, { "name": "LAX-LOS ANGELES", "flight_count": 2319 }, { "name": "LAS-LAS VEGAS", "flight_count": 1340 }, { "name": "MCO-ORLANDO", "flight_count": 852 }, { "name": "SEA-SEATTLE", "flight_count": 728 }, { "name": "PHX-PHOENIX", "flight_count": 662 }, { "name": "SAN-SAN DIEGO", "flight_count": 641 } ], "by_month": [ { "dep_year": "2005-01-01T00:00:00.000Z", "flight_count": 6876 }, { "dep_year": "2004-01-01T00:00:00.000Z", "flight_count": 6817 }, { "dep_year": "2003-01-01T00:00:00.000Z", "flight_count": 6068 }, { "dep_year": "2002-01-01T00:00:00.000Z", "flight_count": 5590 }, { "dep_year": "2001-01-01T00:00:00.000Z", "flight_count": 3785 }, { "dep_year": "2000-01-01T00:00:00.000Z", "flight_count": 3621 } ] } ]
WITH __stage0 AS ( SELECT group_set, carriers_0."nickname" as "carrier__0", CASE WHEN group_set=0 THEN COUNT(1) END as "flight_count__0", CASE WHEN group_set=1 THEN CONCAT(origin_0."code",'-',origin_0."city") END as "name__1", CASE WHEN group_set=1 THEN COUNT(1) END as "flight_count__1", CASE WHEN group_set=2 THEN DATE_TRUNC('year', base."dep_time") END as "dep_year__2", CASE WHEN group_set=2 THEN COUNT(1) END as "flight_count__2" FROM '../data/flights.parquet' as base LEFT JOIN '../data/carriers.parquet' AS carriers_0 ON base."carrier"=carriers_0."code" LEFT JOIN '../data/airports.parquet' AS origin_0 ON base."origin"=origin_0."code" CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,2,1)) as group_set ) as group_set GROUP BY 1,2,4,6 ) SELECT "carrier__0" as "carrier", MAX(CASE WHEN group_set=0 THEN "flight_count__0" END) as "flight_count", COALESCE(LIST({ "name": "name__1", "flight_count": "flight_count__1"} ORDER BY "flight_count__1" desc NULLS LAST) FILTER (WHERE group_set=1)[1:10],[]) as "by_origin", COALESCE(LIST({ "dep_year": "dep_year__2", "flight_count": "flight_count__2"} ORDER BY "dep_year__2" desc NULLS LAST) FILTER (WHERE group_set=2),[]) as "by_month" FROM __stage0 GROUP BY 1 ORDER BY 2 desc NULLS LAST