Joins in Malloy differ from SQL joins. When two sources are joined, Malloy retains the graph nature and hierarchy of the the data relationships. This is unlike SQL, which flattens everything into a single table space.
Aggregate calculations navigate this graph to deduce the locality of computation, so they are always computed correctly regardless of join pattern, avoiding the fan and chasm traps.
In Malloy, syntaxes for join are:
join_one: source name [is source] on boolean expression join_one: source name [is source] with foreign key expression join_many: source name [is source] on boolean expression join_cross: source name [is source] [on boolean expression]
Examples of the above, with orders
as the implied source:
join_one: users is duckdb.table('data/users.parquet') on user_id = users.id join_one: users on user_id = users.id join_one: users with user_id join_many: order_items on order_items.id = id join_cross: order_items2 is duckdb.table('data/order_items.parquet') on user_id = order_items2.user_id
join_one:
- the table we are joining has one row for each row in the source table.
join_many:
- the table we are joining has many rows for each row in the source table
join_cross:
- the join is a cross product and there will be many rows in each side of the join.
Malloy's joins are left outer joins by default.
Since Malloy deals in graphs, some SQL Join types don't make sense (RIGHT JOIN
, for example).
Join Types
Foreign Key to Primary Key
The easiest, most error-proof way to perform a join is using the following syntax:
join_one: source with foreign key
To join based on a foreign key through the primary_key
of a joined source, use with
to specify an expression, which could be as simple as a field name in the source. This expression is matched against the declared primary_key
of the joined source. Sources without a primary_key
cannot use with
joins.
source: users is duckdb.table('../data/users.parquet') extend { primary_key: id } source: order_items is duckdb.table('../data/order_items.parquet') extend { join_one: users with user_id }
This is simply a shortcut, when joining based on the primary key of a joined source. It is exactly equivalent to the on
join written like this.
source: order_items is duckdb.table('../data/order_items.parquet') extend { join_one: users on user_id = users.id }
Naming Joined Sources
If no alias is specified using is
, the name of the join will be the name of the source being joined.
source: carriers is duckdb.table('../data/carriers.parquet') extend { primary_key: code measure: carrier_count is count() } source: flights is duckdb.table('../data/flights.parquet') extend { join_one: carriers with carrier }
To give the joined source a different name within the context source, use is
to alias it.
source: airports is duckdb.table('../data/airports.parquet') extend { primary_key: code } source: flights2 is duckdb.table('../data/flights.parquet') extend { join_one: origin_airport is airports with origin }
In-line Joins
Sources do not need to be modeled before they are used in a join, though the join must be named using is
.
source: flights3 is duckdb.table('../data/flights.parquet') extend { join_one: carriers is duckdb.table('../data/carriers.parquet') extend { primary_key: code } with carrier }
Using Fields from Joined Sources
When a source is joined in, its fields become nested within the parent source. Fields from joined sources can be referenced using .
:
run: flights3 -> { group_by: carriers.nickname aggregate: flight_count is count() }
[ { "nickname": "Southwest", "flight_count": 88751 }, { "nickname": "USAir", "flight_count": 37683 }, { "nickname": "American", "flight_count": 34577 }, { "nickname": "Northwest", "flight_count": 33580 }, { "nickname": "United", "flight_count": 32757 } ]
SELECT carriers_0."nickname" as "nickname", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base LEFT JOIN '../data/carriers.parquet' AS carriers_0 ON carriers_0."code"=base."carrier" GROUP BY 1 ORDER BY 2 desc NULLS LAST
Measures defined in joined sources may be used in addition to dimensions.
run: flights -> { group_by: destination aggregate: carriers.carrier_count }
[ { "destination": "DFW", "carrier_count": 13 }, { "destination": "ATL", "carrier_count": 12 }, { "destination": "IND", "carrier_count": 12 }, { "destination": "LAX", "carrier_count": 12 }, { "destination": "SEA", "carrier_count": 12 } ]
SELECT base."destination" as "destination", COUNT(DISTINCT carriers_0."code") as "carrier_count" FROM '../data/flights.parquet' as base LEFT JOIN '../data/carriers.parquet' AS carriers_0 ON carriers_0."code"=base."carrier" GROUP BY 1 ORDER BY 2 desc NULLS LAST
Join Example
This example demonstrates the definition of several different joins in a model and their use in a query.
Entire subtrees of data can be joined. In the example below, aircraft
joins aircraft_models
, and flights
joins aircraft
and airports
. The tree nature of the join relationship
retained.
group_by: aircraft.aircraft_models.manufacturer
source: aircraft_models is duckdb.table('../data/aircraft_models.parquet') extend { primary_key: aircraft_model_code measure: aircraft_model_count is count() } /* Individual airplanes */ source: aircraft is duckdb.table('../data/aircraft.parquet') extend { primary_key: tail_num measure: aircraft_count is count() join_one: aircraft_models with aircraft_model_code } /* The airports that the aircraft fly to and from */ source: airports2 is duckdb.table('../data/airports.parquet') extend { primary_key: code measure: airport_count is count() } source: flights4 is duckdb.table('../data/flights.parquet') extend { join_one: origin_airport is airports2 with origin join_one: destination_airport is airports2 with destination join_one: aircraft with tail_num } run: flights4 -> { group_by: aircraft.aircraft_models.manufacturer aggregate: flight_count is count() aircraft_count is aircraft.count() aircraft_model_count is aircraft.aircraft_models.count() }
[ { "manufacturer": "BOEING", "flight_count": 183236, "aircraft_count": 23, "aircraft_model_count": 17 }, { "manufacturer": "AIRBUS INDUSTRIE", "flight_count": 55994, "aircraft_count": 8, "aircraft_model_count": 5 }, { "manufacturer": "MCDONNELL DOUGLAS", "flight_count": 39106, "aircraft_count": 5, "aircraft_model_count": 2 }, { "manufacturer": "EMBRAER", "flight_count": 21799, "aircraft_count": 7, "aircraft_model_count": 5 }, { "manufacturer": "AIRBUS", "flight_count": 7699, "aircraft_count": 3, "aircraft_model_count": 2 } ]
SELECT aircraft_models_0."manufacturer" as "manufacturer", COUNT(1) as "flight_count", COUNT(DISTINCT aircraft_0."tail_num") as "aircraft_count", COUNT(DISTINCT aircraft_models_0."aircraft_model_code") as "aircraft_model_count" FROM '../data/flights.parquet' as base LEFT JOIN '../data/aircraft.parquet' AS aircraft_0 ON aircraft_0."tail_num"=base."tail_num" LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0 ON aircraft_models_0."aircraft_model_code"=aircraft_0."aircraft_model_code" GROUP BY 1 ORDER BY 2 desc NULLS LAST
For more examples and how to reason about aggregation across joins, review the Aggregates section.
Inner Joins
Inner join are essentially left joins with an additional condition that the parent table has matches in the joined table. The example below functions logically as an INNER JOIN
, returning only users that have at least one row in the orders table, and only orders that have an associated user.
source: users4 is duckdb.table('../data/users.parquet') extend { join_many: orders is duckdb.table('../data/order_items.parquet') on id = orders.user_id where: orders.user_id != null }
Joins from Queries
Queries may be used as a source for joins, either as-is or extended. They may be referenced by name (as in the example below) or written in-line.
query: aircraft_facts is duckdb.table('../data/flights.parquet') -> { group_by: tail_num aggregate: lifetime_flights is count() lifetime_distance is distance.sum() } source: flights5 is duckdb.table('../data/flights.parquet') extend { join_one: aircraft_facts on tail_num = aircraft_facts.tail_num } run: flights5 -> { group_by: carrier nest: top_aircraft is { group_by: tail_num, aircraft_facts.lifetime_flights limit: 5 order_by: lifetime_flights desc } }
[ { "carrier": "AA", "top_aircraft": [ { "tail_num": "N499AA", "lifetime_flights": 7977 }, { "tail_num": "N583AA", "lifetime_flights": 7930 }, { "tail_num": "N216AA", "lifetime_flights": 7679 }, { "tail_num": "N434AA", "lifetime_flights": 7067 }, { "tail_num": "N304AA", "lifetime_flights": 2289 } ] }, { "carrier": "AS", "top_aircraft": [ { "tail_num": "N974AS", "lifetime_flights": 8453 } ] }, { "carrier": "B6", "top_aircraft": [ { "tail_num": "N533JB", "lifetime_flights": 4393 }, { "tail_num": "N624JB", "lifetime_flights": 449 } ] }, { "carrier": "CO", "top_aircraft": [ { "tail_num": "N47332", "lifetime_flights": 7139 } ] }, { "carrier": "DL", "top_aircraft": [ { "tail_num": "N242DL", "lifetime_flights": 10788 }, { "tail_num": "N171DZ", "lifetime_flights": 8134 }, { "tail_num": "N313DL", "lifetime_flights": 7867 }, { "tail_num": "N752AT", "lifetime_flights": 4995 }, { "tail_num": "N720DA", "lifetime_flights": 346 } ] } ]
WITH __stage0 AS ( SELECT base."tail_num" as "tail_num", COUNT(1) as "lifetime_flights", COALESCE(SUM(base."distance"),0) as "lifetime_distance" FROM '../data/flights.parquet' as base GROUP BY 1 ) , __stage1 AS ( SELECT group_set, base."carrier" as "carrier__0", CASE WHEN group_set=1 THEN base."tail_num" END as "tail_num__1", CASE WHEN group_set=1 THEN aircraft_facts_0."lifetime_flights" END as "lifetime_flights__1" FROM '../data/flights.parquet' as base LEFT JOIN __stage0 AS aircraft_facts_0 ON base."tail_num"=aircraft_facts_0."tail_num" CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set ) as group_set GROUP BY 1,2,3,4 ) SELECT "carrier__0" as "carrier", COALESCE(LIST({ "tail_num": "tail_num__1", "lifetime_flights": "lifetime_flights__1"} ORDER BY "lifetime_flights__1" desc NULLS LAST) FILTER (WHERE group_set=1)[1:5],[]) as "top_aircraft" FROM __stage1 GROUP BY 1 ORDER BY 1 asc NULLS LAST