Joins

VSCode

Joins in Malloy differ from SQL joins. When two sources are joined, Malloy retains the graph nature and hierarchy of the the data relationships. This is unlike SQL, which flattens everything into a single table space.

Aggregate calculations navigate this graph to deduce the locality of computation, so they are always computed correctly regardless of join pattern, avoiding the fan and chasm traps.

In Malloy, syntaxes for join are:

join_one: source name [is source] on boolean expression
join_one: source name [is source] with foreign key expression
join_many: source name [is source] on boolean expression
join_cross: source name [is source] [on boolean expression]

Examples of the above, with orders as the implied source:

join_one: users is duckdb.table('data/users.parquet') on user_id = users.id
join_one: users on user_id = users.id
join_one: users with user_id
join_many: order_items on order_items.id = id
join_cross: order_items2 is duckdb.table('data/order_items.parquet') on user_id = order_items2.user_id

join_one: - the table we are joining has one row for each row in the source table.

join_many: - the table we are joining has many rows for each row in the source table

join_cross: - the join is a cross product and there will be many rows in each side of the join.

Malloy's joins are left outer joins by default. Since Malloy deals in graphs, some SQL Join types don't make sense (RIGHT JOIN, for example).

Join Types

Foreign Key to Primary Key

The easiest, most error-proof way to perform a join is using the following syntax:

join_one: source with foreign key

To join based on a foreign key through the primary_key of a joined source, use with to specify an expression, which could be as simple as a field name in the source. This expression is matched against the declared primary_key of the joined source. Sources without a primary_key cannot use with joins.

source: users is duckdb.table('../data/users.parquet') extend {
  primary_key: id
}

source: order_items is duckdb.table('../data/order_items.parquet') extend {
  join_one: users with user_id
}

This is simply a shortcut, when joining based on the primary key of a joined source. It is exactly equivalent to the on join written like this.

source: order_items is duckdb.table('../data/order_items.parquet') extend {
  join_one: users on user_id = users.id
}

Naming Joined Sources

If no alias is specified using is, the name of the join will be the name of the source being joined.

source: carriers is duckdb.table('../data/carriers.parquet') extend {
  primary_key: code
  measure: carrier_count is count()
}

source: flights is duckdb.table('../data/flights.parquet') extend {
  join_one: carriers with carrier
}

To give the joined source a different name within the context source, use is to alias it.

source: airports is duckdb.table('../data/airports.parquet') extend {
  primary_key: code
}

source: flights2 is duckdb.table('../data/flights.parquet') extend {
  join_one: origin_airport is airports with origin
}

In-line Joins

Sources do not need to be modeled before they are used in a join, though the join must be named using is.

source: flights3 is duckdb.table('../data/flights.parquet') extend {
  join_one: carriers is 
    duckdb.table('../data/carriers.parquet') extend { primary_key: code } with carrier
}

Using Fields from Joined Sources

When a source is joined in, its fields become nested within the parent source. Fields from joined sources can be referenced using .:

run: flights3 -> {
  group_by: carriers.nickname
  aggregate: flight_count is count()
}

QUERY RESULTS

[
  {
    "nickname": "Southwest",
    "flight_count": 88751
  },
  {
    "nickname": "USAir",
    "flight_count": 37683
  },
  {
    "nickname": "American",
    "flight_count": 34577
  },
  {
    "nickname": "Northwest",
    "flight_count": 33580
  },
  {
    "nickname": "United",
    "flight_count": 32757
  }
]

SELECT 
   carriers_0."nickname" as "nickname",
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/carriers.parquet' AS carriers_0
  ON carriers_0."code"=base."carrier"
GROUP BY 1
ORDER BY 2 desc NULLS LAST

Measures defined in joined sources may be used in addition to dimensions.

run: flights -> {
  group_by: destination
  aggregate: carriers.carrier_count
}

QUERY RESULTS

[
  {
    "destination": "DFW",
    "carrier_count": 13
  },
  {
    "destination": "SEA",
    "carrier_count": 12
  },
  {
    "destination": "PHX",
    "carrier_count": 12
  },
  {
    "destination": "IAH",
    "carrier_count": 12
  },
  {
    "destination": "ATL",
    "carrier_count": 12
  }
]

SELECT 
   base."destination" as "destination",
   COUNT(DISTINCT carriers_0."code") as "carrier_count"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/carriers.parquet' AS carriers_0
  ON carriers_0."code"=base."carrier"
GROUP BY 1
ORDER BY 2 desc NULLS LAST

Join Example

This example demonstrates the definition of several different joins in a model and their use in a query. Entire subtrees of data can be joined. In the example below, aircraft joins aircraft_models, and flights joins aircraft and airports. The tree nature of the join relationship retained.

group_by: aircraft.aircraft_models.manufacturer

source: aircraft_models is duckdb.table('../data/aircraft_models.parquet') extend {
  primary_key: aircraft_model_code
  measure: aircraft_model_count is count()
}

/* Individual airplanes */
source: aircraft is duckdb.table('../data/aircraft.parquet') extend {
  primary_key: tail_num
  measure: aircraft_count is count()
  join_one: aircraft_models with aircraft_model_code
}

/* The airports that the aircraft fly to and from */
source: airports2 is duckdb.table('../data/airports.parquet') extend {
  primary_key: code
  measure: airport_count is count()
}

source: flights4 is duckdb.table('../data/flights.parquet') extend {
  join_one: origin_airport is airports2 with origin
  join_one: destination_airport is airports2 with destination
  join_one: aircraft with tail_num
}

run: flights4 -> {
  group_by: aircraft.aircraft_models.manufacturer
  aggregate:
    flight_count is count()
    aircraft_count is aircraft.count()
    aircraft_model_count is aircraft.aircraft_models.count()
}

QUERY RESULTS

[
  {
    "manufacturer": "BOEING",
    "flight_count": 183236,
    "aircraft_count": 23,
    "aircraft_model_count": 17
  },
  {
    "manufacturer": "AIRBUS INDUSTRIE",
    "flight_count": 55994,
    "aircraft_count": 8,
    "aircraft_model_count": 5
  },
  {
    "manufacturer": "MCDONNELL DOUGLAS",
    "flight_count": 39106,
    "aircraft_count": 5,
    "aircraft_model_count": 2
  },
  {
    "manufacturer": "EMBRAER",
    "flight_count": 21799,
    "aircraft_count": 7,
    "aircraft_model_count": 5
  },
  {
    "manufacturer": "AIRBUS",
    "flight_count": 7699,
    "aircraft_count": 3,
    "aircraft_model_count": 2
  }
]

SELECT 
   aircraft_models_0."manufacturer" as "manufacturer",
   COUNT(1) as "flight_count",
   COUNT(DISTINCT aircraft_0."tail_num") as "aircraft_count",
   COUNT(DISTINCT aircraft_models_0."aircraft_model_code") as "aircraft_model_count"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/aircraft.parquet' AS aircraft_0
  ON aircraft_0."tail_num"=base."tail_num"
 LEFT JOIN '../data/aircraft_models.parquet' AS aircraft_models_0
  ON aircraft_models_0."aircraft_model_code"=aircraft_0."aircraft_model_code"
GROUP BY 1
ORDER BY 2 desc NULLS LAST

For more examples and how to reason about aggregation across joins, review the Aggregates section.

Inner Joins

Inner join are essentially left joins with an additional condition that the parent table has matches in the joined table. The example below functions logically as an INNER JOIN, returning only users that have at least one row in the orders table, and only orders that have an associated user.

source: users4 is duckdb.table('../data/users.parquet') extend {
  join_many: orders is duckdb.table('../data/order_items.parquet') on id = orders.user_id
  where: orders.user_id is not null
}

Joins from Queries

Queries may be used as a source for joins, either as-is or extended. They may be referenced by name (as in the example below) or written in-line.

query: aircraft_facts is duckdb.table('../data/flights.parquet') -> {
  group_by: tail_num
  aggregate:
    lifetime_flights is count()
    lifetime_distance is distance.sum()
}

source: flights5 is duckdb.table('../data/flights.parquet') extend {
  join_one: aircraft_facts on tail_num = aircraft_facts.tail_num
}

run: flights5 -> {
  group_by: carrier
  nest: top_aircraft is {
    group_by: tail_num, aircraft_facts.lifetime_flights
    limit: 5
    order_by: lifetime_flights desc
  }
}

QUERY RESULTS

[
  {
    "carrier": "AA",
    "top_aircraft": [
      {
        "tail_num": "N499AA",
        "lifetime_flights": 7977
      },
      {
        "tail_num": "N583AA",
        "lifetime_flights": 7930
      },
      {
        "tail_num": "N216AA",
        "lifetime_flights": 7679
      },
      {
        "tail_num": "N434AA",
        "lifetime_flights": 7067
      },
      {
        "tail_num": "N304AA",
        "lifetime_flights": 2289
      }
    ]
  },
  {
    "carrier": "AS",
    "top_aircraft": [
      {
        "tail_num": "N974AS",
        "lifetime_flights": 8453
      }
    ]
  },
  {
    "carrier": "B6",
    "top_aircraft": [
      {
        "tail_num": "N533JB",
        "lifetime_flights": 4393
      },
      {
        "tail_num": "N624JB",
        "lifetime_flights": 449
      }
    ]
  },
  {
    "carrier": "CO",
    "top_aircraft": [
      {
        "tail_num": "N47332",
        "lifetime_flights": 7139
      }
    ]
  },
  {
    "carrier": "DL",
    "top_aircraft": [
      {
        "tail_num": "N242DL",
        "lifetime_flights": 10788
      },
      {
        "tail_num": "N171DZ",
        "lifetime_flights": 8134
      },
      {
        "tail_num": "N313DL",
        "lifetime_flights": 7867
      },
      {
        "tail_num": "N752AT",
        "lifetime_flights": 4995
      },
      {
        "tail_num": "N720DA",
        "lifetime_flights": 346
      }
    ]
  }
]

WITH __stage0 AS (
  SELECT 
     base."tail_num" as "tail_num",
     COUNT(1) as "lifetime_flights",
     COALESCE(SUM(base."distance"),0) as "lifetime_distance"
  FROM '../data/flights.parquet' as base
  GROUP BY 1
)
, __stage1 AS (
  SELECT
    group_set,
    base."carrier" as "carrier__0",
    CASE WHEN group_set=1 THEN
      base."tail_num"
      END as "tail_num__1",
    CASE WHEN group_set=1 THEN
      aircraft_facts_0."lifetime_flights"
      END as "lifetime_flights__1"
  FROM '../data/flights.parquet' as base
   LEFT JOIN __stage0 AS aircraft_facts_0
    ON base."tail_num"=aircraft_facts_0."tail_num"
  CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set  ) as group_set
  GROUP BY 1,2,3,4
)
SELECT
  "carrier__0" as "carrier",
  COALESCE(LIST({
    "tail_num": "tail_num__1", 
    "lifetime_flights": "lifetime_flights__1"}  ORDER BY  "lifetime_flights__1" desc NULLS LAST) FILTER (WHERE group_set=1)[1:5],[]) as "top_aircraft"
FROM __stage1
GROUP BY 1
ORDER BY 1 asc NULLS LAST