Malloy for dbt Users

VSCode

This guide maps common dbt patterns to their Malloy equivalents, using the flights dataset.

Staging Models

In dbt, "staging models" are the first transformation layer—typically one model per source table, doing light cleanup like renaming columns or casting types. The rest of your project builds on these staged versions rather than referencing raw tables directly. In Malloy, there's no separate staging step. You define a source pointing at the raw table and add transformations inline. The source definition is your staging layer and source extensions allow you to build on the initial source.

dbt:

-- models/staging/stg_flights.sql
select id2, origin, destination, carrier, distance, dep_time
from {{ source('raw', 'flights') }}

Malloy:

source: flights is duckdb.table('../data/flights.parquet')

For transformations at the source level, use extend:

source: flights2 is duckdb.table('../data/flights.parquet') extend {
  dimension:
    flight_year is year(dep_time)
    is_long_haul is distance > 1000
}

run: flights2 -> {
  group_by: flight_year, is_long_haul
  aggregate: flight_count is count()
}

QUERY RESULTS

[
  {
    "flight_year": 2005,
    "is_long_haul": false,
    "flight_count": 54681
  },
  {
    "flight_year": 2004,
    "is_long_haul": false,
    "flight_count": 52213
  },
  {
    "flight_year": 2003,
    "is_long_haul": false,
    "flight_count": 43623
  },
  {
    "flight_year": 2001,
    "is_long_haul": false,
    "flight_count": 35928
  },
  {
    "flight_year": 2002,
    "is_long_haul": false,
    "flight_count": 35695
  }
]

SELECT 
   EXTRACT(year FROM base."dep_time") as "flight_year",
   base."distance">1000 as "is_long_haul",
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
GROUP BY 1,2
ORDER BY 3 desc NULLS LAST

Type Casting

Cast columns to the correct types inline:

source: orders is duckdb.table('raw_orders.parquet') extend {
  dimension:
    order_id is id::number
    order_date is raw_date::date
    total_amount is amount::number
}

Joins

dbt:

select f.*, c.nickname as carrier_name
from {{ ref('stg_flights') }} f
left join {{ ref('stg_carriers') }} c on f.carrier = c.code

Malloy:

source: carriers is duckdb.table('../data/carriers.parquet') extend {
  primary_key: code
}

source: flights3 is duckdb.table('../data/flights.parquet') extend {
  join_one: carriers with carrier
}

run: flights3 -> {
  group_by: carriers.nickname
  aggregate: flight_count is count()
  limit: 5
}

QUERY RESULTS

[
  {
    "nickname": "Southwest",
    "flight_count": 88751
  },
  {
    "nickname": "USAir",
    "flight_count": 37683
  },
  {
    "nickname": "American",
    "flight_count": 34577
  },
  {
    "nickname": "Northwest",
    "flight_count": 33580
  },
  {
    "nickname": "United",
    "flight_count": 32757
  }
]

SELECT 
   carriers_0."nickname" as "nickname",
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/carriers.parquet' AS carriers_0
  ON carriers_0."code"=base."carrier"
GROUP BY 1
ORDER BY 2 desc NULLS LAST
LIMIT 5

join_one indicates a many-to-one relationship (one carrier per flight). Use join_many for one-to-many relationships.

Denormalizing Data

To flatten joined data into a wide table, use select: * with joined fields:

source: orders is duckdb.table('orders.parquet') extend {
  join_one: customers with customer_id
  join_one: products with product_id
}

run: orders -> {
  select:
    *
    customers.customer_name
    customers.region
    products.product_name
    products.category
}

Reusable Transformations (Macros)

dbt:

-- macros/distance_category.sql
{% macro distance_category(column_name) %}
  case
    when {{ column_name }} < 500 then 'short'
    when {{ column_name }} < 1500 then 'medium'
    else 'long'
  end
{% endmacro %}

-- models/flights.sql
select {{ distance_category('distance') }} as distance_category

Malloy uses pick expressions:

source: flights4 is duckdb.table('../data/flights.parquet') extend {
  dimension: distance_category is
    pick 'short' when distance < 500
    pick 'medium' when distance < 1500
    else 'long'
}

run: flights4 -> {
  group_by: distance_category
  aggregate: flight_count is count()
}

QUERY RESULTS

[
  {
    "distance_category": "short",
    "flight_count": 154125
  },
  {
    "distance_category": "medium",
    "flight_count": 151531
  },
  {
    "distance_category": "long",
    "flight_count": 39171
  }
]

SELECT 
   CASE WHEN (base."distance"<500) THEN 'short' WHEN (base."distance"<1500) THEN 'medium' ELSE 'long' END as "distance_category",
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
GROUP BY 1
ORDER BY 2 desc NULLS LAST

Chained Transformations

dbt requires CTEs when columns reference other computed columns:

with with_hour as (
  select *, extract(hour from dep_time) as dep_hour
  from flights
),
with_period as (
  select *,
    case
      when dep_hour < 6 then 'red-eye'
      when dep_hour < 12 then 'morning'
      when dep_hour < 18 then 'afternoon'
      else 'evening'
    end as time_of_day
  from with_hour
)
select * from with_period

Malloy dimensions can reference earlier dimensions directly:

source: flights5 is duckdb.table('../data/flights.parquet') extend {
  dimension:
    dep_hour is hour(dep_time)
    time_of_day is
      pick 'red-eye' when dep_hour < 6
      pick 'morning' when dep_hour < 12
      pick 'afternoon' when dep_hour < 18
      else 'evening'
}

run: flights5 -> {
  group_by: time_of_day
  aggregate: flight_count is count()
}

QUERY RESULTS

[
  {
    "time_of_day": "morning",
    "flight_count": 131249
  },
  {
    "time_of_day": "afternoon",
    "flight_count": 128362
  },
  {
    "time_of_day": "evening",
    "flight_count": 77275
  },
  {
    "time_of_day": "red-eye",
    "flight_count": 7941
  }
]

SELECT 
   CASE WHEN ((EXTRACT(hour FROM base."dep_time"))<6) THEN 'red-eye' WHEN ((EXTRACT(hour FROM base."dep_time"))<12) THEN 'morning' WHEN ((EXTRACT(hour FROM base."dep_time"))<18) THEN 'afternoon' ELSE 'evening' END as "time_of_day",
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
GROUP BY 1
ORDER BY 2 desc NULLS LAST

Aggregations

dbt:

select
  carrier,
  count(*) as flight_count,
  sum(distance) as total_distance,
  avg(distance) as avg_distance
from {{ ref('flights') }}
group by 1

Malloy:

source: flights6 is duckdb.table('../data/flights.parquet') extend {
  measure:
    flight_count is count()
    total_distance is sum(distance)
    avg_distance is avg(distance)
}

run: flights6 -> {
  group_by: carrier
  aggregate: flight_count, total_distance, avg_distance
  limit: 5
}

QUERY RESULTS

[
  {
    "carrier": "WN",
    "flight_count": 88751,
    "total_distance": 54619152,
    "avg_distance": 615.4201304774031
  },
  {
    "carrier": "US",
    "flight_count": 37683,
    "total_distance": 23721642,
    "avg_distance": 629.5051349414855
  },
  {
    "carrier": "AA",
    "flight_count": 34577,
    "total_distance": 37684885,
    "avg_distance": 1089.8830147207682
  },
  {
    "carrier": "NW",
    "flight_count": 33580,
    "total_distance": 33376503,
    "avg_distance": 993.9399344848124
  },
  {
    "carrier": "UA",
    "flight_count": 32757,
    "total_distance": 38882934,
    "avg_distance": 1187.0114479347926
  }
]

SELECT 
   base."carrier" as "carrier",
   COUNT(1) as "flight_count",
   COALESCE(SUM(base."distance"),0) as "total_distance",
   AVG(base."distance") as "avg_distance"
FROM '../data/flights.parquet' as base
GROUP BY 1
ORDER BY 2 desc NULLS LAST
LIMIT 5

Measures are defined once and reused across queries.

Documentation

dbt:

models:
  - name: flights
    description: "Flight records"
    columns:
      - name: distance
        description: "Flight distance in miles"

Malloy uses tags for documentation:

source: flights is duckdb.table('flights.parquet') extend {
  # "Flight distance in miles"
  dimension: distance
}

Imports and Refs

dbt:

from {{ ref('stg_flights') }}

Malloy:

import "staging/flights.malloy"

source: flight_metrics is flights extend {
  measure: on_time_rate is avg(pick 1 when dep_delay <= 0 else 0)
}

Raw SQL

For database-specific features not covered by Malloy:

source: flight_sample is duckdb.sql("""
  SELECT origin, destination, distance, dep_time
  FROM '../data/flights.parquet'
  WHERE distance > 2000
  LIMIT 1000
""") extend {
  measure: flight_count is count()
}

run: flight_sample -> {
  group_by: origin
  aggregate: flight_count
  limit: 5
}

QUERY RESULTS

[
  {
    "origin": "SFO",
    "flight_count": 159
  },
  {
    "origin": "SEA",
    "flight_count": 143
  },
  {
    "origin": "MSP",
    "flight_count": 137
  },
  {
    "origin": "LAX",
    "flight_count": 126
  },
  {
    "origin": "SLC",
    "flight_count": 79
  }
]

SELECT 
   base."origin" as "origin",
   COUNT(1) as "flight_count"
FROM (
  SELECT origin, destination, distance, dep_time
  FROM '../data/flights.parquet'
  WHERE distance > 2000
  LIMIT 1000
) as base
GROUP BY 1
ORDER BY 2 desc NULLS LAST
LIMIT 5

Materialization

To persist query results as tables, views, or files, use MalloySQL—which embeds Malloy queries in DDL:

-- connection:duckdb

import "models/flights.malloy"

CREATE TABLE daily_stats AS
%{
  flights -> {
    group_by: dep_time.day
    aggregate: flight_count, total_distance
  }
}%

See Transform & Materialize for full details on MalloySQL syntax, exports, and views.

Running Queries

dbt:

dbt compile
dbt run --select flights

Malloy:

malloy-cli run models/flights.malloy --query summary
malloy-cli compile models/flights.malloy --query summary  # SQL only

The VS Code extension runs queries directly with real-time error checking.

Next Steps

Transform & Materialize - CLI setup and MalloySQL patterns
Quick Start: Modeling - Build your first semantic model