Jump to the section that matches your symptom:
| Symptom | Problem |
|---|---|
| Row count multiplied after join | Non-Unique Join Key |
| Aggregates inflated 10x | Grain Mismatch |
| Query takes minutes | Slow Query |
| Reference to undefined object | Circular References |
| Date filter syntax questions | Date Arithmetic |
| Can't reference joined field in join | Multi-hop Joins |
| "Field 'x' not found in source 'y'" | Common Errors |
Problem 1: Non-Unique Join Key
Symptoms
Query returns 500,000 rows when you expect 50,000
Aggregates are 10x larger than expected
Single row multiplies into many after join
Diagnosis
Test join cardinality:
source: airports_p1 is duckdb.table('../data/airports.parquet') extend { primary_key: code } source: flights_p1 is duckdb.table('../data/flights.parquet') extend { join_one: airports_p1 with origin } run: flights_p1 -> { where: id2 = 30272525 // ONE row aggregate: source_count is count() joined_count is airports_p1.count() // If source_count=1 and joined_count=1, your join is correct }
[ { "source_count": 1, "joined_count": 1 } ]
SELECT COUNT(1) as "source_count", COUNT(DISTINCT airports_p1_0."code") as "joined_count" FROM '../data/flights.parquet' as base LEFT JOIN '../data/airports.parquet' AS airports_p1_0 ON airports_p1_0."code"=base."origin" WHERE base."id2"=30272525
If joined_count > source_count for a join_one, your join key isn't unique.
The Problem
Your join key doesn't uniquely identify rows in the target table. Consider a flight segments table where each flight can have multiple segments (legs):
// WRONG - flight_segments keyed by (flight_id + segment_number) join_one: flight_segments on id2 = flight_segments.flight_id // One flight matches multiple segment rows → fan-out // CORRECT - include all keys that form the unique identifier join_one: flight_segments on id2 = flight_segments.flight_id and segment_num = flight_segments.segment_number
Fix
Identify the primary key of the table you're joining to
Include ALL columns that form that key in your join condition
Re-test cardinality using the diagnostic query above
Problem 2: Grain Mismatch
Symptoms
Joining a daily/monthly summary table to raw transaction data
Every row gets the same aggregated value
Totals are wildly inflated
The Problem
You have two tables at different granularity:
Raw table: One row per event (e.g., every flight, every order)
Aggregated table: One row per time period (e.g., daily totals, monthly stats)
When you join them, every raw row matches the same summary row. Here, we join carrier-level stats to raw flights, then group by origin—each AA flight adds AA's entire total, each UA flight adds UA's total, etc:
source: carrier_stats is duckdb.sql(""" SELECT carrier, COUNT(*) as carrier_total FROM '../data/flights.parquet' GROUP BY carrier """) source: flights_p2 is duckdb.table('../data/flights.parquet') extend { join_one: carrier_stats on carrier = carrier_stats.carrier } // WRONG: grouping by origin but summing carrier totals run: flights_p2 -> { group_by: origin aggregate: actual_flights is count() wrong_total is carrier_stats.carrier_total.sum() limit: 5 }
[ { "origin": "ATL", "actual_flights": 17875, "wrong_total": 244590 }, { "origin": "DFW", "actual_flights": 17782, "wrong_total": 251234 }, { "origin": "ORD", "actual_flights": 14214, "wrong_total": 223979 }, { "origin": "PHX", "actual_flights": 12476, "wrong_total": 308769 }, { "origin": "LAS", "actual_flights": 11096, "wrong_total": 285556 } ]
SELECT base."origin" as "origin", COUNT(1) as "actual_flights", COALESCE(( SELECT SUM(a.val) as value FROM ( SELECT UNNEST(list(distinct {key:carrier_stats_0."__distinct_key", val: carrier_stats_0."carrier_total"})) a ) ),0) as "wrong_total" FROM '../data/flights.parquet' as base LEFT JOIN (SELECT GEN_RANDOM_UUID() as "__distinct_key", x.* FROM ( SELECT carrier, COUNT(*) as carrier_total FROM '../data/flights.parquet' GROUP BY carrier ) as x) AS carrier_stats_0 ON base."carrier"=carrier_stats_0."carrier" GROUP BY 1 ORDER BY 2 desc NULLS LAST LIMIT 5
Fix
Don't join aggregated tables to raw tables. Query them independently—use the raw table for granular questions and compute aggregations directly from it.
Problem 3: Slow Query
Symptoms
Queries take minutes instead of seconds
Timeouts on large datasets
Fix
Option 0: Check your join cardinality
Join fan-out is a common cause of slow queries. If a join_one isn't actually one-to-one, your dataset explodes silently. See Non-Unique Join Key to diagnose.
Option 1: Add filters to reduce data scanned
run: flights -> { where: dep_time ? @2003 // Filter to one year aggregate: flight_count }
Option 2: Materialize a view of the data
If the same aggregation is queried repeatedly—across dashboards, by multiple users, or by LLM agents—pre-aggregate your data using MalloySQL:
-- connection:duckdb import "models/flights.malloy" CREATE TABLE monthly_stats AS %{ flights -> { group_by: dep_time.month aggregate: flight_count, total_distance } }%
Then model against the materialized table for faster queries.
See Transform & Materialize for details on materializing views with the Malloy CLI.
Problem 4: Circular References
Symptoms
Can't define two sources that each join to the other
The Problem
Malloy compiles linearly. You can't define orders joining to customers while customers also joins to orders—the second source doesn't exist yet when the first is compiled.
Fix
Define base sources first, then extend them with joins. For cyclical relationships, add a third pass:
// Base sources source: _orders is duckdb.table('orders') extend { ... } source: _customers is duckdb.table('customers') extend { ... } // Modeled sources with one direction of joins source: orders_modeled is _orders extend { join_one: customer is _customers on customer_id = customer.id } source: customers_modeled is _customers extend { ... } // Final sources with cyclical joins added source: customers is customers_modeled extend { join_many: orders is orders_modeled on id = orders.customer_id }
Problem 5: Date Arithmetic
Symptoms
"How do I filter to the last 30 days?"
"How do I do
current_date - interval '50 days'?"
Fix
Use Malloy's time literals and ranges:
run: events -> { // Last 30 days where: event_date > @now - 30 days // Specific range where: event_date ? @2024-01-01 to @2024-03-31 // Last complete month where: event_date ? @now.month - 1 month aggregate: event_count is count() }
See Timestamp Operations for more patterns.
Problem 6: Multi-hop Joins
Symptoms
Need to join A to C through intermediate table B
"Cannot reference joined_table.joined_field in join condition"
The Problem
You can't reference a joined table's fields in another join condition:
// This doesn't work source: orders is duckdb.table('orders') extend { join_one: customers with customer_id // Can't use customers.region in this join condition join_one: shipments on shipments.region = customers.region }
Fix
Option 1: Flatten the join path
Join directly if possible:
source: orders is duckdb.table('orders') extend { join_one: customers with customer_id join_one: shipments with shipment_id // Direct join }
Option 2: Denormalize in SQL source
source: orders_with_region is duckdb.sql(""" SELECT o.*, c.region FROM orders o JOIN customers c ON o.customer_id = c.customer_id """) extend { join_one: shipments on region = shipments.region }
Note: Denormalization trades off the semantic relationship. If customers.region changes in the source table, the denormalized source won't reflect it until you rebuild the SQL query.
Option 3: Use separate sources
If you need different join paths for different questions, create purpose-specific sources rather than one uber-source.
Common Error Messages
Quick reference for error strings and their likely causes:
| Error Message | Likely Cause | Solution |
|---|---|---|
Field 'x' not found in source 'y' |
Typo in field name, or missing import | Check spelling; verify the import includes the source |
Cannot use aggregate in dimension |
Using a measure where a dimension is expected | Define as measure: not dimension:, or move to an aggregate: block |
Reference to undefined object |
Circular references | See Circular References |
Source 'x' not found |
Missing import statement | Add import "path/to/file.malloy" |
Cannot join on expression |
Using a joined field in another join condition | See Multi-hop Joins |
Primary key required for join |
Missing primary_key declaration on joined source |
Add primary_key: field_name to the target source |
Syntax error on year, date, timestamp |
Using a reserved word as field name | Use backticks: `year`, `date`, `timestamp` |