A semantic model is the interface to your data—it defines what a "customer" is, how "revenue" is calculated, which tables relate to each other. Instead of scattering this logic across SQL queries, you define it once and reuse it everywhere.
In the previous page, Querying a Semantic Model, we learned how to query an existing model. Now we'll learn how to build one.
1. Define Your Sources
A source is a table plus metadata. Start with just the table:
source: flights is duckdb.table('../data/flights.parquet') extend {} source: airports is duckdb.table('../data/airports.parquet') extend {} source: carriers is duckdb.table('../data/carriers.parquet') extend {}
That's it. You now have sources you can query.
The example above uses DuckDB with local files. To connect to other databases (Postgres, BigQuery, Snowflake, etc.), see Database Support.
2. Explore with VS Code
Before writing any model code, use the VS Code extension to understand your data.
Schema View
Click Schema in the sidebar to see:
All columns and their types
Sample values
What you're working with
Once you've defined a source, you can also open Explorer (the visual query builder) directly in VS Code—click the source name and select Explore from the menu.
Preview Data
You can also click Preview just above the source: definition in VS Code. This runs a quick select on the top 20 rows from the source. Or in code:
run: duckdb.table('../data/flights.parquet') -> { select: * limit: 5 }[ { "arr_delay": -6, "arr_time": "2004-11-18T23:09:00.000Z", "cancelled": "N", "carrier": "US", "dep_delay": -3, "dep_time": "2004-11-18T22:32:00.000Z", "destination": "ABE", "distance": 55, "diverted": "N", "flight_num": "1692", "flight_time": 15, "id2": 30272525, "origin": "PHL", "tail_num": "N806MD", "taxi_in": 4, "taxi_out": 18 }, { "arr_delay": 0, "arr_time": "2004-10-12T21:28:00.000Z", "cancelled": "N", "carrier": "US", "dep_delay": 6, "dep_time": "2004-10-12T20:46:00.000Z", "destination": "ABE", "distance": 55, "diverted": "N", "flight_num": "1650", "flight_time": 18, "id2": 29742442, "origin": "PHL", "tail_num": "N806MD", "taxi_in": 4, "taxi_out": 20 }, { "arr_delay": 2, "arr_time": "2004-11-24T11:14:00.000Z", "cancelled": "N", "carrier": "US", "dep_delay": 0, "dep_time": "2004-11-24T10:20:00.000Z", "destination": "ABE", "distance": 55, "diverted": "N", "flight_num": "1616", "flight_time": 19, "id2": 30270885, "origin": "PHL", "tail_num": "N816MA", "taxi_in": 5, "taxi_out": 30 }, { "arr_delay": -19, "arr_time": "2004-08-31T21:06:00.000Z", "cancelled": "N", "carrier": "US", "dep_delay": 0, "dep_time": "2004-08-31T20:30:00.000Z", "destination": "ABE", "distance": 55, "diverted": "N", "flight_num": "1650", "flight_time": 17, "id2": 28344746, "origin": "PHL", "tail_num": "N806MD", "taxi_in": 4, "taxi_out": 15 }, { "arr_delay": -19, "arr_time": "2004-07-27T10:59:00.000Z", "cancelled": "N", "carrier": "US", "dep_delay": -4, "dep_time": "2004-07-27T10:21:00.000Z", "destination": "ABE", "distance": 55, "diverted": "N", "flight_num": "1643", "flight_time": 17, "id2": 27898410, "origin": "PHL", "tail_num": "N806MD", "taxi_in": 4, "taxi_out": 17 } ]
SELECT base."arr_delay" as "arr_delay", base."arr_time" as "arr_time", base."cancelled" as "cancelled", base."carrier" as "carrier", base."dep_delay" as "dep_delay", base."dep_time" as "dep_time", base."destination" as "destination", base."distance" as "distance", base."diverted" as "diverted", base."flight_num" as "flight_num", base."flight_time" as "flight_time", base."id2" as "id2", base."origin" as "origin", base."tail_num" as "tail_num", base."taxi_in" as "taxi_in", base."taxi_out" as "taxi_out" FROM '../data/flights.parquet' as base LIMIT 5
Run Ad-hoc Queries
run: duckdb.table('../data/flights.parquet') -> { group_by: carrier aggregate: flight_count is count() limit: 10 }
[ { "carrier": "WN", "flight_count": 88751 }, { "carrier": "US", "flight_count": 37683 }, { "carrier": "AA", "flight_count": 34577 }, { "carrier": "NW", "flight_count": 33580 }, { "carrier": "UA", "flight_count": 32757 } ]
SELECT base."carrier" as "carrier", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base GROUP BY 1 ORDER BY 2 desc NULLS LAST LIMIT 10
Explore until you understand:
What one row represents (the "grain")
What columns exist
What questions you want to answer
See Queries for the full query syntax.
3. Add Measures
Measures are reusable aggregations. Add them when you find yourself writing the same count() or sum() repeatedly.
source: flights is duckdb.table('../data/flights.parquet') extend { measure: flight_count is count() total_distance is sum(distance) avg_distance is avg(distance) } run: flights -> { aggregate: flight_count, total_distance, avg_distance }
[ { "flight_count": 344827, "total_distance": 255337195, "avg_distance": 740.4791243145113 } ]
SELECT COUNT(1) as "flight_count", COALESCE(SUM(base."distance"),0) as "total_distance", AVG(base."distance") as "avg_distance" FROM '../data/flights.parquet' as base
Filtered Measures
You can also apply filters directly to measures:
source: flights2 is duckdb.table('../data/flights.parquet') extend { measure: flight_count is count() delayed_flights is count() { where: dep_delay > 15 } long_flights is count() { where: distance > 1000 } delay_rate is delayed_flights / flight_count * 100 } run: flights2 -> { aggregate: flight_count, delayed_flights, delay_rate }
[ { "flight_count": 344827, "delayed_flights": 53524, "delay_rate": 15.52198638737686 } ]
SELECT COUNT(1) as "flight_count", COUNT(CASE WHEN base."dep_delay">15 THEN 1 END) as "delayed_flights", ((COUNT(CASE WHEN base."dep_delay">15 THEN 1 END))*1.0/(COUNT(1)))*100 as "delay_rate" FROM '../data/flights.parquet' as base
Beyond count() and sum(), Malloy supports avg(), min(), max(), and more. Use count(field) for distinct counts. See Aggregates for the full list.
4. Add Dimensions
Dimensions are reusable groupings. Add them when you keep grouping by the same expression.
source: flights3 is duckdb.table('../data/flights.parquet') extend { dimension: flight_year is dep_time.year flight_month is dep_time.month measure: flight_count is count() } run: flights3 -> { group_by: flight_year, flight_month aggregate: flight_count order_by: flight_year, flight_month }
[ { "flight_year": "2000-01-01T00:00:00.000Z", "flight_month": "2000-01-01T00:00:00.000Z", "flight_count": 4020 }, { "flight_year": "2000-01-01T00:00:00.000Z", "flight_month": "2000-02-01T00:00:00.000Z", "flight_count": 3876 }, { "flight_year": "2000-01-01T00:00:00.000Z", "flight_month": "2000-03-01T00:00:00.000Z", "flight_count": 4252 }, { "flight_year": "2000-01-01T00:00:00.000Z", "flight_month": "2000-04-01T00:00:00.000Z", "flight_count": 3778 }, { "flight_year": "2000-01-01T00:00:00.000Z", "flight_month": "2000-05-01T00:00:00.000Z", "flight_count": 4047 } ]
SELECT DATE_TRUNC('year', base."dep_time") as "flight_year", DATE_TRUNC('month', base."dep_time") as "flight_month", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base GROUP BY 1,2 ORDER BY 1 ASC NULLS LAST,2 ASC NULLS LAST
Category Dimensions
Create buckets from numeric columns:
source: flights4 is duckdb.table('../data/flights.parquet') extend { dimension: distance_bucket is pick 'Short' when distance < 500 pick 'Medium' when distance < 1000 else 'Long' measure: flight_count is count() } run: flights4 -> { group_by: distance_bucket aggregate: flight_count }
[ { "distance_bucket": "Short", "flight_count": 154125 }, { "distance_bucket": "Medium", "flight_count": 100700 }, { "distance_bucket": "Long", "flight_count": 90002 } ]
SELECT CASE WHEN (base."distance"<500) THEN 'Short' WHEN (base."distance"<1000) THEN 'Medium' ELSE 'Long' END as "distance_bucket", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base GROUP BY 1 ORDER BY 2 desc NULLS LAST
Dimensions can be any expression—date extractions, string manipulations, case logic, and more.
5. Add Joins
Joins connect related tables. Once defined, all joined data becomes queryable—you define the relationships once and never have to write join logic in queries again.
source: carriers is duckdb.table('../data/carriers.parquet') extend { primary_key: code // Tells Malloy this field uniquely identifies rows } source: flights5 is duckdb.table('../data/flights.parquet') extend { join_one: carriers with carrier // joins where carrier = carriers.code measure: flight_count is count() } run: flights5 -> { group_by: carriers.nickname aggregate: flight_count limit: 10 }
[ { "nickname": "Southwest", "flight_count": 88751 }, { "nickname": "USAir", "flight_count": 37683 }, { "nickname": "American", "flight_count": 34577 }, { "nickname": "Northwest", "flight_count": 33580 }, { "nickname": "United", "flight_count": 32757 } ]
SELECT carriers_0."nickname" as "nickname", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base LEFT JOIN '../data/carriers.parquet' AS carriers_0 ON carriers_0."code"=base."carrier" GROUP BY 1 ORDER BY 2 desc NULLS LAST LIMIT 10
Join Types
join_one: Each flight has one carrier (many-to-one). Use when querying from the "many" side.join_many: Each carrier has many flights (one-to-many). Use when querying from the "one" side—e.g., starting from carriers and analyzing their flights.
Multiple Joins to Same Table
Use an alias when joining a table with itself:
source: airports2 is duckdb.table('../data/airports.parquet') extend { primary_key: code } source: flights6 is duckdb.table('../data/flights.parquet') extend { join_one: origin_airport is airports2 with origin join_one: dest_airport is airports2 with destination measure: flight_count is count() } run: flights6 -> { group_by: origin_city is origin_airport.city dest_city is dest_airport.city aggregate: flight_count limit: 10 }
[ { "origin_city": "WASHINGTON", "dest_city": "NEW YORK", "flight_count": 2195 }, { "origin_city": "NEW YORK", "dest_city": "WASHINGTON", "flight_count": 2174 }, { "origin_city": "NEW YORK", "dest_city": "BOSTON", "flight_count": 1292 }, { "origin_city": "BOSTON", "dest_city": "NEW YORK", "flight_count": 1278 }, { "origin_city": "LOS ANGELES", "dest_city": "LAS VEGAS", "flight_count": 1073 } ]
SELECT origin_airport_0."city" as "origin_city", dest_airport_0."city" as "dest_city", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base LEFT JOIN '../data/airports.parquet' AS origin_airport_0 ON origin_airport_0."code"=base."origin" LEFT JOIN '../data/airports.parquet' AS dest_airport_0 ON dest_airport_0."code"=base."destination" GROUP BY 1,2 ORDER BY 3 desc NULLS LAST LIMIT 10
Malloy handles aggregates correctly across joins—no more fanout bugs. It automatically adjusts aggregations based on join cardinality, so count() returns the right number even through one-to-many joins. See symmetric aggregates for details.
See Joins for more join patterns.
6. Add Views
Views are saved queries. Add them when you have analysis patterns you'll reuse.
source: carriers3 is duckdb.table('../data/carriers.parquet') extend { primary_key: code } source: flights7 is duckdb.table('../data/flights.parquet') extend { join_one: carriers3 with carrier dimension: flight_year is dep_time.year flight_month is dep_time.month measure: flight_count is count() total_distance is sum(distance) avg_distance is avg(distance) delayed_flights is count() { where: dep_delay > 15 } delay_rate is delayed_flights / flight_count * 100 view: monthly_summary is { group_by: flight_year, flight_month aggregate: flight_count, total_distance order_by: flight_year, flight_month } view: carrier_performance is { group_by: carriers3.nickname aggregate: flight_count, delay_rate order_by: flight_count desc } } run: flights7 -> monthly_summary
[ { "flight_year": "2000-01-01T00:00:00.000Z", "flight_month": "2000-01-01T00:00:00.000Z", "flight_count": 4020, "total_distance": 3425091 }, { "flight_year": "2000-01-01T00:00:00.000Z", "flight_month": "2000-02-01T00:00:00.000Z", "flight_count": 3876, "total_distance": 3181806 }, { "flight_year": "2000-01-01T00:00:00.000Z", "flight_month": "2000-03-01T00:00:00.000Z", "flight_count": 4252, "total_distance": 3568248 }, { "flight_year": "2000-01-01T00:00:00.000Z", "flight_month": "2000-04-01T00:00:00.000Z", "flight_count": 3778, "total_distance": 3208048 }, { "flight_year": "2000-01-01T00:00:00.000Z", "flight_month": "2000-05-01T00:00:00.000Z", "flight_count": 4047, "total_distance": 3319093 } ]
SELECT DATE_TRUNC('year', base."dep_time") as "flight_year", DATE_TRUNC('month', base."dep_time") as "flight_month", COUNT(1) as "flight_count", COALESCE(SUM(base."distance"),0) as "total_distance" FROM '../data/flights.parquet' as base GROUP BY 1,2 ORDER BY 1 ASC NULLS LAST,2 ASC NULLS LAST
malloy run: flights7 -> carrier_performance
Ad-hoc Queries Still Work
Views don't limit you. You can still write any query:
run: flights7 -> { where: dep_time ? @2003-01 group_by: carriers3.nickname aggregate: flight_count, avg_distance order_by: flight_count desc limit: 5 }
[ { "nickname": "Southwest", "flight_count": 1209, "avg_distance": 643.7229114971051 }, { "nickname": "Atlantic Southeast", "flight_count": 547, "avg_distance": 212.12797074954295 }, { "nickname": "United", "flight_count": 524, "avg_distance": 1187.3034351145038 }, { "nickname": "American", "flight_count": 422, "avg_distance": 1235.521327014218 }, { "nickname": "Northwest", "flight_count": 419, "avg_distance": 1063.6229116945108 } ]
SELECT carriers3_0."nickname" as "nickname", COUNT(1) as "flight_count", AVG(base."distance") as "avg_distance" FROM '../data/flights.parquet' as base LEFT JOIN '../data/carriers.parquet' AS carriers3_0 ON carriers3_0."code"=base."carrier" WHERE (base."dep_time">=TIMESTAMP '2003-01-01 00:00:00') and (base."dep_time"<TIMESTAMP '2003-02-01 00:00:00') GROUP BY 1 ORDER BY 2 desc NULLS LAST LIMIT 5
See Views for more view patterns including nested views.
Complete Example
In practice, put your source definitions in a .malloy file and import them into notebooks for analysis. Here's what a complete model looks like:
source: carriers_full is duckdb.table('../data/carriers.parquet') extend { primary_key: code } source: airports_full is duckdb.table('../data/airports.parquet') extend { primary_key: code } source: flights_full is duckdb.table('../data/flights.parquet') extend { join_one: carriers_full with carrier join_one: origin_airport is airports_full with origin join_one: dest_airport is airports_full with destination dimension: flight_year is dep_time.year flight_month is dep_time.month distance_bucket is pick 'Short' when distance < 500 pick 'Medium' when distance < 1000 else 'Long' measure: flight_count is count() total_distance is sum(distance) avg_distance is avg(distance) delayed_flights is count() { where: dep_delay > 15 } delay_rate is delayed_flights / flight_count * 100 view: monthly_summary is { group_by: flight_year, flight_month aggregate: flight_count, total_distance order_by: flight_year, flight_month } view: carrier_stats is { group_by: carriers_full.nickname aggregate: flight_count, avg_distance, delay_rate order_by: flight_count desc } view: route_analysis is { group_by: origin_city is origin_airport.city dest_city is dest_airport.city aggregate: flight_count, avg_distance order_by: flight_count desc limit: 20 } } run: flights_full -> carrier_stats
[ { "nickname": "Southwest", "flight_count": 88751, "avg_distance": 615.4201304774031, "delay_rate": 16.919245980326984 }, { "nickname": "USAir", "flight_count": 37683, "avg_distance": 629.5051349414855, "delay_rate": 15.261523764031526 }, { "nickname": "American", "flight_count": 34577, "avg_distance": 1089.8830147207682, "delay_rate": 14.859588743962751 }, { "nickname": "Northwest", "flight_count": 33580, "avg_distance": 993.9399344848124, "delay_rate": 13.764145324597976 }, { "nickname": "United", "flight_count": 32757, "avg_distance": 1187.0114479347926, "delay_rate": 15.346338187257686 } ]
SELECT carriers_full_0."nickname" as "nickname", COUNT(1) as "flight_count", AVG(base."distance") as "avg_distance", ((COUNT(CASE WHEN base."dep_delay">15 THEN 1 END))*1.0/(COUNT(1)))*100 as "delay_rate" FROM '../data/flights.parquet' as base LEFT JOIN '../data/carriers.parquet' AS carriers_full_0 ON carriers_full_0."code"=base."carrier" GROUP BY 1 ORDER BY 2 desc NULLS LAST
The Workflow
Define sources - Point at your tables
Explore - Use VS Code to understand data
Query - Write ad-hoc queries to answer questions
Notice patterns - Same aggregations? Same groupings?
Promote - Move repeated patterns to measures/dimensions
Add joins - When you need data from other tables
Save views - For queries you'll reuse
Build iteratively. Start simple, add what you need.
Learn More
This guide covers the basics. See the language reference for filters, functions, window calculations, and nested views. For common analytics patterns, see percent of total, year-over-year, and cohort analysis.
Next Steps
Advanced Modeling - Deep dive into nesting, pipelines, and complex joins
Troubleshooting - Fix common modeling issues
Publishing - Deploy your model
Language Reference - Full syntax documentation