Building a Semantic Model

VSCode

A semantic model is the interface to your data—it defines what a "customer" is, how "revenue" is calculated, which tables relate to each other. Instead of scattering this logic across SQL queries, you define it once and reuse it everywhere.

In the previous page, Querying a Semantic Model, we learned how to query an existing model. Now we'll learn how to build one.

1. Define Your Sources

A source is a table plus metadata. Start with just the table:

source: flights is duckdb.table('../data/flights.parquet') extend {}
source: airports is duckdb.table('../data/airports.parquet') extend {}
source: carriers is duckdb.table('../data/carriers.parquet') extend {}

That's it. You now have sources you can query.

The example above uses DuckDB with local files. To connect to other databases (Postgres, BigQuery, Snowflake, etc.), see Database Support.

2. Explore with VS Code

Before writing any model code, use the VS Code extension to understand your data.

Schema View

Click Schema in the sidebar to see:

All columns and their types
Sample values
What you're working with

Once you've defined a source, you can also open Explorer (the visual query builder) directly in VS Code—click the source name and select Explore from the menu.

Preview Data

You can also click Preview just above the source: definition in VS Code. This runs a quick select on the top 20 rows from the source. Or in code:

run: duckdb.table('../data/flights.parquet') -> { select: * limit: 5 }

QUERY RESULTS

[
  {
    "arr_delay": "-6",
    "arr_time": "2004-11-18T23:09:00.000Z",
    "cancelled": "N",
    "carrier": "US",
    "dep_delay": "-3",
    "dep_time": "2004-11-18T22:32:00.000Z",
    "destination": "ABE",
    "distance": "55",
    "diverted": "N",
    "flight_num": "1692",
    "flight_time": "15",
    "id2": "30272525",
    "origin": "PHL",
    "tail_num": "N806MD",
    "taxi_in": "4",
    "taxi_out": "18"
  },
  {
    "arr_delay": "0",
    "arr_time": "2004-10-12T21:28:00.000Z",
    "cancelled": "N",
    "carrier": "US",
    "dep_delay": "6",
    "dep_time": "2004-10-12T20:46:00.000Z",
    "destination": "ABE",
    "distance": "55",
    "diverted": "N",
    "flight_num": "1650",
    "flight_time": "18",
    "id2": "29742442",
    "origin": "PHL",
    "tail_num": "N806MD",
    "taxi_in": "4",
    "taxi_out": "20"
  },
  {
    "arr_delay": "2",
    "arr_time": "2004-11-24T11:14:00.000Z",
    "cancelled": "N",
    "carrier": "US",
    "dep_delay": "0",
    "dep_time": "2004-11-24T10:20:00.000Z",
    "destination": "ABE",
    "distance": "55",
    "diverted": "N",
    "flight_num": "1616",
    "flight_time": "19",
    "id2": "30270885",
    "origin": "PHL",
    "tail_num": "N816MA",
    "taxi_in": "5",
    "taxi_out": "30"
  },
  {
    "arr_delay": "-19",
    "arr_time": "2004-08-31T21:06:00.000Z",
    "cancelled": "N",
    "carrier": "US",
    "dep_delay": "0",
    "dep_time": "2004-08-31T20:30:00.000Z",
    "destination": "ABE",
    "distance": "55",
    "diverted": "N",
    "flight_num": "1650",
    "flight_time": "17",
    "id2": "28344746",
    "origin": "PHL",
    "tail_num": "N806MD",
    "taxi_in": "4",
    "taxi_out": "15"
  },
  {
    "arr_delay": "-19",
    "arr_time": "2004-07-27T10:59:00.000Z",
    "cancelled": "N",
    "carrier": "US",
    "dep_delay": "-4",
    "dep_time": "2004-07-27T10:21:00.000Z",
    "destination": "ABE",
    "distance": "55",
    "diverted": "N",
    "flight_num": "1643",
    "flight_time": "17",
    "id2": "27898410",
    "origin": "PHL",
    "tail_num": "N806MD",
    "taxi_in": "4",
    "taxi_out": "17"
  }
]

SELECT 
   base."arr_delay" as "arr_delay",
   base."arr_time" as "arr_time",
   base."cancelled" as "cancelled",
   base."carrier" as "carrier",
   base."dep_delay" as "dep_delay",
   base."dep_time" as "dep_time",
   base."destination" as "destination",
   base."distance" as "distance",
   base."diverted" as "diverted",
   base."flight_num" as "flight_num",
   base."flight_time" as "flight_time",
   base."id2" as "id2",
   base."origin" as "origin",
   base."tail_num" as "tail_num",
   base."taxi_in" as "taxi_in",
   base."taxi_out" as "taxi_out"
FROM '../data/flights.parquet' as base
LIMIT 5

Run Ad-hoc Queries

run: duckdb.table('../data/flights.parquet') -> {
  group_by: carrier
  aggregate: flight_count is count()
  limit: 10
}

QUERY RESULTS

[
  {
    "carrier": "WN",
    "flight_count": 88751
  },
  {
    "carrier": "US",
    "flight_count": 37683
  },
  {
    "carrier": "AA",
    "flight_count": 34577
  },
  {
    "carrier": "NW",
    "flight_count": 33580
  },
  {
    "carrier": "UA",
    "flight_count": 32757
  }
]

SELECT 
   base."carrier" as "carrier",
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
GROUP BY 1
ORDER BY 2 desc NULLS LAST
LIMIT 10

Explore until you understand:

What one row represents (the "grain")
What columns exist
What questions you want to answer

See Queries for the full query syntax.

3. Add Measures

Measures are reusable aggregations. Add them when you find yourself writing the same count() or sum() repeatedly.

source: flights is duckdb.table('../data/flights.parquet') extend {
  measure:
    flight_count is count()
    total_distance is sum(distance)
    avg_distance is avg(distance)
}

run: flights -> { aggregate: flight_count, total_distance, avg_distance }

QUERY RESULTS

[
  {
    "flight_count": 344827,
    "total_distance": "255337195",
    "avg_distance": 740.4791243145113
  }
]

SELECT 
   COUNT(1) as "flight_count",
   COALESCE(SUM(base."distance"),0) as "total_distance",
   AVG(base."distance") as "avg_distance"
FROM '../data/flights.parquet' as base

Filtered Measures

You can also apply filters directly to measures:

source: flights2 is duckdb.table('../data/flights.parquet') extend {
  measure:
    flight_count is count()
    delayed_flights is count() { where: dep_delay > 15 }
    long_flights is count() { where: distance > 1000 }
    delay_rate is delayed_flights / flight_count * 100
}

run: flights2 -> { aggregate: flight_count, delayed_flights, delay_rate }

QUERY RESULTS

[
  {
    "flight_count": 344827,
    "delayed_flights": 53524,
    "delay_rate": 15.52198638737686
  }
]

SELECT 
   COUNT(1) as "flight_count",
   COUNT(CASE WHEN base."dep_delay">15 THEN 1 END) as "delayed_flights",
   ((COUNT(CASE WHEN base."dep_delay">15 THEN 1 END))*1.0/(COUNT(1)))*100 as "delay_rate"
FROM '../data/flights.parquet' as base

Beyond count() and sum(), Malloy supports avg(), min(), max(), and more. Use count(field) for distinct counts. See Aggregates for the full list.

4. Add Dimensions

Dimensions are reusable groupings. Add them when you keep grouping by the same expression.

source: flights3 is duckdb.table('../data/flights.parquet') extend {
  dimension:
    flight_year is dep_time.year
    flight_month is dep_time.month

  measure:
    flight_count is count()
}

run: flights3 -> {
  group_by: flight_year, flight_month
  aggregate: flight_count
  order_by: flight_year, flight_month
}

QUERY RESULTS

[
  {
    "flight_year": "2000-01-01T00:00:00.000Z",
    "flight_month": "2000-01-01T00:00:00.000Z",
    "flight_count": 4020
  },
  {
    "flight_year": "2000-01-01T00:00:00.000Z",
    "flight_month": "2000-02-01T00:00:00.000Z",
    "flight_count": 3876
  },
  {
    "flight_year": "2000-01-01T00:00:00.000Z",
    "flight_month": "2000-03-01T00:00:00.000Z",
    "flight_count": 4252
  },
  {
    "flight_year": "2000-01-01T00:00:00.000Z",
    "flight_month": "2000-04-01T00:00:00.000Z",
    "flight_count": 3778
  },
  {
    "flight_year": "2000-01-01T00:00:00.000Z",
    "flight_month": "2000-05-01T00:00:00.000Z",
    "flight_count": 4047
  }
]

SELECT 
   DATE_TRUNC('year', base."dep_time") as "flight_year",
   DATE_TRUNC('month', base."dep_time") as "flight_month",
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
GROUP BY 1,2
ORDER BY 1 ASC NULLS LAST,2 ASC NULLS LAST

Category Dimensions

Create buckets from numeric columns:

source: flights4 is duckdb.table('../data/flights.parquet') extend {
  dimension:
    distance_bucket is
      pick 'Short' when distance < 500
      pick 'Medium' when distance < 1000
      else 'Long'

  measure: flight_count is count()
}

run: flights4 -> {
  group_by: distance_bucket
  aggregate: flight_count
}

QUERY RESULTS

[
  {
    "distance_bucket": "Short",
    "flight_count": 154125
  },
  {
    "distance_bucket": "Medium",
    "flight_count": 100700
  },
  {
    "distance_bucket": "Long",
    "flight_count": 90002
  }
]

SELECT 
   CASE WHEN (base."distance"<500) THEN 'Short' WHEN (base."distance"<1000) THEN 'Medium' ELSE 'Long' END as "distance_bucket",
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
GROUP BY 1
ORDER BY 2 desc NULLS LAST

Dimensions can be any expression—date extractions, string manipulations, case logic, and more.

5. Add Joins

Joins connect related tables. Once defined, all joined data becomes queryable—you define the relationships once and never have to write join logic in queries again.

source: carriers is duckdb.table('../data/carriers.parquet') extend {
  primary_key: code  // Tells Malloy this field uniquely identifies rows
}

source: flights5 is duckdb.table('../data/flights.parquet') extend {
  join_one: carriers with carrier  // joins where carrier = carriers.code

  measure: flight_count is count()
}

run: flights5 -> {
  group_by: carriers.nickname
  aggregate: flight_count
  limit: 10
}

QUERY RESULTS

[
  {
    "nickname": "Southwest",
    "flight_count": 88751
  },
  {
    "nickname": "USAir",
    "flight_count": 37683
  },
  {
    "nickname": "American",
    "flight_count": 34577
  },
  {
    "nickname": "Northwest",
    "flight_count": 33580
  },
  {
    "nickname": "United",
    "flight_count": 32757
  }
]

SELECT 
   carriers_0."nickname" as "nickname",
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/carriers.parquet' AS carriers_0
  ON carriers_0."code"=base."carrier"
GROUP BY 1
ORDER BY 2 desc NULLS LAST
LIMIT 10

Join Types

join_one: Each flight has one carrier (many-to-one). Use when querying from the "many" side.
join_many: Each carrier has many flights (one-to-many). Use when querying from the "one" side—e.g., starting from carriers and analyzing their flights.

Multiple Joins to Same Table

Use an alias when joining a table with itself:

source: airports2 is duckdb.table('../data/airports.parquet') extend {
  primary_key: code
}

source: flights6 is duckdb.table('../data/flights.parquet') extend {
  join_one: origin_airport is airports2 with origin
  join_one: dest_airport is airports2 with destination

  measure: flight_count is count()
}

run: flights6 -> {
  group_by:
    origin_city is origin_airport.city
    dest_city is dest_airport.city
  aggregate: flight_count
  limit: 10
}

QUERY RESULTS

[
  {
    "origin_city": "WASHINGTON",
    "dest_city": "NEW YORK",
    "flight_count": 2195
  },
  {
    "origin_city": "NEW YORK",
    "dest_city": "WASHINGTON",
    "flight_count": 2174
  },
  {
    "origin_city": "NEW YORK",
    "dest_city": "BOSTON",
    "flight_count": 1292
  },
  {
    "origin_city": "BOSTON",
    "dest_city": "NEW YORK",
    "flight_count": 1278
  },
  {
    "origin_city": "LOS ANGELES",
    "dest_city": "LAS VEGAS",
    "flight_count": 1073
  }
]

SELECT 
   origin_airport_0."city" as "origin_city",
   dest_airport_0."city" as "dest_city",
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/airports.parquet' AS origin_airport_0
  ON origin_airport_0."code"=base."origin"
 LEFT JOIN '../data/airports.parquet' AS dest_airport_0
  ON dest_airport_0."code"=base."destination"
GROUP BY 1,2
ORDER BY 3 desc NULLS LAST
LIMIT 10

Malloy handles aggregates correctly across joins—no more fanout bugs. It automatically adjusts aggregations based on join cardinality, so count() returns the right number even through one-to-many joins. See symmetric aggregates for details.

See Joins for more join patterns.

6. Add Views

Views are saved queries. Add them when you have analysis patterns you'll reuse.

source: carriers3 is duckdb.table('../data/carriers.parquet') extend {
  primary_key: code
}

source: flights7 is duckdb.table('../data/flights.parquet') extend {
  join_one: carriers3 with carrier

  dimension:
    flight_year is dep_time.year
    flight_month is dep_time.month

  measure:
    flight_count is count()
    total_distance is sum(distance)
    avg_distance is avg(distance)
    delayed_flights is count() { where: dep_delay > 15 }
    delay_rate is delayed_flights / flight_count * 100

  view: monthly_summary is {
    group_by: flight_year, flight_month
    aggregate: flight_count, total_distance
    order_by: flight_year, flight_month
  }

  view: carrier_performance is {
    group_by: carriers3.nickname
    aggregate: flight_count, delay_rate
    order_by: flight_count desc
  }
}

run: flights7 -> monthly_summary

QUERY RESULTS

[
  {
    "flight_year": "2000-01-01T00:00:00.000Z",
    "flight_month": "2000-01-01T00:00:00.000Z",
    "flight_count": 4020,
    "total_distance": "3425091"
  },
  {
    "flight_year": "2000-01-01T00:00:00.000Z",
    "flight_month": "2000-02-01T00:00:00.000Z",
    "flight_count": 3876,
    "total_distance": "3181806"
  },
  {
    "flight_year": "2000-01-01T00:00:00.000Z",
    "flight_month": "2000-03-01T00:00:00.000Z",
    "flight_count": 4252,
    "total_distance": "3568248"
  },
  {
    "flight_year": "2000-01-01T00:00:00.000Z",
    "flight_month": "2000-04-01T00:00:00.000Z",
    "flight_count": 3778,
    "total_distance": "3208048"
  },
  {
    "flight_year": "2000-01-01T00:00:00.000Z",
    "flight_month": "2000-05-01T00:00:00.000Z",
    "flight_count": 4047,
    "total_distance": "3319093"
  }
]

SELECT 
   DATE_TRUNC('year', base."dep_time") as "flight_year",
   DATE_TRUNC('month', base."dep_time") as "flight_month",
   COUNT(1) as "flight_count",
   COALESCE(SUM(base."distance"),0) as "total_distance"
FROM '../data/flights.parquet' as base
GROUP BY 1,2
ORDER BY 1 ASC NULLS LAST,2 ASC NULLS LAST

malloy run: flights7 -> carrier_performance

Ad-hoc Queries Still Work

Views don't limit you. You can still write any query:

run: flights7 -> {
  where: dep_time ? @2003-01
  group_by: carriers3.nickname
  aggregate: flight_count, avg_distance
  order_by: flight_count desc
  limit: 5
}

QUERY RESULTS

[
  {
    "nickname": "Southwest",
    "flight_count": 1209,
    "avg_distance": 643.7229114971051
  },
  {
    "nickname": "Atlantic Southeast",
    "flight_count": 547,
    "avg_distance": 212.12797074954295
  },
  {
    "nickname": "United",
    "flight_count": 524,
    "avg_distance": 1187.3034351145038
  },
  {
    "nickname": "American",
    "flight_count": 422,
    "avg_distance": 1235.521327014218
  },
  {
    "nickname": "Northwest",
    "flight_count": 419,
    "avg_distance": 1063.6229116945108
  }
]

SELECT 
   carriers3_0."nickname" as "nickname",
   COUNT(1) as "flight_count",
   AVG(base."distance") as "avg_distance"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/carriers.parquet' AS carriers3_0
  ON carriers3_0."code"=base."carrier"
WHERE (base."dep_time">=TIMESTAMP '2003-01-01 00:00:00') and (base."dep_time"<TIMESTAMP '2003-02-01 00:00:00')
GROUP BY 1
ORDER BY 2 desc NULLS LAST
LIMIT 5

See Views for more view patterns including nested views.

Complete Example

In practice, put your source definitions in a .malloy file and import them into notebooks for analysis. Here's what a complete model looks like:

source: carriers_full is duckdb.table('../data/carriers.parquet') extend {
  primary_key: code
}
source: airports_full is duckdb.table('../data/airports.parquet') extend {
  primary_key: code
}

source: flights_full is duckdb.table('../data/flights.parquet') extend {
  join_one: carriers_full with carrier
  join_one: origin_airport is airports_full with origin
  join_one: dest_airport is airports_full with destination

  dimension:
    flight_year is dep_time.year
    flight_month is dep_time.month
    distance_bucket is
      pick 'Short' when distance < 500
      pick 'Medium' when distance < 1000
      else 'Long'

  measure:
    flight_count is count()
    total_distance is sum(distance)
    avg_distance is avg(distance)
    delayed_flights is count() { where: dep_delay > 15 }
    delay_rate is delayed_flights / flight_count * 100

  view: monthly_summary is {
    group_by: flight_year, flight_month
    aggregate: flight_count, total_distance
    order_by: flight_year, flight_month
  }

  view: carrier_stats is {
    group_by: carriers_full.nickname
    aggregate: flight_count, avg_distance, delay_rate
    order_by: flight_count desc
  }

  view: route_analysis is {
    group_by:
      origin_city is origin_airport.city
      dest_city is dest_airport.city
    aggregate: flight_count, avg_distance
    order_by: flight_count desc
    limit: 20
  }
}

run: flights_full -> carrier_stats

QUERY RESULTS

[
  {
    "nickname": "Southwest",
    "flight_count": 88751,
    "avg_distance": 615.4201304774031,
    "delay_rate": 16.919245980326984
  },
  {
    "nickname": "USAir",
    "flight_count": 37683,
    "avg_distance": 629.5051349414855,
    "delay_rate": 15.261523764031526
  },
  {
    "nickname": "American",
    "flight_count": 34577,
    "avg_distance": 1089.8830147207682,
    "delay_rate": 14.859588743962751
  },
  {
    "nickname": "Northwest",
    "flight_count": 33580,
    "avg_distance": 993.9399344848124,
    "delay_rate": 13.764145324597976
  },
  {
    "nickname": "United",
    "flight_count": 32757,
    "avg_distance": 1187.0114479347926,
    "delay_rate": 15.346338187257686
  }
]

SELECT 
   carriers_full_0."nickname" as "nickname",
   COUNT(1) as "flight_count",
   AVG(base."distance") as "avg_distance",
   ((COUNT(CASE WHEN base."dep_delay">15 THEN 1 END))*1.0/(COUNT(1)))*100 as "delay_rate"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/carriers.parquet' AS carriers_full_0
  ON carriers_full_0."code"=base."carrier"
GROUP BY 1
ORDER BY 2 desc NULLS LAST

The Workflow

Define sources - Point at your tables
Explore - Use VS Code to understand data
Query - Write ad-hoc queries to answer questions
Notice patterns - Same aggregations? Same groupings?
Promote - Move repeated patterns to measures/dimensions
Add joins - When you need data from other tables
Save views - For queries you'll reuse

Build iteratively. Start simple, add what you need.

Learn More

This guide covers the basics. See the language reference for filters, functions, window calculations, and nested views. For common analytics patterns, see percent of total, year-over-year, and cohort analysis.

Next Steps

Advanced Modeling - Deep dive into nesting, pipelines, and complex joins
Troubleshooting - Fix common modeling issues
Publishing - Deploy your model
Language Reference - Full syntax documentation