Malloy for SQL Folks

VSCode

This document will to move you quickly into the Malloy language by showing how different queries translate SQL into Malloy.

Let's start with a basic query.

SQL queries can be directly translated into Malloy

-- connection: duckdb
SELECT 
  flights.origin as "Origin Code",
  orig.city,
  orig.full_name as "airport name",
  count(*) as flight_count,
  count(distinct tail_num) as aircraft_count,
  avg(distance) as avg_distance
  count(CASE WHEN dep_delay > 30 THEN 1 END) as delayed_flight_count
  count(CASE WHEN dep_delay > 30 THEN 1 END)/count(*) as percent_delayed
FROM '../data/flights.parquet' as flights
LEFT JOIN '../data/airports.parquet' as orig ON flights.origin = orig.code
WHERE destination = 'SFO'
GROUP BY 1,2,3
ORDER BY count(*) desc
LIMIT 10

Translation of SQL into Malloy

In converting from SQL to Malloy, we refactor the query into two parts. The resuable components go into a source: block, The transformation goes into a query block.

 source: flights is                     -- (1) refactor the reusable parts 
    duckdb.table('../data/flights.parquet') -- (2) get schema
  extend {                              -- (3) add definitions
    join_one: orig is                   -- (3) joins need 'one' or 'many'
      duckdb.table('../data/airports.parquet') 
        on origin = orig.code
    dimension: 
        is_delayed is dep_delay > 30    -- (4) reusable scalcar calculation 
    measure:                            -- (5) aggregate definition
      flight_count is count()           -- (6) Mostly the same as count(*) 
      delayed_flight_count is 
        flight_count {where: is_delayed}  -- (7) filtered aggregates
      # percent                         -- (8) tell the renderer to show the result as a percent
      percent_delayed is              
        delayed_flight_count / flight_count  -- ** reuse calculations
}
 
run: flights -> {                       -- (9) FROM is first
  where: destination = 'SFO'
  group_by:
    `Origin Code` is origin             --  (10) 'is' vs 'as' 
                                        --  (11) the root table has no alias 
    orig.city                   --  Output column name is 'city', like SQL.
    `Airport Name` is orig.full_name    -- (12) Backtick to quote names 
  aggregate:
    flight_count            
    aircraft_count is count(tail_num)   -- (13) exactly the same as count(distinct tail_num) 
    avg_distance is distance.avg()      -- (14) aggregate locality
    delayed_flight_count
    percent_delayed
  order_by: flight_count desc           -- (15) uses the output name instead of expression or column ordinal
  limit: 10    
}

	Concept	Description
(1)	Reusable Components	The `source:` object containts the reusable parts of the query. In a query, joins and aggregate expressions and scalar calculations are often resuable between queries.
(2)	Get Schema	Malloy is strongly typed. Malloy reads the schema definition from tables during the compilation process
(3)	Joins need more info	Joins need an additional piece of information. Malloy guarentees that aggregate calculations won't be affected by by joins. In order to make this promise, when joining you have to tell Malloy if that data will fan out. Use `join_one:` if there won't be an increase in the number of rows after the join. `join_many:` if the join could cause there to me more rows.
(4)	`dimension:`	Dimensions are reusable scalar expressions that are written in terms of other definitions in the source. In this case we're making boolean `is_delayed` that can be used like a column in the flights table
(5)	`measure:`	Measures are reusable aggregate calculations.
(6)	count() rationalized	`count()` works a lot like `count(*)` does in conceptually SQL. `count()` it will return the number of rows in the root table that match, even if the data fans out. `alias.count()` will return the number of rows in a joined alias, again regardless of the fan out pattern.
(7)	Filtered Aggregates	All Aggregregate expressions can be simply filtered
(8)	Annotations	Any named object in Malloy can be annotated. Annotations come back as part of the compilation process. In this case we are annotation to tell the Malloy renderer to show the result as a percentage.
(9)	FROM is first	The order of declarations in Malloy flows in an more logical in that the from starts first. Malloy often looks like from -> {where: group_by:, aggregate: order_by: limit:}
(10)	'is' vs 'AS'	Malloy proritizes readiblity. Placing the name of the thing being declared makes the code easier to read. In most places where SQL would use 'AS' malloy uses 'is' and reverses the declaration order.
(11)	Aliases are Heirachial	Aliases in Malloy work as a heirachy as opposed to a flat namespacein SQL. In SQL every table has an alias. In Malloy, the root table has no alias (it does, but it is always called `source`). Joins in Malloy can be nested to create heirachies. In Malloy `orders.users.address` is how you might chase throuh join aliases.
(12)	Human identifiers	You can use arbitraray characters including spaces in identifier names. Available characters vary by SQL engine.
(13)	count distinct	`count(tail_num)` is a `count(distinct)` and exactly like SQL. SQL's `count(foo)` can be written as `count() {where: foo is not null}`
(14)	Relation aware agggregates	In SQL aggregate are against the resulting joined matrix. In Malloy aggregate are computed against the table they are joined from. In Malloy you can peform a sum or average anyplace in the join heiracty.
(15)	Order By uses output names	In a SQL order by, you have to either specify the expression for a column or the oridinal number of the column. In Malloy you write the order_by: using the name of the column as it will appear in the output.

Basic Structure of Malloy

The code above is an example of a query. Queries are invoked with run: that takes a query block as a parameter.

Malloy has essentially two types of objects, query: and source:. A query: is named a transformation. A source: is an object or network of objects that can be transformed. For example, a table is a source: and the output of a query is a source:

Malloy's concept of query: pretty much maps to a SQL's concecpt of a SELECT statement..

A source: in Malloy acts and an API to a dataset. A source: declaration is much like a function library in imperitive languages.

Like function libraries, sources hide complexity.

Query Block

The query bock is used to transform data.

A query block is always preceeded by the -> operator.

In the query example above, the query block follows the run:.

Components of a query block include group_by:, aggregate:, select:, order_by:, limit:, calculate:, where:, having: and index:.

The output of a query block is a table (a source: actually). The column names in the result table follow SQLs example in that the the last part unaliased name is the column name.

The query blocks can named and later joined, like SQL's WITH name AS (SELECT ...)

  query: airports_by_state is  duckdb.table(airports.parquet) -> {
     group_by: state, fac_type
     aggregate: airport_count is count()
  }

Queries can be chained in a pipeline, for example.

  run: duckdb.table(airports.parquet) -> {
     group_by: state, fac_type
     aggregate: airport_count is count()
  } -> {
    aggregte: total_airports is airport_count.sum()
  }

`source:` Declaration Block

A source: declaration block adds reusable calculations to an existing source, like a table. These declaration operate like methods or properties in object oriented languages (with the self being the table).

A declaration block is always preceeded by an extend.

In the query above the source declaration follows the extend:.

Source blocks can contain measure: (aggregate cacluation), dimension: (scalar calculation), where: (a filter to always apply to a source), primary_key:, view: (a predefined query), join_xxx: (declares relationship to another source).

source: airports is duckdb.table('airports.parquet') extend {
  primay_key: code
  dimension: city_state is concat(state,',',city)
  measure:
    airport_count is count()
    average_elevation is elevation.avg()
}

Once a source is declared, queries are represented more simply. Calculations can be used in any number of different queries.

Malloy	SQL	Notes
`run: airports -> { aggregate: airport_count }`	`SELECT count( * ) as airport_count FROM airports.parquet`	The calculation `airport_count` comes from the declaration in the source.
`run: airports -> { group_by: city_state aggregate: airport_count }`	`SELECT concat(state,',',city) as city_state count( * ) as airport_count FROM airports.parquet GROUP BY 1 ORDER BY count(*) desc`	Malloy has resonable defaults for ordering queries.

             Malloy

SQL

Notes

run: airports -> {
 aggregate: airport_count
}

SELECT
  count( * ) as airport_count
FROM airports.parquet

The calculation airport_count comes from the declaration in the source.

run: airports -> {
  group_by: city_state
 aggregate: airport_count
}

SELECT
  concat(state,',',city) as city_state
  count( * ) as airport_count
FROM airports.parquet
GROUP BY 1
ORDER BY count(*) desc

Malloy has resonable defaults for ordering queries.

Joins are declarations.

In Malloy, joins usually happen in sources (though they can appear in queries also). When writing a malloy query, a join in a source: is only invoked in a query if some member of element of the join is referenced . In order to accuratly compute aggregate calculations in joined sources, Malloy needs to know if a join fans out the data (potentially produces more rows after the join is invoked).

Assuming the following source

source: flights is duckdb.table('flights.parquet') extend {
  join_one: orig is duckdb.table('airports.parquet') on origin_code = orig.code
  measure:
    flight_count is count()
}

Malloy	SQL	Notes
`run: flights -> { aggregate: flight_count }`	`SELECT count( * ) as flight_count FROM airports.parquet`	Notice the join is NOT invoked
`run: flights -> { group_by: orig.city aggregate: fliight_count }`	`SELECT orig.city count( * ) as flight_count FROM flights.parquet as flights LEFT JOIN airports.parquet as orig ON origin_code=orig.code GROUP BY 1 ORDER BY count(*) DESC`	Note that the join is invoked because `orig.city` is referenced

             Malloy

SQL

Notes

run: flights -> {
  aggregate: flight_count
}

SELECT
  count( * ) as flight_count
 FROM airports.parquet

Notice the join is NOT invoked

run: flights -> {
  group_by: orig.city
  aggregate: fliight_count
}

SELECT
  orig.city 
  count( * ) as flight_count
 FROM flights.parquet as flights
LEFT JOIN airports.parquet as orig 
  ON origin_code=orig.code
GROUP BY 1
ORDER BY count(*) DESC

Note that the join is invoked because orig.city is referenced

Note: In SQL dialects that support nested/repeated structures and arrays, joins are autmatically declared.

Types

In Malloy, type are number, string, boolean, date, timestamp, record and array. Databases have many varieties of these types. Malloy attempts to keep the calculations within the native type system. You can cast to database native types in your Malloy expressions.

Malloy's boolean is two state when being returned in a query (it will never return null in a select: or a group_by:, only false.)

Cast and Safe Cast

In Malloy :: is the cast operator.
::: is the safe cast operator if the SQL dialect supports the operation.
You can cast to any native SQL type by placing the type name in quotes.

foo::"BIGINT"

Expressions

Expression and malloy are pretty similar to SQL. Where we've change the language, we've done it to make it more readable, modern or consistent and composable. If you write the SQL expression, generally, the compiler will tell you how to write the Malloy expression.

This is intended to serve as a quick reference, more complete documentation can be found here.

SQL	Malloy	Description / Docs
`SUM(), AVG(), MAX(), MIN(), COUNT(), etc`	`sum(), avg(), max(), min(), count(), etc...`	Basic SQL aggregations are supported verbatim, but it’s worth learning about Malloy’s additional aggregate locality / symmetric aggregate handling.
`CASE` `WHEN size_n < 3 THEN 'S'` `WHEN size_n <5 THEN 'M'` `ELSE 'L' END`	`size_n ?` `pick 'S' when < 3` `pick 'M' when <5` `else 'L'`	Pick is Malloy’s improvement of SQL’s `CASE` statement. This example also introduces the `?` Apply operator, which "applies" a value to another value, condition, or computation. This is most often used with partial comparisons or alternations.
`COUNT(CASE WHEN status = 'Returned' THEN 1 END),` `AVG(CASE WHEN brand = 'Levi's' THEN price END)`	`count() { where: status = 'Returned' }` `avg_price { where: brand = 'Levi\'s' }`	Aggregates may be filtered using filter expressions. Doc
`CAST(distance AS string),` `distance::string`	`distance::string`	Type Cast. Also worth reviewing Types doc.

Malloy has its own standard function library.

Builtin functions work the same across SQL Dialects

In order for Malloy code to execute the same on many SQL engines, Malloy has a set of functions that works the same on all SQL Engines.

Calling Native database functions

In Malloy can can call native database functions or write arbitrary SQL expressions and use them in dimensions.

Malloy	Description / Docs
`sinh!number(x)`	Call, the 'native `sinh()` function it returns a number
`sinh!(x)`	Call, the 'native `sinh()` function it returns a number because `x` is of type number
`json_extract(j, '$.x'):::string`	Call the native `json_extract` function and safe cast the results to a string

Arbitrary SQL Expressions in Dimensions.

You can use arbitrary SQL in expressions using the sql_XXX functions. The mechanism is helpful for coorelated, array mapping functions and more. sql_number writes an expression that returns a numeric scalar. The """ is another form of string quoting in Malloy. ${TABLE} is substituted with the current source's table alias in SQL.

Code	Use
`foo_count is sql_number("""(SELECT COUNT(*) FROM UNNEST${TABLE}.foo)""")`	Coorelaed subquery
`input_tables_string is sql_string("""ARRAY_JOIN(TRANSFORM(${TABLE}.inputs, (x) -> (x.table_name)),',')""")`	Presto Lambda

Working with Time

The Time Expressions reference contains substantially more detail and examples.

SQL	Malloy	Docs
`TIMESTAMP_TRUNC(created_at, WEEK),` `DATE_TRUNC(order_items.shipped_at, MONTH)`	`created_at.week` `shipped_at.month`	Truncation
`EXTRACT(DAYOFWEEK FROM shipped_at),` `EXTRACT(HOUR FROM created_at)`	`day_of_week(shipped_at)` `hour(created_at)`	Extraction
`DATE_DIFF(DATE(CURRENT_TIMESTAMP()),DATE(created_at), DAY),` `TIMESTAMP_DIFF(TIMESTAMP(shipped_at), created_at, HOUR)`	`days(created_at to now)` `hours(created_at to shipped_at)`	Date Diff / Intervals
`created_at >= TIMESTAMP('2003-01-01', 'UTC') AND created_at < TIMESTAMP('2004-01-01', 'UTC')` `created_at >= TIMESTAMP(DATETIME_SUB(DATETIME(CURRENT_TIMESTAMP()),INTERVAL 1 YEAR)) AND created_at < TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_SUB(DATETIME( CURRENT_TIMESTAMP()),INTERVAL 1 YEAR))),INTERVAL 1 YEAR))` `(EXTRACT(DAYOFWEEK FROM created_at)) NOT IN (1,7)`	`created_at ? @2003` `created_at ? now - 1 year for 1 year` `not(day_of_week(created_at) = 1 \| 7)`	Filter Expressions, Apply Operator

Window Functions

SQL	Malloy	Docs
`FIRST_VALUE(product_brand) OVER (PARTITION BY user_id ORDER BY created_at ASC)`	`calculate: first_value(product_brand) {partition_by: user_id order_by: created_at asc}`	Names of fields in `parition_by:` are from the output space