Malloy Documentation
search

This document will to move you quickly into the Malloy language by showing how different queries translate SQL into Malloy.

Let's start with a basic query.

SQL queries can be directly translated into Malloy

-- connection: duckdb
SELECT 
  flights.origin as "Origin Code",
  orig.city,
  orig.full_name as "airport name",
  count(*) as flight_count,
  count(distinct tail_num) as aircraft_count,
  avg(distance) as avg_distance
  count(CASE WHEN dep_delay > 30 THEN 1 END) as delayed_flight_count
  count(CASE WHEN dep_delay > 30 THEN 1 END)/count(*) as percent_delayed
FROM '../data/flights.parquet' as flights
LEFT JOIN '../data/airports.parquet' as orig ON flights.origin = orig.code
WHERE destination = 'SFO'
GROUP BY 1,2,3
ORDER BY count(*) desc
LIMIT 10

Translation of SQL into Malloy

In converting from SQL to Malloy, we refactor the query into two parts. The resuable components go into a source: block, The transformation goes into a query block.

 source: flights is                     -- (1) refactor the reusable parts 
    duckdb.table('../data/flights.parquet') -- (2) get schema
  extend {                              -- (3) add definitions
    join_one: orig is                   -- (3) joins need 'one' or 'many'
      duckdb.table('../data/airports.parquet') 
        on origin = orig.code
    dimension: 
        is_delayed is dep_delay > 30    -- (4) reusable scalcar calculation 
    measure:                            -- (5) aggregate definition
      flight_count is count()           -- (6) Mostly the same as count(*) 
      delayed_flight_count is 
        flight_count {where: is_delayed}  -- (7) filtered aggregates
      # percent                         -- (8) tell the renderer to show the result as a percent
      percent_delayed is              
        delayed_flight_count / flight_count  -- ** reuse calculations
}
 
run: flights -> {                      -- (9) FROM is first
  where: destination = 'SFO'
  group_by:
    `Origin Code` is origin             --  (10) 'is' vs 'as' 
                                        --  (11) the root table has no alias 
    orig.city                   --  Output column name is 'city', like SQL.
    `Airport Name` is orig.full_name    -- (12) Backtick to quote names 
  aggregate:
    flight_count            
    aircraft_count is count(tail_num)   -- (13) exactly the same as count(distinct tail_num) 
    avg_distance is distance.avg()      -- (14) aggregate locality
    delayed_flight_count
    percent_delayed
  order_by: flight_count desc           -- (15) uses the output name instead of expression or column ordinal
  limit: 10    
}
Concept Description
(1) Reusable Components The source: object containts the reusable parts of the query. In a query, joins and aggregate expressions and scalar calculations are often resuable between queries.
(2) Get Schema Malloy is strongly typed. Malloy reads the schema definition from tables during the compilation process
(3) Joins need more info Joins need an additional piece of information. Malloy guarentees that aggregate calculations won't be affected by by joins. In order to make this promise, when joining you have to tell Malloy if that data will fan out. Use join_one: if there won't be an increase in the number of rows after the join. join_many: if the join could cause there to me more rows.
(4) dimension: Dimensions are reusable scalar expressions that are written in terms of other definitions in the source. In this case we're making boolean is_delayed that can be used like a column in the flights table
(5) measure: Measures are reusable aggregate calculations.
(6) count() rationalized count() works a lot like count(*) does in conceptually SQL. count() it will return the number of rows in the root table that match, even if the data fans out. alias.count() will return the number of rows in a joined alias, again regardless of the fan out pattern.
(7) Filtered Aggregates All Aggregregate expressions can be simply filtered
(8) Annotations Any named object in Malloy can be annotated. Annotations come back as part of the compilation process. In this case we are annotation to tell the Malloy renderer to show the result as a percentage.
(9) FROM is first The order of declarations in Malloy flows in an more logical in that the from starts first. Malloy often looks like from -> {where: group_by:, aggregate: order_by: limit:}
(10) 'is' vs 'AS' Malloy proritizes readiblity. Placing the name of the thing being declared makes the code easier to read. In most places where SQL would use 'AS' malloy uses 'is' and reverses the declaration order.
(11) Aliases are Heirachial Aliases in Malloy work as a heirachy as opposed to a flat namespacein SQL. In SQL every table has an alias. In Malloy, the root table has no alias (it does, but it is always called source). Joins in Malloy can be nested to create heirachies. In Malloy orders.users.address is how you might chase throuh join aliases.
(12) Human identifiers You can use arbitraray characters including spaces in identifier names. Available characters vary by SQL engine.
(13) count distinct count(tail_num) is a count(distinct) and exactly like SQL. SQL's count(foo) can be written as count() {where: foo != null}
(14) Relation aware agggregates In SQL aggregate are against the resulting joined matrix. In Malloy aggregate are computed against the table they are joined from. In Malloy you can peform a sum or average anyplace in the join heiracty.
(15) Order By uses output names In a SQL order by, you have to either specify the expression for a column or the oridinal number of the column. In Malloy you write the order_by: using the name of the column as it will appear in the output.

Basic Structure of Malloy

The code above is an example of a query. Queries are invoked with run: that takes a query block as a parameter.

Malloy has essentially two types of objects, query: and source:. A query: is named a transformation. A source: is an object or network of objects that can be transformed. For example, a table is a source: and the output of a query is a source:

Malloy's concept of query: pretty much maps to a SQL's concecpt of a SELECT statement..

A source: in Malloy acts and an API to a dataset. A source: declaration is much like a function library in imperitive languages.

Like function libraries, sources hide complexity.

Query Block

The query bock is used to transform data.

A query block is always preceeded by the -> operator.

In the query example above, the query block follows the run:.

Components of a query block include group_by:, aggregate:, select:, order_by:, limit:, calculate:, where:, having: and index:.

The output of a query block is a table (a source: actually). The column names in the result table follow SQLs example in that the the last part unaliased name is the column name.

The query blocks can named and later joined, like SQL's WITH name AS (SELECT ...)

  query: airports_by_state is  duckdb.table(airports.parquet) -> {
     group_by: state, fac_type
     aggregate: airport_count is count()
  }

Queries can be chained in a pipeline, for example.

  run: duckdb.table(airports.parquet) -> {
     group_by: state, fac_type
     aggregate: airport_count is count()
  } -> {
    aggregte: total_airports is airport_count.sum()
  }

source: Declaration Block

A source: declaration block adds reusable calculations to an existing source, like a table. These declaration operate like methods or properties in object oriented languages (with the self being the table).

A declaration block is always preceeded by an extend.

In the query above the source declaration follows the extend:.

Source blocks can contain measure: (aggregate cacluation), dimension: (scalar calculation), where: (a filter to always apply to a source), primary_key:, view: (a predefined query), join_xxx: (declares relationship to another source).

source: airports is duckdb.table('airports.parquet') extend {
  primay_key: code
  dimension: city_state is concat(state,',',city)
  measure:
    airport_count is count()
    average_elevation is elevation.avg()
}

Once a source is declared, queries are represented more simply. Calculations can be used in any number of different queries.

             Malloy             
          SQL          
Notes
run: airports -> {
aggregate: airport_count
}
SELECT
count( * ) as airport_count
FROM airports.parquet
The calculation airport_count comes from the declaration in the source.
run: airports -> {
group_by: city_state
aggregate: airport_count
}
SELECT
concat(state,',',city) as city_state
count( * ) as airport_count
FROM airports.parquet
GROUP BY 1
ORDER BY count(*) desc
Malloy has resonable defaults for ordering queries.

Joins are declarations.

In Malloy, joins usually happen in sources (though they can appear in queries also). When writing a malloy query, a join in a source: is only invoked in a query if some member of element of the join is referenced . In order to accuratly compute aggregate calculations in joined sources, Malloy needs to know if a join fans out the data (potentially produces more rows after the join is invoked).

Assuming the following source

source: flights is duckdb.table('flights.parquet') extend {
  join_one: orig is duckdb.table('airports.parquet') on origin_code = orig.code
  measure:
    flight_count is count()
}
             Malloy             
          SQL          
Notes
run: flights -> {
aggregate: flight_count
}
SELECT
count( * ) as flight_count
FROM airports.parquet
Notice the join is NOT invoked
run: flights -> {
group_by: orig.city
aggregate: fliight_count
}
SELECT
orig.city
count( * ) as flight_count
FROM flights.parquet as flights
LEFT JOIN airports.parquet as orig
ON origin_code=orig.code
GROUP BY 1
ORDER BY count(*) DESC
Note that the join is invoked because orig.city is referenced

Note: In SQL dialects that support nested/repeated structures and arrays, joins are autmatically declared.

Types

In Malloy, type are number, string, boolean, date, timestamp, record and array. Databases have many varieties of these types. Malloy attempts to keep the calculations within the native type system. You can cast to database native types in your Malloy expressions.

Malloy's boolean is two state when being returned in a query (it will never return null in a select: or a group_by:, only false.)

Cast and Safe Cast

  • In Malloy :: is the cast operator.

  • ::: is the safe cast operator if the SQL dialect supports the operation.

  • You can cast to any native SQL type by placing the type name in quotes.

    foo::"BIGINT"

Expressions

Expression and malloy are pretty similar to SQL. Where we've change the language, we've done it to make it more readable, modern or consistent and composable. If you write the SQL expression, generally, the compiler will tell you how to write the Malloy expression.

This is intended to serve as a quick reference, more complete documentation can be found here.

SQL Malloy Description / Docs
SUM(), AVG(), MAX(), MIN(), COUNT(), etc 
 sum(), avg(), max(), min(), count(), etc... 
Basic SQL aggregations are supported verbatim, but it’s worth learning about Malloy’s additional aggregate locality / symmetric aggregate handling.
 CASE   WHEN size_n < 3 THEN 'S'   WHEN size_n <5 THEN 'M' ELSE 'L' END 
 size_n ?   pick 'S' when < 3   pick 'M' when <5  else 'L' 
Pick is Malloy’s improvement of SQL’s CASE statement. This example also introduces the ? Apply operator, which "applies" a value to another value, condition, or computation. This is most often used with partial comparisons or alternations.
 COUNT(CASE WHEN status = 'Returned' THEN 1 END), AVG(CASE WHEN brand = 'Levi's' THEN price END) 
 count() { where: status = 'Returned' } avg_price { where: brand = 'Levi\'s' } 
Aggregates may be filtered using filter expressions. Doc
 CAST(distance AS string), distance::string 
distance::string
Type Cast. Also worth reviewing Types doc.

Malloy has its own standard function library.

Builtin functions work the same across SQL Dialects

In order for Malloy code to execute the same on many SQL engines, Malloy has a set of functions that works the same on all SQL Engines.

Calling Native database functions

In Malloy can can call native database functions or write arbitrary SQL expressions and use them in dimensions.

Malloy Description / Docs
 sinh!number(x) 
Call, the 'native sinh() function it returns a number
 sinh!(x) 
Call, the 'native sinh() function it returns a number because x is of type number
 json_extract(j, '$.x'):::string 
Call the native json_extract function and safe cast the results to a string

Arbitrary SQL Expressions in Dimensions.

You can use arbitrary SQL in expressions using the sql_XXX functions. The mechanism is helpful for coorelated, array mapping functions and more. sql_number writes an expression that returns a numeric scalar. The """ is another form of string quoting in Malloy. ${TABLE} is substituted with the current source's table alias in SQL.

Code Use
foo_count is sql_number("""(SELECT COUNT(*) FROM UNNEST${TABLE}.foo)""")
Coorelaed subquery
input_tables_string is sql_string("""ARRAY_JOIN(TRANSFORM(${TABLE}.inputs, (x) -> (x.table_name)),',')""")
Presto Lambda

Working with Time

The Time Expressions reference contains substantially more detail and examples.

SQL Malloy Docs
 TIMESTAMP_TRUNC(created_at, WEEK), DATE_TRUNC(order_items.shipped_at, MONTH) 
 created_at.week shipped_at.month 
Truncation
 EXTRACT(DAYOFWEEK FROM shipped_at), EXTRACT(HOUR FROM created_at)
 day_of_week(shipped_at) hour(created_at) 
Extraction
 DATE_DIFF(DATE(CURRENT_TIMESTAMP()),DATE(created_at), DAY), TIMESTAMP_DIFF(TIMESTAMP(shipped_at), created_at, HOUR)
 days(created_at to now) hours(created_at to shipped_at) 
Date Diff / Intervals
created_at >= TIMESTAMP('2003-01-01', 'UTC') AND created_at < TIMESTAMP('2004-01-01', 'UTC') 
created_at >= TIMESTAMP(DATETIME_SUB(DATETIME(CURRENT_TIMESTAMP()),INTERVAL 1 YEAR)) AND created_at < TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_SUB(DATETIME( CURRENT_TIMESTAMP()),INTERVAL 1 YEAR))),INTERVAL 1 YEAR)) 
(EXTRACT(DAYOFWEEK FROM created_at)) NOT IN (1,7) 
 created_at ? @2003 

 created_at ? now - 1 year for 1 year 

not(day_of_week(created_at) = 1 | 7) 
Filter Expressions, Apply Operator

Window Functions

SQL Malloy Docs
FIRST_VALUE(product_brand) OVER (PARTITION BY user_id ORDER BY created_at ASC) 
calculate: first_value(product_brand) {partition_by: user_id order_by: created_at asc}
Names of fields in parition_by: are from the output space