This document will to move you quickly into the Malloy language by showing how different queries translate SQL into Malloy.
Let's start with a basic query.
SQL queries can be directly translated into Malloy
-- connection: duckdb SELECT flights.origin as "Origin Code", orig.city, orig.full_name as "airport name", count(*) as flight_count, count(distinct tail_num) as aircraft_count, avg(distance) as avg_distance count(CASE WHEN dep_delay > 30 THEN 1 END) as delayed_flight_count count(CASE WHEN dep_delay > 30 THEN 1 END)/count(*) as percent_delayed FROM '../data/flights.parquet' as flights LEFT JOIN '../data/airports.parquet' as orig ON flights.origin = orig.code WHERE destination = 'SFO' GROUP BY 1,2,3 ORDER BY count(*) desc LIMIT 10
Translation of SQL into Malloy
In converting from SQL to Malloy, we refactor the query into two parts. The resuable components go into a source:
block, The transformation goes into a query block.
source: flights is -- (1) refactor the reusable parts duckdb.table('../data/flights.parquet') -- (2) get schema extend { -- (3) add definitions join_one: orig is -- (3) joins need 'one' or 'many' duckdb.table('../data/airports.parquet') on origin = orig.code dimension: is_delayed is dep_delay > 30 -- (4) reusable scalcar calculation measure: -- (5) aggregate definition flight_count is count() -- (6) Mostly the same as count(*) delayed_flight_count is flight_count {where: is_delayed} -- (7) filtered aggregates # percent -- (8) tell the renderer to show the result as a percent percent_delayed is delayed_flight_count / flight_count -- ** reuse calculations } run: flights -> { -- (9) FROM is first where: destination = 'SFO' group_by: `Origin Code` is origin -- (10) 'is' vs 'as' -- (11) the root table has no alias orig.city -- Output column name is 'city', like SQL. `Airport Name` is orig.full_name -- (12) Backtick to quote names aggregate: flight_count aircraft_count is count(tail_num) -- (13) exactly the same as count(distinct tail_num) avg_distance is distance.avg() -- (14) aggregate locality delayed_flight_count percent_delayed order_by: flight_count desc -- (15) uses the output name instead of expression or column ordinal limit: 10 }
Concept | Description | |
---|---|---|
(1) | Reusable Components | The source: object containts the reusable parts of the query. In a query, joins and aggregate expressions and scalar calculations are often resuable between queries. |
(2) | Get Schema | Malloy is strongly typed. Malloy reads the schema definition from tables during the compilation process |
(3) | Joins need more info | Joins need an additional piece of information. Malloy guarentees that aggregate calculations won't be affected by by joins. In order to make this promise, when joining you have to tell Malloy if that data will fan out. Use join_one: if there won't be an increase in the number of rows after the join. join_many: if the join could cause there to me more rows. |
(4) | dimension: |
Dimensions are reusable scalar expressions that are written in terms of other definitions in the source. In this case we're making boolean is_delayed that can be used like a column in the flights table |
(5) | measure: |
Measures are reusable aggregate calculations. |
(6) | count() rationalized | count() works a lot like count(*) does in conceptually SQL. count() it will return the number of rows in the root table that match, even if the data fans out. alias.count() will return the number of rows in a joined alias, again regardless of the fan out pattern. |
(7) | Filtered Aggregates | All Aggregregate expressions can be simply filtered |
(8) | Annotations | Any named object in Malloy can be annotated. Annotations come back as part of the compilation process. In this case we are annotation to tell the Malloy renderer to show the result as a percentage. |
(9) | FROM is first | The order of declarations in Malloy flows in an more logical in that the from starts first. Malloy often looks like from -> {where: group_by:, aggregate: order_by: limit:} |
(10) | 'is' vs 'AS' | Malloy proritizes readiblity. Placing the name of the thing being declared makes the code easier to read. In most places where SQL would use 'AS' malloy uses 'is' and reverses the declaration order. |
(11) | Aliases are Heirachial | Aliases in Malloy work as a heirachy as opposed to a flat namespacein SQL. In SQL every table has an alias. In Malloy, the root table has no alias (it does, but it is always called source ). Joins in Malloy can be nested to create heirachies. In Malloy orders.users.address is how you might chase throuh join aliases. |
(12) | Human identifiers | You can use arbitraray characters including spaces in identifier names. Available characters vary by SQL engine. |
(13) | count distinct | count(tail_num) is a count(distinct) and exactly like SQL. SQL's count(foo) can be written as count() {where: foo != null} |
(14) | Relation aware agggregates | In SQL aggregate are against the resulting joined matrix. In Malloy aggregate are computed against the table they are joined from. In Malloy you can peform a sum or average anyplace in the join heiracty. |
(15) | Order By uses output names | In a SQL order by, you have to either specify the expression for a column or the oridinal number of the column. In Malloy you write the order_by: using the name of the column as it will appear in the output. |
Basic Structure of Malloy
The code above is an example of a query. Queries are invoked with run:
that takes a query block as a parameter.
Malloy has essentially two types of objects, query:
and source:
. A query:
is named a transformation. A source:
is an object or network of objects that can be transformed. For example, a table is a source:
and the output of a query is a source:
Malloy's concept of query:
pretty much maps to a SQL's concecpt of a SELECT statement..
A source:
in Malloy acts and an API to a dataset. A source:
declaration is much like a function library in imperitive languages.
Like function libraries, sources hide complexity.
Query Block
The query bock is used to transform data.
A query block is always preceeded by the ->
operator.
In the query example above, the query block follows the run:
.
Components of a query block include group_by:
, aggregate:
, select:
, order_by:
, limit:
, calculate:
, where:
, having:
and index:
.
The output of a query block is a table (a source:
actually). The column names in the result table follow SQLs example in that the the last part unaliased name is the column name.
The query blocks can named and later joined, like SQL's WITH name AS (SELECT ...)
query: airports_by_state is duckdb.table(airports.parquet) -> { group_by: state, fac_type aggregate: airport_count is count() }
Queries can be chained in a pipeline, for example.
run: duckdb.table(airports.parquet) -> { group_by: state, fac_type aggregate: airport_count is count() } -> { aggregte: total_airports is airport_count.sum() }
source:
Declaration Block
A source:
declaration block adds reusable calculations to an existing source, like a table. These declaration operate like methods or properties in object oriented languages (with the self being the table).
A declaration block is always preceeded by an extend
.
In the query above the source declaration follows the extend:
.
Source blocks can contain measure:
(aggregate cacluation), dimension:
(scalar calculation), where:
(a filter to always apply to a source), primary_key:
, view:
(a predefined query), join_xxx:
(declares relationship to another source).
source: airports is duckdb.table('airports.parquet') extend { primay_key: code dimension: city_state is concat(state,',',city) measure: airport_count is count() average_elevation is elevation.avg() }
Once a source is declared, queries are represented more simply. Calculations can be used in any number of different queries.
Malloy |
SQL |
Notes |
---|---|---|
|
|
The calculation airport_count comes from the declaration in the source. |
|
|
Malloy has resonable defaults for ordering queries. |
Joins are declarations.
In Malloy, joins usually happen in sources (though they can appear in queries also). When writing a malloy query, a join in a source:
is only invoked in a query if some member of element of the join is referenced . In order to accuratly compute aggregate calculations in joined sources, Malloy needs to know if a join fans out the data (potentially produces more rows after the join is invoked).
Assuming the following source
source: flights is duckdb.table('flights.parquet') extend { join_one: orig is duckdb.table('airports.parquet') on origin_code = orig.code measure: flight_count is count() }
Malloy |
SQL |
Notes |
---|---|---|
|
|
Notice the join is NOT invoked |
|
|
Note that the join is invoked because orig.city is referenced |
Note: In SQL dialects that support nested/repeated structures and arrays, joins are autmatically declared.
Types
In Malloy, type are number
, string
, boolean
, date
, timestamp
, record
and array
. Databases have many varieties of these types. Malloy attempts to keep the calculations within the native type system. You can cast to database native types in your Malloy expressions.
Malloy's boolean
is two state when being returned in a query (it will never return null in a select:
or a group_by:
, only false.)
Cast and Safe Cast
In Malloy
::
is the cast operator.:::
is the safe cast operator if the SQL dialect supports the operation.You can cast to any native SQL type by placing the type name in quotes.
foo::"BIGINT"
Expressions
Expression and malloy are pretty similar to SQL. Where we've change the language, we've done it to make it more readable, modern or consistent and composable. If you write the SQL expression, generally, the compiler will tell you how to write the Malloy expression.
This is intended to serve as a quick reference, more complete documentation can be found here.
SQL | Malloy | Description / Docs |
---|---|---|
|
|
Basic SQL aggregations are supported verbatim, but it’s worth learning about Malloy’s additional aggregate locality / symmetric aggregate handling. |
|
|
Pick is Malloy’s improvement of SQL’s CASE statement. This example also introduces the ? Apply operator, which "applies" a value to another value, condition, or computation. This is most often used with partial comparisons or alternations. |
|
|
Aggregates may be filtered using filter expressions. Doc |
|
|
Type Cast. Also worth reviewing Types doc. |
Malloy has its own standard function library.
Builtin functions work the same across SQL Dialects
In order for Malloy code to execute the same on many SQL engines, Malloy has a set of functions that works the same on all SQL Engines.
Calling Native database functions
In Malloy can can call native database functions or write arbitrary SQL expressions and use them in dimensions.
Malloy | Description / Docs |
---|---|
|
Call, the 'native sinh() function it returns a number |
|
Call, the 'native sinh() function it returns a number because x is of type number |
|
Call the native json_extract function and safe cast the results to a string |
Arbitrary SQL Expressions in Dimensions.
You can use arbitrary SQL in expressions using the sql_XXX
functions. The mechanism is helpful for coorelated, array mapping functions and more. sql_number
writes an expression that returns a numeric scalar. The """
is another form of string quoting in Malloy. ${TABLE}
is substituted with the current source's table alias in SQL.
Code | Use |
---|---|
|
Coorelaed subquery |
|
Presto Lambda |
Working with Time
The Time Expressions reference contains substantially more detail and examples.
SQL | Malloy | Docs |
---|---|---|
|
|
Truncation |
|
|
Extraction |
|
|
Date Diff / Intervals |
|
|
Filter Expressions, Apply Operator |
Window Functions
SQL | Malloy | Docs |
---|---|---|
|
|
Names of fields in parition_by: are from the output space |