Queries

VSCode

The basic syntax for a query in Malloy consists of a source and a view consisting of one or more stages separated by ->. The shape of the data defined in the original source is transformed by each stage.

run: flights -> { 
  group_by: carrier 
  aggregate: flight_count
}

QUERY RESULTS

[
  {
    "carrier": "WN",
    "flight_count": 88751
  },
  {
    "carrier": "US",
    "flight_count": 37683
  },
  {
    "carrier": "AA",
    "flight_count": 34577
  },
  {
    "carrier": "NW",
    "flight_count": 33580
  },
  {
    "carrier": "UA",
    "flight_count": 32757
  }
]

SELECT 
   base."carrier" as "carrier",
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
GROUP BY 1
ORDER BY 2 desc NULLS LAST

Sources

Every query begins with a source, which can be thought of as a table with an associated collection of calculations and join relationships. The most common kind of source is a table source, e.g. duckdb.table('flights.csv'); see the Sources section for more information on creating and extending sources.

Generally, ad-hoc queries will use one of these kinds of sources directly:

run: duckdb.table('../data/flights.parquet') -> { 
  aggregate: total_flight_count is count() 
}

QUERY RESULTS

[
  {
    "total_flight_count": 344827
  }
]

SELECT 
   COUNT(1) as "total_flight_count"
FROM '../data/flights.parquet' as base

However, as analysis becomes more involved, it is often useful to define reusable computations in a named source, then use that named source as the basis for queries.

source: flights2 is duckdb.table('../data/flights.parquet') extend { 
  measure: 
    flight_count is count() 
    total_distance is sum(distance)
}

run: flights2 -> {
  group_by: carrier
  aggregate: flight_count, total_distance
}

QUERY RESULTS

[
  {
    "carrier": "WN",
    "flight_count": 88751,
    "total_distance": 54619152
  },
  {
    "carrier": "US",
    "flight_count": 37683,
    "total_distance": 23721642
  },
  {
    "carrier": "AA",
    "flight_count": 34577,
    "total_distance": 37684885
  },
  {
    "carrier": "NW",
    "flight_count": 33580,
    "total_distance": 33376503
  },
  {
    "carrier": "UA",
    "flight_count": 32757,
    "total_distance": 38882934
  }
]

SELECT 
   base."carrier" as "carrier",
   COUNT(1) as "flight_count",
   COALESCE(SUM(base."distance"),0) as "total_distance"
FROM '../data/flights.parquet' as base
GROUP BY 1
ORDER BY 2 desc NULLS LAST

It is also sometimes useful to define a query and then later use that query as the source for another query:

query: flights_by_carrier is duckdb.table('../data/flights.parquet') -> {
  group_by: carrier
  aggregate: flight_count is count()
}

run: flights_by_carrier -> { 
  select: carrier
  limit: 2 
}

QUERY RESULTS

[
  {
    "carrier": "RU"
  },
  {
    "carrier": "OH"
  }
]

WITH __stage0 AS (
  SELECT 
     base."carrier" as "carrier",
     COUNT(1) as "flight_count"
  FROM '../data/flights.parquet' as base
  GROUP BY 1
)
SELECT 
   base."carrier" as "carrier"
FROM __stage0 as base
LIMIT 2

Views

A view defines a transformation (or series of transformations) to perform on a source. In a query, everything after the first -> is a view.

//   source  -> view
run: flights -> { select: * }

Views can be defined in a source extension to enable reusability and nesting. You can always copy and paste the view of a query into the definition of the source of that query (or an extension of that source).

source: flights3 is flights extend {
  view: by_carrier2 is { 
    group_by: carrier 
    aggregate: flight_count
  }
}

run: flights3 -> by_carrier2

QUERY RESULTS

[
  {
    "carrier": "WN",
    "flight_count": 88751
  },
  {
    "carrier": "US",
    "flight_count": 37683
  },
  {
    "carrier": "AA",
    "flight_count": 34577
  },
  {
    "carrier": "NW",
    "flight_count": 33580
  },
  {
    "carrier": "UA",
    "flight_count": 32757
  }
]

SELECT 
   base."carrier" as "carrier",
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
GROUP BY 1
ORDER BY 2 desc NULLS LAST

Queries that reference a view by name can add additional stages after those defined in the view.

run: flights3 -> by_carrier2 -> {
  select: carrier
  limit: 1
}

QUERY RESULTS

[
  {
    "carrier": "DL"
  }
]

WITH __stage0 AS (
  SELECT 
     base."carrier" as "carrier",
     COUNT(1) as "flight_count"
  FROM '../data/flights.parquet' as base
  GROUP BY 1
)
SELECT 
   base."carrier" as "carrier"
FROM __stage0 as base
LIMIT 1

Refinements, Query partials and Shorthand

Query blocks write be written in parts, those parts can be either views in the source and combined with the refinement operator +

The query below

run: flights -> {
  group_by: carrier
  aggregate: flight_count
  limit: 10
}

can be written as below. Each of the parts of a query can be separated into a 'partial query'.

run: flights -> 
  {group_by: carrier}
  + {aggregate: flight_count}
  + {limit: 10}

The query can also be written as:

run: flights -> 
  carrier
  + flight_count
  + {limit: 10}

In the query above, the refinement operator + combines the parameters of a query so the query can be built from parts. Measure, dimension and views declared in the source can also be used. Referencing dimension carrier expands to a partial of {group_by: x}. Referencing measure flight_count expands to a partial query of {aggregate: flight_count}

For more information about defining, reusing, nesting, and refining views, see the Views section.