Malloy Documentation
search

Filtering which data is used in a query is an incredibly important aspect of data analysis. Malloy makes it easy to target specific parts of a query to apply individual filters.

Filter Syntax

Regardless of the placement of a filter, the syntax looks the same.

{ where: filter_one, filter_two }

Each filter can be any expression of type boolean, whether that's a boolean field is_commercial_flight, a comparison distance > 1000, or any of the other kinds of boolean expressions that Malloy supports. For examples see the table below, or for detailed information on the kinds of expressions Malloy supports, see the Expressions section.

The comma-separated list of filters are anded together, so where: x, y is equivalent to where: x and y.

Filter Placement

A filter can be applied to the source of a query, to just one stage of a query, or even to a particular field or expression (measure or nested view).

Filtering a Query's Source

When filtering a query's source, the filter applies to the whole query.

document
run: flights extend { where: distance > 1000 } -> { aggregate: flight_count } 
QUERY RESULTS
[
  {
    "flight_count": 90002
  }
]
SELECT 
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
WHERE base."distance">1000

Filtering in a Query Stage

A filter can also be applied to an individual query stage.

document
run: flights -> {
  group_by: carrier
  aggregate: flight_count
} -> {
  where: carrier ? 'UA' | 'AA'
  select: carrier, flight_count
}
QUERY RESULTS
[
  {
    "carrier": "UA",
    "flight_count": 32757
  },
  {
    "carrier": "AA",
    "flight_count": 34577
  }
]
WITH __stage0 AS (
  SELECT 
     base."carrier" as "carrier",
     COUNT(1) as "flight_count"
  FROM '../data/flights.parquet' as base
  GROUP BY 1
)
SELECT 
   base."carrier" as "carrier",
   base."flight_count" as "flight_count"
FROM __stage0 as base
WHERE base."carrier" IN ('UA','AA')

Filtering Aggregate Calculations

Any measure can be filtered by adding a where clause.

document
run: flights -> {
  aggregate:
    ca_flights is flight_count { where: origin.state = 'CA' }
    ny_flights is count() { where: origin.state = 'NY' }
    avg_km_from_ca is avg(distance / 0.621371) { where: origin.state = 'CA' }
}
QUERY RESULTS
[
  {
    "ca_flights": 40670,
    "ny_flights": 16641,
    "avg_km_from_ca": 1491.4398916693121
  }
]
SELECT 
   (COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END)) as "ca_flights",
   COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END) as "ny_flights",
   AVG(CASE WHEN origin_0."state"='CA' THEN base."distance"*1.0/0.621371::DOUBLE END) as "avg_km_from_ca"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/airports.parquet' AS origin_0
  ON origin_0."code"=base."origin"

Filtering Measures

Even complex measures can be filtered. A common use case is to create a filtered measure and then create that as a percent of total.

document
// add a couple of measures to the `flights` source
source: my_flights is flights extend {
  measure: delayed_flights is flight_count { where: dep_delay > 30 }
  measure: percent_delayed is delayed_flights / flight_count
}

run: my_flights -> {
  aggregate:
    ca_flights is flight_count { where: origin.state = 'CA' }
    ca_delayed_flights is delayed_flights { where: origin.state = 'CA' }
    ca_percent_delayed is percent_delayed { where: origin.state = 'CA' }
    ny_flights is flight_count { where: origin.state = 'NY' }
    ny_delayed_flights is delayed_flights { where: origin.state = 'NY' }
    ny_percent_delayed is percent_delayed { where: origin.state = 'NY' }
}
QUERY RESULTS
[
  {
    "ca_flights": 40670,
    "ca_delayed_flights": 3471,
    "ca_percent_delayed": 0.08534546348659947,
    "ny_flights": 16641,
    "ny_delayed_flights": 1368,
    "ny_percent_delayed": 0.0822065981611682
  }
]
SELECT 
   (COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END)) as "ca_flights",
   ((COUNT(CASE WHEN (origin_0."state"='CA')
  AND (base."dep_delay">30) THEN 1 END))) as "ca_delayed_flights",
   (((COUNT(CASE WHEN (origin_0."state"='CA')
  AND (base."dep_delay">30) THEN 1 END)))*1.0/(COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END))) as "ca_percent_delayed",
   (COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END)) as "ny_flights",
   ((COUNT(CASE WHEN (origin_0."state"='NY')
  AND (base."dep_delay">30) THEN 1 END))) as "ny_delayed_flights",
   (((COUNT(CASE WHEN (origin_0."state"='NY')
  AND (base."dep_delay">30) THEN 1 END)))*1.0/(COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END))) as "ny_percent_delayed"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/airports.parquet' AS origin_0
  ON origin_0."code"=base."origin"

Filtering Nested Views

Nested views can also be filtered using refinement.

document
source: my_flights2 is flights extend {
  measure: delayed_flights is flight_count { where: dep_delay > 30 }
  view: delay_stats is {
    aggregate:
      flight_count
      delayed_flights
      percent_delayed is delayed_flights / flight_count
  }
}

run: my_flights2 -> {
  nest:
    ca_stats is delay_stats + { where: origin.state = 'CA' }
    ny_stats is delay_stats + { where: origin.state = 'NY' }
}
QUERY RESULTS
[
  {
    "ca_stats": {
      "flight_count": 40670,
      "delayed_flights": 3471,
      "percent_delayed": 0.08534546348659947
    },
    "ny_stats": {
      "flight_count": 16641,
      "delayed_flights": 1368,
      "percent_delayed": 0.0822065981611682
    }
  }
]
WITH __stage0 AS (
  SELECT
    group_set,
    CASE WHEN group_set=1 THEN
      COUNT(1)
      END as "flight_count__1",
    (CASE WHEN group_set=1 THEN
      COUNT(CASE WHEN base."dep_delay">30 THEN 1 END)
      END) as "delayed_flights__1",
    ((CASE WHEN group_set=1 THEN
      COUNT(CASE WHEN base."dep_delay">30 THEN 1 END)
      END))*1.0/(CASE WHEN group_set=1 THEN
      COUNT(1)
      END) as "percent_delayed__1",
    CASE WHEN group_set=2 THEN
      COUNT(1)
      END as "flight_count__2",
    (CASE WHEN group_set=2 THEN
      COUNT(CASE WHEN base."dep_delay">30 THEN 1 END)
      END) as "delayed_flights__2",
    ((CASE WHEN group_set=2 THEN
      COUNT(CASE WHEN base."dep_delay">30 THEN 1 END)
      END))*1.0/(CASE WHEN group_set=2 THEN
      COUNT(1)
      END) as "percent_delayed__2"
  FROM '../data/flights.parquet' as base
   LEFT JOIN '../data/airports.parquet' AS origin_0
    ON origin_0."code"=base."origin"
  CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,2,1)) as group_set  ) as group_set
  WHERE ((group_set NOT IN (1) OR (group_set IN (1) AND origin_0."state"='CA')))
  AND ((group_set NOT IN (2) OR (group_set IN (2) AND origin_0."state"='NY')))
  GROUP BY 1
)
SELECT
  COALESCE(FIRST({"flight_count": "flight_count__1" , "delayed_flights": "delayed_flights__1" , "percent_delayed": "percent_delayed__1" }) FILTER(WHERE group_set=1), {"flight_count": NULL, "delayed_flights": NULL, "percent_delayed": NULL}) as "ca_stats",
  COALESCE(FIRST({"flight_count": "flight_count__2" , "delayed_flights": "delayed_flights__2" , "percent_delayed": "percent_delayed__2" }) FILTER(WHERE group_set=2), {"flight_count": NULL, "delayed_flights": NULL, "percent_delayed": NULL}) as "ny_stats"
FROM __stage0

Filtering null values

Malloy has the keyword null to represent the null value.

To check to see if a value is null, simply compare it to null, there is no IS NULL or IS NOT NULL operator in Malloy.

dimension: name_not_present is name = null
dimension: name_present is name != null

Common Patterns in Filters

This section describes some of the more common patterns used in filter expressions. For a more detailed description of the possible kinds of expressions, see the Expressions section.

Comparisons

All the usual comparison operators behave as expected, and are some of the most common kinds of filters.

Example Meaning
size = 10 Does size equal 10
size > 10 Is size > 10
size != 10 size is not equal to 10

Combining Filters

Filters can be logically combined using and, or, and not.

Operation Example
Logical union distance_miles > 1000 or duration_hours > 8
Logical conjunction size >= 10 and size < 100
Logical negation not is_commercial_flight

Ranges

A range of numeric or time values can be constructed with the tooperator, e.g. 10 to 100. The ~ operator will check to see if a value is within a range.

Example Meaning
size ~ 10 to 20 size is in the range [10, 20)
event_time ~ @2003 event_time occurs within the year 2003

String "Like" Matching

When comparing strings, the = operator checks for pure equality, whereas the ~ and !~ operators, LIKE and NOT LIKE.

Example Meaning
name ~ 'M%' The first letter of name is M
name !~ '%Z%' name does not contain a Z

In the right hand (pattern) string, the following syntax is used:

  • A percent sign % matches any number of characters

  • An underscore _ matches a single character

Regular Expressions

When the right hand side of a ~ or !~ operator is a regular expression, Malloy checks whether the left hand side matches that regular expression. In Standard SQL, Malloy uses the REGEXP_COMPARE function.

Example Meaning
state ~ r'^(CA|NY)$' state is 'CA' or 'NY'
name !~ r'Z$' name does not end with a Z

Alternation

Checking equality against multiple possible values is extremely common, and can be achieved succinctly using the apply operator and alternation.

Example Meaning
color ? 'red' | 'green' | 'blue' color is one of the primary colors
email ~ '%.com' | '%.org' email has .com or .org domain

Examples of Filter Expressions

Use this table as a quick reference for common types of filter expressions.

Example Meaning
size = 10 size is equal to 10
size > 10 size is greater than 10
size != 10 size is not equal to 10
size: 10 to 100 size is greater than or equal to 10 and less than 100
size >= 10 and size < 100 size is greater than or equal to 10 and less than 100
size ? >= 10 & < 100 size is greater than or equal to 10 and less than 100
color ? 'red' | 'green' | 'blue' color is red, green or blue
size ? 14 | 42 | > 100 size is 14, 42, or greater than 100
time ? @2003 to @2013 time is between the years 2003 and 2013 (excluding 2013)
time > 1591129283 + 10 hours time greater than 10 hours after 1591129283 (epoch timestamp)
time ? now.date time is today
time ? now.month - 1 time is in the previous calendar month
name ~ 'M%' first letter of name is a capital M
name !~ '%z%' name does not contain a lower case Z
state ~ r'^(CA|NY)$' state is 'CA' or 'NY'
name !~ r'Z$' name does not end with a Z