Filters

VSCode

Filtering which data is used in a query is an incredibly important aspect of data analysis. Malloy makes it easy to target specific parts of a query to apply individual filters.

Filter Syntax

Regardless of the placement of a filter, the syntax looks the same.

{ where: filter_one, filter_two }

Each filter can be any expression of type boolean, whether that's a boolean field is_commercial_flight, a comparison distance > 1000, or any of the other kinds of boolean expressions that Malloy supports. For examples see the table below, or for detailed information on the kinds of expressions Malloy supports, see the Expressions section.

The comma-separated list of filters are anded together, so where: x, y is equivalent to where: x and y.

Filter Placement

A filter can be applied to the source of a query, to just one stage of a query, or even to a particular field or expression (measure or nested view).

Filtering a Query's Source

When filtering a query's source, the filter applies to the whole query.

run: flights extend { where: distance > 1000 } -> { aggregate: flight_count }

QUERY RESULTS

[
  {
    "flight_count": 90002
  }
]

SELECT 
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
WHERE base."distance">1000

Filtering in a Query Stage

A filter can also be applied to an individual query stage.

run: flights -> {
  group_by: carrier
  aggregate: flight_count
} -> {
  where: carrier ? 'UA' | 'AA'
  select: carrier, flight_count
}

QUERY RESULTS

[
  {
    "carrier": "AA",
    "flight_count": 34577
  },
  {
    "carrier": "UA",
    "flight_count": 32757
  }
]

WITH __stage0 AS (
  SELECT 
     base."carrier" as "carrier",
     COUNT(1) as "flight_count"
  FROM '../data/flights.parquet' as base
  GROUP BY 1
)
SELECT 
   base."carrier" as "carrier",
   base."flight_count" as "flight_count"
FROM __stage0 as base
WHERE base."carrier" IN ('UA','AA')

Filtering Aggregate Calculations

Any measure can be filtered by adding a where clause.

run: flights -> {
  aggregate:
    ca_flights is flight_count { where: origin.state = 'CA' }
    ny_flights is count() { where: origin.state = 'NY' }
    avg_km_from_ca is avg(distance / 0.621371) { where: origin.state = 'CA' }
}

QUERY RESULTS

[
  {
    "ca_flights": 40670,
    "ny_flights": 16641,
    "avg_km_from_ca": 1491.4398916693121
  }
]

SELECT 
   (COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END)) as "ca_flights",
   COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END) as "ny_flights",
   AVG(CASE WHEN origin_0."state"='CA' THEN base."distance"*1.0/0.621371::DOUBLE END) as "avg_km_from_ca"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/airports.parquet' AS origin_0
  ON origin_0."code"=base."origin"

Filtering Measures

Even complex measures can be filtered. A common use case is to create a filtered measure and then create that as a percent of total.

// add a couple of measures to the `flights` source
source: my_flights is flights extend {
  measure: delayed_flights is flight_count { where: dep_delay > 30 }
  measure: percent_delayed is delayed_flights / flight_count
}

run: my_flights -> {
  aggregate:
    ca_flights is flight_count { where: origin.state = 'CA' }
    ca_delayed_flights is delayed_flights { where: origin.state = 'CA' }
    ca_percent_delayed is percent_delayed { where: origin.state = 'CA' }
    ny_flights is flight_count { where: origin.state = 'NY' }
    ny_delayed_flights is delayed_flights { where: origin.state = 'NY' }
    ny_percent_delayed is percent_delayed { where: origin.state = 'NY' }
}

QUERY RESULTS

[
  {
    "ca_flights": 40670,
    "ca_delayed_flights": 3471,
    "ca_percent_delayed": 0.08534546348659947,
    "ny_flights": 16641,
    "ny_delayed_flights": 1368,
    "ny_percent_delayed": 0.0822065981611682
  }
]

SELECT 
   (COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END)) as "ca_flights",
   ((COUNT(CASE WHEN (origin_0."state"='CA')
  AND (base."dep_delay">30) THEN 1 END))) as "ca_delayed_flights",
   (((COUNT(CASE WHEN (origin_0."state"='CA')
  AND (base."dep_delay">30) THEN 1 END)))*1.0/(COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END))) as "ca_percent_delayed",
   (COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END)) as "ny_flights",
   ((COUNT(CASE WHEN (origin_0."state"='NY')
  AND (base."dep_delay">30) THEN 1 END))) as "ny_delayed_flights",
   (((COUNT(CASE WHEN (origin_0."state"='NY')
  AND (base."dep_delay">30) THEN 1 END)))*1.0/(COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END))) as "ny_percent_delayed"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/airports.parquet' AS origin_0
  ON origin_0."code"=base."origin"

Filtering Nested Views

Nested views can also be filtered using refinement.

source: my_flights2 is flights extend {
  measure: delayed_flights is flight_count { where: dep_delay > 30 }
  view: delay_stats is {
    aggregate:
      flight_count
      delayed_flights
      percent_delayed is delayed_flights / flight_count
  }
}

run: my_flights2 -> {
  nest:
    ca_stats is delay_stats + { where: origin.state = 'CA' }
    ny_stats is delay_stats + { where: origin.state = 'NY' }
}

QUERY RESULTS

[
  {
    "ca_stats": {
      "flight_count": 40670,
      "delayed_flights": 3471,
      "percent_delayed": 0.08534546348659947
    },
    "ny_stats": {
      "flight_count": 16641,
      "delayed_flights": 1368,
      "percent_delayed": 0.0822065981611682
    }
  }
]

WITH __stage0 AS (
  SELECT
    group_set,
    CASE WHEN group_set=1 THEN
      COUNT(1)
      END as "flight_count__1",
    (CASE WHEN group_set=1 THEN
      COUNT(CASE WHEN base."dep_delay">30 THEN 1 END)
      END) as "delayed_flights__1",
    ((CASE WHEN group_set=1 THEN
      COUNT(CASE WHEN base."dep_delay">30 THEN 1 END)
      END))*1.0/(CASE WHEN group_set=1 THEN
      COUNT(1)
      END) as "percent_delayed__1",
    CASE WHEN group_set=2 THEN
      COUNT(1)
      END as "flight_count__2",
    (CASE WHEN group_set=2 THEN
      COUNT(CASE WHEN base."dep_delay">30 THEN 1 END)
      END) as "delayed_flights__2",
    ((CASE WHEN group_set=2 THEN
      COUNT(CASE WHEN base."dep_delay">30 THEN 1 END)
      END))*1.0/(CASE WHEN group_set=2 THEN
      COUNT(1)
      END) as "percent_delayed__2"
  FROM '../data/flights.parquet' as base
   LEFT JOIN '../data/airports.parquet' AS origin_0
    ON origin_0."code"=base."origin"
  CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,2,1)) as group_set  ) as group_set
  WHERE ((group_set NOT IN (1) OR (group_set IN (1) AND origin_0."state"='CA')))
  AND ((group_set NOT IN (2) OR (group_set IN (2) AND origin_0."state"='NY')))
  GROUP BY 1
)
SELECT
  COALESCE(FIRST({"flight_count": "flight_count__1" , "delayed_flights": "delayed_flights__1" , "percent_delayed": "percent_delayed__1" }) FILTER(WHERE group_set=1), {"flight_count": NULL, "delayed_flights": NULL, "percent_delayed": NULL}) as "ca_stats",
  COALESCE(FIRST({"flight_count": "flight_count__2" , "delayed_flights": "delayed_flights__2" , "percent_delayed": "percent_delayed__2" }) FILTER(WHERE group_set=2), {"flight_count": NULL, "delayed_flights": NULL, "percent_delayed": NULL}) as "ny_stats"
FROM __stage0

Filtering null values

To check to see if a value is null, use IS NULL or IS NOT NULL.

dimension: name_not_present is name is null
dimension: name_present is name is not null

Common Patterns in Filters

This section describes some of the more common patterns used in filter expressions. For a more detailed description of the possible kinds of expressions, see the Expressions section.

Comparisons

All the usual comparison operators behave as expected, and are some of the most common kinds of filters.

Example	Meaning
`size = 10`	Does `size` equal 10
`size > 10`	Is `size` > 10
`size != 10`	`size` is not equal to 10

Combining Filters

Filters can be logically combined using and, or, and not.

Operation	Example
Logical union	`distance_miles > 1000 or duration_hours > 8`
Logical conjunction	`size >= 10 and size < 100`
Logical negation	`not is_commercial_flight`

Ranges

A range of numeric or time values can be constructed with the tooperator, e.g. 10 to 100. The ~ operator will check to see if a value is within a range.

Example	Meaning
`size ~ 10 to 20`	`size` is in the range [10, 20)
`event_time ~ @2003`	`event_time` occurs within the year 2003

String "Like" Matching

When comparing strings, the = operator checks for pure equality, whereas the ~ and !~ operators, LIKE and NOT LIKE.

Example	Meaning
`name ~ 'M%'`	The first letter of `name` is M
`name !~ '%Z%'`	`name` does not contain a Z

In the right hand (pattern) string, the following syntax is used:

A percent sign % matches any number of characters
An underscore _ matches a single character

Regular Expressions

When the right hand side of a ~ or !~ operator is a regular expression, Malloy checks whether the left hand side matches that regular expression. In Standard SQL, Malloy uses the REGEXP_COMPARE function.

Example	Meaning
`state ~ r'^(CA\|NY)$'`	`state` is `'CA'` or `'NY'`
`name !~ r'Z$'`	`name` does not end with a Z

Alternation

Checking equality against multiple possible values is extremely common, and can be achieved succinctly using the apply operator and alternation.

Example	Meaning
`color ? 'red' \| 'green' \| 'blue'`	`color` is one of the primary colors
`email ~ '%.com' \| '%.org'`	`email` has `.com` or `.org` domain

Examples of Filter Expressions

Use this table as a quick reference for common types of filter expressions.

Example	Meaning
`size = 10`	`size` is equal to 10
`size > 10`	`size` is greater than 10
`size != 10`	`size` is not equal to 10
`size: 10 to 100`	`size` is greater than or equal to 10 and less than 100
`size >= 10 and size < 100`	`size` is greater than or equal to 10 and less than 100
`size ? >= 10 & < 100`	`size` is greater than or equal to 10 and less than 100
`color ? 'red' \| 'green' \| 'blue'`	`color` is red, green or blue
`size ? 14 \| 42 \| > 100`	`size` is 14, 42, or greater than 100
`time ? @2003 to @2013`	`time` is between the years 2003 and 2013 (excluding 2013)
`time > 1591129283 + 10 hours`	`time` greater than 10 hours after 1591129283 (epoch timestamp)
`time ? now.date`	`time` is today
`time ? now.month - 1`	`time` is in the previous calendar month
`name ~ 'M%'`	first letter of `name` is a capital M
`name !~ '%z%'`	`name` does not contain a lower case Z
`state ~ r'^(CA\|NY)$'`	`state` is 'CA' or 'NY'
`name !~ r'Z$'`	`name` does not end with a Z