Filtering which data is used in a query is an incredibly important aspect of data analysis. Malloy makes it easy to target specific parts of a query to apply individual filters.
Filter Syntax
Regardless of the placement of a filter, the syntax looks the same.
{ where: filter_one, filter_two }
Each filter can be any expression of type boolean
, whether that's a boolean field is_commercial_flight
, a comparison distance > 1000
, or any of the other kinds of boolean expressions that Malloy supports. For examples see the table below, or for detailed information on the kinds of expressions Malloy supports, see the Expressions section.
The comma-separated list of filters are and
ed together, so where: x, y
is equivalent to where: x and y
.
Filter Placement
A filter can be applied to the source of a query, to just one stage of a query, or even to a particular field or expression (measure or nested view).
Filtering a Query's Source
When filtering a query's source, the filter applies to the whole query.
run: flights extend { where: distance > 1000 } -> { aggregate: flight_count }
[ { "flight_count": 90002 } ]
SELECT COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base WHERE base."distance">1000
Filtering in a Query Stage
A filter can also be applied to an individual query stage.
run: flights -> { group_by: carrier aggregate: flight_count } -> { where: carrier ? 'UA' | 'AA' select: carrier, flight_count }
[ { "carrier": "UA", "flight_count": 32757 }, { "carrier": "AA", "flight_count": 34577 } ]
WITH __stage0 AS ( SELECT base."carrier" as "carrier", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base GROUP BY 1 ) SELECT base."carrier" as "carrier", base."flight_count" as "flight_count" FROM __stage0 as base WHERE base."carrier" IN ('UA','AA')
Filtering Aggregate Calculations
Any measure can be filtered by adding a where clause.
run: flights -> { aggregate: ca_flights is flight_count { where: origin.state = 'CA' } ny_flights is count() { where: origin.state = 'NY' } avg_km_from_ca is avg(distance / 0.621371) { where: origin.state = 'CA' } }
[ { "ca_flights": 40670, "ny_flights": 16641, "avg_km_from_ca": 1491.4398916693121 } ]
SELECT (COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END)) as "ca_flights", COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END) as "ny_flights", AVG(CASE WHEN origin_0."state"='CA' THEN base."distance"*1.0/0.621371::DOUBLE END) as "avg_km_from_ca" FROM '../data/flights.parquet' as base LEFT JOIN '../data/airports.parquet' AS origin_0 ON origin_0."code"=base."origin"
Filtering Measures
Even complex measures can be filtered. A common use case is to create a filtered measure and then create that as a percent of total.
// add a couple of measures to the `flights` source source: my_flights is flights extend { measure: delayed_flights is flight_count { where: dep_delay > 30 } measure: percent_delayed is delayed_flights / flight_count } run: my_flights -> { aggregate: ca_flights is flight_count { where: origin.state = 'CA' } ca_delayed_flights is delayed_flights { where: origin.state = 'CA' } ca_percent_delayed is percent_delayed { where: origin.state = 'CA' } ny_flights is flight_count { where: origin.state = 'NY' } ny_delayed_flights is delayed_flights { where: origin.state = 'NY' } ny_percent_delayed is percent_delayed { where: origin.state = 'NY' } }
[ { "ca_flights": 40670, "ca_delayed_flights": 3471, "ca_percent_delayed": 0.08534546348659947, "ny_flights": 16641, "ny_delayed_flights": 1368, "ny_percent_delayed": 0.0822065981611682 } ]
SELECT (COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END)) as "ca_flights", ((COUNT(CASE WHEN (origin_0."state"='CA') AND (base."dep_delay">30) THEN 1 END))) as "ca_delayed_flights", (((COUNT(CASE WHEN (origin_0."state"='CA') AND (base."dep_delay">30) THEN 1 END)))*1.0/(COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END))) as "ca_percent_delayed", (COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END)) as "ny_flights", ((COUNT(CASE WHEN (origin_0."state"='NY') AND (base."dep_delay">30) THEN 1 END))) as "ny_delayed_flights", (((COUNT(CASE WHEN (origin_0."state"='NY') AND (base."dep_delay">30) THEN 1 END)))*1.0/(COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END))) as "ny_percent_delayed" FROM '../data/flights.parquet' as base LEFT JOIN '../data/airports.parquet' AS origin_0 ON origin_0."code"=base."origin"
Filtering Nested Views
Nested views can also be filtered using refinement.
source: my_flights2 is flights extend { measure: delayed_flights is flight_count { where: dep_delay > 30 } view: delay_stats is { aggregate: flight_count delayed_flights percent_delayed is delayed_flights / flight_count } } run: my_flights2 -> { nest: ca_stats is delay_stats + { where: origin.state = 'CA' } ny_stats is delay_stats + { where: origin.state = 'NY' } }
[ { "ca_stats": { "flight_count": 40670, "delayed_flights": 3471, "percent_delayed": 0.08534546348659947 }, "ny_stats": { "flight_count": 16641, "delayed_flights": 1368, "percent_delayed": 0.0822065981611682 } } ]
WITH __stage0 AS ( SELECT group_set, CASE WHEN group_set=1 THEN COUNT(1) END as "flight_count__1", (CASE WHEN group_set=1 THEN COUNT(CASE WHEN base."dep_delay">30 THEN 1 END) END) as "delayed_flights__1", ((CASE WHEN group_set=1 THEN COUNT(CASE WHEN base."dep_delay">30 THEN 1 END) END))*1.0/(CASE WHEN group_set=1 THEN COUNT(1) END) as "percent_delayed__1", CASE WHEN group_set=2 THEN COUNT(1) END as "flight_count__2", (CASE WHEN group_set=2 THEN COUNT(CASE WHEN base."dep_delay">30 THEN 1 END) END) as "delayed_flights__2", ((CASE WHEN group_set=2 THEN COUNT(CASE WHEN base."dep_delay">30 THEN 1 END) END))*1.0/(CASE WHEN group_set=2 THEN COUNT(1) END) as "percent_delayed__2" FROM '../data/flights.parquet' as base LEFT JOIN '../data/airports.parquet' AS origin_0 ON origin_0."code"=base."origin" CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,2,1)) as group_set ) as group_set WHERE ((group_set NOT IN (1) OR (group_set IN (1) AND origin_0."state"='CA'))) AND ((group_set NOT IN (2) OR (group_set IN (2) AND origin_0."state"='NY'))) GROUP BY 1 ) SELECT COALESCE(FIRST({"flight_count": "flight_count__1" , "delayed_flights": "delayed_flights__1" , "percent_delayed": "percent_delayed__1" }) FILTER(WHERE group_set=1), {"flight_count": NULL, "delayed_flights": NULL, "percent_delayed": NULL}) as "ca_stats", COALESCE(FIRST({"flight_count": "flight_count__2" , "delayed_flights": "delayed_flights__2" , "percent_delayed": "percent_delayed__2" }) FILTER(WHERE group_set=2), {"flight_count": NULL, "delayed_flights": NULL, "percent_delayed": NULL}) as "ny_stats" FROM __stage0
Filtering null values
Malloy has the keyword null
to represent the null value.
To check to see if a value is null
, simply compare it to null
, there is no IS NULL
or IS NOT NULL
operator in Malloy.
dimension: name_not_present is name = null dimension: name_present is name != null
Common Patterns in Filters
This section describes some of the more common patterns used in filter expressions. For a more detailed description of the possible kinds of expressions, see the Expressions section.
Comparisons
All the usual comparison operators behave as expected, and are some of the most common kinds of filters.
Example | Meaning |
---|---|
size = 10 |
Does size equal 10 |
size > 10 |
Is size > 10 |
size != 10 |
size is not equal to 10 |
Combining Filters
Filters can be logically combined using and
, or
, and not
.
Operation | Example |
---|---|
Logical union | distance_miles > 1000 or duration_hours > 8 |
Logical conjunction | size >= 10 and size < 100 |
Logical negation | not is_commercial_flight |
Ranges
A range of numeric or time values can be constructed
with the to
operator, e.g. 10 to 100
. The ~
operator will check to
see if a value is within a range.
Example | Meaning |
---|---|
size ~ 10 to 20 |
size is in the range [10, 20) |
event_time ~ @2003 |
event_time occurs within the year 2003 |
String "Like" Matching
When comparing strings, the =
operator checks for pure equality, whereas the ~
and !~
operators, LIKE
and NOT LIKE
.
Example | Meaning |
---|---|
name ~ 'M%' |
The first letter of name is M |
name !~ '%Z%' |
name does not contain a Z |
In the right hand (pattern) string, the following syntax is used:
A percent sign
%
matches any number of charactersAn underscore
_
matches a single character
Regular Expressions
When the right hand side of a ~
or !~
operator is a regular expression,
Malloy checks whether the left hand side matches that regular expression. In Standard SQL, Malloy uses the REGEXP_COMPARE
function.
Example | Meaning |
---|---|
state ~ r'^(CA|NY)$' |
state is 'CA' or 'NY' |
name !~ r'Z$' |
name does not end with a Z |
Alternation
Checking equality against multiple possible values is extremely common, and can be achieved succinctly using the apply operator and alternation.
Example | Meaning |
---|---|
color ? 'red' | 'green' | 'blue' |
color is one of the primary colors |
email ~ '%.com' | '%.org' |
email has .com or .org domain |
Examples of Filter Expressions
Use this table as a quick reference for common types of filter expressions.
Example | Meaning |
---|---|
size = 10 |
size is equal to 10 |
size > 10 |
size is greater than 10 |
size != 10 |
size is not equal to 10 |
size: 10 to 100 |
size is greater than or equal to 10 and less than 100 |
size >= 10 and size < 100 |
size is greater than or equal to 10 and less than 100 |
size ? >= 10 & < 100 |
size is greater than or equal to 10 and less than 100 |
color ? 'red' | 'green' | 'blue' |
color is red, green or blue |
size ? 14 | 42 | > 100 |
size is 14, 42, or greater than 100 |
time ? @2003 to @2013 |
time is between the years 2003 and 2013 (excluding 2013) |
time > 1591129283 + 10 hours |
time greater than 10 hours after 1591129283 (epoch timestamp) |
time ? now.date |
time is today |
time ? now.month - 1 |
time is in the previous calendar month |
name ~ 'M%' |
first letter of name is a capital M |
name !~ '%z%' |
name does not contain a lower case Z |
state ~ r'^(CA|NY)$' |
state is 'CA' or 'NY' |
name !~ r'Z$' |
name does not end with a Z |