Filtering which data is used in a query is an incredibly important aspect of data analysis. Malloy makes it easy to target specific parts of a query to apply individual filters.
Filter Syntax
Regardless of the placement of a filter, the syntax looks the same.
{ where: filter_one, filter_two }
Each filter can be any expression of type boolean
, whether that's a boolean field is_commercial_flight
, a comparison distance > 1000
, or any of the other kinds of boolean expressions that Malloy supports. For examples see the table below, or for detailed information on the kinds of expressions Malloy supports, see the Expressions section.
The comma-separated list of filters are and
ed together, so where: x, y
is equivalent to where: x and y
.
Filter Placement
A filter can be applied to the source of a query, to just one stage of a query, or even to a particular field or expression (measure or nested view).
Filtering a Query's Source
When filtering a query's source, the filter applies to the whole query.
run: flights extend { where: distance > 1000 } -> { aggregate: flight_count }
[ { "flight_count": 90002 } ]
SELECT COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base WHERE base."distance">1000
Filtering in a Query Stage
A filter can also be applied to an individual query stage.
run: flights -> { group_by: carrier aggregate: flight_count } -> { where: carrier ? 'UA' | 'AA' select: carrier, flight_count }
[ { "carrier": "UA", "flight_count": 32757 }, { "carrier": "AA", "flight_count": 34577 } ]
WITH __stage0 AS ( SELECT base."carrier" as "carrier", COUNT(1) as "flight_count" FROM '../data/flights.parquet' as base GROUP BY 1 ) SELECT base."carrier" as "carrier", base."flight_count" as "flight_count" FROM __stage0 as base WHERE base."carrier" IN ('UA','AA')
Filtering Aggregate Calculations
Any measure can be filtered by adding a where clause.
run: flights -> { aggregate: ca_flights is flight_count { where: origin.state = 'CA' } ny_flights is count() { where: origin.state = 'NY' } avg_km_from_ca is avg(distance / 0.621371) { where: origin.state = 'CA' } }
[ { "ca_flights": 40670, "ny_flights": 16641, "avg_km_from_ca": 1491.4398916693121 } ]
SELECT (COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END)) as "ca_flights", COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END) as "ny_flights", AVG(CASE WHEN origin_0."state"='CA' THEN base."distance"*1.0/0.621371::DOUBLE END) as "avg_km_from_ca" FROM '../data/flights.parquet' as base LEFT JOIN '../data/airports.parquet' AS origin_0 ON origin_0."code"=base."origin"
Filtering Measures
Even complex measures can be filtered. A common use case is to create a filtered measure and then create that as a percent of total.
// add a couple of measures to the `flights` source source: my_flights is flights extend { measure: delayed_flights is flight_count { where: dep_delay > 30 } measure: percent_delayed is delayed_flights / flight_count } run: my_flights -> { aggregate: ca_flights is flight_count { where: origin.state = 'CA' } ca_delayed_flights is delayed_flights { where: origin.state = 'CA' } ca_percent_delayed is percent_delayed { where: origin.state = 'CA' } ny_flights is flight_count { where: origin.state = 'NY' } ny_delayed_flights is delayed_flights { where: origin.state = 'NY' } ny_percent_delayed is percent_delayed { where: origin.state = 'NY' } }
[ { "ca_flights": 40670, "ca_delayed_flights": 3471, "ca_percent_delayed": 0.08534546348659947, "ny_flights": 16641, "ny_delayed_flights": 1368, "ny_percent_delayed": 0.0822065981611682 } ]
SELECT (COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END)) as "ca_flights", ((COUNT(CASE WHEN (origin_0."state"='CA') AND (base."dep_delay">30) THEN 1 END))) as "ca_delayed_flights", (((COUNT(CASE WHEN (origin_0."state"='CA') AND (base."dep_delay">30) THEN 1 END)))*1.0/(COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END))) as "ca_percent_delayed", (COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END)) as "ny_flights", ((COUNT(CASE WHEN (origin_0."state"='NY') AND (base."dep_delay">30) THEN 1 END))) as "ny_delayed_flights", (((COUNT(CASE WHEN (origin_0."state"='NY') AND (base."dep_delay">30) THEN 1 END)))*1.0/(COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END))) as "ny_percent_delayed" FROM '../data/flights.parquet' as base LEFT JOIN '../data/airports.parquet' AS origin_0 ON origin_0."code"=base."origin"
Filtering Nested Views
Nested views can also be filtered using refinement.
source: my_flights2 is flights extend { measure: delayed_flights is flight_count { where: dep_delay > 30 } view: delay_stats is { aggregate: flight_count delayed_flights percent_delayed is delayed_flights / flight_count } } run: my_flights2 -> { nest: ca_stats is delay_stats + { where: origin.state = 'CA' } ny_stats is delay_stats + { where: origin.state = 'NY' } }