Malloy Documentation
search

Filtering which data is used in a query is an incredibly important aspect of data analysis. Malloy makes it easy to target specific parts of a query to apply individual filters.

Filter Syntax

Regardless of the placement of a filter, the syntax looks the same.

{ where: filter_one, filter_two }

Each filter can be any expression of type boolean, whether that's a boolean field is_commercial_flight, a comparison distance > 1000, or any of the other kinds of boolean expressions that Malloy supports. For examples see the table below, or for detailed information on the kinds of expressions Malloy supports, see the Expressions section.

The comma-separated list of filters are anded together, so where: x, y is equivalent to where: x and y.

Filter Placement

A filter can be applied to the source of a query, to just one stage of a query, or even to a particular field or expression (measure or nested view).

Filtering a Query's Source

When filtering a query's source, the filter applies to the whole query.

document
run: flights extend { where: distance > 1000 } -> { aggregate: flight_count } 
QUERY RESULTS
[
  {
    "flight_count": 90002
  }
]
SELECT 
   COUNT(1) as "flight_count"
FROM '../data/flights.parquet' as base
WHERE base."distance">1000

Filtering in a Query Stage

A filter can also be applied to an individual query stage.

document
run: flights -> {
  group_by: carrier
  aggregate: flight_count
} -> {
  where: carrier ? 'UA' | 'AA'
  select: carrier, flight_count
}
QUERY RESULTS
[
  {
    "carrier": "UA",
    "flight_count": 32757
  },
  {
    "carrier": "AA",
    "flight_count": 34577
  }
]
WITH __stage0 AS (
  SELECT 
     base."carrier" as "carrier",
     COUNT(1) as "flight_count"
  FROM '../data/flights.parquet' as base
  GROUP BY 1
)
SELECT 
   base."carrier" as "carrier",
   base."flight_count" as "flight_count"
FROM __stage0 as base
WHERE base."carrier" IN ('UA','AA')

Filtering Aggregate Calculations

Any measure can be filtered by adding a where clause.

document
run: flights -> {
  aggregate:
    ca_flights is flight_count { where: origin.state = 'CA' }
    ny_flights is count() { where: origin.state = 'NY' }
    avg_km_from_ca is avg(distance / 0.621371) { where: origin.state = 'CA' }
}
QUERY RESULTS
[
  {
    "ca_flights": 40670,
    "ny_flights": 16641,
    "avg_km_from_ca": 1491.4398916693121
  }
]
SELECT 
   (COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END)) as "ca_flights",
   COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END) as "ny_flights",
   AVG(CASE WHEN origin_0."state"='CA' THEN base."distance"*1.0/0.621371::DOUBLE END) as "avg_km_from_ca"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/airports.parquet' AS origin_0
  ON origin_0."code"=base."origin"

Filtering Measures

Even complex measures can be filtered. A common use case is to create a filtered measure and then create that as a percent of total.

document
// add a couple of measures to the `flights` source
source: my_flights is flights extend {
  measure: delayed_flights is flight_count { where: dep_delay > 30 }
  measure: percent_delayed is delayed_flights / flight_count
}

run: my_flights -> {
  aggregate:
    ca_flights is flight_count { where: origin.state = 'CA' }
    ca_delayed_flights is delayed_flights { where: origin.state = 'CA' }
    ca_percent_delayed is percent_delayed { where: origin.state = 'CA' }
    ny_flights is flight_count { where: origin.state = 'NY' }
    ny_delayed_flights is delayed_flights { where: origin.state = 'NY' }
    ny_percent_delayed is percent_delayed { where: origin.state = 'NY' }
}
QUERY RESULTS
[
  {
    "ca_flights": 40670,
    "ca_delayed_flights": 3471,
    "ca_percent_delayed": 0.08534546348659947,
    "ny_flights": 16641,
    "ny_delayed_flights": 1368,
    "ny_percent_delayed": 0.0822065981611682
  }
]
SELECT 
   (COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END)) as "ca_flights",
   ((COUNT(CASE WHEN (origin_0."state"='CA')
  AND (base."dep_delay">30) THEN 1 END))) as "ca_delayed_flights",
   (((COUNT(CASE WHEN (origin_0."state"='CA')
  AND (base."dep_delay">30) THEN 1 END)))*1.0/(COUNT(CASE WHEN origin_0."state"='CA' THEN 1 END))) as "ca_percent_delayed",
   (COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END)) as "ny_flights",
   ((COUNT(CASE WHEN (origin_0."state"='NY')
  AND (base."dep_delay">30) THEN 1 END))) as "ny_delayed_flights",
   (((COUNT(CASE WHEN (origin_0."state"='NY')
  AND (base."dep_delay">30) THEN 1 END)))*1.0/(COUNT(CASE WHEN origin_0."state"='NY' THEN 1 END))) as "ny_percent_delayed"
FROM '../data/flights.parquet' as base
 LEFT JOIN '../data/airports.parquet' AS origin_0
  ON origin_0."code"=base."origin"

Filtering Nested Views

Nested views can also be filtered using refinement.

document
source: my_flights2 is flights extend {
  measure: delayed_flights is flight_count { where: dep_delay > 30 }
  view: delay_stats is {
    aggregate:
      flight_count
      delayed_flights
      percent_delayed is delayed_flights / flight_count
  }
}

run: my_flights2 -> {
  nest:
    ca_stats is delay_stats + { where: origin.state = 'CA' }
    ny_stats is delay_stats + { where: origin.state = 'NY' }
}
QUERY RESULTS