Malloy has a special query operator that builds _Dimensional Search Indexes for sources. A Dimensional Search Index is a table with 4 columns. Dimensional indexes are useful for a variety of things including filtering suggestions and LLMs.
When filtering data, you might know a term, but not necessarily which column in the one of the join data contains it. Indexing the data on field names and high cardinality fields let's you qucikly find the term and the associated value.
Indexing could be used by LLMs to find the interesting column/term mapping in the data set.
fieldName - The path to the column in the source
fieldValue - The dimensional value for the field (or range if fieldType is not a string)
fieldType - Type type of the column.
weight - a weighting to use as to the importance the distinct dimensional value. Defaults to cardinality of the field.
Simple Example
We're going to take the airports table and index it. The results are an un ordered list of distinct fieldName/fieldValue pairs appear in the table. The weight, in this case is the number of rows that partciular occurs on.
run: duckdb.table('../data/airports.parquet') -> { index: * }
[ { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": null, "weight": 12040 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "02/2000", "weight": 54 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "11/1986", "weight": 119 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "11/1992", "weight": 92 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "10/1999", "weight": 41 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "12/1992", "weight": 86 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "09/1994", "weight": 20 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "08/1990", "weight": 36 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "01/1990", "weight": 15 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "08/1981", "weight": 40 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "12/1986", "weight": 56 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "11/1987", "weight": 25 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "04/1996", "weight": 3 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "01/1994", "weight": 36 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "04/1999", "weight": 45 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "03/1992", "weight": 17 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "07/1995", "weight": 15 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "04/1982", "weight": 92 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "05/1991", "weight": 31 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "03/1998", "weight": 62 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "01/1987", "weight": 31 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "05/1993", "weight": 8 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "02/1987", "weight": 129 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "10/1987", "weight": 62 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "09/1999", "weight": 31 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "07/1982", "weight": 80 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "01/1998", "weight": 28 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "04/1990", "weight": 16 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "12/1983", "weight": 77 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "12/1997", "weight": 31 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "09/1990", "weight": 11 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "07/1983", "weight": 20 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "09/1991", "weight": 14 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "09/1995", "weight": 15 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "05/1988", "weight": 20 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "06/2000", "weight": 59 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "06/1982", "weight": 13 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "11/1993", "weight": 32 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "11/1994", "weight": 4 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "10/1989", "weight": 25 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "10/1992", "weight": 35 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": "02/1991", "weight": 6 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "BETHEL", "weight": 34 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "W ALEUTIAN ISLS", "weight": 3 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "ATLANTA", "weight": 780 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "JACKSONVILLE", "weight": 467 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "KANSAS CITY", "weight": 713 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "HOUSTON", "weight": 765 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "LAS VEGAS", "weight": 136 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "LOS ANGELES", "weight": 498 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "KLAMATH FALLS", "weight": 260 }, { "fieldName": "c_ldg_rts", "fieldPath": "c_ldg_rts", "fieldType": "string", "fieldValue": "Y", "weight": 283 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "E", "weight": 2127 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "SE", "weight": 2281 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "NE", "weight": 2393 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "W", "weight": 2065 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "NNW", "weight": 11 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "NNE", "weight": 13 }, { "fieldName": "cert", "fieldPath": "cert", "fieldType": "string", "fieldValue": "LU 06/1986", "weight": 1 }, { "fieldName": "cert", "fieldPath": "cert", "fieldType": "string", "fieldValue": "CS 05/1973", "weight": 84 }, { "fieldName": "cert", "fieldPath": "cert", "fieldType": "string", "fieldValue": "LS 05/1973", "weight": 1 }, { "fieldName": "cert", "fieldPath": "cert", "fieldType": "string", "fieldValue": "AS 06/1987", "weight": 1 }, { "fieldName": "cert", "fieldPath": "cert", "fieldType": "string", "fieldValue": "BS 03/1978", "weight": 1 }, { "fieldName": "cert", "fieldPath": "cert", "fieldType": "string", "fieldValue": "CS 03/1973", "weight": 1 }, { "fieldName": "cert", "fieldPath": "cert", "fieldType": "string", "fieldValue": "LU 01/1980", "weight": 1 }, { "fieldName": "cert", "fieldPath": "cert", "fieldType": "string", "fieldValue": "LU 03/1999", "weight": 1 }, { "fieldName": "cert", "fieldPath": "cert", "fieldType": "string", "fieldValue": "LU 06/1975", "weight": 2 }, { "fieldName": "cert", "fieldPath": "cert", "fieldType": "string", "fieldValue": "DS 05/1973", "weight": 26 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "TAORA IS MALOELAP ATOLL", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "NAMORIK ATOLL", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "JABOR JALUIT ATOLL", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "ROI-NAMUR", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "JOHNSTON ISLAND", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "KWAJALEIN", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "PALMER", "weight": 17 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "DELTA JUNCTION", "weight": 6 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "HOMER", "weight": 10 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "TRAPPER CREEK/TALKEETNA", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "SHAKTOOLIK", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "CAPE BEAUFORT", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "KAKE", "weight": 2 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "LIME VILLAGE", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "SUTTON", "weight": 5 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "DIOMEDE", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "HEALY", "weight": 3 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "SHISHMAREF", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "CHALKYITSIK", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "NUNAPITCHUK", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "BIG MOUNTAIN", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "OUZINKIE", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "ST MICHAEL", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "BIG LAKE", "weight": 5 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "MC GRATH", "weight": 3 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "ATKA", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "STEBBINS", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "CHEFORNAK", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "KODIAK", "weight": 5 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "BOUNDARY", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "BREVIG MISSION", "weight": 1 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "NIKOLSKI", "weight": 1 } ]
WITH __stage0 AS ( SELECT group_set, CASE group_set WHEN 0 THEN 'act_date' WHEN 1 THEN 'aero_cht' WHEN 2 THEN 'c_ldg_rts' WHEN 3 THEN 'cbd_dir' WHEN 4 THEN 'cbd_dist' WHEN 5 THEN 'cert' WHEN 6 THEN 'city' WHEN 7 THEN 'cntl_twr' WHEN 8 THEN 'code' WHEN 9 THEN 'county' WHEN 10 THEN 'cust_intl' WHEN 11 THEN 'elevation' WHEN 12 THEN 'faa_dist' WHEN 13 THEN 'faa_region' WHEN 14 THEN 'fac_type' WHEN 15 THEN 'fac_use' WHEN 16 THEN 'fed_agree' WHEN 17 THEN 'full_name' WHEN 18 THEN 'id' WHEN 19 THEN 'joint_use' WHEN 20 THEN 'latitude' WHEN 21 THEN 'longitude' WHEN 22 THEN 'major' WHEN 23 THEN 'mil_rts' WHEN 24 THEN 'own_type' WHEN 25 THEN 'site_number' WHEN 26 THEN 'state' END as "fieldName", CASE group_set WHEN 0 THEN 'act_date' WHEN 1 THEN 'aero_cht' WHEN 2 THEN 'c_ldg_rts' WHEN 3 THEN 'cbd_dir' WHEN 4 THEN 'cbd_dist' WHEN 5 THEN 'cert' WHEN 6 THEN 'city' WHEN 7 THEN 'cntl_twr' WHEN 8 THEN 'code' WHEN 9 THEN 'county' WHEN 10 THEN 'cust_intl' WHEN 11 THEN 'elevation' WHEN 12 THEN 'faa_dist' WHEN 13 THEN 'faa_region' WHEN 14 THEN 'fac_type' WHEN 15 THEN 'fac_use' WHEN 16 THEN 'fed_agree' WHEN 17 THEN 'full_name' WHEN 18 THEN 'id' WHEN 19 THEN 'joint_use' WHEN 20 THEN 'latitude' WHEN 21 THEN 'longitude' WHEN 22 THEN 'major' WHEN 23 THEN 'mil_rts' WHEN 24 THEN 'own_type' WHEN 25 THEN 'site_number' WHEN 26 THEN 'state' END as "fieldPath", CASE group_set WHEN 0 THEN 'string' WHEN 1 THEN 'string' WHEN 2 THEN 'string' WHEN 3 THEN 'string' WHEN 4 THEN 'number' WHEN 5 THEN 'string' WHEN 6 THEN 'string' WHEN 7 THEN 'string' WHEN 8 THEN 'string' WHEN 9 THEN 'string' WHEN 10 THEN 'string' WHEN 11 THEN 'number' WHEN 12 THEN 'string' WHEN 13 THEN 'string' WHEN 14 THEN 'string' WHEN 15 THEN 'string' WHEN 16 THEN 'string' WHEN 17 THEN 'string' WHEN 18 THEN 'number' WHEN 19 THEN 'string' WHEN 20 THEN 'number' WHEN 21 THEN 'number' WHEN 22 THEN 'string' WHEN 23 THEN 'string' WHEN 24 THEN 'string' WHEN 25 THEN 'string' WHEN 26 THEN 'string' END as "fieldType", CASE group_set WHEN 99999 THEN CAST(NULL as VARCHAR) WHEN 0 THEN base."act_date" WHEN 1 THEN base."aero_cht" WHEN 2 THEN base."c_ldg_rts" WHEN 3 THEN base."cbd_dir" WHEN 5 THEN base."cert" WHEN 6 THEN base."city" WHEN 7 THEN base."cntl_twr" WHEN 8 THEN base."code" WHEN 9 THEN base."county" WHEN 10 THEN base."cust_intl" WHEN 12 THEN base."faa_dist" WHEN 13 THEN base."faa_region" WHEN 14 THEN base."fac_type" WHEN 15 THEN base."fac_use" WHEN 16 THEN base."fed_agree" WHEN 17 THEN base."full_name" WHEN 19 THEN base."joint_use" WHEN 22 THEN base."major" WHEN 23 THEN base."mil_rts" WHEN 24 THEN base."own_type" WHEN 25 THEN base."site_number" WHEN 26 THEN base."state" END as "fieldValue", COUNT(*) as "weight", CASE group_set WHEN 99999 THEN '' WHEN 4 THEN MIN(CAST(base."cbd_dist" as VARCHAR)) || ' to ' || CAST(MAX(base."cbd_dist") as VARCHAR) WHEN 11 THEN MIN(CAST(base."elevation" as VARCHAR)) || ' to ' || CAST(MAX(base."elevation") as VARCHAR) WHEN 18 THEN MIN(CAST(base."id" as VARCHAR)) || ' to ' || CAST(MAX(base."id") as VARCHAR) WHEN 20 THEN MIN(CAST(base."latitude" as VARCHAR)) || ' to ' || CAST(MAX(base."latitude") as VARCHAR) WHEN 21 THEN MIN(CAST(base."longitude" as VARCHAR)) || ' to ' || CAST(MAX(base."longitude") as VARCHAR) END as "fieldRange" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,27,1)) as group_set ) as group_set GROUP BY 1,2,3,4,5 ) SELECT "fieldName", "fieldPath", "fieldType", COALESCE("fieldValue", "fieldRange") as "fieldValue", "weight" FROM __stage0
Add Ordering
Adding a second query stage to filter on string columns and ordering by weight descending shows us the most common fieldName/fieldValue pairs in the dataset.
All Malloy queries run as a single SQL query. The index:
operator is no different. Click the SQL tab to see how this works.
run: duckdb.table('../data/airports.parquet') -> { index: * } -> { where: fieldType = 'string' select: * order_by: weight desc }
[ { "fieldName": "major", "fieldPath": "major", "fieldType": "string", "fieldValue": "N", "weight": 19523 }, { "fieldName": "cert", "fieldPath": "cert", "fieldType": "string", "fieldValue": null, "weight": 19142 }, { "fieldName": "cntl_twr", "fieldPath": "cntl_twr", "fieldType": "string", "fieldValue": "N", "weight": 19124 }, { "fieldName": "fed_agree", "fieldPath": "fed_agree", "fieldType": "string", "fieldValue": null, "weight": 16252 }, { "fieldName": "c_ldg_rts", "fieldPath": "c_ldg_rts", "fieldType": "string", "fieldValue": null, "weight": 15145 }, { "fieldName": "cust_intl", "fieldPath": "cust_intl", "fieldType": "string", "fieldValue": null, "weight": 15145 }, { "fieldName": "joint_use", "fieldPath": "joint_use", "fieldType": "string", "fieldValue": null, "weight": 14804 }, { "fieldName": "mil_rts", "fieldPath": "mil_rts", "fieldType": "string", "fieldValue": null, "weight": 14716 }, { "fieldName": "fac_use", "fieldPath": "fac_use", "fieldType": "string", "fieldValue": "PR", "weight": 14428 }, { "fieldName": "own_type", "fieldPath": "own_type", "fieldType": "string", "fieldValue": "PR", "weight": 14306 }, { "fieldName": "fac_type", "fieldPath": "fac_type", "fieldType": "string", "fieldValue": "AIRPORT", "weight": 13925 }, { "fieldName": "act_date", "fieldPath": "act_date", "fieldType": "string", "fieldValue": null, "weight": 12040 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "NONE", "weight": 7085 }, { "fieldName": "fac_use", "fieldPath": "fac_use", "fieldType": "string", "fieldValue": "PU", "weight": 5365 }, { "fieldName": "own_type", "fieldPath": "own_type", "fieldType": "string", "fieldValue": "PU", "weight": 5174 }, { "fieldName": "fac_type", "fieldPath": "fac_type", "fieldType": "string", "fieldValue": "HELIPORT", "weight": 5135 }, { "fieldName": "joint_use", "fieldPath": "joint_use", "fieldType": "string", "fieldValue": "N", "weight": 4779 }, { "fieldName": "cust_intl", "fieldPath": "cust_intl", "fieldType": "string", "fieldValue": "N", "weight": 4575 }, { "fieldName": "faa_region", "fieldPath": "faa_region", "fieldType": "string", "fieldValue": "AGL", "weight": 4437 }, { "fieldName": "c_ldg_rts", "fieldPath": "c_ldg_rts", "fieldType": "string", "fieldValue": "N", "weight": 4365 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "N", "weight": 3694 }, { "fieldName": "faa_region", "fieldPath": "faa_region", "fieldType": "string", "fieldValue": "ASW", "weight": 3268 }, { "fieldName": "mil_rts", "fieldPath": "mil_rts", "fieldType": "string", "fieldValue": "Y", "weight": 2958 }, { "fieldName": "faa_region", "fieldPath": "faa_region", "fieldType": "string", "fieldValue": "ASO", "weight": 2924 }, { "fieldName": "faa_region", "fieldPath": "faa_region", "fieldType": "string", "fieldValue": "AEA", "weight": 2586 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "NW", "weight": 2438 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "SW", "weight": 2427 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "NE", "weight": 2393 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "SE", "weight": 2281 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "S", "weight": 2155 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "E", "weight": 2127 }, { "fieldName": "mil_rts", "fieldPath": "mil_rts", "fieldType": "string", "fieldValue": "N", "weight": 2119 }, { "fieldName": "faa_region", "fieldPath": "faa_region", "fieldType": "string", "fieldValue": "ANM", "weight": 2102 }, { "fieldName": "cbd_dir", "fieldPath": "cbd_dir", "fieldType": "string", "fieldValue": "W", "weight": 2065 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "TX", "weight": 1845 }, { "fieldName": "fed_agree", "fieldPath": "fed_agree", "fieldType": "string", "fieldValue": "NGY", "weight": 1682 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "NEW YORK", "weight": 1581 }, { "fieldName": "faa_region", "fieldPath": "faa_region", "fieldType": "string", "fieldValue": "ACE", "weight": 1579 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "CHI", "weight": 1528 }, { "fieldName": "faa_region", "fieldPath": "faa_region", "fieldType": "string", "fieldValue": "AWP", "weight": 1503 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "CHICAGO", "weight": 1480 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "DETROIT", "weight": 1262 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "DET", "weight": 1240 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "SEA", "weight": 1162 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "MSP", "weight": 1046 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "ATL", "weight": 1029 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "CA", "weight": 984 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "NYC", "weight": 957 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "ORL", "weight": 905 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "IL", "weight": 890 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "DALLAS-FT WORTH", "weight": 888 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "WASHINGTON", "weight": 887 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "FL", "weight": 856 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "ST LOUIS", "weight": 847 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "HAR", "weight": 846 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "PA", "weight": 804 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "ATLANTA", "weight": 780 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "HOUSTON", "weight": 765 }, { "fieldName": "faa_region", "fieldPath": "faa_region", "fieldType": "string", "fieldValue": "ANE", "weight": 763 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "SEATTLE", "weight": 749 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "OH", "weight": 749 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "KANSAS CITY", "weight": 713 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "TWIN CITIES", "weight": 699 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "DEN", "weight": 679 }, { "fieldName": "cntl_twr", "fieldPath": "cntl_twr", "fieldType": "string", "fieldValue": "Y", "weight": 669 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "DCA", "weight": 667 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "IN", "weight": 643 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "CINCINNATI", "weight": 624 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "BIS", "weight": 616 }, { "fieldName": "faa_region", "fieldPath": "faa_region", "fieldType": "string", "fieldValue": "AAL", "weight": 608 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "AK", "weight": 608 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "MEMPHIS", "weight": 605 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "SFO", "weight": 585 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "NY", "weight": 576 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "OMAHA", "weight": 551 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "WI", "weight": 543 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "MO", "weight": 537 }, { "fieldName": "fed_agree", "fieldPath": "fed_agree", "fieldType": "string", "fieldValue": "N", "weight": 515 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "MN", "weight": 507 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "SAN ANTONIO", "weight": 503 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "JAN", "weight": 502 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "LA", "weight": 500 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "LOS ANGELES", "weight": 498 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "MI", "weight": 489 }, { "fieldName": "faa_dist", "fieldPath": "faa_dist", "fieldType": "string", "fieldValue": "MEM", "weight": 487 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "WA", "weight": 484 }, { "fieldName": "fac_type", "fieldPath": "fac_type", "fieldType": "string", "fieldValue": "SEAPLANE BASE", "weight": 473 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "JACKSONVILLE", "weight": 467 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "NEW ORLEANS", "weight": 444 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "OK", "weight": 443 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "OR", "weight": 441 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "GA", "weight": 440 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "ND", "weight": 436 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "SAN FRANCISCO", "weight": 427 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "CO", "weight": 425 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "VA", "weight": 421 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "GREEN BAY", "weight": 418 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "KS", "weight": 415 }, { "fieldName": "state", "fieldPath": "state", "fieldType": "string", "fieldValue": "NC", "weight": 400 }, { "fieldName": "aero_cht", "fieldPath": "aero_cht", "fieldType": "string", "fieldValue": "CHARLOTTE", "weight": 380 } ]
WITH __stage0 AS ( SELECT group_set, CASE group_set WHEN 0 THEN 'act_date' WHEN 1 THEN 'aero_cht' WHEN 2 THEN 'c_ldg_rts' WHEN 3 THEN 'cbd_dir' WHEN 4 THEN 'cbd_dist' WHEN 5 THEN 'cert' WHEN 6 THEN 'city' WHEN 7 THEN 'cntl_twr' WHEN 8 THEN 'code' WHEN 9 THEN 'county' WHEN 10 THEN 'cust_intl' WHEN 11 THEN 'elevation' WHEN 12 THEN 'faa_dist' WHEN 13 THEN 'faa_region' WHEN 14 THEN 'fac_type' WHEN 15 THEN 'fac_use' WHEN 16 THEN 'fed_agree' WHEN 17 THEN 'full_name' WHEN 18 THEN 'id' WHEN 19 THEN 'joint_use' WHEN 20 THEN 'latitude' WHEN 21 THEN 'longitude' WHEN 22 THEN 'major' WHEN 23 THEN 'mil_rts' WHEN 24 THEN 'own_type' WHEN 25 THEN 'site_number' WHEN 26 THEN 'state' END as "fieldName", CASE group_set WHEN 0 THEN 'act_date' WHEN 1 THEN 'aero_cht' WHEN 2 THEN 'c_ldg_rts' WHEN 3 THEN 'cbd_dir' WHEN 4 THEN 'cbd_dist' WHEN 5 THEN 'cert' WHEN 6 THEN 'city' WHEN 7 THEN 'cntl_twr' WHEN 8 THEN 'code' WHEN 9 THEN 'county' WHEN 10 THEN 'cust_intl' WHEN 11 THEN 'elevation' WHEN 12 THEN 'faa_dist' WHEN 13 THEN 'faa_region' WHEN 14 THEN 'fac_type' WHEN 15 THEN 'fac_use' WHEN 16 THEN 'fed_agree' WHEN 17 THEN 'full_name' WHEN 18 THEN 'id' WHEN 19 THEN 'joint_use' WHEN 20 THEN 'latitude' WHEN 21 THEN 'longitude' WHEN 22 THEN 'major' WHEN 23 THEN 'mil_rts' WHEN 24 THEN 'own_type' WHEN 25 THEN 'site_number' WHEN 26 THEN 'state' END as "fieldPath", CASE group_set WHEN 0 THEN 'string' WHEN 1 THEN 'string' WHEN 2 THEN 'string' WHEN 3 THEN 'string' WHEN 4 THEN 'number' WHEN 5 THEN 'string' WHEN 6 THEN 'string' WHEN 7 THEN 'string' WHEN 8 THEN 'string' WHEN 9 THEN 'string' WHEN 10 THEN 'string' WHEN 11 THEN 'number' WHEN 12 THEN 'string' WHEN 13 THEN 'string' WHEN 14 THEN 'string' WHEN 15 THEN 'string' WHEN 16 THEN 'string' WHEN 17 THEN 'string' WHEN 18 THEN 'number' WHEN 19 THEN 'string' WHEN 20 THEN 'number' WHEN 21 THEN 'number' WHEN 22 THEN 'string' WHEN 23 THEN 'string' WHEN 24 THEN 'string' WHEN 25 THEN 'string' WHEN 26 THEN 'string' END as "fieldType", CASE group_set WHEN 99999 THEN CAST(NULL as VARCHAR) WHEN 0 THEN base."act_date" WHEN 1 THEN base."aero_cht" WHEN 2 THEN base."c_ldg_rts" WHEN 3 THEN base."cbd_dir" WHEN 5 THEN base."cert" WHEN 6 THEN base."city" WHEN 7 THEN base."cntl_twr" WHEN 8 THEN base."code" WHEN 9 THEN base."county" WHEN 10 THEN base."cust_intl" WHEN 12 THEN base."faa_dist" WHEN 13 THEN base."faa_region" WHEN 14 THEN base."fac_type" WHEN 15 THEN base."fac_use" WHEN 16 THEN base."fed_agree" WHEN 17 THEN base."full_name" WHEN 19 THEN base."joint_use" WHEN 22 THEN base."major" WHEN 23 THEN base."mil_rts" WHEN 24 THEN base."own_type" WHEN 25 THEN base."site_number" WHEN 26 THEN base."state" END as "fieldValue", COUNT(*) as "weight", CASE group_set WHEN 99999 THEN '' WHEN 4 THEN MIN(CAST(base."cbd_dist" as VARCHAR)) || ' to ' || CAST(MAX(base."cbd_dist") as VARCHAR) WHEN 11 THEN MIN(CAST(base."elevation" as VARCHAR)) || ' to ' || CAST(MAX(base."elevation") as VARCHAR) WHEN 18 THEN MIN(CAST(base."id" as VARCHAR)) || ' to ' || CAST(MAX(base."id") as VARCHAR) WHEN 20 THEN MIN(CAST(base."latitude" as VARCHAR)) || ' to ' || CAST(MAX(base."latitude") as VARCHAR) WHEN 21 THEN MIN(CAST(base."longitude" as VARCHAR)) || ' to ' || CAST(MAX(base."longitude") as VARCHAR) END as "fieldRange" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,27,1)) as group_set ) as group_set GROUP BY 1,2,3,4,5 ) , __stage1 AS ( SELECT "fieldName", "fieldPath", "fieldType", COALESCE("fieldValue", "fieldRange") as "fieldValue", "weight" FROM __stage0 ) SELECT base."fieldName" as "fieldName", base."fieldPath" as "fieldPath", base."fieldType" as "fieldType", base."fieldValue" as "fieldValue", base."weight" as "weight" FROM __stage1 as base WHERE base."fieldType"='string' ORDER BY 5 desc NULLS LAST
Index For Filtering User Interfaces
Indexes can be used find the best way to filter a dataset. For example supposed we'd like to find 'SANTA CRUZ' in the dataset. Upon approaching the dataset, but we don't which column might contain it. In a UI you might imagine that you type 'SANTA' and let have suggestons for values that might be appropriate. In the results we can see that top value, 'SANTA ROSA', appears as county on 26 rows in the table. We can also see that 'SANTA CRUZ' is both a city
and a county
..
run: duckdb.table('../data/airports.parquet') -> { index: * } -> { where: fieldValue ~ r'SANTA' select: * order_by: weight desc limit: 15 }
[ { "fieldName": "county", "fieldPath": "county", "fieldType": "string", "fieldValue": "SANTA ROSA", "weight": 26 }, { "fieldName": "county", "fieldPath": "county", "fieldType": "string", "fieldValue": "SANTA BARBARA", "weight": 22 }, { "fieldName": "county", "fieldPath": "county", "fieldType": "string", "fieldValue": "SANTA CRUZ", "weight": 10 }, { "fieldName": "county", "fieldPath": "county", "fieldType": "string", "fieldValue": "SANTA CLARA", "weight": 10 }, { "fieldName": "county", "fieldPath": "county", "fieldType": "string", "fieldValue": "SANTA FE", "weight": 9 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "SANTA ANA", "weight": 6 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "SANTA FE", "weight": 6 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "SANTA BARBARA", "weight": 5 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "SANTA ROSA", "weight": 4 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "SANTA MARIA", "weight": 3 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "SANTA YNEZ", "weight": 3 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "SANTA ELENA", "weight": 2 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "SANTA CRUZ", "weight": 2 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "SANTA YSABEL", "weight": 2 }, { "fieldName": "city", "fieldPath": "city", "fieldType": "string", "fieldValue": "RANCHO SANTA MARGARITA", "weight": 2 } ]
WITH __stage0 AS ( SELECT group_set, CASE group_set WHEN 0 THEN 'act_date' WHEN 1 THEN 'aero_cht' WHEN 2 THEN 'c_ldg_rts' WHEN 3 THEN 'cbd_dir' WHEN 4 THEN 'cbd_dist' WHEN 5 THEN 'cert' WHEN 6 THEN 'city' WHEN 7 THEN 'cntl_twr' WHEN 8 THEN 'code' WHEN 9 THEN 'county' WHEN 10 THEN 'cust_intl' WHEN 11 THEN 'elevation' WHEN 12 THEN 'faa_dist' WHEN 13 THEN 'faa_region' WHEN 14 THEN 'fac_type' WHEN 15 THEN 'fac_use' WHEN 16 THEN 'fed_agree' WHEN 17 THEN 'full_name' WHEN 18 THEN 'id' WHEN 19 THEN 'joint_use' WHEN 20 THEN 'latitude' WHEN 21 THEN 'longitude' WHEN 22 THEN 'major' WHEN 23 THEN 'mil_rts' WHEN 24 THEN 'own_type' WHEN 25 THEN 'site_number' WHEN 26 THEN 'state' END as "fieldName", CASE group_set WHEN 0 THEN 'act_date' WHEN 1 THEN 'aero_cht' WHEN 2 THEN 'c_ldg_rts' WHEN 3 THEN 'cbd_dir' WHEN 4 THEN 'cbd_dist' WHEN 5 THEN 'cert' WHEN 6 THEN 'city' WHEN 7 THEN 'cntl_twr' WHEN 8 THEN 'code' WHEN 9 THEN 'county' WHEN 10 THEN 'cust_intl' WHEN 11 THEN 'elevation' WHEN 12 THEN 'faa_dist' WHEN 13 THEN 'faa_region' WHEN 14 THEN 'fac_type' WHEN 15 THEN 'fac_use' WHEN 16 THEN 'fed_agree' WHEN 17 THEN 'full_name' WHEN 18 THEN 'id' WHEN 19 THEN 'joint_use' WHEN 20 THEN 'latitude' WHEN 21 THEN 'longitude' WHEN 22 THEN 'major' WHEN 23 THEN 'mil_rts' WHEN 24 THEN 'own_type' WHEN 25 THEN 'site_number' WHEN 26 THEN 'state' END as "fieldPath", CASE group_set WHEN 0 THEN 'string' WHEN 1 THEN 'string' WHEN 2 THEN 'string' WHEN 3 THEN 'string' WHEN 4 THEN 'number' WHEN 5 THEN 'string' WHEN 6 THEN 'string' WHEN 7 THEN 'string' WHEN 8 THEN 'string' WHEN 9 THEN 'string' WHEN 10 THEN 'string' WHEN 11 THEN 'number' WHEN 12 THEN 'string' WHEN 13 THEN 'string' WHEN 14 THEN 'string' WHEN 15 THEN 'string' WHEN 16 THEN 'string' WHEN 17 THEN 'string' WHEN 18 THEN 'number' WHEN 19 THEN 'string' WHEN 20 THEN 'number' WHEN 21 THEN 'number' WHEN 22 THEN 'string' WHEN 23 THEN 'string' WHEN 24 THEN 'string' WHEN 25 THEN 'string' WHEN 26 THEN 'string' END as "fieldType", CASE group_set WHEN 99999 THEN CAST(NULL as VARCHAR) WHEN 0 THEN base."act_date" WHEN 1 THEN base."aero_cht" WHEN 2 THEN base."c_ldg_rts" WHEN 3 THEN base."cbd_dir" WHEN 5 THEN base."cert" WHEN 6 THEN base."city" WHEN 7 THEN base."cntl_twr" WHEN 8 THEN base."code" WHEN 9 THEN base."county" WHEN 10 THEN base."cust_intl" WHEN 12 THEN base."faa_dist" WHEN 13 THEN base."faa_region" WHEN 14 THEN base."fac_type" WHEN 15 THEN base."fac_use" WHEN 16 THEN base."fed_agree" WHEN 17 THEN base."full_name" WHEN 19 THEN base."joint_use" WHEN 22 THEN base."major" WHEN 23 THEN base."mil_rts" WHEN 24 THEN base."own_type" WHEN 25 THEN base."site_number" WHEN 26 THEN base."state" END as "fieldValue", COUNT(*) as "weight", CASE group_set WHEN 99999 THEN '' WHEN 4 THEN MIN(CAST(base."cbd_dist" as VARCHAR)) || ' to ' || CAST(MAX(base."cbd_dist") as VARCHAR) WHEN 11 THEN MIN(CAST(base."elevation" as VARCHAR)) || ' to ' || CAST(MAX(base."elevation") as VARCHAR) WHEN 18 THEN MIN(CAST(base."id" as VARCHAR)) || ' to ' || CAST(MAX(base."id") as VARCHAR) WHEN 20 THEN MIN(CAST(base."latitude" as VARCHAR)) || ' to ' || CAST(MAX(base."latitude") as VARCHAR) WHEN 21 THEN MIN(CAST(base."longitude" as VARCHAR)) || ' to ' || CAST(MAX(base."longitude") as VARCHAR) END as "fieldRange" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,27,1)) as group_set ) as group_set GROUP BY 1,2,3,4,5 ) , __stage1 AS ( SELECT "fieldName", "fieldPath", "fieldType", COALESCE("fieldValue", "fieldRange") as "fieldValue", "weight" FROM __stage0 ) SELECT base."fieldName" as "fieldName", base."fieldPath" as "fieldPath", base."fieldType" as "fieldType", base."fieldValue" as "fieldValue", base."weight" as "weight" FROM __stage1 as base WHERE REGEXP_MATCHES(base."fieldValue",'SANTA') ORDER BY 5 desc NULLS LAST LIMIT 15
We can then write a simple query to show the rows. It turns out that 'SANTA CRUZ' is a county in both California and Arizona.
run: duckdb.table('../data/airports.parquet') -> { where: county ~ 'SANTA CRUZ' select: * }
[ { "act_date": "10/1987", "aero_cht": "PHOENIX", "c_ldg_rts": null, "cbd_dir": "E", "cbd_dist": 1, "cert": null, "city": "TUBAC", "cntl_twr": "N", "code": "2AZ8", "county": "SANTA CRUZ", "cust_intl": null, "elevation": 3200, "faa_dist": "NONE", "faa_region": "AWP", "fac_type": "ULTRALIGHT", "fac_use": "PR", "fed_agree": null, "full_name": "TUBAC ULTRALIGHT FLIGHTPARK", "id": 1444, "joint_use": null, "latitude": 31.61, "longitude": -111.03, "major": "N", "mil_rts": null, "own_type": "PR", "site_number": "00811.*U", "state": "AZ" }, { "act_date": null, "aero_cht": "PHOENIX", "c_ldg_rts": "N", "cbd_dir": "NE", "cbd_dist": 7, "cert": null, "city": "NOGALES", "cntl_twr": "N", "code": "OLS", "county": "SANTA CRUZ", "cust_intl": "Y", "elevation": 3955, "faa_dist": "NONE", "faa_region": "AWP", "fac_type": "AIRPORT", "fac_use": "PU", "fed_agree": "NGY", "full_name": "NOGALES INTL", "id": 1312, "joint_use": "N", "latitude": 31.41, "longitude": -110.84, "major": "N", "mil_rts": "Y", "own_type": "PU", "site_number": "00739.*A", "state": "AZ" }, { "act_date": null, "aero_cht": "LOS ANGELES", "c_ldg_rts": null, "cbd_dir": null, "cbd_dist": 0, "cert": null, "city": "SAN NICOLAS ISLAND", "cntl_twr": "Y", "code": "NSI", "county": "SANTA CRUZ", "cust_intl": null, "elevation": 504, "faa_dist": "NONE", "faa_region": "AWP", "fac_type": "AIRPORT", "fac_use": "PR", "fed_agree": null, "full_name": "SAN NICOLAS ISLAND NOLF", "id": 2287, "joint_use": null, "latitude": 33.23, "longitude": -119.45, "major": "N", "mil_rts": "Y", "own_type": "MN", "site_number": "02220.*A", "state": "CA" }, { "act_date": "07/1986", "aero_cht": "SAN FRANCISCO", "c_ldg_rts": null, "cbd_dir": "NW", "cbd_dist": 8, "cert": null, "city": "SANTA CRUZ", "cntl_twr": "N", "code": "CL77", "county": "SANTA CRUZ", "cust_intl": null, "elevation": 2020, "faa_dist": "SFO", "faa_region": "AWP", "fac_type": "AIRPORT", "fac_use": "PR", "fed_agree": null, "full_name": "BONNY DOON VILLAGE", "id": 2307, "joint_use": null, "latitude": 37.07, "longitude": -122.12, "major": "N", "mil_rts": null, "own_type": "PR", "site_number": "02241.1*A", "state": "CA" }, { "act_date": null, "aero_cht": "SAN FRANCISCO", "c_ldg_rts": null, "cbd_dir": "NE", "cbd_dist": 3, "cert": null, "city": "SANTA CRUZ", "cntl_twr": "N", "code": "CA37", "county": "SANTA CRUZ", "cust_intl": null, "elevation": 115, "faa_dist": "SFO", "faa_region": "AWP", "fac_type": "HELIPORT", "fac_use": "PR", "fed_agree": null, "full_name": "DOMINICAN SANTA CRUZ HOSPITAL", "id": 2308, "joint_use": null, "latitude": 36.99, "longitude": -121.98, "major": "N", "mil_rts": null, "own_type": "PR", "site_number": "02241.12*H", "state": "CA" }, { "act_date": null, "aero_cht": "SAN FRANCISCO", "c_ldg_rts": null, "cbd_dir": "NW", "cbd_dist": 6, "cert": null, "city": "DAVENPORT", "cntl_twr": "N", "code": "6Q6", "county": "SANTA CRUZ", "cust_intl": null, "elevation": 125, "faa_dist": "SFO", "faa_region": "AWP", "fac_type": "AIRPORT", "fac_use": "PR", "fed_agree": null, "full_name": "LAS TRANCAS", "id": 1693, "joint_use": "N", "latitude": 37.08, "longitude": -122.27, "major": "N", "mil_rts": "N", "own_type": "PR", "site_number": "01486.9*A", "state": "CA" }, { "act_date": null, "aero_cht": "SAN FRANCISCO", "c_ldg_rts": null, "cbd_dir": "NW", "cbd_dist": 5, "cert": null, "city": "WATSONVILLE", "cntl_twr": "N", "code": "CA65", "county": "SANTA CRUZ", "cust_intl": null, "elevation": 480, "faa_dist": "SFO", "faa_region": "AWP", "fac_type": "HELIPORT", "fac_use": "PR", "fed_agree": null, "full_name": "ALTA VISTA", "id": 2441, "joint_use": "N", "latitude": 36.97, "longitude": -121.86, "major": "N", "mil_rts": "N", "own_type": "PR", "site_number": "02429.1*H", "state": "CA" }, { "act_date": "09/1987", "aero_cht": "SAN FRANCISCO", "c_ldg_rts": null, "cbd_dir": "E", "cbd_dist": 0, "cert": null, "city": "WATSONVILLE", "cntl_twr": "N", "code": "CL99", "county": "SANTA CRUZ", "cust_intl": null, "elevation": 111, "faa_dist": "SFO", "faa_region": "AWP", "fac_type": "HELIPORT", "fac_use": "PR", "fed_agree": null, "full_name": "WATSONVILLE COMMUNITY HOSPITAL", "id": 2442, "joint_use": null, "latitude": 36.93, "longitude": -121.77, "major": "N", "mil_rts": null, "own_type": "PU", "site_number": "02429.11*H", "state": "CA" }, { "act_date": null, "aero_cht": "SAN FRANCISCO", "c_ldg_rts": null, "cbd_dir": "W", "cbd_dist": 4, "cert": null, "city": "WATSONVILLE", "cntl_twr": "N", "code": "CA66", "county": "SANTA CRUZ", "cust_intl": null, "elevation": 70, "faa_dist": "SFO", "faa_region": "AWP", "fac_type": "AIRPORT", "fac_use": "PR", "fed_agree": null, "full_name": "MONTEREY BAY ACADEMY", "id": 2443, "joint_use": "N", "latitude": 36.9, "longitude": -121.84, "major": "N", "mil_rts": "N", "own_type": "PR", "site_number": "02429.2*A", "state": "CA" }, { "act_date": null, "aero_cht": "SAN FRANCISCO", "c_ldg_rts": "N", "cbd_dir": "NW", "cbd_dist": 3, "cert": null, "city": "WATSONVILLE", "cntl_twr": "N", "code": "WVI", "county": "SANTA CRUZ", "cust_intl": "N", "elevation": 160, "faa_dist": "SFO", "faa_region": "AWP", "fac_type": "AIRPORT", "fac_use": "PU", "fed_agree": "NGPRY", "full_name": "WATSONVILLE MUNI", "id": 2440, "joint_use": "N", "latitude": 36.93, "longitude": -121.78, "major": "N", "mil_rts": "Y", "own_type": "PU", "site_number": "02429.*A", "state": "CA" } ]
SELECT base."act_date" as "act_date", base."aero_cht" as "aero_cht", base."c_ldg_rts" as "c_ldg_rts", base."cbd_dir" as "cbd_dir", base."cbd_dist" as "cbd_dist", base."cert" as "cert", base."city" as "city", base."cntl_twr" as "cntl_twr", base."code" as "code", base."county" as "county", base."cust_intl" as "cust_intl", base."elevation" as "elevation", base."faa_dist" as "faa_dist", base."faa_region" as "faa_region", base."fac_type" as "fac_type", base."fac_use" as "fac_use", base."fed_agree" as "fed_agree", base."full_name" as "full_name", base."id" as "id", base."joint_use" as "joint_use", base."latitude" as "latitude", base."longitude" as "longitude", base."major" as "major", base."mil_rts" as "mil_rts", base."own_type" as "own_type", base."site_number" as "site_number", base."state" as "state" FROM '../data/airports.parquet' as base WHERE base."county" LIKE 'SANTA CRUZ'
Indexing to show top values for each dimension
It is often difficult to approach a new dataset. The index operator provides an intersting way to quickly gain an understanding of the dataset. By piping the results of an index another stage, we can quickly see all the interesting values for each of the interesting dimesions. Again, the weight shows the number of rows for that particular dimension/value.
run: duckdb.table('../data/airports.parquet') -> { index: * } -> { group_by: fieldName nest: values is { group_by: fieldValue, weight order_by: weight desc limit: 10 } order_by: fieldName }
[ { "fieldName": "act_date", "values": [ { "fieldValue": null, "weight": 12040 }, { "fieldValue": "04/1993", "weight": 145 }, { "fieldValue": "02/1987", "weight": 129 }, { "fieldValue": "03/1993", "weight": 127 }, { "fieldValue": "11/1990", "weight": 122 }, { "fieldValue": "11/1986", "weight": 119 }, { "fieldValue": "02/1982", "weight": 118 }, { "fieldValue": "08/1988", "weight": 115 }, { "fieldValue": "11/1988", "weight": 113 }, { "fieldValue": "03/1988", "weight": 113 } ] }, { "fieldName": "aero_cht", "values": [ { "fieldValue": "NEW YORK", "weight": 1581 }, { "fieldValue": "CHICAGO", "weight": 1480 }, { "fieldValue": "DETROIT", "weight": 1262 }, { "fieldValue": "DALLAS-FT WORTH", "weight": 888 }, { "fieldValue": "WASHINGTON", "weight": 887 }, { "fieldValue": "ST LOUIS", "weight": 847 }, { "fieldValue": "ATLANTA", "weight": 780 }, { "fieldValue": "HOUSTON", "weight": 765 }, { "fieldValue": "SEATTLE", "weight": 749 }, { "fieldValue": "KANSAS CITY", "weight": 713 } ] }, { "fieldName": "c_ldg_rts", "values": [ { "fieldValue": null, "weight": 15145 }, { "fieldValue": "N", "weight": 4365 }, { "fieldValue": "Y", "weight": 283 } ] }, { "fieldName": "cbd_dir", "values": [ { "fieldValue": "N", "weight": 3694 }, { "fieldValue": "NW", "weight": 2438 }, { "fieldValue": "SW", "weight": 2427 }, { "fieldValue": "NE", "weight": 2393 }, { "fieldValue": "SE", "weight": 2281 }, { "fieldValue": "S", "weight": 2155 }, { "fieldValue": "E", "weight": 2127 }, { "fieldValue": "W", "weight": 2065 }, { "fieldValue": null, "weight": 119 }, { "fieldValue": "SSW", "weight": 20 } ] }, { "fieldName": "cbd_dist", "values": [ { "fieldValue": "0 to 73", "weight": 19793 } ] }, { "fieldName": "cert", "values": [ { "fieldValue": null, "weight": 19142 }, { "fieldValue": "AS 05/1973", "weight": 94 }, { "fieldValue": "CS 05/1973", "weight": 84 }, { "fieldValue": "BS 05/1973", "weight": 81 }, { "fieldValue": "LU 05/1973", "weight": 37 }, { "fieldValue": "LU 08/1990", "weight": 33 }, { "fieldValue": "DS 05/1973", "weight": 26 }, { "fieldValue": "AU 05/1973", "weight": 25 }, { "fieldValue": "ES 05/1973", "weight": 15 }, { "fieldValue": "LU 11/1974", "weight": 9 } ] }, { "fieldName": "city", "values": [ { "fieldValue": "HOUSTON", "weight": 108 }, { "fieldValue": "LOS ANGELES", "weight": 60 }, { "fieldValue": "COLUMBUS", "weight": 47 }, { "fieldValue": "SPRINGFIELD", "weight": 45 }, { "fieldValue": "JACKSON", "weight": 42 }, { "fieldValue": "GREENVILLE", "weight": 38 }, { "fieldValue": "WASHINGTON", "weight": 38 }, { "fieldValue": "CLINTON", "weight": 37 }, { "fieldValue": "PHOENIX", "weight": 37 }, { "fieldValue": "PHILADELPHIA", "weight": 35 } ] }, { "fieldName": "cntl_twr", "values": [ { "fieldValue": "N", "weight": 19124 }, { "fieldValue": "Y", "weight": 669 } ] }, { "fieldName": "code", "values": [ { "fieldValue": "N18", "weight": 1 }, { "fieldValue": "P16", "weight": 1 }, { "fieldValue": "KWA", "weight": 1 }, { "fieldValue": "8K9", "weight": 1 }, { "fieldValue": "AK18", "weight": 1 }, { "fieldValue": "7AK4", "weight": 1 }, { "fieldValue": "28AK", "weight": 1 }, { "fieldValue": "AK42", "weight": 1 }, { "fieldValue": "4K5", "weight": 1 }, { "fieldValue": "9A8", "weight": 1 } ] }, { "fieldName": "county", "values": [ { "fieldValue": "WASHINGTON", "weight": 214 }, { "fieldValue": "JEFFERSON", "weight": 199 }, { "fieldValue": "LOS ANGELES", "weight": 176 }, { "fieldValue": "MONTGOMERY", "weight": 154 }, { "fieldValue": "JACKSON", "weight": 140 }, { "fieldValue": "FRANKLIN", "weight": 140 }, { "fieldValue": "HARRIS", "weight": 137 }, { "fieldValue": "MARION", "weight": 127 }, { "fieldValue": "ORANGE", "weight": 125 }, { "fieldValue": "MARICOPA", "weight": 117 } ] }, { "fieldName": "cust_intl", "values": [ { "fieldValue": null, "weight": 15145 }, { "fieldValue": "N", "weight": 4575 }, { "fieldValue": "Y", "weight": 73 } ] }, { "fieldName": "elevation", "values": [ { "fieldValue": "-1 to 12442", "weight": 19793 } ] }, { "fieldName": "faa_dist", "values": [ { "fieldValue": "NONE", "weight": 7085 }, { "fieldValue": "CHI", "weight": 1528 }, { "fieldValue": "DET", "weight": 1240 }, { "fieldValue": "SEA", "weight": 1162 }, { "fieldValue": "MSP", "weight": 1046 }, { "fieldValue": "ATL", "weight": 1029 }, { "fieldValue": "NYC", "weight": 957 }, { "fieldValue": "ORL", "weight": 905 }, { "fieldValue": "HAR", "weight": 846 }, { "fieldValue": "DEN", "weight": 679 } ] }, { "fieldName": "faa_region", "values": [ { "fieldValue": "AGL", "weight": 4437 }, { "fieldValue": "ASW", "weight": 3268 }, { "fieldValue": "ASO", "weight": 2924 }, { "fieldValue": "AEA", "weight": 2586 }, { "fieldValue": "ANM", "weight": 2102 }, { "fieldValue": "ACE", "weight": 1579 }, { "fieldValue": "AWP", "weight": 1503 }, { "fieldValue": "ANE", "weight": 763 }, { "fieldValue": "AAL", "weight": 608 }, { "fieldValue": null, "weight": 23 } ] }, { "fieldName": "fac_type", "values": [ { "fieldValue": "AIRPORT", "weight": 13925 }, { "fieldValue": "HELIPORT", "weight": 5135 }, { "fieldValue": "SEAPLANE BASE", "weight": 473 }, { "fieldValue": "ULTRALIGHT", "weight": 125 }, { "fieldValue": "STOLPORT", "weight": 86 }, { "fieldValue": "GLIDERPORT", "weight": 37 }, { "fieldValue": "BALLOONPORT", "weight": 12 } ] }, { "fieldName": "fac_use", "values": [ { "fieldValue": "PR", "weight": 14428 }, { "fieldValue": "PU", "weight": 5365 } ] }, { "fieldName": "fed_agree", "values": [ { "fieldValue": null, "weight": 16252 }, { "fieldValue": "NGY", "weight": 1682 }, { "fieldValue": "N", "weight": 515 }, { "fieldValue": "NGY3", "weight": 219 }, { "fieldValue": "NGPY", "weight": 171 }, { "fieldValue": "NY1", "weight": 123 }, { "fieldValue": "N1", "weight": 97 }, { "fieldValue": "1", "weight": 96 }, { "fieldValue": "NGSY", "weight": 87 }, { "fieldValue": "NGPY3", "weight": 84 } ] }, { "fieldName": "full_name", "values": [ { "fieldValue": "MEMORIAL HOSPITAL", "weight": 21 }, { "fieldValue": "SMITH", "weight": 14 }, { "fieldValue": "JOHNSON", "weight": 13 }, { "fieldValue": "ST MARY'S HOSPITAL", "weight": 11 }, { "fieldValue": "MILLER", "weight": 10 }, { "fieldValue": "DAVIS FIELD", "weight": 10 }, { "fieldValue": "HILLTOP", "weight": 10 }, { "fieldValue": "TAYLOR", "weight": 9 }, { "fieldValue": "WILSON", "weight": 9 }, { "fieldValue": "WILLIAMS", "weight": 9 } ] }, { "fieldName": "id", "values": [ { "fieldValue": "1 to 19793", "weight": 19793 } ] }, { "fieldName": "joint_use", "values": [ { "fieldValue": null, "weight": 14804 }, { "fieldValue": "N", "weight": 4779 }, { "fieldValue": "Y", "weight": 210 } ] }, { "fieldName": "latitude", "values": [ { "fieldValue": "-14.18 to 71.28", "weight": 19793 } ] }, { "fieldName": "longitude", "values": [ { "fieldValue": "-100.0 to 174.11", "weight": 19793 } ] }, { "fieldName": "major", "values": [ { "fieldValue": "N", "weight": 19523 }, { "fieldValue": "Y", "weight": 270 } ] }, { "fieldName": "mil_rts", "values": [ { "fieldValue": null, "weight": 14716 }, { "fieldValue": "Y", "weight": 2958 }, { "fieldValue": "N", "weight": 2119 } ] }, { "fieldName": "own_type", "values": [ { "fieldValue": "PR", "weight": 14306 }, { "fieldValue": "PU", "weight": 5174 }, { "fieldValue": "MR", "weight": 128 }, { "fieldValue": "MA", "weight": 107 }, { "fieldValue": "MN", "weight": 78 } ] }, { "fieldName": "site_number", "values": [ { "fieldValue": "51518.4*A", "weight": 1 }, { "fieldValue": "50320.13*A", "weight": 1 }, { "fieldValue": "50146.06*A", "weight": 1 }, { "fieldValue": "50140.71*H", "weight": 1 }, { "fieldValue": "50215.2*A", "weight": 1 }, { "fieldValue": "50529.78*H", "weight": 1 }, { "fieldValue": "50075.1*A", "weight": 1 }, { "fieldValue": "50870.24*A", "weight": 1 }, { "fieldValue": "50429.*A", "weight": 1 }, { "fieldValue": "50684.4*A", "weight": 1 } ] }, { "fieldName": "state", "values": [ { "fieldValue": "TX", "weight": 1845 }, { "fieldValue": "CA", "weight": 984 }, { "fieldValue": "IL", "weight": 890 }, { "fieldValue": "FL", "weight": 856 }, { "fieldValue": "PA", "weight": 804 }, { "fieldValue": "OH", "weight": 749 }, { "fieldValue": "IN", "weight": 643 }, { "fieldValue": "AK", "weight": 608 }, { "fieldValue": "NY", "weight": 576 }, { "fieldValue": "WI", "weight": 543 } ] }, { "fieldName": null, "values": [ { "fieldValue": null, "weight": 19793 } ] } ]
WITH __stage0 AS ( SELECT group_set, CASE group_set WHEN 0 THEN 'act_date' WHEN 1 THEN 'aero_cht' WHEN 2 THEN 'c_ldg_rts' WHEN 3 THEN 'cbd_dir' WHEN 4 THEN 'cbd_dist' WHEN 5 THEN 'cert' WHEN 6 THEN 'city' WHEN 7 THEN 'cntl_twr' WHEN 8 THEN 'code' WHEN 9 THEN 'county' WHEN 10 THEN 'cust_intl' WHEN 11 THEN 'elevation' WHEN 12 THEN 'faa_dist' WHEN 13 THEN 'faa_region' WHEN 14 THEN 'fac_type' WHEN 15 THEN 'fac_use' WHEN 16 THEN 'fed_agree' WHEN 17 THEN 'full_name' WHEN 18 THEN 'id' WHEN 19 THEN 'joint_use' WHEN 20 THEN 'latitude' WHEN 21 THEN 'longitude' WHEN 22 THEN 'major' WHEN 23 THEN 'mil_rts' WHEN 24 THEN 'own_type' WHEN 25 THEN 'site_number' WHEN 26 THEN 'state' END as "fieldName", CASE group_set WHEN 0 THEN 'act_date' WHEN 1 THEN 'aero_cht' WHEN 2 THEN 'c_ldg_rts' WHEN 3 THEN 'cbd_dir' WHEN 4 THEN 'cbd_dist' WHEN 5 THEN 'cert' WHEN 6 THEN 'city' WHEN 7 THEN 'cntl_twr' WHEN 8 THEN 'code' WHEN 9 THEN 'county' WHEN 10 THEN 'cust_intl' WHEN 11 THEN 'elevation' WHEN 12 THEN 'faa_dist' WHEN 13 THEN 'faa_region' WHEN 14 THEN 'fac_type' WHEN 15 THEN 'fac_use' WHEN 16 THEN 'fed_agree' WHEN 17 THEN 'full_name' WHEN 18 THEN 'id' WHEN 19 THEN 'joint_use' WHEN 20 THEN 'latitude' WHEN 21 THEN 'longitude' WHEN 22 THEN 'major' WHEN 23 THEN 'mil_rts' WHEN 24 THEN 'own_type' WHEN 25 THEN 'site_number' WHEN 26 THEN 'state' END as "fieldPath", CASE group_set WHEN 0 THEN 'string' WHEN 1 THEN 'string' WHEN 2 THEN 'string' WHEN 3 THEN 'string' WHEN 4 THEN 'number' WHEN 5 THEN 'string' WHEN 6 THEN 'string' WHEN 7 THEN 'string' WHEN 8 THEN 'string' WHEN 9 THEN 'string' WHEN 10 THEN 'string' WHEN 11 THEN 'number' WHEN 12 THEN 'string' WHEN 13 THEN 'string' WHEN 14 THEN 'string' WHEN 15 THEN 'string' WHEN 16 THEN 'string' WHEN 17 THEN 'string' WHEN 18 THEN 'number' WHEN 19 THEN 'string' WHEN 20 THEN 'number' WHEN 21 THEN 'number' WHEN 22 THEN 'string' WHEN 23 THEN 'string' WHEN 24 THEN 'string' WHEN 25 THEN 'string' WHEN 26 THEN 'string' END as "fieldType", CASE group_set WHEN 99999 THEN CAST(NULL as VARCHAR) WHEN 0 THEN base."act_date" WHEN 1 THEN base."aero_cht" WHEN 2 THEN base."c_ldg_rts" WHEN 3 THEN base."cbd_dir" WHEN 5 THEN base."cert" WHEN 6 THEN base."city" WHEN 7 THEN base."cntl_twr" WHEN 8 THEN base."code" WHEN 9 THEN base."county" WHEN 10 THEN base."cust_intl" WHEN 12 THEN base."faa_dist" WHEN 13 THEN base."faa_region" WHEN 14 THEN base."fac_type" WHEN 15 THEN base."fac_use" WHEN 16 THEN base."fed_agree" WHEN 17 THEN base."full_name" WHEN 19 THEN base."joint_use" WHEN 22 THEN base."major" WHEN 23 THEN base."mil_rts" WHEN 24 THEN base."own_type" WHEN 25 THEN base."site_number" WHEN 26 THEN base."state" END as "fieldValue", COUNT(*) as "weight", CASE group_set WHEN 99999 THEN '' WHEN 4 THEN MIN(CAST(base."cbd_dist" as VARCHAR)) || ' to ' || CAST(MAX(base."cbd_dist") as VARCHAR) WHEN 11 THEN MIN(CAST(base."elevation" as VARCHAR)) || ' to ' || CAST(MAX(base."elevation") as VARCHAR) WHEN 18 THEN MIN(CAST(base."id" as VARCHAR)) || ' to ' || CAST(MAX(base."id") as VARCHAR) WHEN 20 THEN MIN(CAST(base."latitude" as VARCHAR)) || ' to ' || CAST(MAX(base."latitude") as VARCHAR) WHEN 21 THEN MIN(CAST(base."longitude" as VARCHAR)) || ' to ' || CAST(MAX(base."longitude") as VARCHAR) END as "fieldRange" FROM '../data/airports.parquet' as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,27,1)) as group_set ) as group_set GROUP BY 1,2,3,4,5 ) , __stage1 AS ( SELECT "fieldName", "fieldPath", "fieldType", COALESCE("fieldValue", "fieldRange") as "fieldValue", "weight" FROM __stage0 ) , __stage2 AS ( SELECT group_set, base."fieldName" as "fieldName__0", CASE WHEN group_set=1 THEN base."fieldValue" END as "fieldValue__1", CASE WHEN group_set=1 THEN base."weight" END as "weight__1" FROM __stage1 as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set ) as group_set GROUP BY 1,2,3,4 ) SELECT "fieldName__0" as "fieldName", COALESCE(LIST({ "fieldValue": "fieldValue__1", "weight": "weight__1"} ORDER BY "weight__1" desc NULLS LAST) FILTER (WHERE group_set=1)[1:10],[]) as "values" FROM __stage2 GROUP BY 1 ORDER BY 1 ASC NULLS LAST
Sampling
With large datasets, you can also sample a small subsection using the sample:
parameter. Sampled indexes are great at identifing the important low cardinality fields.
run: duckdb.table('../data/airports.parquet') -> { index: * sample: 5000 // sample only 5000 rows } -> { group_by: fieldName nest: values is { group_by: fieldValue, weight order_by: weight desc limit: 10 } order_by: fieldName }
[ { "fieldName": "act_date", "values": [ { "fieldValue": null, "weight": 3063 }, { "fieldValue": "04/1993", "weight": 41 }, { "fieldValue": "08/1988", "weight": 37 }, { "fieldValue": "02/1982", "weight": 36 }, { "fieldValue": "02/1987", "weight": 33 }, { "fieldValue": "11/1988", "weight": 30 }, { "fieldValue": "02/1998", "weight": 29 }, { "fieldValue": "11/1986", "weight": 29 }, { "fieldValue": "08/1994", "weight": 26 }, { "fieldValue": "01/1992", "weight": 26 } ] }, { "fieldName": "aero_cht", "values": [ { "fieldValue": "NEW YORK", "weight": 417 }, { "fieldValue": "CHICAGO", "weight": 391 }, { "fieldValue": "DETROIT", "weight": 340 }, { "fieldValue": "WASHINGTON", "weight": 237 }, { "fieldValue": "ST LOUIS", "weight": 235 }, { "fieldValue": "DALLAS-FT WORTH", "weight": 230 }, { "fieldValue": "TWIN CITIES", "weight": 193 }, { "fieldValue": "SEATTLE", "weight": 192 }, { "fieldValue": "HOUSTON", "weight": 176 }, { "fieldValue": "KANSAS CITY", "weight": 173 } ] }, { "fieldName": "c_ldg_rts", "values": [ { "fieldValue": null, "weight": 3855 }, { "fieldValue": "N", "weight": 1083 }, { "fieldValue": "Y", "weight": 62 } ] }, { "fieldName": "cbd_dir", "values": [ { "fieldValue": "N", "weight": 914 }, { "fieldValue": "NW", "weight": 622 }, { "fieldValue": "SW", "weight": 622 }, { "fieldValue": "NE", "weight": 582 }, { "fieldValue": "SE", "weight": 573 }, { "fieldValue": "S", "weight": 560 }, { "fieldValue": "W", "weight": 550 }, { "fieldValue": "E", "weight": 537 }, { "fieldValue": null, "weight": 22 }, { "fieldValue": "SSW", "weight": 5 } ] }, { "fieldName": "cbd_dist", "values": [ { "fieldValue": "0 to 62", "weight": 5000 } ] }, { "fieldName": "cert", "values": [ { "fieldValue": null, "weight": 4837 }, { "fieldValue": "AS 05/1973", "weight": 24 }, { "fieldValue": "BS 05/1973", "weight": 17 }, { "fieldValue": "CS 05/1973", "weight": 15 }, { "fieldValue": "LU 08/1990", "weight": 10 }, { "fieldValue": "ES 05/1973", "weight": 8 }, { "fieldValue": "DS 05/1973", "weight": 7 }, { "fieldValue": "LU 05/1973", "weight": 7 }, { "fieldValue": "AU 05/1973", "weight": 2 }, { "fieldValue": "LU 11/1974", "weight": 2 } ] }, { "fieldName": "city", "values": [ { "fieldValue": "HOUSTON", "weight": 20 }, { "fieldValue": "LOS ANGELES", "weight": 14 }, { "fieldValue": "SPRINGFIELD", "weight": 11 }, { "fieldValue": "MIAMI", "weight": 11 }, { "fieldValue": "ANCHORAGE", "weight": 11 }, { "fieldValue": "JACKSON", "weight": 11 }, { "fieldValue": "MOUNT VERNON", "weight": 10 }, { "fieldValue": "SALEM", "weight": 10 }, { "fieldValue": "CLINTON", "weight": 10 }, { "fieldValue": "LANCASTER", "weight": 10 } ] }, { "fieldName": "cntl_twr", "values": [ { "fieldValue": "N", "weight": 4837 }, { "fieldValue": "Y", "weight": 163 } ] }, { "fieldName": "code", "values": [ { "fieldValue": "CT33", "weight": 1 }, { "fieldValue": "NE41", "weight": 1 }, { "fieldValue": "OVL", "weight": 1 }, { "fieldValue": "81NJ", "weight": 1 }, { "fieldValue": "50F", "weight": 1 }, { "fieldValue": "9OK6", "weight": 1 }, { "fieldValue": "1KY7", "weight": 1 }, { "fieldValue": "DE21", "weight": 1 }, { "fieldValue": "TX14", "weight": 1 }, { "fieldValue": "TE11", "weight": 1 } ] }, { "fieldName": "county", "values": [ { "fieldValue": "WASHINGTON", "weight": 53 }, { "fieldValue": "JEFFERSON", "weight": 47 }, { "fieldValue": "LOS ANGELES", "weight": 46 }, { "fieldValue": "JACKSON", "weight": 46 }, { "fieldValue": "MONTGOMERY", "weight": 42 }, { "fieldValue": "WAYNE", "weight": 36 }, { "fieldValue": "ORANGE", "weight": 35 }, { "fieldValue": "FRANKLIN", "weight": 34 }, { "fieldValue": "MARICOPA", "weight": 33 }, { "fieldValue": "MADISON", "weight": 30 } ] }, { "fieldName": "cust_intl", "values": [ { "fieldValue": null, "weight": 3854 }, { "fieldValue": "N", "weight": 1127 }, { "fieldValue": "Y", "weight": 19 } ] }, { "fieldName": "elevation", "values": [ { "fieldValue": "-179 to 12442", "weight": 5000 } ] }, { "fieldName": "faa_dist", "values": [ { "fieldValue": "NONE", "weight": 1752 }, { "fieldValue": "CHI", "weight": 397 }, { "fieldValue": "DET", "weight": 339 }, { "fieldValue": "SEA", "weight": 293 }, { "fieldValue": "MSP", "weight": 269 }, { "fieldValue": "NYC", "weight": 260 }, { "fieldValue": "ATL", "weight": 234 }, { "fieldValue": "HAR", "weight": 225 }, { "fieldValue": "ORL", "weight": 207 }, { "fieldValue": "DCA", "weight": 185 } ] }, { "fieldName": "faa_region", "values": [ { "fieldValue": "AGL", "weight": 1165 }, { "fieldValue": "ASW", "weight": 808 }, { "fieldValue": "AEA", "weight": 703 }, { "fieldValue": "ASO", "weight": 680 }, { "fieldValue": "ANM", "weight": 529 }, { "fieldValue": "AWP", "weight": 381 }, { "fieldValue": "ACE", "weight": 380 }, { "fieldValue": "ANE", "weight": 200 }, { "fieldValue": "AAL", "weight": 146 }, { "fieldValue": null, "weight": 8 } ] }, { "fieldName": "fac_type", "values": [ { "fieldValue": "AIRPORT", "weight": 3473 }, { "fieldValue": "HELIPORT", "weight": 1353 }, { "fieldValue": "SEAPLANE BASE", "weight": 114 }, { "fieldValue": "ULTRALIGHT", "weight": 29 }, { "fieldValue": "STOLPORT", "weight": 18 }, { "fieldValue": "GLIDERPORT", "weight": 9 }, { "fieldValue": "BALLOONPORT", "weight": 4 } ] }, { "fieldName": "fac_use", "values": [ { "fieldValue": "PR", "weight": 3686 }, { "fieldValue": "PU", "weight": 1314 } ] }, { "fieldName": "fed_agree", "values": [ { "fieldValue": null, "weight": 4145 }, { "fieldValue": "NGY", "weight": 403 }, { "fieldValue": "N", "weight": 121 }, { "fieldValue": "NGY3", "weight": 46 }, { "fieldValue": "NGPY", "weight": 43 }, { "fieldValue": "NY1", "weight": 32 }, { "fieldValue": "NGSY", "weight": 30 }, { "fieldValue": "N1", "weight": 23 }, { "fieldValue": "1", "weight": 22 }, { "fieldValue": "NGRY", "weight": 18 } ] }, { "fieldName": "full_name", "values": [ { "fieldValue": "MEMORIAL HOSPITAL", "weight": 7 }, { "fieldValue": "ST FRANCIS HOSPITAL", "weight": 6 }, { "fieldValue": "JOHNSON", "weight": 5 }, { "fieldValue": "FLYING W", "weight": 5 }, { "fieldValue": "DAVIS", "weight": 5 }, { "fieldValue": "ANDERSON", "weight": 4 }, { "fieldValue": "ST MARY'S HOSPITAL", "weight": 4 }, { "fieldValue": "JACKSON", "weight": 4 }, { "fieldValue": "NELSON", "weight": 4 }, { "fieldValue": "PROVIDENCE HOSPITAL", "weight": 4 } ] }, { "fieldName": "id", "values": [ { "fieldValue": "100 to 19791", "weight": 5000 } ] }, { "fieldName": "joint_use", "values": [ { "fieldValue": null, "weight": 3744 }, { "fieldValue": "N", "weight": 1205 }, { "fieldValue": "Y", "weight": 51 } ] }, { "fieldName": "latitude", "values": [ { "fieldValue": "11.23 to 70.91", "weight": 5000 } ] }, { "fieldName": "longitude", "values": [ { "fieldValue": "-100.0 to 171.73", "weight": 5000 } ] }, { "fieldName": "major", "values": [ { "fieldValue": "N", "weight": 4928 }, { "fieldValue": "Y", "weight": 72 } ] }, { "fieldName": "mil_rts", "values": [ { "fieldValue": null, "weight": 3729 }, { "fieldValue": "Y", "weight": 709 }, { "fieldValue": "N", "weight": 562 } ] }, { "fieldName": "own_type", "values": [ { "fieldValue": "PR", "weight": 3663 }, { "fieldValue": "PU", "weight": 1257 }, { "fieldValue": "MR", "weight": 34 }, { "fieldValue": "MA", "weight": 32 }, { "fieldValue": "MN", "weight": 14 } ] }, { "fieldName": "site_number", "values": [ { "fieldValue": "02798.68*A", "weight": 1 }, { "fieldValue": "08643.2*A", "weight": 1 }, { "fieldValue": "02865.*H", "weight": 1 }, { "fieldValue": "08711.*C", "weight": 1 }, { "fieldValue": "14748.*A", "weight": 1 }, { "fieldValue": "25903.*A", "weight": 1 }, { "fieldValue": "02814.01*H", "weight": 1 }, { "fieldValue": "14637.01*A", "weight": 1 }, { "fieldValue": "21694.*A", "weight": 1 }, { "fieldValue": "24070.26*H", "weight": 1 } ] }, { "fieldName": "state", "values": [ { "fieldValue": "TX", "weight": 449 }, { "fieldValue": "CA", "weight": 251 }, { "fieldValue": "IL", "weight": 232 }, { "fieldValue": "PA", "weight": 213 }, { "fieldValue": "FL", "weight": 199 }, { "fieldValue": "OH", "weight": 194 }, { "fieldValue": "IN", "weight": 168 }, { "fieldValue": "NY", "weight": 162 }, { "fieldValue": "AK", "weight": 146 }, { "fieldValue": "LA", "weight": 144 } ] }, { "fieldName": null, "values": [ { "fieldValue": null, "weight": 5000 } ] } ]
WITH __stage0 AS ( SELECT * from (SELECT * FROM '../data/airports.parquet' USING SAMPLE 5000) as x limit 100000 ) , __stage1 AS ( SELECT group_set, CASE group_set WHEN 0 THEN 'act_date' WHEN 1 THEN 'aero_cht' WHEN 2 THEN 'c_ldg_rts' WHEN 3 THEN 'cbd_dir' WHEN 4 THEN 'cbd_dist' WHEN 5 THEN 'cert' WHEN 6 THEN 'city' WHEN 7 THEN 'cntl_twr' WHEN 8 THEN 'code' WHEN 9 THEN 'county' WHEN 10 THEN 'cust_intl' WHEN 11 THEN 'elevation' WHEN 12 THEN 'faa_dist' WHEN 13 THEN 'faa_region' WHEN 14 THEN 'fac_type' WHEN 15 THEN 'fac_use' WHEN 16 THEN 'fed_agree' WHEN 17 THEN 'full_name' WHEN 18 THEN 'id' WHEN 19 THEN 'joint_use' WHEN 20 THEN 'latitude' WHEN 21 THEN 'longitude' WHEN 22 THEN 'major' WHEN 23 THEN 'mil_rts' WHEN 24 THEN 'own_type' WHEN 25 THEN 'site_number' WHEN 26 THEN 'state' END as "fieldName", CASE group_set WHEN 0 THEN 'act_date' WHEN 1 THEN 'aero_cht' WHEN 2 THEN 'c_ldg_rts' WHEN 3 THEN 'cbd_dir' WHEN 4 THEN 'cbd_dist' WHEN 5 THEN 'cert' WHEN 6 THEN 'city' WHEN 7 THEN 'cntl_twr' WHEN 8 THEN 'code' WHEN 9 THEN 'county' WHEN 10 THEN 'cust_intl' WHEN 11 THEN 'elevation' WHEN 12 THEN 'faa_dist' WHEN 13 THEN 'faa_region' WHEN 14 THEN 'fac_type' WHEN 15 THEN 'fac_use' WHEN 16 THEN 'fed_agree' WHEN 17 THEN 'full_name' WHEN 18 THEN 'id' WHEN 19 THEN 'joint_use' WHEN 20 THEN 'latitude' WHEN 21 THEN 'longitude' WHEN 22 THEN 'major' WHEN 23 THEN 'mil_rts' WHEN 24 THEN 'own_type' WHEN 25 THEN 'site_number' WHEN 26 THEN 'state' END as "fieldPath", CASE group_set WHEN 0 THEN 'string' WHEN 1 THEN 'string' WHEN 2 THEN 'string' WHEN 3 THEN 'string' WHEN 4 THEN 'number' WHEN 5 THEN 'string' WHEN 6 THEN 'string' WHEN 7 THEN 'string' WHEN 8 THEN 'string' WHEN 9 THEN 'string' WHEN 10 THEN 'string' WHEN 11 THEN 'number' WHEN 12 THEN 'string' WHEN 13 THEN 'string' WHEN 14 THEN 'string' WHEN 15 THEN 'string' WHEN 16 THEN 'string' WHEN 17 THEN 'string' WHEN 18 THEN 'number' WHEN 19 THEN 'string' WHEN 20 THEN 'number' WHEN 21 THEN 'number' WHEN 22 THEN 'string' WHEN 23 THEN 'string' WHEN 24 THEN 'string' WHEN 25 THEN 'string' WHEN 26 THEN 'string' END as "fieldType", CASE group_set WHEN 99999 THEN CAST(NULL as VARCHAR) WHEN 0 THEN base."act_date" WHEN 1 THEN base."aero_cht" WHEN 2 THEN base."c_ldg_rts" WHEN 3 THEN base."cbd_dir" WHEN 5 THEN base."cert" WHEN 6 THEN base."city" WHEN 7 THEN base."cntl_twr" WHEN 8 THEN base."code" WHEN 9 THEN base."county" WHEN 10 THEN base."cust_intl" WHEN 12 THEN base."faa_dist" WHEN 13 THEN base."faa_region" WHEN 14 THEN base."fac_type" WHEN 15 THEN base."fac_use" WHEN 16 THEN base."fed_agree" WHEN 17 THEN base."full_name" WHEN 19 THEN base."joint_use" WHEN 22 THEN base."major" WHEN 23 THEN base."mil_rts" WHEN 24 THEN base."own_type" WHEN 25 THEN base."site_number" WHEN 26 THEN base."state" END as "fieldValue", COUNT(*) as "weight", CASE group_set WHEN 99999 THEN '' WHEN 4 THEN MIN(CAST(base."cbd_dist" as VARCHAR)) || ' to ' || CAST(MAX(base."cbd_dist") as VARCHAR) WHEN 11 THEN MIN(CAST(base."elevation" as VARCHAR)) || ' to ' || CAST(MAX(base."elevation") as VARCHAR) WHEN 18 THEN MIN(CAST(base."id" as VARCHAR)) || ' to ' || CAST(MAX(base."id") as VARCHAR) WHEN 20 THEN MIN(CAST(base."latitude" as VARCHAR)) || ' to ' || CAST(MAX(base."latitude") as VARCHAR) WHEN 21 THEN MIN(CAST(base."longitude" as VARCHAR)) || ' to ' || CAST(MAX(base."longitude") as VARCHAR) END as "fieldRange" FROM __stage0 as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,27,1)) as group_set ) as group_set GROUP BY 1,2,3,4,5 ) , __stage2 AS ( SELECT "fieldName", "fieldPath", "fieldType", COALESCE("fieldValue", "fieldRange") as "fieldValue", "weight" FROM __stage1 ) , __stage3 AS ( SELECT group_set, base."fieldName" as "fieldName__0", CASE WHEN group_set=1 THEN base."fieldValue" END as "fieldValue__1", CASE WHEN group_set=1 THEN base."weight" END as "weight__1" FROM __stage2 as base CROSS JOIN (SELECT UNNEST(GENERATE_SERIES(0,1,1)) as group_set ) as group_set GROUP BY 1,2,3,4 ) SELECT "fieldName__0" as "fieldName", COALESCE(LIST({ "fieldValue": "fieldValue__1", "weight": "weight__1"} ORDER BY "weight__1" desc NULLS LAST) FILTER (WHERE group_set=1)[1:10],[]) as "values" FROM __stage3 GROUP BY 1 ORDER BY 1 ASC NULLS LAST
A More Complex Example
The rest of this pages uses the model below. The data is an excerpt from the IMDB. The Malloy schema for this model is shown on the right. The core value is movies, but joined at the principals (the people that worked on the movie) and the people (the actual data about the individuals).
We use the measure total_ratings
to determin a movie's popularity. An individual's popularity is determined by the some of all the ratings of the movies a person has worked on.
source: movies is duckdb.table('../data/titles.parquet') extend { join_many: principals is duckdb.table('../data/principals.parquet') extend { join_one: people is duckdb.table('../data/names.parquet') on nconst = people.nconst } on tconst = principals.tconst measure: total_ratings is numVotes.sum() }
Weights can be any measure
Often a row count will work nicely as a weight, but sometimes there is something better. In movies, for example the sum of the number of votes will not only find the interesting most interesting movies but will also find the most interesting people. For example the most interesting people in the dataset.
run: movies -> { group_by: principals.people.primaryName aggregate: total_ratings }
[ { "primaryName": "Brad Pitt", "total_ratings": 18269386 }, { "primaryName": "Stan Lee", "total_ratings": 18229893 }, { "primaryName": "John Williams", "total_ratings": 17651379 }, { "primaryName": "Leonardo DiCaprio", "total_ratings": 17032662 }, { "primaryName": "Tom Hanks", "total_ratings": 16650835 }, { "primaryName": "Christopher Nolan", "total_ratings": 15368415 }, { "primaryName": "Steven Spielberg", "total_ratings": 15069082 }, { "primaryName": "Robert De Niro", "total_ratings": 14629861 }, { "primaryName": "Christian Bale", "total_ratings": 13730681 }, { "primaryName": "Samuel L. Jackson", "total_ratings": 13423134 }, { "primaryName": "Robert Downey Jr.", "total_ratings": 12685331 }, { "primaryName": "George Lucas", "total_ratings": 12584331 }, { "primaryName": "Matt Damon", "total_ratings": 12509393 }, { "primaryName": "Quentin Tarantino", "total_ratings": 12348394 }, { "primaryName": "Bruce Willis", "total_ratings": 12045753 }, { "primaryName": "Jack Kirby", "total_ratings": 11901244 }, { "primaryName": "Johnny Depp", "total_ratings": 11755228 }, { "primaryName": "Morgan Freeman", "total_ratings": 11435905 }, { "primaryName": "Tom Cruise", "total_ratings": 11290573 }, { "primaryName": "Thomas Newman", "total_ratings": 11056371 }, { "primaryName": "Hans Zimmer", "total_ratings": 10957053 }, { "primaryName": "Scott Rudin", "total_ratings": 10793692 }, { "primaryName": "Charles Roven", "total_ratings": 10778794 }, { "primaryName": "Bob Kane", "total_ratings": 10685405 }, { "primaryName": "Ian McKellen", "total_ratings": 10552024 }, { "primaryName": "Chris Evans", "total_ratings": 10497785 }, { "primaryName": "Harrison Ford", "total_ratings": 10449365 }, { "primaryName": "Scarlett Johansson", "total_ratings": 10128373 }, { "primaryName": "David S. Goyer", "total_ratings": 10022517 }, { "primaryName": "Jonathan Nolan", "total_ratings": 9823211 }, { "primaryName": "Martin Scorsese", "total_ratings": 9692050 }, { "primaryName": "Peter Jackson", "total_ratings": 9612306 }, { "primaryName": "Orlando Bloom", "total_ratings": 9482690 }, { "primaryName": "James Newton Howard", "total_ratings": 9473293 }, { "primaryName": "Stephen King", "total_ratings": 9454020 }, { "primaryName": "Hugh Jackman", "total_ratings": 9409702 }, { "primaryName": "Kevin Feige",