malloydata / malloy Goto Github PK

View Code? Open in Web Editor NEW

1.9K 26.0 74.0 306.4 MB

Malloy is an experimental language for describing data relationships and transformations.

Home Page: http://www.malloydata.dev

License: MIT License

JavaScript 0.43% TypeScript 97.62% ANTLR 1.15% Shell 0.44% Nix 0.01% PEG.js 0.09% HTML 0.10% CSS 0.13% MDX 0.03%

data database sql data-visualization malloy semantic-modeling

malloy's Introduction

Malloy

Malloy is an experimental language for describing data relationships and transformations. It is both a semantic modeling language and a querying language that runs queries against a relational database. Malloy currently supports BigQuery and Postgres, as well as querying Parquet and CSV files via DuckDB.

Click here to try Malloy in your browser!

Installing Malloy

The easiest way to try Malloy is with our VS Code Extension, which provides a place to create Malloy models, execute queries, get help, and more. VS Code is a text editor and IDE (integrated development environment) that runs on your desktop or in your browser. A few ways to install the extension:

To get to know the Malloy language, follow our Quickstart.

Note: The Malloy VSCode Extension tracks a small amount of anonymous usage data. You can opt out in the extension settings. Learn more.

Join the Community

Join our Malloy Slack Community! Use this community to ask questions, meet other Malloy users, and share ideas with one another.
Use GitHub issues in this Repo to provide feedback, suggest improvements, report bugs, and start new discussions.

Resources

Documentation:

Malloy Language - A quick introduction to the language
eCommerce Example Analysis - a walkthrough of the basics on an ecommerce dataset (BigQuery public dataset)
Modeling Walkthrough - introduction to modeling via the Iowa liquor sales public data set (BigQuery public dataset)

YouTube - Watch demos / walkthroughs of Malloy

Contributing

If you would like to work on Malloy, take a look at the instructions for developing Malloy.

To report security issues please see our security policy.

Malloy is not an officially supported Google product.

Syntax Example

Here is a simple example of a Malloy query:

run: bigquery.table('malloy-data.faa.flights') -> {
  where: origin ? 'SFO'
  group_by: carrier
  aggregate:
    flight_count is count()
    average_flight_time is flight_time.avg()
}

In SQL this would be expressed:

SELECT
   carrier,
   COUNT(*) as flight_count,
   AVG(flight_time) as average_flight_time
FROM `malloy-data.faa.flights`
WHERE origin = 'SFO'
GROUP BY carrier
ORDER BY flight_count desc         -- malloy automatically orders by the first aggregate

Learn more about the syntax and language features of Malloy in the Quickstart.

malloy's People

Contributors

Stargazers

Watchers

malloy's Issues

Improve "raw" error location references

based on @mtoy-googly-moogly :

The errors in the edit window are problem. There are "raw" errors, with code locations and error messges, and "cooked errors" which are only really useful in a terminal, and it looks like the editor is displaying the cooked errors?

What is not yet accurate in the "raw" errors are the code locations. Was waiting for someone to need that.

Sorts with bar_chart renderer are incorrect

With a query along the lines of

explore 'malloy-303216.fluidstate.items' 
| reduce 
  category 
  item_count is count()
  top_items is (reduce 
    top 5
    item 
    item_count is count()
  )

By default, top_items will be ordered by top_items.item_count, but the bar_chart renderer is instead ordering by category.

Feature Request : Provide an Semantic Model Diagram

can you please provide a command to render an ER Diagram from the relationships, it make working with the Model much easier

Make refining or re-opening a defined object possible

For example, limiting, ordering, and/or doing top n of a turtle; or even potentially modifying which fields are in it or its definitions.

Circular dependencies in imports sometimes breaks the build

NOTE: references to code here are from #71.

I'm encountering a bug that appears to be tied to load order and circular dependencies of imports in lang. I've isolated the issue in this branch. Mostly the changes consists of deleting extraneous things (calls to run queries, etc) so that the issue can be isolated.

clone this branch and yarn; yarn build. then run this single test:

yarn workspace malloy jest /field-symbols.spec.ts

This test completes successfully. Now comment out line 65 in malloy_query - this is a dummy line I've added to instantiate a MalloyTranslator. So, we're just removing the lines that instantiate a MalloyTranslator in this class:

const parse = new MalloyTranslator("internal://query/1", {
  URLs: { ["internal://query/1"]: "test" },
});

Re-run the test:

yarn run v1.22.10
$ /Users/bporterfield/dev/malloy/node_modules/.bin/jest /field-symbols.spec.ts
 FAIL  src/lang/field-symbols.spec.ts
  ● Test suite failed to run

    TypeError: Class extends value undefined is not a constructor or null

      131 | }
      132 |
    > 133 | class ConstantFieldSpace extends FieldSpace {
          |                                  ^
      134 |   constructor() {
      135 |     super({
      136 |       type: "struct",

      at Object.<anonymous> (src/lang/ast/ast-expr.ts:133:34)
      at Object.<anonymous> (src/lang/ast/index.ts:16:1)
      at Object.<anonymous> (src/lang/field-space.ts:17:1)
      at Object.<anonymous> (src/lang/space-field.ts:16:1)
      at Object.<anonymous> (src/lang/field-symbols.spec.ts:15:1)

Test Suites: 1 failed, 1 total
Tests:       0 total
Snapshots:   0 total
Time:        1.168 s
Ran all test suites matching /\/field-symbols.spec.ts/i.

This message appears to be indicative of circular references, and note that those references can be imports, not just types. Here is an article about the issue, and another follow-up article that does make some sense to me.

How would removing this reference in a completely different class impact this test? The test in question loads space-field.ts, which then loads malloy_query.ts. In today's world, malloy_query.ts then loads up MalloyTranslator, which must additionally must include some classes that otherwise are not yet defined in a different loading context. Once MalloyTranslator no longer happens to be loaded in malloy_query.ts, the order of imports seems to result in the issue described in the article:

I've also added the Madge package in this branch, which can be used to detect circular dependencies. Here is the output:

➜  malloy git:(circularReferenceBug) ✗ yarn madge --extensions ts,tsx,js,jsx --circular packages/malloy/src
yarn run v1.22.10
$ /Users/bporterfield/dev/malloy/node_modules/.bin/madge --extensions ts,tsx,js,jsx --circular packages/malloy/src
Processed 47 files (1.8s) (35 warnings)

✖ Found 6 circular dependencies!

1) db/bigquery.ts > malloy.ts
2) lang/ast/index.ts > lang/ast/ast-main.ts
3) lang/field-space.ts > lang/space-field.ts
4) lang/lib/Malloy/MalloyParser.ts > lang/lib/Malloy/MalloyListener.ts
5) lang/lib/Malloy/MalloyParser.ts > lang/lib/Malloy/MalloyVisitor.ts
6) lang/ast/index.ts > lang/ast/ast-time-expr.ts

Node describes how this issue can happen here, and they also mention that "Careful planning is required to allow cyclic module dependencies to work correctly within an application.". It's unclear to me what other JS runtimes do under the same conditions, but we may see more of these issues if we attempt to run in-browser, for ex.

I imagine we will hit new similar issues if we don't add madge to our tests and have some strategy for ordering imports, perhaps as suggested by the second article. I attempted to resolve the issues myself, but there are a lot of import references between classes or functions in /ast, space-field.ts, and field-space.ts, malloy_types.ts and malloy_query.ts and I'm unsure what the best organization might be. Happy to help work / implement potential solutions.

Typed SQL Functions

the type of the output is always the type of the first argument, so to use it in anything you generally have to cast it.

Add support for correlated subqueries

Examples of why one might want this:

In where clauses

SELECT 100.0*SUM(num_tested_lines) / SUM(num_relevant_lines)
FROM code_coverage_reports
WHERE timestamp = (SELECT max(timestamp) from code_coverage_reports)

particularly with in:

SELECT column_name(s)
FROM table_name
WHERE column_name IN (SELECT STATEMENT);

Sequencing things (note: this isn't how we'd advise people sequence things in BQ so this is more for a future where there is other dialect support, such as MySQL)

  - dimension: user_order_sequence_number
    type: number
    sql: |
      (
        SELECT COUNT(*)
        FROM orders o
        WHERE o.id <= ${TABLE}.id
          AND o.user_id = ${TABLE}.user_id
      )

Specify in the documentation that you first need to fork the repository

Before running yarn install and yarn build you need to fork the repository - I think it's worth specifying this explicitly in the documentation, given the audience for Malloy is not necessarily used to this kind of process.

Pick allows multiple whens of same value

This is currently valid:

color is color_code:
  pick 'Blue' when = 7
  pick 'Red' when = 7

It might be nice to warn/fail here?

join: [...] syntax

Currently joins work when I write join: on each line.

Should join: [...] work?

Window Functions

Add support for window functions. A few we'd probably like to support:

LAG / LEAD : "how long did it take for User to get from step A to step B?" "How long are people typically idle after they do x action?" "What was the previous status of this item?"
FIRST_VALUE / LAST_VALUE / NTH_VALUE - important for event/session data. "what was the first action User took? What was the first/last action or landing/exit page in User's session?"; grabbing the "first" ever referrer for a user across many sessions, filling in each row of a historical state table from EAV data. etc
RANK / ROW_NUMBER - often for sequencing things like transactions "What percent of purchases come from first-time vs newish vs seasoned customers?" or just for generating a primary key where there isn't one, or even for funnel analysis / finding given value in a partition
Percentile functions (including MEDIAN), helpful for visualizing the data and seeing the distribution

Add support for UNION

Implement "malloy sort order"

Per @mtoy-googly-moogly :

Default sort order should be the default, QueryBase also now has by?: string so the user can change which is the magic field.

Project is not working correctly in pipelines...

This fails to compile. Changing project to group_by works.

      query: table('malloytest.aircraft')->{
        aggregate: f is count(*)
      }->{
        project: f2 is f+1
      }

This also fails. Might be the same thing. Changing 'project' to 'group_by' allows these to compile.

      query: table('malloytest.airports')->{
        aggregate: airport_count is count()
        nest: pipe_turtle is {
          group_by: [
            state
            county
          ]
          aggregate: a is count()
        } -> {
          project: [
            state is upper(state)
            a
          ]
        } -> {
          group_by: state
          aggregate: total_airports is a.sum()
        }
      }

ambiguous output field name error when using fields from different shapes with the same name

This throws the error Error: Ambiguous output field name 'full_name'. :

  farthest_flights is (reduce top 5 order by distance desc
    origin.full_name
    destination.full_name
    distance
  )

workaround is to alias both:

  farthest_flights is (reduce top 5 order by distance desc
    origin_name is origin.full_name
    destination_name is destination.full_name
    distance
  )

But it would be nicer not to need to.

Allow management of max query billing settings in plugin

Allow management of max query billing settings in plugin
https://cloud.google.com/bigquery/docs/best-practices-costs#limit_query_costs_by_restricting_the_number_of_bytes_billed

Support for numeric datatype

Currently cannot query fields of type: numeric, nor Preview any table that contains one

differences in limiting behavior running turtle with and with `reduce` confusing

It feels odd to me that these produce differently limited results.

by_state is defined:

by_state is (reduce
    state
    airport_count

Make wild card tables in BigQuery work

When we encounter a * in a table name, when looking up the struct def. We need to use a different code path to get the table structure and add a special '_TABLE_SUFFIX' dimension.

not possible to reference/turtle a named query from a joined shape

trying to run

  origin_dashboard is (reduce top 5
    origin.full_name
    flight_count
    average_distance
    origin.airports_by_state
    )
  )

and it's saying "path not found state" (which is used in origin.airports_by_state). relevant parts of the model below:

define airports is (explore 'malloy-data.faa.airports'
  primary key code
  
  airport_count is count()

  airports_by_state is (reduce
    state
    airport_count 
  )
);


define flights is (explore 'malloy-data.faa.flights'
  origin_code renames origin
  destination_code renames destination

  origin is join airports on origin_code
  destination is join airports on destination_code

  average_distance is distance.avg()
  flight_count is count()

  origin_dashboard is (reduce top 5
    origin.full_name
    flight_count
    average_distance
    origin.airports_by_state
    )
  )
);

define jetblue_dashboard is (
  flights | carrier_dashboard : [carriers.nickname: 'Jetblue']
);

Allow `query: from(... )

A pretty good use case for this is joining the results of one query with another without having to create an explore.

query: from(aircraft_models->{
  group_by: m is manufacturer
  aggregate: num_models is count(*)
  }){
  join: seats is from(
    aircraft_models->{
      group_by: m is manufacturer
      aggregate: total_seats is seats.sum()
    }
  ) on m
}
-> {
  project: [
    m
    num_models
    seats.total_seats
  ]
  order_by: 2 desc
  limit: 1
}

Feature Request : Provide Binary Distribution

I try to build it on Wndows 10, I got errors when using Yarn build, can we have please binary code, we are analysts not software engineers :)

Support colons in table names?

BigQuery public datasets by default use colons. Here is the hackernews dataset: bigquery-public-data:hacker_news and here is a table in it (taken from the BigQuery Console): bigquery-public-data:hacker_news.comments.

A new user would rightfully copy this table name into Malloy, but the colon causes Malloy to throw up. The workaround is to replace the colon with a dot, but this won't be obvious to new users. cc @lloydtabb @anikaks

Wrong error message for reference before definition

Hey @christopherswenson ... I need help.

If I have this in the IDE:

explore: contributions is table('bigquery-public-data.fec.indiv20') {
  measure: total_amt is sum(transaction_amt)
  dimension: candidate_id is REGEXP_EXTRACT(memo_text, r'(C\d\d\d\d+)')
  join: candidate is candidates on candidate_id
}

explore: candidates is table('bigquery-public-data.fec.cn20') {
  primary_key: cand_pcc
}

query: contributions -> {
  top: 5
  where: candidate_id != NULL
  group_by: candidate_id
  aggregate: total_amt
  order_by: total_amt desc
}

Then I get a lineless-internal error about dialects, because of the forward reference

But if I write t a test:

    const m = new BetaModel(`
      explore: newAB is a { join: newB is bb on astring }
      explore: newB is b
    `);
    expect(m).not.toCompile();
    console.log(m.prettyErrors());

It passes and produces this output:

 console.log
   FILE: internal://test/root
   line 2: Undefined data source 'bb'
     |     explore: newAB is a { join: newB is bb on astring }
     |                                         ^

     at Object.<anonymous> (src/lang/beta.spec.ts:579:13)

https://github.com/looker-open-source/malloy/blob/main/packages/malloy/src/lang/beta.spec.ts#L576-L585

I don't understand how my test is different than what the IDE does. It calls .translate until it is final and looks at the error log.

can't query dataset.INFORMATION_SCHEMA.COLUMNS

Today we allow querying a table via full path (project.dataset.table) or via "local to your primary project" path (dataset.table). The way we parse this in both cases makes accessing INFORMATION_SCHEMA.COLUMNS unworkable:

we expect 3-part paths to start with project, so explore 'examples.INFORMATION_SCHEMA.COLUMNS' doesn't work
we expect 4-part paths to not be real paths, so explore 'malloy-303216.examples.INFORMATION_SCHEMA.COLUMNS' doesn't work

make the error when attempting to join on a field that shares namespace with a table a bit more obvious

I'm not sure if this is easy or even possible, but I got hung up writing:

    origin is join airports on origin
    destination is join airports on destination

and getting Error: origin is not of type a scalar. This error didn't mean a lot to me.

I figured out that I needed to renames the origin and destination fields to de-crowd the namespace:

    origin_code renames origin
    destination_code renames destination

    origin is join airports on origin_code
    destination is join airports on destination_code

Bad parse causes syntax highlighting to disappear

@christopherswenson had written:
@mtoy-googly-moogly It seems like we're no longer highlighting when there's a bad parse again.

I can look into this, but wanted to give you a heads up.

@mtoy-googly-moogly had responded:
What I thought you asked was for me to do the "highlight", which only depends on the token stream, even if the parse failed. I believe this is what is implemented. The "symbols" pass is only run if the parse runs without errors.

Are you saying that the "highlight" recognizer is not running when there are syntax errors, but it once did.

Are you saying the "highlight" recognizer has always been skipped on syntax errors?

Are you saying that I misunderstood you and you expect the "symbols" recognizer to be run, even when the parse returns syntax errors?

Show Raw/JSon/SQL/Metadata in VSCode plugin

Quoting malloy keywords that SQL wants

Per @mtoy-googly-moogly:

This does not parse because "day" is a malloy keyword

DATE_DIFF(d1, d2, DAY)

This does not parse because now that day is not a keyword, malloy is pretty sure it is a field name, and there is no field called "DAY"

DATE_DIFF(d1, d2, `DAY`)

I think we can hack fix this one by not recognizing SQL keywords as field names, but I think that just kicks the can down the road to the next problem. It feels like a little big brain thinking about quoting is needed.

Icons on buttons in schema viewer have disappeared

They're completely invisible until you hover over them, then it goes a darker grey and can be pressed but still no icon

should have a play button:

should have a refresh button:

Pick with mixed return types should generate helpful error

Currently, the WHEN part of a pick only accepts a partial for numbers, but allows values for strings.

This does not work:

  cause is cause_code:
    pick 'Unknown' when 0
    pick 'Earthquake' when 1
    else 'Not Valid'

This does:

  cause is cause_code:
    pick 'Unknown' when = 0
    pick 'Earthquake' when = 1
    else 'Not Valid'

The value works when using a string, for example:

color_type is color:
  pick 'simple primary' when 'red' | 'green' | 'blue'
  else 'complex'

Plugin seems to interfere with cmd+enter to commit in git menu

Feedback from a tester:

I did have an issue with it when i switched back to normal coding. when i did cmd+enter, which is normally what I do to commit in the VS Code git panel, it would try and run a query or something weird and error. so I had to disable the extension

Provide way to see query history

A simple way to browse and re-open past queries--not necessarily saved/named ones, just past things I ran and whether they did so successfully--would be powerful!

syntax errors result in Run code lenses appearing everywhere

I noticed this while watching Anicia recently, but here's one example (missing a comma between filters)

nest: should allow arrays.

From the airports model. We should allow arrays in nest: blocks.

  query: airports_by_region is { 
    where: faa_region != null
    group_by: faa_region
    aggregate: airport_count
    nest: by_fac_type
    nest: by_state
    nest: major_airports
    -- nest: [
    --   by_fac_type
    --   by_state
    --   major_airports 
    -- ]
  }

URL encoding messing with data styles

I'm getting this error:

Error loading data style 'scratch.styles.json': Error: cannot open file:///Users/anikaks/Library/Application%2520Support/Malloy/thelook_anika/scratch.styles.json. Detail: Unable to read file '/Users/anikaks/Library/Application%20Support/Malloy/thelook_anika/scratch.styles.json' (Error: Unable to resolve non-existing file '/Users/anikaks/Library/Application%20Support/Malloy/thelook_anika/scratch.styles.json')

and my styles aren't coming through for my local malloy files. Everything in the _samples directory is fine. Chatted with @christopherswenson who thinks it's related to URL encoding.

Todd encountered the same issue trying to use sample models in a renamed folder.

Îñţérñåţîöñåļîžåţîöñ support

Several Îñţérñåţîöñåļîžåţîöñ i18n issues I can think about

different collation order for different human language in "order by"
ref https://en.wikipedia.org/wiki/Alphabetical_order
ref https://unicode.org/reports/tr10/
ref https://icu4c-demos.unicode.org/icu-bin/collation.html
full unicode support in regular expression
ref https://unicode.org/reports/tr18/
ref https://icu4c-demos.unicode.org/icu-bin/redemo
formatting data type to text in human language by selecting a locale
number format
currency format
date/time format
interval format

Results window looks wonky in dark themes

Lindsey encountered this using Spacegray theme in VSCode

Throw better error when an alias on a field is omitted

2022-01-28 NOTE: I don't think this specific error appears anymore but we still could use better errors for this problem

People frequently forget to alias fields that need them. PARSE ERROR: no viable alternative at input. is not particularly helpful in understanding where you've gone wrong.

... if one solution is to bring back commas in field lists, and stop requiring an alias for everything... I could get on board with that! commas really aren't so bad in the scheme of things (especially since they're needed all sorts of other places).

pick is too picky about types -- NULL should match all types

per @mtoy-googly-moogly : pick when true then 'a' else NULL will error because pick wants all values to be compatible and it isn't very smart

Timezone Conversion

Timezones are a major annoyance of writing SQL, so it would be useful to be able to have Malloy handle timezone conversion. To be determined if it would make sense for this to be at the model/explore/field level (perhaps all options would eventually be needed?).

Issue with re-aliasing on joins

Filing on behalf of Bryan W: "I added a join and I wanted to re-alias it, but when using it the field list, it is is still requires the old alias. "

Should be able to use reduce top 1 order_facts.* instead, but it doesn’t work.

define order_items_new is ( order_items
   first_order is (
       reduce top 1
           order_facts_pk.*
   )
   joins
       order_facts is order_facts_pk on order_id
);
explore order_items_new | reduce user_id first_order

Allow ordering by a field that is not in the SELECT clause for a query

This is legal in BQ:

SELECT 
  origin 
FROM `malloy-data.faa.flights` 
GROUP BY 1 
ORDER by count(*) 
LIMIT 10

But this errors in Malloy:
explore flights | reduce carrier order by flight_count

"Error: Internal Error, field Not defined flight_count"

Send debug info to OUTPUT window

Per @lloydtabb :
It would be great if we could some how log debug output to the OUTPUT window below. For example, I'm running a query and I'm getting only a few results. What did the query look like?

When we have exceptions, it would be nice not to have to chase down a logfile on a disk somewhere (I can't seem to find one).

allow turtles to be defined as refinements of existing turtles

not implemented in betalang, requested as a feature twice already, not doing it right now because bug fixes and tests are more important, but should get to it asap.

explore: betterThing is thing {
  query: betterTurtle is existingTurtle { ... with refinements ... }
}

project ** won't run a query with ambiguous field names even though it's legal in BQ

If you run project ** against a defined shape and tables have the same field names (which happens a lot with names like id and created_at, you get an error and the query refuses to run.

and this runs fine in the BQ console:

select * From `lookerdata.liquor.order_items3`  oi
join `lookerdata.liquor.users` users on users.id = oi.user_id
limit 10

bigquery just appends a _1 to the second id and created_at fields.

Allow reference before definition

per @mtoy-googly-moogly :

define reference is (explore definition)
define definition is (explore 'table')

This kind of thing is illegal currently but should be legal.

BIG piece of work, but want to get to this "someday."

My additional note:

I seem to run into this a lot. I think the reason I care about this the most is that currently I kind of have to put my favorite shape/explore that I join everything to at the bottom of my model file, where it's becoming increasingly hard to find as I add more and more joined tables / definitions.

allow boolean as dimensional value for bar (and other) charts

This does not render a bar chart and throws error invalid type for bar chart renderer

  has_symmetric_aggregates is query ~ '%md5%'  

  by_symmetric_aggregates is (reduce 
    has_symmetric_aggregates is has_symmetric_aggregates
    query_count
  )

This works:

  has_symmetric_aggregates is query ~ '%md5%'  

  by_symmetric_aggregates is (reduce 
    has_symmetric_aggregates is has_symmetric_aggregates::string 
    query_count
  )

Provide way to see tables available on connection

For the case where someone is getting to know a new dataset (or just can't remember what all their tables are called), being able to browse through schemas/projects/tables is extremely useful. Introducing an easy "show tables" and preview functionality without needing to know about the table and "export" it would be really powerful, even if it's a samples model of the info schema. If you're not already VERY comfortable with a database, it's generally necessary to be able to look at all the tables you have. It doesn't need to be fancy at all but for exploring a new database/dataset this feels important--otherwise folks will need to have a SQL runner open in tandem to look at.

Support for datetime datatype

Shows up in SQL text as BigQuery type "DATETIME" not supported by Malloy as timestamp

Prevents Preview or any sort of PROJECT *