GithubHelp home page GithubHelp logo

werunom / tree-sitter Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tree-sitter/tree-sitter

0.0 1.0 0.0 7.66 MB

An incremental parsing system for programmings tools

License: MIT License

C 25.26% Python 1.39% Shell 1.63% Batchfile 0.14% C++ 71.57%

tree-sitter's Introduction

tree-sitter

Build Status Build status

Tree-sitter is a C library for incremental parsing, intended to be used via bindings to higher-level languages. It can be used to build a concrete syntax tree for a program and efficiently update the syntax tree as the program is edited. This makes it suitable for use in text-editing programs.

Tree-sitter uses an incremental LR parsing algorithm, as described in the paper Incremental Analysis of Real Programming Languages by Tim Wagner & Susan Graham. It handles ambiguity at compile-time via precedence annotations, and at run-time via the GLR algorithm. This allows it to generate a fast parser for any language that can be described with a context-free grammar.

Installation

script/configure # Generate a Makefile
make             # Build static libraries for the compiler and runtime

Overview

Tree-sitter consists of two libraries. The first library, libcompiler, can be used to generate a parser for a language by supplying a context-free grammar describing the language. Once the parser has been generated, libcompiler is no longer needed.

The second library, libruntime, is used in combination with the parsers generated by libcompiler, to generate syntax trees based on text documents, and keep the syntax trees up-to-date as changes are made to the documents.

Writing a grammar

Tree-sitter's grammars are specified as JSON strings. This format allows them to be easily created and manipulated in high-level languages like JavaScript. The structure of a grammar is formally specified by this JSON schema. You can generate a parser for a grammar using the ts_compile_grammar function provided by libcompiler.

Here's a simple example of using ts_compile_grammar to create a parser for basic arithmetic expressions. It uses C++11 multi-line strings for readability.

// arithmetic_grammar.cc

#include <stdio.h>
#include "tree_sitter/compiler.h"

int main() {
  TSCompileResult result = ts_compile_grammar(R"JSON(
    {
      "name": "arithmetic",

      // Things that can appear anywhere in the language, like comments
      // and whitespace, are expressed as 'extras'.
      "extras": [
        {"type": "PATTERN", "value": "\\s"},
        {"type": "SYMBOL", "name": "comment"}
      ],

      "rules": {

        // The first rule listed in the grammar becomes the 'start rule'.
        "expression": {
          "type": "CHOICE",
          "members": [
            {"type": "SYMBOL", "name": "sum"},
            {"type": "SYMBOL", "name": "product"},
            {"type": "SYMBOL", "name": "number"},
            {"type": "SYMBOL", "name": "variable"},
            {
              "type": "SEQ",
              "members": [
                {"type": "STRING", "value": "("},
                {"type": "SYMBOL", "name": "expression"},
                {"type": "STRING", "value": ")"}
              ]
            }
          ]
        },

        // Tokens like '+' and '*' are described directly within the
        // grammar's rules, as opposed to in a seperate lexer description.
        "sum": {
          "type": "PREC_LEFT",
          "value": 1,
          "content": {
            "type": "SEQ",
            "members": [
              {"type": "SYMBOL", "name": "expression"},
              {"type": "STRING", "value": "+"},
              {"type": "SYMBOL", "name": "expression"}
            ]
          }
        },

        // Ambiguities can be resolved at compile time by assigning precedence
        // values to rule subtrees.
        "product": {
          "type": "PREC_LEFT",
          "value": 2,
          "content": {
            "type": "SEQ",
            "members": [
              {"type": "SYMBOL", "name": "expression"},
              {"type": "STRING", "value": "*"},
              {"type": "SYMBOL", "name": "expression"}
            ]
          }
        },

        // Tokens can be specified using ECMAScript regexps.
        "number": {"type": "PATTERN", "value": "\\d+"},
        "comment": {"type": "PATTERN", "value": "#.*"},
        "variable": {"type": "PATTERN", "value": "[a-zA-Z]\\w*"},
      }
    }
  )JSON");

  if (result.error_type != TSCompileErrorTypeNone) {
    fprintf(stderr, "Compilation failed: %s\n", result.error_message);
    return 1;
  }

  puts(result.code);

  return 0;
}

To create the parser, compile this file like this:

clang++ -std=c++11 \
  -I tree-sitter/include \
  arithmetic_grammar.cc \
  "$(find tree-sitter/out/Release -name libcompiler.a)" \
  -o arithmetic_grammar

Then run the executable to print out the C code for the parser:

./arithmetic_grammar > arithmetic_parser.c

Using the parser

Providing the text to parse

Text input is provided to a tree-sitter parser via a TSInput struct, which contains function pointers for seeking to positions in the text, and for reading chunks of text. The text can be encoded in either UTF8 or UTF16. This interface allows you to efficiently parse text that is stored in your own data structure.

Querying the syntax tree

The libruntime API provides a DOM-style interface for inspecting syntax trees. Functions like ts_node_child(node, index) and ts_node_next_sibling(node) expose every node in the concrete syntax tree. This is useful for operations like syntax-highlighting, which operate on a token-by-token basis. You can also traverse the tree in a more abstract way by using functions like ts_node_named_child(node, index) and ts_node_next_named_sibling(node). These functions don't expose nodes that were specified in the grammar as anonymous tokens, like ( and +. This is useful when analyzing the meaning of a document.

// test_parser.c

#include <assert.h>
#include <string.h>
#include <stdio.h>
#include "tree_sitter/runtime.h"

// Declare the language function that was generated from your grammar.
TSLanguage *tree_sitter_arithmetic();

int main() {
  TSDocument *document = ts_document_new();
  ts_document_set_language(document, tree_sitter_arithmetic());
  ts_document_set_input_string(document, "a + b * 5");
  ts_document_parse(document);

  TSNode root_node = ts_document_root_node(document);
  assert(!strcmp(ts_node_type(root_node, document), "expression"));
  assert(ts_node_named_child_count(root_node) == 1);

  TSNode sum_node = ts_node_named_child(root_node, 0);
  assert(!strcmp(ts_node_type(sum_node, document), "sum"));
  assert(ts_node_named_child_count(sum_node) == 2);

  TSNode product_node = ts_node_child(ts_node_named_child(sum_node, 1), 0);
  assert(!strcmp(ts_node_type(product_node, document), "product"));
  assert(ts_node_named_child_count(product_node) == 2);

  printf("Syntax tree: %s\n", ts_node_string(root_node, document));
  ts_document_free(document);
  return 0;
}

To demo this parser's capabilities, compile this program like this:

clang \
  -I tree-sitter/include \
   test_parser.c arithmetic_parser.c \
  "$(find tree-sitter/out/Release -name libruntime.a)" \
  -o test_parser

./test_parser

References

tree-sitter's People

Contributors

maxbrunsfeld avatar joshvera avatar philipturnbull avatar robrix avatar tclem avatar gnprice avatar rewinfrey avatar amilajack avatar pike avatar aymannadeem avatar 314eter avatar yudai-nkt avatar zachtrice avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.