tsuki-lang / tsuki Goto Github PK

An elegant, robust, and efficient programming language, that just lets you get things done.

License: MIT License

Rust 100.00%

tsuki's Introduction

tsuki

A programming language that focuses on being fun to program in, and aiding developers in writing more robust software, all while maintaining high performance.

The compiler is still in its infancy, and it'll probably take me a while before it's actually usable. In the meantime, you can check out the spec, which lays out the general feature set and vision of the language.

Compiling

Right now compiling tsuki isn't exactly the most trivial of tasks, and Windows is not yet supported.

Step 0. Install a C (and C++) compiler.

tsuki depends on libc and uses whatever C compiler is available on the system as cc to link executables. This can be overridden using the $TSUKI_CC or $CC environment variables, in that order of priority. The C++ compiler is necessary to build LLVM.

Step 1. Compile LLVM 12.

The best way to get LLVM for tsuki is to build it manually. I had pretty bad experiences with using repository LLVM, with problems ranging from missing static libraries on Ubuntu, no llvm-config on Windows, to random SIGILLs after a month of hiatus on Arch.

So here's uncle Liquid's method of obtaining LLVM:

# This is where we're going to install LLVM, so change this to some sensible path.
# bash - in this case you also need to add this to .bashrc
export LLVM_SYS_120_PREFIX=$HOME/llvm
# fish
set -Ux LLVM_SYS_120_PREFIX ~/llvm

# Now it's time to get LLVM. We'll use their GitHub releases for that.
mkdir -p ~/llvm
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-12.0.1/llvm-12.0.1.src.tar.xz
tar xJf llvm-12.0.1.src.tar.xz

# Now let's get the build going.
cd llvm-12.0.1.src
mkdir -p build
cd build
# If doing a release build, remove LLVM_ENABLE_ASSERTIONS, and set CMAKE_BUILD_TYPE to Release.
# Also, if compiling for other platforms such as aarch64, change the target in LLVM_TARGETS_TO_BUILD.
# You can find a list of all available targets, as well as some other build options, here:
# https://llvm.org/docs/GettingStarted.html#local-llvm-configuration
cmake .. \
   -D CMAKE_INSTALL_PREFIX=$LLVM_SYS_120_PREFIX \
   -D CMAKE_BUILD_TYPE=Debug \
   -D LLVM_ENABLE_ASSERTIONS=1 \
   -D LLVM_TARGETS_TO_BUILD=X86 \
   -G Ninja
# To reduce memory usage during the process of compiling LLVM, clang with the mold linker can be
# used. Grab mold here:
# https://github.com/rui314/mold
# And add the flags:
# -D CMAKE_C_COMPILER=clang
# -D CMAKE_CXX_COMPILER=clang++
# -D CMAKE_CXX_LINK_FLAGS=-fuse-ld=mold
# As far as I know it's not possible to use mold with gcc.

# IMPORTANT:
# When not using clang+mold, open a task manager or system monitor. You're going to want to look
# after your memory usage. If it starts growing rapidly, cancel the build and use --parallel 1.
# Linking with GNU ld uses up a lot of memory, so it's better to let it run a single linker at a
# time.
cmake --build . --target install --parallel 8

Maybe someday I'll make a dedicated script for this, but today is not that day.

Step 2. Compile and run.

With all that, running tsuki should be as simple as:

cargo run

Using the compiler

While still in its early stages, the compiler is able to compile arbitrary user code into a working executable. The most basic usage of the compiler would be:

$ tsuki --package-name main --package-root src --main-file src/main.tsu
# or, abbreviated:
$ tsuki -p main -r src -m src/main.tsu

package_name specifies the name of the output file, and is also used for mangling.

Refer to the code examples in code to see what's currently implemented or being worked on.

tsuki's People

Contributors

Stargazers

Watchers

Forkers

notsooscar icodein

tsuki's Issues

Number literals allow for multiple consecutive underscores

This should not be allowed:

1____000

Additionally, this lack of restrictions around underscores creates an ambiguity when used with floating point exponents:

1_000_e+1

SemLiterals interprets this as if e+1 were a type suffix, which it obviously isn't.
Also, maybe there needs to be a better way of specifying the digit part and the suffix part, directly from the lexer?

Enforce UTF-8

Split SemTypes into multiple files

If SemTypes keeps growing like this, it'll quickly get quite unwieldy and hard to maintain.

Closures

Closures are a pretty essential feature for things like iterator adapters. There are a few considerations in mind when it comes to closures:

How do we make them not require the heap? I'd rather have closures be quite cheap in cost.
- One possible approach is to take Rust's compiler-generated opaque type idea, where to use closures, you need to create a generic parameter that allows for any type to be passed, or optionally, use trait pointers.
I think we should allow for explicitly preventing closures from storing pointers, as it often causes more trouble than it's worth (lifetime hell). Or maybe create two constraints, fun and fun rc, the first of which permits any variables but acts like a pointer (can only be used as an argument and not stored anywhere), the second only permits functions that do not capture external variables by reference (ie. the latter can be stored in an rc).
- I'm not a fan of fun rc because, with its primary purpose being storage in rcs, rc fun rc (Float): Float does not look very pretty. Maybe rc fun (Float): Float should imply that the inner fun is rc?
There needs to be a way of specifying that variables should be moved into the closure, explicitly.

Here's a preview of what such closure constraints would look like.

fun map[T, U, F](sequence: Seq[T], fn: F): Seq[U]
where
   F: fun (T): U

   var result = Seq[U].with_capacity(sequence.len)
   for element in sequence.iter_move
      val mapped = fn(element)
      result.push(mapped)
   result

The above defines a function map, which accepts the argument fn, whose type is F which must satisfy fun (T): U. Now, each function satisfies the fun (T): U constraint, but all values that are only known to have said constraint act like pointers in terms of lifetimes, so we can work with them, pass them by argument, return them, but not store them in external locations.

Additionally, we need to define syntax for creating closures. I propose the following:

# No return type and no arguments:
var my_function = fun
  print("x")

# With an argument with an explicit type:
var my_function = fun (x: Float)
  print(x + 2)

# With an argument with an explicit parameter type _and_ return type:
var my_function = fun (x: Float): Float
  print(x + 2)
  x + 2

# With inferred parameter type, single-line version using `->`:
[1, 2, 3]
  .iter_move
  .for_each(fun (x) -> print(x))

# With inferred parameter and return type, again single-line version:
[1, 2, 3]
  .iter_move
  .map(fun (x) -> x + 2)

Moving values into the closure is done by adding the move keyword after fun, like fun move -> x (a closure without arguments, moving all captured variables, in this case x, into its body).

Indentation-based syntax does not play well with this version of creating closures:

do_stuff(fun ()
  print("abc")
)

leaving an ugly trailing parenthesis at the end. Thus I also propose a syntactic sugar to implement later:

do_stuff() do ()
  print("abc")

Dubbed "infix do blocks", it moves the last closure argument out of the parentheses, into a separate block outside of the function, introduced through an infix operator. Alternatively, to not overload the do keyword with two meanings, fun could also be used.

do_stuff() fun ()
  print("abc")

Remove `SemLiterals`

The lowering done by SemLiterals is better done in SemTypes, which has all the context about integer types it needs. With the current implementation the following is invalid:

var x: Size = 1

because 1 is inferred to be of type Int32, which cannot be converted to an unsigned type like Size.

Roadmap for implementing language features

This issue outlines the tasks needed for a fully working tsuki compiler.

New features are not going to be added into the language until the first version of the compiler is finished, at which point features are going to be refined (along with the spec) and removed, but no new features are going to be added into the language.

Macros, which are currently TODO, are a 1.0 goal, but I need an overview of how the internal AST is going to look before I can implement them.

Do note that this roadmap does not include the standard library.

Pattern matching formalization

Right now pattern matching is not very clearly defined, so this issue attempts to resolve that.

Matching against existing variables, vs introducing new variables in patterns

One problem I have with Rust's pattern matching, is that there's no way of matching against an existing variable. Each identifier introduces a new variable, eg. in Some(x), x is a new variable, in Thing { x: y }, y is also a new variable. Thus in tsuki I want to have a proper syntax for introducing variables into scope, vs matching against existing values.

Since bringing values into scope is a more common use case than matching existing variables, the syntax val x can be used to match against an existing variable.

let outer = 1
match thing
  Some(val outer) -> print("it's the outer value!")
  Some(inner) -> print(inner)
  Nil -> print("nothin' to see here")

The patterns

We should support a fairly limited, yet flexible set of patterns in the beginning.

The wildcard pattern _
- Matches anything and discards it.
Literal patterns, such as 1, true, :my_atom, "Hello there."
- Matches a value literally, using the equality operator ==.
Range patterns, such as 1..5, 1..<5.
- Matches a value between a specified range.
Outer variable patterns, such as val abc
- Matches a value and compares it to the one stored in the variable abc, using the equality operator ==.
Variable binding patterns, such as abc, var abc
- Matches a value and introduces it into scope under a user-specified name and optional mutability.
- Moves the matched value into the new variable in the process.
Tuple pattern, such as (x, var y).
- Matches each tuple field.
Union variant patterns, such as Some(x), MyUnion.MyVariant(var a, var b)
- Matches each of the variant's fields.
Object patterns, such as MyObject { some_field = x, another_field = var y }
- Matches individual object fields.

Refutability

A pattern is irrefutable if it can be proved to always match, no matter the input. A pattern is refutable if it can be proved to not match sometimes, given a specific set of inputs. These are mutually exclusive, ie. a pattern that is not irrefutable is refutable, so only specifying one of these rules is enough.

A pattern is refutable if it contains any of the following patterns:

Literal patterns,
Variable patterns,
Union variant patterns, but only if the matched value's type is not the union variant type itself.

Any pattern that does not contain any of the aforementioned patterns is irrefutable.

Unifying assignment

With patterns introduced, assignment should be unified such that it uses pattern matching instead of some arbitrary keywords. Having assignment return the old value is quite a nice feature to have, so the existing = operator is not going anywhere. A new let statement could be introduced for matching. It would obsolete the existing val and var statements in favor of taking a pattern to match against on its left hand side. This code:

val x = 1
var y = 2
val _ = x

would instead be written as:

let x = 1
let var y = 2
let _ = x

Although it looks a little verbose at first, it is a lot more flexible, as it allows for matching against patterns:

let this_is_surely_some = Some(1)
let Some(one) = this_is_surely_some
let (x, y) = (1, 2)

let also becomes part of the if statement and while loop.

if let Some(x) = maybe_nil
  print(x)
while let Some(x) = my_iterator.next
  print(x)

As before, it should be possible to specify multiple patterns, all of which must match.

if let Some(x) = maybe_nil, let Nil = surely_nil
  ()

`for` loops with patterns

for loops should also use this pattern syntax, always expecting an irrefutable pattern before the in keyword.

for x in [1, 2, 3]
  print(x)

As demonstrated before, more complex matches can be made.

for (x, y) in [(1, 2), (2, 3), (3, 4)]
  print((x, y))

When will 0.1.0 be released?

In my opinion, if you're able to build a JSON (de)serialization and pretty printing library in tsuki, without running into major bugs or missing features, that's the point at which version 0.1.0 should be released.

Such a library would test compiler support for the most critical components of the language:

Basic control flow structures
Functions
Objects - you need to store the lexer state somehow, while the parser is chewing away at the tokens.
Unions - speaking of tokens, you need to represent those too, somehow.
Traits - the parser needs to notify the consumer of all the values it encounters along the way.
Atoms - we need a way of discriminating error kinds inside the library.
FFI - the easiest way of printing things to stdout (or a file) is by using libc, so the standard library could initially use that to accelerate development. For that we need FFI with C.
Standard library
- Support for primitive data types.
- String manipulation.
- Advanced functionality for primitive data types, such as parsing numbers.
- Generic data structures such as Seq[T] and Table[K, V].

Module-level documentation

There needs to be a way of documenting modules. Right now documentation comments document whatever follows, but module-level documentation is a different case, as modules are defined by the file structure rather than syntax.

Therefore, I propose the following syntax for module-level docs:

#! This is a module-level documentation comment.
#! Hello, world!

Identifiers allow for multiple consecutive underscores

Related: #2

The lexer should prevent identifiers from having more than a single underscore consecutively, and also disallow trailing underscores. This would become illegal:

test__test
hello_world_

Implement `const T {V}`

This may be a non-trivial refactoring influencing the type system in a significant way, it's better to do this sooner than later.

`type` should declare weakly distinct types instead of aliasing

In most programs it's more useful to declare a distinct type rather than aliasing an existing type. Say you're implementing an entity component system for a video game: you represent each entity with an ID:

type Entity = Size

Converting from a Size to an Entity should be explicit, as we're adding information by asserting that this Size is a valid entity. Converting from an Entity to a Size is less harmful, because although it loses information, any Entity is a valid Size, so no invariants are broken.

How the conversions should be performed - I'm still not sure. The simplest syntax for that would be Entity(1) - using a type like a function call. Converting from the weakly distinct type to the base type is implicit, so no syntax is needed, but just in case the compiler needs extra type information, we can use Size(an_entity) for consistency.

All implementations from the base type are available on the alias type, and since the existing operations on the base type know nothing of the alias, they still continue to operate and produce the base type. Referring back to the entity example, adding two entities using + is perfectly legal, but it produces a Size and not an Entity, so a conversion must be performed, like so: Entity(Entity(1) + Entity(1))

Converting from a base type to a type alias should only be possible within the package that declares the type, as we don't want external packages breaking invariants.

More sensible numeric type naming

The names Uint and Float stem from Weird Things C Did(™) and so I think they should be renamed to more sensible names.

Unsigned integers are better called naturals so Uint should be renamed to Nat.
- The name is abbreviated to 3 letters because it's a common type, just like Int.
- I've seen this name used in a few places in functional programming languages.
- It looks much better with all types using PascalCase in the language.
- Explicit sized versions are Nat8, Nat16, Nat32, Nat64.
Floats are better called reals, so Float should be renamed to Real.
- One opposing thought is that fixed-point numbers exist. Thus we should maybe have Float be renamed to Real as a "sensible default", but not so much sized types Float32 and Float64.

The "unsigned integer" to "natural" change will also make unsigned integers have equal importance as signed integers. Signedness will no longer sound like an opt-out; after all, tsuki treats integers and naturals as completely different types (though implicit widening conversions from naturals to integers may be permitted in the future, as they're fully sound).

Generate debug information

Named blocks

Because the language lacks goto, breaking out of a nested loop is quite unreadable:

var done = false
for y in 1..10
   for x in 1..10
      if my_array[x + y * 10] == 0
         done = true
         break
   if done
      break

Therefore, I think that named blocks should be introduced:

block loops
   for y in 1..10
      for x in 1..10
         if my_array[x + y * 10] == 0
            break @loops

Reasoning behind the syntax:

Prefixed declarations are better than infixed, because they're easier to parse and don't require backtracking.
- Thus, block <name> and break @<name> instead of <name>: and break <name>
block should be an expression, thus one should be able to return a value out of a block.
- I'm still not sure about the syntax here. break @loops -1 looks maybe just a little weird? Especially when combined with some more complex expression like break @loops x + y * 2
  - Maybe introduce a with keyword between the label and the result expression: break @loops with x + y * 2
  - On the other hand, the @label already separates the label name from the expression, especially if it's highlighted differently from normal expressions in editors.