p7g / c-bytecode-vm Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 2.0 596 KB

A VM implementing a dynamically-typed imperative programming language from scratch.

Home Page: https://p7g.github.io/c-bytecode-vm

Makefile 0.47% C 99.42% Brainfuck 0.11%

interpreter programming-language

c-bytecode-vm's People

Contributors

Stargazers

Watchers

Forkers

yfw123 chaojunzhou

c-bytecode-vm's Issues

Struct and array destructuring in function parameters

Should be possible without too much trouble.

Rather than parsing all the parameters before appending the function prologue to the bytecode, the parameters can be parsed afterward (when the bytecode is in the body of the function). This means the code to destructure the arguments can be within the function at the top.

Failed assertion on compile error

ssertion failed: (id == agent.next_module_id - 1), function cb_agent_unreserve_modspec_id, file agent.c, line 182.

Tests for standard library

I've only written a few tests, there are still many more to go.

Exceptions + try/catch

Currently whenever something goes wrong, it's either unrecoverable, or uses something unergonomic like lib/result.

I think this could be implemented as follows:

When entering a try block, there is an ENTER_TRY opcode with the address of the catch block as argument. This opcode will store the address in the local state of cb_eval.
- to allow for nested try blocks, it should be stored in a stack.
When an exception is raised, it is placed on the stack.
If inside a try/catch, the instruction pointer is set to the address of the catch block. Otherwise, the cb_eval returns 1.
The call stack is unwinded until a try block is encountered, or the top-level is reached, in which case a stack trace is printed and the program exits.
At the end of the try block, an address is removed from the try stack and the instruction pointer jumps to after the catch block.

Extend modules from C

In a bunch of standard library modules there's stuff like this:

module Array;

export let new = array_new;

This is so that you can import the array lib and do Array.new(23). I think it would be better if the C functions could be added directly to the library.

Static global scope

All global variables are known at compile time, so there is no need to store them in a hashmap.

The reasoning for this decision initially was to allow for usage of a value before definition in the global scope. This, however, can be implemented by putting a placeholder in the bytecode, and updating it after the desired value is defined. Once compilation for a module is finished, we can just display an error if any positions have yet to be populated.

Remove non-module global scope

Thanks to #21, all code is run within a module. It should no longer be necessary to have a global scope for code running outside a module.

Move intrinsic functions into C modules

There's not really any reason to have a bunch of default global variables anymore. Perhaps the entire make_intrinsics function can be removed.

Only read files when importing if they have yet to be imported

It seems like a file is read every time it's imported, even if we've already imported it (and thus skip compiling it).

This seems inefficient, we could probably just keep a list or map of files that we've already seen and skip it if it's in there.

Nested modules

Like Python, it would be nice to support nested modules.

Some open questions:

Does there need to be in every directory that is also a module? (like __init__.py)
What does the syntax for a relative import look like?

Customizable object marking

Currently, when allocating an object that should be managed by the garbage collector, you can specify a function that will run to clean up before the object is freed. It is not, however, possible to customize how the object is marked.

To do this, you have to add a new branch to cb_value_mark, which won't work when extensions are available.

A new field could be added to cb_gc_header which is a function pointer to a function that will mark the object. The default (if the field is NULL) would be a function that does nothing. The downside is that every object will one pointer larger (which adds up). Unfortunately, it's not really feasible to put this function anywhere else. In Python, it could live in the type of the object, but since we don't have extensible types here, that wouldn't help much.

One alternative would be to define a struct like this:

struct cb_gc_hooks {
	void (*deinit)(void *obj);
	void (*mark)(void *obj);
};

And then every object would still have just one pointer, to an instance of this struct. In most cases I think this should be statically allocated. If any new fields are needed in the future they could be added to the struct type.

The downside here would be an extra level of indirection when collecting and marking the object, but that is probably more acceptable than every object be 4 or 8 bytes larger.

Documentation

Some documentation on the language and standard library would be nice. Maybe a library to parse source and generate documentation from it would be worth it 🤔

Add iter functions to modules exporting iterable things

array
arraylist
assoclist
hashmap
list
string

Filesystem write operations

Functions for:

managing filesystem permissions
creating files and directories
deleting files and directories
writing to files

Out of bounds access into agent.import_paths with empty CBCVM_PATH

When trying to import a module.

Remove recursion from garbage collector

Since the garbage collector traverses nested objects using recursion, there are cases where marking a deeply-nested object will result in a stack overflow.

Instead, it should be possible to mark all objects iteratively, using a heap-allocated queue of objects to be marked.

This issue only affects the mark phase of the GC.

STORE ops should not leave the value on the stack

Rather than having the majority of store operations require a following OP_POP to remove the value from the stack, the store instruction should not leave a value on the stack. If we need the value to be left there (i.e. assignment expression), it should be implemented as a duplication of the top of the stack and then storing one of those values.

This should have a slight performance improvement, and will reduce the amount of bytecode needed in most cases.

Allow circular importing

It would be problematic if two modules depend on each other at import time, but it could be useful to allow for cases where the imports are only used within functions

Destructuring assignment expressions

Now you can write let [a, b] = arr, { c, d } = struct_;. It would be nice if the same thing could be done in an expression.

Thankfully, there is no ambiguity between a left brace at the start of an expression and the start of a new block (like in JS), so there should be no need to write ({ a, b } = some_struct).

Parsing this for arrays will be a little trickier, since it's not possible to tell whether the current expression is an array literal or destructuring assignment until either:

we encounter an element expression that's not an identifier (we don't support nested destructuring at this time)
there is or isn't an equal after the array literal

It should be possible to make it work though.

Structs

This language sorely needs some efficient way to store structured data. Currently, there are arrays, which can be used as a sort of struct efficiently, but the ergonomics are terrible.

Here's an excerpt of what I've been doing (from lib/hashmap.rbcvm):

let CAP = 0;
let LOAD = 1;
let HASH_FN = 2;
let BUCKETS = 3;

export function with_capacity(capacity) {
  return [
    capacity,
    0,
    hash_string,
    Array.new(capacity),
  ];
}

export function set(self, key, value) {
  # ...
  let bucket = self[BUCKETS][hashed];
  if bucket == null {
    # ...
    self[LOAD] = self[LOAD] + 1;
    self[BUCKETS][hashed] = bucket;
  }
  # ...
  if self[LOAD] / self[CAP] < LOAD_FACTOR {
    grow(self);
  }
}

A language that faced a similar kind of problem is erlang, which uses tuples extensively. The solution there was introducing records, which look like this:

-record(hashmap, { capacity, load = 0, hash_fn = hash_string, buckets }).

test() ->
    NewMap = #hashmap{ capacity = 0, buckets = {} },
    Buckets = NewMap#hashmap.buckets.

Basically, records allow naming fields of tuples at compile-time. To my understanding, they compile down to tuples.

A similar thing could be used here, though # can't be used, since that's for comments (unless we change that lol). Here's a possibility, reimplementing the above code:

struct HashMap {
    capacity,
    load = 0,
    hash_fn = hash_string,
    buckets,
}

export function with_capacity(capacity) {
  return :HashMap{
    capacity=capacity,
    Array.new(capacity),
  };
}

export function set(self, key, value) {
  # ...
  let bucket = self:HashMap.buckets[hashed];
  if bucket == null {
    # ...
    self:HashMap.load = self:HashMap.load + 1;
    self:HashMap.buckets[hashed] = bucket;
  }
  # ...
  if self:HashMap.load / self:HashMap.capacity < LOAD_FACTOR {
    grow(self);
  }
}

println(:HashMap{capacity=0, []});  # ["HashMap", 0, 0, function hash_string, []]

Some questions:

How should this work when a struct comes from another module?
Should you even be able to use structs from another module?
How should validation work when you try to use a value as a struct instance?

Function call revamp

Currently we determine the number of arguments to a function statically. This prevents cool things like variable numbers of arguments and, indirectly, multiple return values. Here's how we can improve it:

When we encounter an infix left paren (i.e. the start of a call expression), we'll add an opcode called OP_PREP_FOR_CALL, which will store the current stack position in the function state. This stack position will be the position after the function we'll be calling.

After that opcode will be the evaluation of all the arguments, and then the OP_CALL. This means we can just compare the stack position at call time with the stored position from prep for call to see how many arguments we actually have. This means that the number of arguments can be dynamic, without incurring much overhead. In fact, while we add an extra operation, we can remove the argument to OP_CALL so the size of a function call is smaller by 32 bits.

This means that splatting an array into the arguments of a function is just a matter of pushing all the elements of the array to the stack, and if calling a function leaves behind multiple values (multiple returns), that's fine too.

Garbage-collected struct specs

It makes sense to garbage-collect struct specs. In most cases this will make no difference (usually they are gonna be at the global scope), but this would allow dynamic creation of struct specs which would be cool.

In addition, right now this would not free any memory, even though the test struct spec becomes unreachable:

struct test {}
test = null;

Handle recursion in cb_value_to_string

Compound assignment

Having to do i = i + 1 makes me sad 😞

Generators

I recently refactored the interpreter such that calling cb_eval evaluates a function. This means that when a function returns, cb_eval also returns.

This design makes it easy (ish) to implement generators:

If a function contains a yield expression, calling it returns a generator object.
When this generator object is called, it is evaluated until it reaches a yield.
When a yield expression is encountered, cb_eval stores whatever state is needed to resume evaluation in the generator object, and then returns.
Any value provided to yield is returned from calling the generator object.
If an argument is passed when calling the generator object, it is put on the stack when the generator resumes.

Here is an example:

generator range(n) {
    for let i = 0; i < n; i = i + 1 {
        yield i
    }
}

The use of a generator keyword could help to implement this with a single-pass compiler, but might not be necessary. Javascript does not have a keyword, but similarly requires a function to be declared as a generator with function*. Python, however, does not require anything. A generator is simply any function that contains yield.

The bytecode for a yield expression could be as follows:

; evaluate the expression to be yielded, if there is no expression, the value is null
CONST_NULL
; at this instruction, cb_eval would return
YIELD
; if the result of yield is not used:
POP

In terms of grammar, yield could be easily parsed as a prefix unary operator.

Some challenges might be dealing with arguments to the generator. These will need to be stored in the generator object, in all likelihood. We don't want to need separate opcodes to handle retrieving arguments within a generator, so they will need to be pushed onto the stack every time the generator is resumed.

An approach could be to add a CB_VALUE_USERDATA type, which simple holds a void *, and then generators could be implemented in terms of builtin functions like this:

let gen = range(10);
for let i = next(gen); !done(gen); i = next(gen) {
    println(i);
}

Since this requires a yield keyword, however, it might be worth adding it as a first-class feature of the language (i.e. CB_VALUE_GENERATOR type).

Separate "repr" and "display" string representations

Right now, printing an array is a "repr" type of display (not sure if there is any other alternative), but printing a string or char does not escape any characters, nor does it wrap the value in quotes. This can be annoying when printing an array of strings or characters.

Inspired by Python, it could be something like this:

If a value supports a "to string" operation, use that (this would be the "display" representation)
Otherwise, get the "repr" string for the value. This would look similar to a literal for the value in most cases.
If a value contains other values, the repr for the outermost value should use the repr for inner values.

Avoid double pointer indirection for strings

Right now a string consists of 2 structs: One with a GC header, and one which holds a pointer to the actual string. Both are heap allocated, and the former holds a pointer to the latter. It seems wasteful to have both since an object can have arbitrary size anyway.

Function string representation

Currently it's something like function whatever, I think it should be something like <function whatever>

REPL

Standard library documentation

I think a nice way to do this could be as follows:

Add a docs module to the standard library that looks like this:

export function document_function(modname, func, description) {}
export function document_var(modname, name, type, description) {}

export function module(modname) {
	return struct {
		document_function = function (func, description) {
			document_function(modname, func, description);
		},
		document_var = function (name, type, description) {
			document_var(modname, name, type, description);
		},
	};
}

export function html_generator() {}
export function markdown_generator() {}
export function generate(generator=html_generator()) {}

The documentation would be generated by running something like:

import array;
import arraylist;
# and so on, for all stdlib modules...

fs.write_file(argv()[0], docs.generate());

The benefits of this are:

No need to write a second parser for the language (in C we compile in one pass; there's no AST)
No need to add language features to make it happen (might need some new reflection though)

Some drawbacks:

Minor run-time performance effect when modules are imported
Documentation has to come after the thing it's documenting

An MVP might not use any reflection; just a name + description. The name could look like func(a, b, c) to show the parameters.

Import aliases

import hashmap as M

JIT compiler

One of my goals with this language is to keep as much functionality out of the host language as possible. Currently, the standard library is almost entirely written in whatever this language is called, even including things like hashing functions.

This makes it more difficult to make it fast, but it should also mean it's a prime candidate for a JIT compiler.

Default arguments

Special-case stdlib

Right now the standard library needs to be imported by path, which is not ideal. I think it should work something like import "array" and this would be understood to mean $CBCVM_ROOT/lib/array.rbcvm.

Maybe these should be registered in a hashmap somewhere, and when importing, if the name is in that hashmap, we use the special-case file instead of trying to resolve it relatively to the pwd.

Alternatively, there could be an environment variable or option like CBCVM_PATH, which is structured like the normal linux PATH variable. Each directory would be checked for a file matching the imported name, and the standard library would be in there. This is probably the better approach.

p7g / c-bytecode-vm Goto Github PK

c-bytecode-vm's People

Contributors

Stargazers

Watchers

Forkers

c-bytecode-vm's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs