p7g / c-bytecode-vm Goto Github PK
View Code? Open in Web Editor NEWA VM implementing a dynamically-typed imperative programming language from scratch.
Home Page: https://p7g.github.io/c-bytecode-vm
A VM implementing a dynamically-typed imperative programming language from scratch.
Home Page: https://p7g.github.io/c-bytecode-vm
Should be possible without too much trouble.
Rather than parsing all the parameters before appending the function prologue to the bytecode, the parameters can be parsed afterward (when the bytecode is in the body of the function). This means the code to destructure the arguments can be within the function at the top.
ssertion failed: (id == agent.next_module_id - 1), function cb_agent_unreserve_modspec_id, file agent.c, line 182.
I've only written a few tests, there are still many more to go.
Currently whenever something goes wrong, it's either unrecoverable, or uses something unergonomic like lib/result.
I think this could be implemented as follows:
try
block, there is an ENTER_TRY
opcode with the address of the catch block as argument. This opcode will store the address in the local state of cb_eval
.
cb_eval
returns 1.In a bunch of standard library modules there's stuff like this:
module Array;
export let new = array_new;
This is so that you can import the array lib and do Array.new(23)
. I think it would be better if the C functions could be added directly to the library.
All global variables are known at compile time, so there is no need to store them in a hashmap.
The reasoning for this decision initially was to allow for usage of a value before definition in the global scope. This, however, can be implemented by putting a placeholder in the bytecode, and updating it after the desired value is defined. Once compilation for a module is finished, we can just display an error if any positions have yet to be populated.
Thanks to #21, all code is run within a module. It should no longer be necessary to have a global scope for code running outside a module.
There's not really any reason to have a bunch of default global variables anymore. Perhaps the entire make_intrinsics
function can be removed.
It seems like a file is read every time it's imported, even if we've already imported it (and thus skip compiling it).
This seems inefficient, we could probably just keep a list or map of files that we've already seen and skip it if it's in there.
Like Python, it would be nice to support nested modules.
Some open questions:
__init__.py
)Currently, when allocating an object that should be managed by the garbage collector, you can specify a function that will run to clean up before the object is freed. It is not, however, possible to customize how the object is marked.
To do this, you have to add a new branch to cb_value_mark
, which won't work when extensions are available.
A new field could be added to cb_gc_header
which is a function pointer to a function that will mark the object. The default (if the field is NULL
) would be a function that does nothing. The downside is that every object will one pointer larger (which adds up). Unfortunately, it's not really feasible to put this function anywhere else. In Python, it could live in the type of the object, but since we don't have extensible types here, that wouldn't help much.
One alternative would be to define a struct like this:
struct cb_gc_hooks {
void (*deinit)(void *obj);
void (*mark)(void *obj);
};
And then every object would still have just one pointer, to an instance of this struct. In most cases I think this should be statically allocated. If any new fields are needed in the future they could be added to the struct type.
The downside here would be an extra level of indirection when collecting and marking the object, but that is probably more acceptable than every object be 4 or 8 bytes larger.
Some documentation on the language and standard library would be nice. Maybe a library to parse source and generate documentation from it would be worth it ๐ค
Functions for:
When trying to import a module.
Since the garbage collector traverses nested objects using recursion, there are cases where marking a deeply-nested object will result in a stack overflow.
Instead, it should be possible to mark all objects iteratively, using a heap-allocated queue of objects to be marked.
This issue only affects the mark phase of the GC.
Rather than having the majority of store operations require a following OP_POP
to remove the value from the stack, the store instruction should not leave a value on the stack. If we need the value to be left there (i.e. assignment expression), it should be implemented as a duplication of the top of the stack and then storing one of those values.
This should have a slight performance improvement, and will reduce the amount of bytecode needed in most cases.
It would be problematic if two modules depend on each other at import time, but it could be useful to allow for cases where the imports are only used within functions
Now you can write let [a, b] = arr, { c, d } = struct_;
. It would be nice if the same thing could be done in an expression.
Thankfully, there is no ambiguity between a left brace at the start of an expression and the start of a new block (like in JS), so there should be no need to write ({ a, b } = some_struct)
.
Parsing this for arrays will be a little trickier, since it's not possible to tell whether the current expression is an array literal or destructuring assignment until either:
It should be possible to make it work though.
This language sorely needs some efficient way to store structured data. Currently, there are arrays, which can be used as a sort of struct efficiently, but the ergonomics are terrible.
Here's an excerpt of what I've been doing (from lib/hashmap.rbcvm
):
let CAP = 0;
let LOAD = 1;
let HASH_FN = 2;
let BUCKETS = 3;
export function with_capacity(capacity) {
return [
capacity,
0,
hash_string,
Array.new(capacity),
];
}
export function set(self, key, value) {
# ...
let bucket = self[BUCKETS][hashed];
if bucket == null {
# ...
self[LOAD] = self[LOAD] + 1;
self[BUCKETS][hashed] = bucket;
}
# ...
if self[LOAD] / self[CAP] < LOAD_FACTOR {
grow(self);
}
}
A language that faced a similar kind of problem is erlang, which uses tuples extensively. The solution there was introducing records, which look like this:
-record(hashmap, { capacity, load = 0, hash_fn = hash_string, buckets }).
test() ->
NewMap = #hashmap{ capacity = 0, buckets = {} },
Buckets = NewMap#hashmap.buckets.
Basically, records allow naming fields of tuples at compile-time. To my understanding, they compile down to tuples.
A similar thing could be used here, though #
can't be used, since that's for comments (unless we change that lol). Here's a possibility, reimplementing the above code:
struct HashMap {
capacity,
load = 0,
hash_fn = hash_string,
buckets,
}
export function with_capacity(capacity) {
return :HashMap{
capacity=capacity,
Array.new(capacity),
};
}
export function set(self, key, value) {
# ...
let bucket = self:HashMap.buckets[hashed];
if bucket == null {
# ...
self:HashMap.load = self:HashMap.load + 1;
self:HashMap.buckets[hashed] = bucket;
}
# ...
if self:HashMap.load / self:HashMap.capacity < LOAD_FACTOR {
grow(self);
}
}
println(:HashMap{capacity=0, []}); # ["HashMap", 0, 0, function hash_string, []]
Some questions:
Currently we determine the number of arguments to a function statically. This prevents cool things like variable numbers of arguments and, indirectly, multiple return values. Here's how we can improve it:
When we encounter an infix left paren (i.e. the start of a call expression), we'll add an opcode called OP_PREP_FOR_CALL
, which will store the current stack position in the function state. This stack position will be the position after the function we'll be calling.
After that opcode will be the evaluation of all the arguments, and then the OP_CALL
. This means we can just compare the stack position at call time with the stored position from prep for call to see how many arguments we actually have. This means that the number of arguments can be dynamic, without incurring much overhead. In fact, while we add an extra operation, we can remove the argument to OP_CALL
so the size of a function call is smaller by 32 bits.
This means that splatting an array into the arguments of a function is just a matter of pushing all the elements of the array to the stack, and if calling a function leaves behind multiple values (multiple returns), that's fine too.
It makes sense to garbage-collect struct specs. In most cases this will make no difference (usually they are gonna be at the global scope), but this would allow dynamic creation of struct specs which would be cool.
In addition, right now this would not free any memory, even though the test
struct spec becomes unreachable:
struct test {}
test = null;
Having to do i = i + 1
makes me sad ๐
I recently refactored the interpreter such that calling cb_eval
evaluates a function. This means that when a function returns, cb_eval
also returns.
This design makes it easy (ish) to implement generators:
yield
expression, calling it returns a generator object.yield
.yield
expression is encountered, cb_eval
stores whatever state is needed to resume evaluation in the generator object, and then returns.yield
is returned from calling the generator object.Here is an example:
generator range(n) {
for let i = 0; i < n; i = i + 1 {
yield i
}
}
The use of a generator
keyword could help to implement this with a single-pass compiler, but might not be necessary. Javascript does not have a keyword, but similarly requires a function to be declared as a generator with function*
. Python, however, does not require anything. A generator is simply any function that contains yield
.
The bytecode for a yield expression could be as follows:
; evaluate the expression to be yielded, if there is no expression, the value is null
CONST_NULL
; at this instruction, cb_eval would return
YIELD
; if the result of yield is not used:
POP
In terms of grammar, yield
could be easily parsed as a prefix unary operator.
Some challenges might be dealing with arguments to the generator. These will need to be stored in the generator object, in all likelihood. We don't want to need separate opcodes to handle retrieving arguments within a generator, so they will need to be pushed onto the stack every time the generator is resumed.
An approach could be to add a CB_VALUE_USERDATA
type, which simple holds a void *
, and then generators could be implemented in terms of builtin functions like this:
let gen = range(10);
for let i = next(gen); !done(gen); i = next(gen) {
println(i);
}
Since this requires a yield
keyword, however, it might be worth adding it as a first-class feature of the language (i.e. CB_VALUE_GENERATOR
type).
Right now, printing an array is a "repr" type of display (not sure if there is any other alternative), but printing a string or char does not escape any characters, nor does it wrap the value in quotes. This can be annoying when printing an array of strings or characters.
Inspired by Python, it could be something like this:
Right now a string consists of 2 structs: One with a GC header, and one which holds a pointer to the actual string. Both are heap allocated, and the former holds a pointer to the latter. It seems wasteful to have both since an object can have arbitrary size anyway.
Currently it's something like function whatever
, I think it should be something like <function whatever>
I think a nice way to do this could be as follows:
Add a docs
module to the standard library that looks like this:
export function document_function(modname, func, description) {}
export function document_var(modname, name, type, description) {}
export function module(modname) {
return struct {
document_function = function (func, description) {
document_function(modname, func, description);
},
document_var = function (name, type, description) {
document_var(modname, name, type, description);
},
};
}
export function html_generator() {}
export function markdown_generator() {}
export function generate(generator=html_generator()) {}
The documentation would be generated by running something like:
import array;
import arraylist;
# and so on, for all stdlib modules...
fs.write_file(argv()[0], docs.generate());
The benefits of this are:
Some drawbacks:
An MVP might not use any reflection; just a name + description. The name could look like func(a, b, c)
to show the parameters.
import hashmap as M
One of my goals with this language is to keep as much functionality out of the host language as possible. Currently, the standard library is almost entirely written in whatever this language is called, even including things like hashing functions.
This makes it more difficult to make it fast, but it should also mean it's a prime candidate for a JIT compiler.
Right now the standard library needs to be imported by path, which is not ideal. I think it should work something like import "array"
and this would be understood to mean $CBCVM_ROOT/lib/array.rbcvm
.
Maybe these should be registered in a hashmap somewhere, and when importing, if the name is in that hashmap, we use the special-case file instead of trying to resolve it relatively to the pwd.
Alternatively, there could be an environment variable or option like CBCVM_PATH
, which is structured like the normal linux PATH variable. Each directory would be checked for a file matching the imported name, and the standard library would be in there. This is probably the better approach.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.