spacemeshos / poet Goto Github PK

Spacemesh PoET service reference implementation

License: MIT License

Go 96.85% Dockerfile 0.17% Makefile 2.77% Shell 0.22%

golang go proof-of-concept proof-of-space proof-of-space-time proof-of-sequential-work

poet's Issues

Add retries to broadcast failure

When round broadcast is failing, the round stays in "unbroadcasted" state, and only the recovery mechanism can initiate another attempt to broadcast. This is not enough, and we should introduce delayed retries after the initial failure.

Add support for multiple sha256() per Hx()

Add a server param - iter - the number of sequential iterations per Hx() when using sha256(), so we shift total cpu time of proof building to sha256() computations.

func (h *sha256Hash) Hash(data ...[]byte) []byte {

	h.hash.Reset()
	h.hash.Write(h.x)

	for _, d := range data {
		_, _ = h.hash.Write(d)
	}

	temp := h.hash.Sum([]byte{})

	for i := 0; i < h.iters; i++ {
		h.hash.Reset()
		h.hash.Write(h.x)
		h.hash.Write(temp)
		temp = h.hash.Sum(h.emptySlice)
	}

	return temp
}

Storage discrepancies for large values of n

Need to figure out why actual dag storage size for n >= 30 is double the expected size T = 2^n

POET core

Clean the prover file when execution stops

The prover key-value storage file is cleaned after execution finished and the NIP was generated.
Need to clean the file also when the execution was interrupted and didn't finish.

Apply the configured hash function

https://github.com/spacemeshos/poet-ref/blob/develop/service/round.go#L64

Add logs for multiple poets

We need logs to validate that the poet behaves as expected.

Open round recovery policy

Background

PoET Service starts with the opening of round 1 (opening duration is specified by the initialduration mandatory flag).
Once duration elapsed, round 1 starts executing, and round 2 is opening.
Round 2 start of execution (and round 3 opening) is defined by rounds scheduling (duration optional flag). If it happens before round 1 end of execution, there would be 2 rounds executing in-parallel. If after, there would be a period in which no round would be executing. Current behaviour is to start round 2 execution on round 1 end of execution if scheduling time is after or not defined (so there's always at least 1 execution).

Options for when attempting to recover a round in “open” state (after server crash or shutdown):

Execute it immediately and open a new round instead.
Keep it open until the previous round end of execution, regardless of the opening schedule.
If schedule is/was defined, keep it open for the remaining duration, including the duration in which the server was down. If elapsed, execute it immediately.
If schedule is/was defined, keep it open for the remaining duration, not including the duration in which the server was down. This might create a time drift in the overall scheduling.

Option 1 is currently implemented. This creates an issue of creating an additional parallel round execution for every server restart attempt, and thus should be changed.

Main consideration is to whether we plan to use the rounds scheduling or just start executing each round on the previous round end of execution. If the first, options 2-4 are viable. If the later, option 2 is the fallback and need to be implemented anyway.

Please provide feedback for the desired behaviour.

Make PoET thread safe

One case where we're not thread safe is when closing a round -- we may be losing the last membership submissions if they're coming in the last moment.

A thorough review is in order.

Service to advertise rounds duration (in ticks)

Client should submit challenges for rounds with an expected duration.

PoET core and service

Build the client/server protocol over the POET core implementation to serve multiple miners per one sequential work.

Add persistency to rounds state

At the moment everything is saved in memory, including the round's challenges list.

Benchmark a long running poet service [WIP]

Goal is to benchmark increasing values of n until we test and bench a ~7 days poet

Finalize the config validation

Which is taken place in loadConfig function.

Log messages as JSON

In order to parse poet log messages by ES we need to have those logs as JSON.
Use logs infra from go-spacemesh to print the log messages as JSON when configuring --test-mode execution parameter.

Add gRPC and Json-http service integration tests

Packages refactoring

Extract verifier code from internal to a separated verifier package.
Rename internal package as prover, and extract shared code to shared/common package.

POET core optimizations / benchmarking

Complete basic service API impl

Both on the Service type and on the gRPC server.

PoET discovery

Overview / Motivation

We want to support multiple PoET services and allow a miner to discover them and select between them.

The Task

TODO: Clearly describe the issue requirements here...

Implementation Notes

TODO: Add links to relevant resources, specs, related issues, etc...

Contribution Guidelines

Important: Issue assignment to developers will be by the order of their application and proficiency level according to the tasks complexity. We will not assign tasks to developers who have'nt introduced themselves on our Gitter dev channel

Introduce yourself on go-spacemesh dev chat channel - ask our team any question you may have about this task
Fork branch develop to your own repo and work in your repo
You must document all methods, enums and types with godoc comments
You must write go unit tests for all types and methods when submitting a component, and integration tests if you submit a feature
When ready for code review, submit a PR from your repo back to branch develop
Attach relevant issue to PR

Docker push workflow triggered on PR instead of on merge

With the introduction of dependabot a small issue popped up; PRs created with dependabot failed their CI builds (e.g. this one: #117)

The reason for that is that dependabot doesn't have access to the same set of secrets as PRs created from others have. So it cannot authenticate against Dockerhub and the build fails.

The question here is should we move the "Docker build & push" step out of the CI workflow and move it into one that is only triggered AFTER the merge of a PR, instead of on every commit? At the moment a new image is released every time the image build has been successful, even if tests failed.

@dshulyak @lrettig @evt

Integrate with Spacemesh p2p layer

Use the new merkle-tree implementation

The number of leafs to prove depends on the security parameter constant (T).
merkle-tree can be used to create more efficient proof of multiple leafs, specially for the verifier which currently use a key-value cache to detect repeating nodes in the tree.

Verifier error handling

The verifier currently crashes (panic) if is given incorrect values (invalid proof). Need to gracefully report these error instead.

Apply new logger to POET core

Update to new API

Upgrade the poet server broadcaster to use the new API. A new service was added to the API for the poet in spacemeshos/api#118 and implemented in spacemeshos/go-spacemesh#2159.

failed creating poet client harness: error during killing process: exit status 123 | kill: you need to specify whom to kill

Alpine Linux (which we use to run our tests in docker) has a funny version of lsof installed that doesn't tell the truth:

/go/src/github.com/spacemeshos/go-spacemesh # lsof -i tcp:18550 | grep LISTEN | awk '{print $2}' | xargs kill -9
kill: you need to specify whom to kill
/go/src/github.com/spacemeshos/go-spacemesh # apk add busybox-extras
(1/1) Installing busybox-extras (1.31.1-r19)
Executing busybox-extras-1.31.1-r19.post-install
Executing busybox-1.31.1-r16.trigger
OK: 160 MiB in 44 packages
/go/src/github.com/spacemeshos/go-spacemesh # busybox-extras telnet localhost 18550
Connected to localhost
@hello
/go/src/github.com/spacemeshos/go-spacemesh # lsof -i tcp:18550
1       /bin/busybox    /dev/pts/0
1       /bin/busybox    /dev/pts/0
1       /bin/busybox    /dev/pts/0
1       /bin/busybox    /dev/tty
2941    /tmp/poet/poet  /dev/null
2941    /tmp/poet/poet  /dev/null
2941    /tmp/poet/poet  pipe:[865420]
2941    /tmp/poet/poet  /root/.poet/logs/poet.log
2941    /tmp/poet/poet  anon_inode:[eventpoll]
2941    /tmp/poet/poet  pipe:[862641]
2941    /tmp/poet/poet  pipe:[862641]
2941    /tmp/poet/poet  socket:[862642]
2941    /tmp/poet/poet  socket:[862646]
2941    /tmp/poet/poet  socket:[864392]
2941    /tmp/poet/poet  socket:[863516]
2941    /tmp/poet/poet  /tmp/poet/data/1680/challengesDb/LOCK
2941    /tmp/poet/poet  /tmp/poet/data/1680/challengesDb/LOG
2941    /tmp/poet/poet  /tmp/poet/data/1680/challengesDb/MANIFEST-000000
2941    /tmp/poet/poet  /tmp/poet/data/1680/challengesDb/000001.log
2941    /tmp/poet/poet  /tmp/poet/data/1679/layercache_0.bin (deleted)
/go/src/github.com/spacemeshos/go-spacemesh # apk add lsof
(1/1) Installing lsof (4.93.2-r0)
Executing busybox-1.31.1-r16.trigger
OK: 160 MiB in 45 packages
/go/src/github.com/spacemeshos/go-spacemesh # lsof -i tcp:18550
COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
poet    2941 root   11u  IPv4 862642      0t0  TCP localhost:18550 (LISTEN)
poet    2941 root   13u  IPv4 864392      0t0  TCP localhost:50380->localhost:18550 (ESTABLISHED)
poet    2941 root   14u  IPv4 863516      0t0  TCP localhost:18550->localhost:50380 (ESTABLISHED)
/go/src/github.com/spacemeshos/go-spacemesh # lsof -i tcp:18550 | grep LISTEN | awk '{print $2}'
2941
/go/src/github.com/spacemeshos/go-spacemesh # lsof -i tcp:18550 | grep LISTEN | awk '{print $2}' | xargs kill -9
/go/src/github.com/spacemeshos/go-spacemesh # lsof -i tcp:18550
/go/src/github.com/spacemeshos/go-spacemesh #

This causes problems inside the harness:

poet/integration/harness.go

Line 132 in 79d0b52

 args := fmt.Sprintf("lsof -i tcp:%d | grep LISTEN | awk '{print $2}' | xargs kill -9", addr.Port) 

I see this error from several of the go-spacemesh tests in github.com/spacemeshos/go-spacemesh/cmd/node including this one:

/go/src/github.com/spacemeshos/go-spacemesh # go test -timeout 0 -p 1 -count 1 -v github.com/spacemeshos/go-spacemesh/
cmd/node -run Test_PoETHarnessSanity
=== RUN   Test_PoETHarnessSanity
    Test_PoETHarnessSanity: app_test.go:72:
                Error Trace:    app_test.go:72
                Error:          Received unexpected error:
                                error during killing process: exit status 123 | kill: you need to specify whom to kill
                Test:           Test_PoETHarnessSanity
--- FAIL: Test_PoETHarnessSanity (1.16s)
FAIL
FAIL    github.com/spacemeshos/go-spacemesh/cmd/node    1.184s
FAIL

This code should make sure the output of the shell command here is not null before passing it to kill.

Integration with node broken

Recent updates #66 and #67 broken integration with the go-spacemesh node. please check the node and run tests before or atleast after merging changes.

Those changes break all integration in a way that go-spacemesh cannot be tested right now.

Add timeout for the proofs long-polling requests

The client shouldn't send the requests too early.

Implement sha256 using Intel sha256 instructions

This is an epic - should be broken down to several tasks

Background

Intel accelerated crypto instructions define sha256 extension instructions - they have not been widely implemented on the last few generations of Intel CPUs
AMD ships both server and desktop CPUs with the sha extension instructions (EPYC and RYZEN)
Servers with EPYC cpus are now available on [AWS](See [Epyc](https://aws.amazon.com/about-aws/whats-new/2018/11/introducing_amazon_ec2_instances_featuring_amd_epyc_processors/ available), and are good candidates to run Spacemesh production poet for the testnet timeframe
Benchmarks show x3-x5 pref over optimized sha256 using AVX2 Intel instructions - there is no faster published sha256 hardware benchmark that I could find
There are already several open source libraries that use the SHA instructions to compute SHA256 for a small input buffer (add links)

Tasks

Integrate and test sha256() using Intel sha256 instructions using assembly code published by open source libraries
The hardware optimized sha256 should be automatically used when the cpu support SHA (cpuid=SHA)...
Security and correctness review the implementation by our research team
Test and profile on both RYZEN (desktop) and EPYC (server) AMD systems

@iddo333 @moshababo @zalmen @tal-m

Every execution round should follow with a idle period

The Poet period should consist of an execution part where the PoSW is executed and an idle period during which the poet will wait for the beginning of the next execution round. The purpose of the idle period is to allow the miners to generate a PoST proof and get ready for the next round of PoET (assuming that a miner will mostly work with the same PoET for many epochs)

Re-broadcast whenever new gateway nodes are set

@noamnelke wrote:

When all PoET server attempts to broadcast a proof fail we currently can’t re-broadcast a proof without restarting the PoET server. This can easily be improved by initiating a re-broadcast whenever new gateway nodes are set via the API.

Out of memory crash

Either a memory leak or out of memory issue in the current code-base. n=33, 32GB Ram. Crashes at 75% DAG generation. Blocks further benchmarks. We need a better way to generate the dag than the naive in-memory depth-first traversal that is currently implements. @zalmen @moshababo @noamnelke

Apply new merkle-tree to POET core

This should include the multiple memberships proof & its verification.

poet submitted invalid proof after recovery

sorry for private link, but basically see comments that start at https://spacemesh.slack.com/archives/C01BBB2U64U/p1664858910998709

Add versioning, tagging, docker push

We ran into an issue today where after #105 was merged, go-spacemesh system tests started failing because they don't pin the poet version (see spacemeshos/go-spacemesh#2189). In order to facilitate pinning, we should add versioning here and set up automatic dockerbuild/push when a new version is merged or a new release is cut.

Cleanup proof endpoints

Now that PoET sends membership and PoET proofs via gossip, we should clean up the old gRPC endpoints for obtaining proofs. This may require changing how some tests work.

`Service::errChan` is never read so failed rounds will block

When a round execution fails, it writes the error into errChan. However, nobody reads that channel anywhere. Because a write to a full channel blocks, it will lead to goroutines hanging indefinitely and leaking resources.

Implement efficient membership proof

At the moment for every membership proof request the merkle tree is being entirely re-constructed.

POET core code review

"completed broadcast successfully after" is rejected in automation

It is caused due to broken import to the log package.

Update gateway after service already started

Add support to update PoET service gateway nodes after it already started.
At the moment one must restart the server in order to do this.

Document usage

There's currently no documentation on how to run the POET server. Add at least a basic "# Running the server" section to the README. It should contain instructions on how to connect go-spacemesh to it.

Happy to work on this.

@moshababo, does anything special need to be done? When I run poet locally, I noticed that it's looking for a poet.conf. Can we add a sample of this to the repo? Do I need to supply any commandline args? Or should go-spacemesh connect to it (on the default port) with no special params/config required?

service.Info() accesses resources concurrently without protecting

service.go:Info()
reads from openRound and executingRounds while the routine started at Start
writes to these variables. there's even protection (mutex) on executingRounds inside Start
but none when reading in Info().
It is important to say that Info() can be called at any time as it is triggered by an API call.

cc: @moshababo

Add service identity

Persistent (?) key-pair construction
Signing over NIPs

Add config support to the test harness instance

Specially the service rounds duration, to be used from the test client.

Faulty optimization in PoET verification

In verifier.Verify, the optimization that marks the proof as valid if we already handled one of the siblings on the way to the root is wrong. The fact that we got to a known sibling says nothing on the validity of the path that we already calculated (i.e. from the leaf to the current position). A correct optimization will require storing in the cache sibling pairs (instead of only one of them) and if we got to a pair we can indeed stop the verification and mark the proof as valid.

PoET CI should be executed on Windows and MacOS

Summary

PoET is used as a dependency for go-spacemesh. Since go-spacemesh is built for Linux, MacOS and Windows, PoET should also be able to be built on all 3 platforms:

Acceptance criteria

Update CI to build and run tests on all 3 platforms
- Add standard set of linters & checkers in CI (staticcheck, golangci-lint, test-fmt, etc.)
- Report passing & failing tests using JUnit Report
Target Go 1.19 instead of 1.18
Fix existing tests that might fail on non-Linux platforms

Create prover data store file in datadir instead of execution folder

PoET should maintain an identity and sign proofs

Create and persist a key-pair
Make the PoET challenge hash(concat(membershipRoot, poetId))
Sign proof messages and include signature

spacemeshos / poet Goto Github PK

poet's Issues

Background

Options for when attempting to recover a round in “open” state (after server crash or shutdown):

Overview / Motivation

The Task

Implementation Notes

Contribution Guidelines

Background

Tasks

Summary

Acceptance criteria

Recommend Projects

Recommend Topics

Recommend Org

Jobs