storj-archived / sips Goto Github PK

Storj Improvement Proposals.

License: GNU General Public License v3.0

sips storj-improvement-proposals storj docs community

sips's Introduction

Storj Improvement Proposals

People wishing to submit SIPs, first should propose their idea or document in the Community Slack. After discussion they should add the documented SIP as a pull request. After copy-editing and acceptance, it will be published here. Having a SIP here does not make it a formally accepted standard until its status becomes Active. For a SIP to become Active requires the mutual consent of the community.

Index

Number	Title	Owner	Type	Status
0001	SIP Purpose and Guidelines	Shawn Wilkinson	Process	Active
0002	Bounding Sybil Attacks with Identity Cost	Shawn Wilkinson	Standard	Draft
0003	Remote Notifications and Triggers	Gordon Hall	Standard	Active
0004	Contract Transfers and Renewals	Gordon Hall	Standard	Active
0005	File Encryption and Erasure Encoding Standard	Braydon Fuller	Standard	Active
0006	Farmer Load Balancing Based on Reputation	Moby von Briesen	Standard	Active
0007	Storj Bridge Directory with an Ethereum Contract	Braydon Fuller	Standard	Draft
0008	Farmer Time-locked Storage Payouts	Braydon Fuller	Standard	Draft
0009	Bandwidth Reputation and Accounting	Braydon Fuller	Standard	Draft
0032	Hierarchically Deterministic Node IDs	Braydon Fuller	Standard	Active

sips's People

Contributors

Stargazers

Watchers

Forkers

braydonf cpollard1001 mobyvb phutchins workfunction kratosaleem687 dylanlott fragamemnon navillasa jalbertogonzalez ykhokhlov

sips's Issues

Allow farmers to announce a downtime interval to the network

A farmer brought up the idea to allow farmers to announce a downtime interval to the network for for example updates, fixes... This is actually a good idea if you think about it, announcing a downtime interval to the network makes things more efficient and allows the bridge to know in advance when a user will go offline, adjusting mirroring strategies... accordingly.

Decentralized bridges

Start thinking/compiling a plan on how federated bridges would work and how it could be implemented.

The way i see this being implemented is that some of the Storj services will be offloaded to volunteers who run a package like a Bitcoin full node on a server, they then get a reward for running and maintaining this package. The main goal of this is to prevent the Storj network from crashing when the main Storj bridge goes down or even during bridge updates. These "min bridges" do not have to run all normal Storj bridge services, for instance i don't see a reason for these federated bridges to run a billing system, this should be handles by the main bridge. Setting up a federated bridge should be easy enough so that tech savvy people can start and maintain a bridge. The entire idea of this is to make sure Storj does not become a second AWS where if the main bridge goes down, so does the entire network.

Implications of farmer selection on billing

Searchable encryption scheme

[doi 10.1109%2FICC.2017.7996810] Cai, Chengjun; Yuan, Xingliang; Wang, Cong -- [IEEE ICC 2017 - 2017 IEEE International Conference on Communications - Paris, France (2017.5.21-2017.5.25)] 2017 IEEE In.pdf

Storj is mentioned.

S3 tools compatible interface

Dear Storj,

As you might spot recently DigitalOcean launched Spaces storage. One of the main features is a compatibility with existing AWS S3 tools. Obviously, the reason is to lure existing customers and projects.

Implementing S3 Protocol will unlock petabytes of data looking for a better storage. Just imagine all the existing tools will become reusable with Storj!

Incorrect SIP Licensing

"Copyright/public domain -- Each SIP must either be explicitly labelled as placed in the public domain (see this SIP as an example) or licensed under the Open Publication License."

This was taken from the original document by error, and does not reflect our free software licensing values.

Exchange Report Document

Greetings Earthman,

Right now there are three developers, including myself, that are working on applications that communicate to the API outside of the existing Storj libraries. We use the CLI tool debug as reference.

Currently, there is a new feature added which involves Reports, specifically Exchange Reports. It appears these are called after every Shard transfer. (Both up and down.) The information that is being passed is unknown, so it would be helpful if there was a document which briefly explained where these values are coming from.

Specifically...

"dataHash": Assume Shard Hash?
"reporterId": ???
"farmerId": Assume Farmer Hash?
"clientId": ???
"exchangeStart": ??? Start of byte index? Or possibly this is a unix time stamp of some sort?
"exchangeEnd": ??? End of byte index? time stamp?
"exchangeResultCode": ??? Likely the Farmer response after transfer, but wasn't sure if I need to provide?
"exchangeResultMessage": ??? Same

This is not urgent, but I was told that mirrors rely on this feature, so we would obviously like to support mirroring of shards.

Thanks!

Distribute Shards via Geo-Loc-IP

Kevin:

Storj is distributed data. In order to prevent the data from being centralized under a data center that has the resources to produce hundreds of thousands of nodes, a solution must be brought that can provide for the data to be distributed with a metric that doesn't rely on node id closeness to the shard hash mixed with response time.

We (LittleSkunk and I) feel that the best solution would be to use the geo-location features of IP addresses, to prevent shards from concentrating in one region/area. If a German DataCenter decides to run 500,000 nodes. The Bridge would still distribute data evenly to each region. Germany gets 1, China gets 1, Saudi Arabia gets 1, Mexico gets 1, etc.

I realize this adds some complexity. However, doing nothing to prevent this will eventually cause the data to centralize and make Storj pointless. (Other than the feels about supporting the little guy) It is better to prevent this now, as part of the architecture, so that centralization can't happen. And the network stays distributed.

Plus, I think you guys are thinking about using regionalized IP for renters who have requirements to keep their data in a specific region. So this could simply be an outgrowth of that. If you're going to specifically aim shards at a region, you could also specifically diversify shards among regions, so as to prevent all the data going to the place that has the most nodes.

The goal here is to prevent data centralization. If you have a better way to do that, groovy. Let's do that then. But I think doing nothing is a mistake. Storj needs to stay distributed. That's all. Thanks!

Meije:

What is described by Kevin above is specially important once farmers are selected based on performance metrics and/or Geolocation. Say a renter Lives in the Netherlands, he is tied to national data storage laws and has to store the data within the Netherlands. He needs gigabit farmers. The bridge would then try and select farmers within the Netherlands that qualify to the selection parameters, however, this renter has a data-center only a few Km away with thousands of nodes, this datacenter would now get all or almost all of his data, meaning that his data is now centralized. With the idea above checking if a shard from a specific file is already stored on a specific GeoIP-location and then ignoring that location and selecting another location within the country ensures that the data is always stored in a decentralized fashion and prevents data loss if the data-center goes offline. Things like Geo-IP zoning like is used with "no fly zones" would be a good option.

There is one concern with this technique which is that many ISP's provide IP's that point to one specific hub, so IP tracking all nodes on that hub would just point to one geographical location.

Detect nodes that variably limit their bandwidth

Currently there are nodes that have monthly data bandwidth caps, they allow max Storj node throughput of data for the first few days and then when they are close to the limit they rate limit the Storj Share transfer speed. This will affect the renters, specially when they pay for a certain performance aspect as described in storj-archived/billing#71. These nodes should be detected and down-ranked. This would imply carrying out regular benchmarking tests through the month.

This is a potential problem in the bigger schemes of things, as farmers might once they fill all their drives completely limit (e.g. 1Kbs) upload traffic, this way they will still get paid for storage but the renter will never actually be able to retrieve the file. The Storj network should detect this, expire all the shards on that user's node, stop the payment and mirror the shards to another reliable node.

Download from specific positions of a file across multiple farmers -> improves file transfer speed

This is something that in combination with node benchmarking (selection based on for example bandwidth) would reduce the transfer time and increase transfer speed. Bittorrent does this well.

Reference: storj-archived/core#705

Specification for file system operations on the Bridge

The current Bridge model defines a two-level flat structure. At the first level we have a list of buckets. Each bucket can contain a list of files - the second level.

The Filezilla integration, that was recently introduced, makes an attempt to emulate a full file system with tree hierarchy. Buckets are displayed as directories and in each bucket you can have subdirectories nested to an arbitrary level.

The actual implementation in the Filezilla integration does not really matter right now. What matters is that whatever the implementation is, it must be described in a specification, so all other client integrations can implement it the same way, so we have consistent behavior across different clients.

It is also worth discussing if this implementation is a responsibility of the clients at all, or it should be implemented in the Bridge itself.

When thinking about how to introduce file system operations, there are several possible approaches:

Change the internal model of the Bridge from buckets-files to tree hierarchy and expose the file system operations as API.
Keep the current buckets-files model in the Bridge, but emulate file system operations in the Bridge itself. Expose the operations as API.
Don't touch the Bridge. Emulate file system operation in libstorj and expose them as API.
Don't touch the Bridge and libstorj. Provide a specification for clients how to emulate file system operation on the libstorj API. Every client should follow this spec.
Don't do anything. Leave clients to decide if they want to have file system operations and do it in consistent way.

We are currently at point 5. My hope is to move up on the above list as much as possible.

Implementing file system operation entirely on the client side (points 3-5) has the problem that some operations cannot be done in an atomic way. For example, If directories are implementing by prefixing each filename with the directory name, then deleting a directory would require deleting each of its files separately. This may require hundreds and even thousands of API requests from the client to the Bridge. This is not only inefficient, but the full list of API calls may not be completed for various reasons, thus leaving the bridge in an inconsistent state. It would be a similar issue with renaming and moving directories.

So it would be best if we have point 1 or 2 implemented. It would be much more efficient and consistent if the Bridge itself is responsible for the file system operations.

I am also curious, why the buckets-lists model was chosen for the Bridge in first place. Perhaps, this is where the discussion should start.

Data deduplication with erasure encoding

Combine erasure coding with data deduplication to simultaneously reduce the overall redundancy in data while increasing the redundancy of unique data. Deduplication also requires less network transfer.

http://shiftleft.com/mirrors/www.hpl.hp.com/personal/Mark_Lillibridge/Reliability/website_draft.pdf
https://en.wikipedia.org/wiki/Data_deduplication

Farmer and renter retrievability indexes

We all agree on the fact that cloud storage prices will continue to drop to the point that just holding data on a farmers's drive will not be profitable for them and that only uploading back to renter will compensate the low storage payout. Say i am farmer (A), i get an average of 1-2$/TB for storage a month, but since i upload about 10$ a month worth of data, i will continue to be a farmer. Now i am farmer (B), i fill all my space, i get 1-2$/TB and zero uploads because my data is coming from a client that only uses the space as backups, it is not profitable for me, i will delete the drive and start over to see if i get data that a renter accesses frequently. How are we going to solve that problem?,

Right now there is no way for a renter to specify say a "retrievability index " (how many time the files will be retrieved again) and there is no way for a farmer to set the min/max "retrievability" index. Adding such a parameter (or adjustable scale) for both a farmer and a renter will allow the network to specialize for certain use cases. Some farmers are very slow and don't want to upload the data back at all if possible, which would imply adding a low retievability index to the contracts, then vice versa for files that are accessed frequently.