github / archive-program Goto Github PK
View Code? Open in Web Editor NEWThe GitHub Archive Program & Arctic Code Vault
The GitHub Archive Program & Arctic Code Vault
The new highlights section in profiles left sidebar ("Arctic Code Vault Contributor") should link to some page with more info.
Here are some ideas/guidelines to evaluate books about technology (but not literature):
General principles instead of specific technologies
Technical instead of pop science
Comprehensive instead of partial
Unique instead of redundant
Optional, but recommended: include official standards when possible
Examples:
Note: We should still include books about each one of these topics; but the standards can be used as reference.
I also propose the following changes in the structure:
With that in mind, here's a shorter list of books for sections 1 to 7:
Optional:
Optional:
Optional:
Optional:
Before adding a book, ask yourself: is book about general principles? Is it technical? Is it comprehensive? Does it provide unique information?
It would be nice to publish the tech tree reel at some point, if possible in any way. It obviously is too late now to change anything on it, but it surely is still an interesting resource.
The books about "Internet" should be moved to "Networking and connectivity":
Turkic languages are spoken by 250 million people. Turkish is the most commonly spoken Turkic language, with 90 million speakers. Going to provide the translation myself in the near future.
Is the code used to generate the frames available somewhere? Is that a standardized project?
cc @craSH
In the guide, and also in the https://archiveprogram.github.com/faq/ tech tree is mentioned, but nowhere any example how it should look like, what should be its structure or any other guide for repo owners to make their tech tree. Is there any more info on this topic?
I'll be the first to admit I'm biased, but is there a reason why the list would favor Flask over Django, particularly when the latter is arguably more popular, robust and mature?
I see Ruby on Rails is favored over Sinatra, so size / scope doesn't seem to be the deciding factor.
Just wanted to find out if this will be the only archive program or there will be another one and there will be another one, when?
The tone should be descriptive and not opinionated. Other than exposing the personal views of the anonymous authors statements such as the following should play no part.
Examples (emphasis mine)
a clumsy but straightforward language
a very cryptic, limited, but fast and powerful family
Hello! Is this project suffering some delays due to coronavirus pandemic? Just curious. I hope everybody is fine. Thanks.
(Could repos with tag 'coronavirus', and related, be included after you started the snapshot?)
It's stated in the FAQ that:
The snapshot will consist of the HEAD of the default branch of each repository,
And the guide confirms:
However, in order to save space, this archive's repositories generally do not include git histories.
I understand the need to save space, but I want to point out that in some cases, the change history can be extremely important for understanding why code was written.
As an example, in the course of my work, I regularly need to read and understand the source code of the Linux kernel. A great deal of information about how the code works, and how and why it came to be designed and organized the way that it is, is found in the commit history.
In any large project that has matured over many years and seen many mistakes made, bugs fixed, edge cases covered, and lessons learned, all of that knowledge is in the commit logs. Many a developer who has worked on someone else's code has cried out to themselves, the universe, or their favourite deity, "What the hell is this line for? What on earth were they thinking?". And the first (and sometimes only) place to look for answers is the commit logs.
I know there are storage costs and practicality to consider, but I would strongly suggest that commit history be preserved, at least for some projects. Possible heuristics could be importance/popularity and complexity (lines of code?). I don't presume to know enough to tell you how to manage the archive, but as a developer, I can tell you that some codebases almost cannot be worked on without commit history.
Thank you for your participation. For history record the matter regards such questionnable practices "For the GitHub Arctic Code Vault, we are unable to remove data that has already been stored." enforced without any previous knowledge or notification to the end-user and here customers.
This is a factual statement which is incorrect
This is known as the closed source model, and, historically, was the early, crude approach to software development.
Code was all open before it was later closed and then often open again.
Source https://en.wikipedia.org/wiki/History_of_free_and_open-source_software
In the 1950s and 1960s, computer operating software and compilers were delivered as a part of hardware purchases without separate fees. At the time, source code, the human-readable form of software, was generally distributed with the software providing the ability to fix bugs or add new functions.[1] Universities were early adopters of computing technology. Many of the modifications developed by universities were openly shared, in keeping with the academic principles of sharing knowledge, and organizations sprung up to facilitate sharing.
For example, if I created a repository after 2019/11 and before 2020/02, then I have deleted it recently (such as a month ago), were the repository archived?
Respected Sir,
I want to add more languages as different guide md files in your repository. Can I make a pull request regarding this
A few possible improvements:
Source code of open source software
Open source software is made available to any and all who want to use it, at no cost, so they can in turn improve it, or use it to build something new and better.
It might be better to explicitly mention that it is the source code that is made available to all, like so: Source code of Open source software is made available to any and all ...
Open source software project
An open source software project
Excuse me for the nitpick. We define this as open source software project and then refer to it as open source project in all other places. May be its better to just define it that way.
GitHub not defined
GitHub seems to be mentioned in a dozen places but I don't see any clear definition as to what GitHub is in a way similar to how other things such as computer, Git etc. are defined. It would be nice to define it somewhere.
The archive is so large -- roughly 24 trillion bytes
The guide talks about bytes without having explained what it is. This makes it hard understand how much data 24 trillion bytes is.
Uncaught TypeError: Cannot set property 'innerHTML' of null
at showRemaining (main.js:247)
showRemaining @ main.js:247
setInterval (async)
bind @ main.js:262
init @ main.js:223
(anonymous) @ main.js:14
l @ jquery-3.3.1.min.js:2
c @ jquery-3.3.1.min.js:2
setTimeout (async)
(anonymous) @ jquery-3.3.1.min.js:2
u @ jquery-3.3.1.min.js:2
fireWith @ jquery-3.3.1.min.js:2
fire @ jquery-3.3.1.min.js:2
u @ jquery-3.3.1.min.js:2
fireWith @ jquery-3.3.1.min.js:2
ready @ jquery-3.3.1.min.js:2
_ @ jquery-3.3.1.min.js:2
document.getElementById("countdown").innerHTML = ""; // null
I think this guide should be a plain txt file that can be read and understood easily. GitHub would easily parse markdown to readable text. But what would happen when someone reaches the code vault in an age when there are no such parsers? 🤔
The guide says nothing about the concept of license. Since this is about open source, and most of the open source projects are released under one of the currently available open source licenses, I think it would be useful to show an overview of them (or at least of the very concept).
The people of the future will not know how to code JavaScript
I can't see in the Tech Tree any book regarding futurism, technological singularity, predictions, etc.[1]
This site[1] contains plenty of information about thousands of file formats, it's a wiki and CC-0.
We can ask for a XML dump of the whole site.
Instead of:
I recommend the following books:
Hello! I just read in your blog post[1] that you will add a snapshot of Wikipedia to the vault. It's great!
I created some time ago two repos when I learnt about the GitHub Vault, Wiktionary[2] and Wikispecies[3]. They include XML dumps of those wiki sites, only current versions for pages, and some metadata (date, username of last edit, etc).
Are you going to include in the wiki snapshot only Wikipedia? Only English? Full history or current versions only?
Other interesting projects which could be added, if there aren't issues (copyright or other), are: Project Gutenberg (60,000 books), Open Library (index of all known books) and Rosetta Project.
Thanks!
[1] https://github.blog/2020-02-03-the-arctic-code-vault-starts-production-and-your-open-source-projects-are-being-archived/
[2] https://github.com/emijrp/dictionaries-timecapsule
[3] https://github.com/emijrp/species-timecapsule
Nothing on shells, libraries, user space tools, nor even text editors?
While open source software is written by people from all over the world, "it is not guaranteed that the inheritors of this archive will know English," as mentioned by the guide.
The guide also serves as a great introduction to programming and data storage. Through translation, we can make it more accessible not only to the Vault's future handlers, but also to the general public interested in computer principles.
Language | GitHub user |
---|---|
Bulgarian | @sahwar |
German | @marcauberer |
Romanian | @vladfrangu |
Simplified Chinese | @xyx0826 |
Spanish | @erubio0 |
I don't know how comprehensive the documentation of the XZ format included on every reel is, but if it is just the C source code of xzdec
or a similar tool, it would be prudent to include some book about data compression concepts (huffman coding, arithmetic coding, LZ77, etc.). While it should be possible to port any program from any language to any other language without understanding why it works, this is certainly easier when you have some background what the program is doing.
Compilers, assembler, and operating systems are three different topics.
Assembly should be part of Programming languages (see #73).
The other should be split in two topics:
And maybe:
There are no books about computation models and complexity theory.
If there is no space for many books I suggest these four:
All of them are self-contained with sections or appendices covering the basics. They could be added to the "Fundamentals of computing and the Internet" section or the "Algorithms and data structures" section.
When we talk about year 2020 or other dates, we should clarify we use Gregorian calendar.[1]
Make a german translation for GUIDE.md
Another suggestion for the Tech Tree, the Stack Exchange data dump.
So, using the hypothetical CPython example, the item in this list with the ID 12345 might have a start frame of 054321, a start byte of 03210321, an end frame of of 054545, and an end byte of 12321232.
This specifies the end frame as 54545, but in the next paragraph:
Decode all frames from the start frame, 54321, to the end frame, 54544
54544 is referred to as the end frame.
Is this because the value given for the end frame is exclusive and is actually the value for the frame after the last frame in the range? or is this just a mistake in the guide?
If it's the former, an explanation in the guide to clarify the inconsistency might be necessary.
Hello World
programme without compression? So that the buyer can try to use a source of illumination and some kind of magnifier
to read the code directly, kind of fun.This is a great project and I can understand its significance. However, I would like to point out that several people may (unknowingly) have private and/or sensitive info recorded in their repositories which may be archived. While this information is likely publicly accessible and may have been scraped by others, it would still serve a user well to be able to delete said info when they so choose. Now that this information may be archived, users may suddenly want to double check and/or opt-out.
Of course, GitHub's Privacy Policy clearly notes that:
If you choose to store any Sensitive Personal Information on our servers, you are responsible for complying with any regulatory controls regarding that data.
but it is very likely that people have not read the Privacy Policy and it would thus be helpful to add a small disclaimer that sensitive information may be present in a user's repository and they may want to review it. Again, I understand that this is the user's responsibility and they should have dealt with it in the first place, but a small nudge in the right direction would be immensely helpful.
Can you add a French translation for GUIDE.md ?
Ice Can Transfer Viruses Lots More Than Air
The current list of programming languages is quite... arbitrary.
I suggest the following languages based on their historical importance and current popularity:
(*) "Lisp is worth learning for the profound enlightenment experience you will have when you finally get it; that experience will make you a better programmer for the rest of your days, even if you never actually use Lisp itself a lot." (Eric S. Raymond)
Other languages that could be mentioned:
What follows, which we call the Tech Tree, is a selection of works intended to describe how the world makes and uses software today, as well as an overview of how computers work and the foundational technologies required to make and use computers
I dont think the current tech tree is nearly focused enough to archive "how computers work and the foundational technologies required to make and use computers"
Let's look at some examples.
Compilers, assembler, and operating systems
All of the three compiler book cover building a compiler for a low level imperative language (essentially C). There is no compiler about functional programming language (SML/Haskell), no compiler for declarative language (datalog), or for high-level dynamically typed language (smalltalk). Furthermore there is no book that talk about SAT/SMT solving, nor Garbage Collection, which is the foundation for compiler and programming language (C Compiler use graph coloring (NP complete) for register allocation).
Programming languages
All of the book only talk about contemporary programming language, with zero books about how to design a language. How does one design the type system to ensure safety, without sacrificing too much expressiveness, meanwhile allowing efficient type checking/inference? How do you manage different kind of effect (e.g. reference, concurrency, probability, nondeterminism, exception)? How do you design the language such that it contain a few simple yet universal construct, instead of having lots of ad hoc construct, and quickly becoming and overly-complex language (see Gedanken and Scheme)?
Scientific computing
Scientific Computing workload often consists of
0: lots of domain knowledge
1: ran with numerics algorithm (e.g. finite element methods)
2: optimization for superomputer
I didnt see any of them.
Machine learning
There are five books on deep learning, with no book on bayesian method/probabilistic graphical model/symbolic methods/classical machine learning.
Even if this is just for deep learning, it is still not enough - there is no mention on how to rebuild deep learning framework.
Arrording to my calculations, RecursionError: maximum recursion depth exceeded
I don't know where the choice of programming languages came from (or how it compares to the stats within the archive), but it seems to differ from those in the latest State of the Octoverse:
https://octoverse.github.com/#top-languages
Shell being the most obvious omission (and key to compiling quite a bit of software; see also Make etc).
In many places in this document phrases are used which relate to time or the passage of time. These concepts are artificial constructs that may be meaningless in the future.
For example (emphasis mine)
Most modern languages include libraries of pre-written functions, and such libraries can be very voluminous and elaborate. Some of today's most popular programming languages include:
C, one of the oldest and fastest
I didn't create a repo or make a commit in the last year, but few months age I start working on some projects, willI have Arctic Code Vault Contributor badge or I'm just missed the train?
It is the 2D-barcode format "boxing" format that is used to code the digital data, see https://github.com/piql/boxing
Each pixel has 4 grey levels, storing 2 bits/pixel.
Is there a way to know where in the reels is a repo? Something like a map or index.
Does 100 KB denote 100 kB (100000 bytes) or 100 KiB (102400 bytes)?
According to this Wikipedia article, I would assume KiB.
Replace the following titles:
With a single title about data modeling and database design:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.