github / archive-program Goto Github PK

View Code? Open in Web Editor NEW

3.0K 3.0K 249.0 557 KB

The GitHub Archive Program & Arctic Code Vault

archive-program's People

Contributors

Stargazers

Watchers

Forkers

anbj michaeltd xyx0826 harmon758 nishkarshraj priyansh19 emijrp piql haykam821 sahwar lgutter leben-ohne-depperten-nachbaren jonico silvrwolfboy mdlglobal-atlassian-net misspriss100beinme arkiver2 limkokholefork domitechnshimba aswinjayaji refs galaxy001 pobizhe louis-yhz forkkillet slord6 fredmaggiowski t-gyo jcwebhole nik707 komal7209 helloworldzly nucliweb clayne penguinliong im-sk sourangshughosh abdulkhadhar jimmycushnie jackal007 eparovyshnaya sgsouham tarrenj 1crazymoney jwang-x lewis-liu kzcm brianv0 mrbenwang hemangandhi mage1k99 standardgalactic violet26 wh-forker xn-9527 catgofire code-mirror aaron-creator eticzon gregsaintjean sudopluto lokling 500qin configcputimer hnyuuu vovoma kercker hulin32 karbon0x yasserkaddour qawemlilo chrisfosterelli jblogin simonw sergiopantoja plutoluna taoste chrisrenner philipdongfei qdwqn zengjiapei3000 linpengcheng nicocanada xiebiao vidyabhandary ccjiancui aswathm78 puresteelzhang zeta1999 iclxl ogasakitomoya roy-spec jugalj05hi avinashdalvi89 aayu24 shlokatadistance tigareks parv3sh ricardo1946 sidrakshe28

archive-program's Issues

Highlights section in profile "Arctic Code Vault Contributor"

The new highlights section in profiles left sidebar ("Arctic Code Vault Contributor") should link to some page with more info.

Guidelines

Here are some ideas/guidelines to evaluate books about technology (but not literature):

General principles instead of specific technologies
- Example: "Relational Databases" instead of "MySQL" or "PostgreSQL"
- Exceptions: technologies that became standards with multiple implementations: C, UNIX, etc.
Technical instead of pop science
- Example: "The Elements of Computing Systems" instead of "But How Do It Know?" (which I assume is intended for a general audience)
Comprehensive instead of partial
- Example: "The Art of Computer Programming" instead of "Everyday Data Structures"
Unique instead of redundant
- We don't need several books about each topic.
- If you have to pick one, pick the most general, technical and comprehensive :)
Optional, but recommended: include official standards when possible
- Examples:
  - W3 standards (HTTP, HTML, XML, CSS, etc)
  - ECMAScript (a.k.a. "Javascript")
  - C18
  - Scheme R6RS
  - Java Virtual Machine
  - etc.
- Note: We should still include books about each one of these topics; but the standards can be used as reference.

I also propose the following changes in the structure:

Split "Fundamentals of computing" and "the Internet":
- "Fundamentals of computing"
- Move "the internet" to "Networking"
Split "Compilers, Assembler and Operating systems":
- "Compilers and Interpreters"
- "Operating Systems"
- Move "Assembler" to "Programming Languages".

With that in mind, here's a shorter list of books for sections 1 to 7:

1. Fundamentals of computing

Code: The Hidden Language of Computer Hardware and Software by Charles Petzold (Pearson Education)
The Elements of Computing Systems: Building a Modern Computer from First Principles by Noam Nisan (MIT Press)

2. Algorithms and data structures

The Art of Computer Programming by Donald Knuth (Pearson)

Optional:

Introduction to Algorithms, by Thomas H. Cormen (MIT Press)

3. Compilers and Interpreters

Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman (Addison-Wesley)
Modern Compiler Implementation in C by Andrew W. Appel, Maia Ginsburg (Cambridge University Press)
Structure and Interpretation of Computer Programs, by Abelson, Sussman, and Sussman (MIT Press)

Optional:

Lex & Yacc by John R. Levine, Tony Mason, Doug Brown (O'Reilly)

4. Programming Languages

Assembly: Programming from the Ground Up by Jonathan Bartlett (GNU Free Documentation License)
Forth: A sometimes minimal FORTH compiler and tutorial for Linux / i386 systems, by Richard W.M. Jones (PUBLIC DOMAIN)
C: C Programming Language, by K&R (Pearson)
C++: C++ Programming Language, by Bjarne Stroustrup (Pearson)
Scheme: How to Design Programs, by Matthias Felleisen (MIT Press, Creative Commons CC BY-NC-ND)

Optional:

Javascript: JavaScript: The Definitive Guide, by David Flanagan (O'Reilly)
Python: ?
Java: ?

5. Operating systems

Modern Operating Systems by Andrew S. Tanenbaum (Pearson PLC)
Operating Systems: Design and Implementation by Andrew S. Tanenbaum (Prentice Hall)
Operating System Concepts, by Abraham Silberschatz (Willey)
Advanced Programming in the UNIX Environment, by W. Richard Stevens (Addison-Wesley Professional)
The Art of Unix Programming by Eric S. Raymond (Addison-Wesley, but also available under a Creative Commons license)

Optional:

The Linux Programming Interface, by Michael Kerrisk (No Starch Press)

6. Databases

A Relational Model of Data for Large Shared Data Banks, by E.F.Codd (IBM Research Laboratory)
An Introduction to Database Systems, by C.J. Date (Pearson)
Fundamentals of Database Systems, by Ramez Elmasri (Pearson)

7. Networking

Cabling: The Complete Guide To Copper and Fiber-Optic Networking by Andrew Oliviero and Bill Woodward (Wiley)
Ethernet: The Definitive Guide by Charles E. Spurgeon and Joann Zimmerman (O'Reilly)
TCP/IP Illustrated (volumes 1-3), by Richard Stevens (Addison-Wesley Professional)
DNS and BIND by Cricket Liu and Paul Albitz (O'Reilly)
HTTP: The Definitive Guide by David Gourley, Brian Totty, Marjorie Sayer, Anshu Aggarwal, and Sailu Reddy (O'Reilly)
Computer Networks, by Andrew S. Tanenbaum (Pearson)

Before adding a book, ask yourself: is book about general principles? Is it technical? Is it comprehensive? Does it provide unique information?

[Suggestion] Publish the tech tree

It would be nice to publish the tech tree reel at some point, if possible in any way. It obviously is too late now to change anything on it, but it surely is still an interesting resource.

Separate "Fundamentals of computing" and "Internet"

The books about "Internet" should be moved to "Networking and connectivity":

Fundamentals of computing

The Pattern On The Stone by W. Daniel Hills (Basic Books)
But How Do It Know? by J. Clark Scott (John C Scott)
Code: The Hidden Language of Computer Hardware and Software by Charles Petzold (Pearson Education)
The Elements of Computing Systems: Building a Modern Computer from First Principles by Noam Nisan (MIT Press)

Networking and Internet

Cabling: The Complete Guide To Copper and Fiber-Optic Networking by Andrew Oliviero and Bill Woodward (Wiley)
Ethernet: The Definitive Guide by Charles E. Spurgeon and Joann Zimmerman (O'Reilly)
Understanding TCP/IP by Alena Kabelová and Libor Dostálek (Packt)
(...)
Tubes: A Journey To The Center Of The Internet by Andrew Blum (HarperCollins)
Introduction to Networking: How the Internet Works by Charles Severance, illustrated by Mauro Toselli and Aimee Andrion (Charles Severance)

Add Turkish translation

Turkic languages are spoken by 250 million people. Turkish is the most commonly spoken Turkic language, with 90 million speakers. Going to provide the translation myself in the near future.

2D Barcode Format

Is the code used to generate the frames available somewhere? Is that a standardized project?

cc @craSH

Insufficient Tech Tree explanation

In the guide, and also in the https://archiveprogram.github.com/faq/ tech tree is mentioned, but nowhere any example how it should look like, what should be its structure or any other guide for repo owners to make their tech tree. Is there any more info on this topic?

Web development – choice of frameworks

I'll be the first to admit I'm biased, but is there a reason why the list would favor Flask over Django, particularly when the latter is arguably more popular, robust and mature?

I see Ruby on Rails is favored over Sinatra, so size / scope doesn't seem to be the deciding factor.

Another Archiving?

Just wanted to find out if this will be the only archive program or there will be another one and there will be another one, when?

Neutral language not opinionated

The tone should be descriptive and not opinionated. Other than exposing the personal views of the anonymous authors statements such as the following should play no part.

Examples (emphasis mine)

a clumsy but straightforward language

a very cryptic, limited, but fast and powerful family

Delays?

Hello! Is this project suffering some delays due to coronavirus pandemic? Just curious. I hope everybody is fine. Thanks.

(Could repos with tag 'coronavirus', and related, be included after you started the snapshot?)

Commit/change history

It's stated in the FAQ that:

The snapshot will consist of the HEAD of the default branch of each repository,

And the guide confirms:

However, in order to save space, this archive's repositories generally do not include git histories.

I understand the need to save space, but I want to point out that in some cases, the change history can be extremely important for understanding why code was written.

As an example, in the course of my work, I regularly need to read and understand the source code of the Linux kernel. A great deal of information about how the code works, and how and why it came to be designed and organized the way that it is, is found in the commit history.

In any large project that has matured over many years and seen many mistakes made, bugs fixed, edge cases covered, and lessons learned, all of that knowledge is in the commit logs. Many a developer who has worked on someone else's code has cried out to themselves, the universe, or their favourite deity, "What the hell is this line for? What on earth were they thinking?". And the first (and sometimes only) place to look for answers is the commit logs.

I know there are storage costs and practicality to consider, but I would strongly suggest that commit history be preserved, at least for some projects. Possible heuristics could be importance/popularity and complexity (lines of code?). I don't presume to know enough to tell you how to manage the archive, but as a developer, I can tell you that some codebases almost cannot be worked on without commit history.

The matter has been escalated to private discussions.

Thank you for your participation. For history record the matter regards such questionnable practices "For the GitHub Arctic Code Vault, we are unable to remove data that has already been stored." enforced without any previous knowledge or notification to the end-user and here customers.

Factual error

This is a factual statement which is incorrect

This is known as the closed source model, and, historically, was the early, crude approach to software development.

Code was all open before it was later closed and then often open again.

Source https://en.wikipedia.org/wiki/History_of_free_and_open-source_software

In the 1950s and 1960s, computer operating software and compilers were delivered as a part of hardware purchases without separate fees. At the time, source code, the human-readable form of software, was generally distributed with the software providing the ability to fix bugs or add new functions.[1] Universities were early adopters of computing technology. Many of the modifications developed by universities were openly shared, in keeping with the academic principles of sharing knowledge, and organizations sprung up to facilitate sharing.

Were deleted repositories archived.

For example, if I created a repository after 2019/11 and before 2020/02, then I have deleted it recently (such as a month ago), were the repository archived?

Addition of More Languages in the Guide

Respected Sir,
I want to add more languages as different guide md files in your repository. Can I make a pull request regarding this

Few suggestions

A few possible improvements:

Source code of open source software

Open source software is made available to any and all who want to use it, at no cost, so they can in turn improve it, or use it to build something new and better.

It might be better to explicitly mention that it is the source code that is made available to all, like so: Source code of Open source software is made available to any and all ...
Open source software project

An open source software project

Excuse me for the nitpick. We define this as open source software project and then refer to it as open source project in all other places. May be its better to just define it that way.
GitHub not defined
GitHub seems to be mentioned in a dozen places but I don't see any clear definition as to what GitHub is in a way similar to how other things such as computer, Git etc. are defined. It would be nice to define it somewhere.

Using the concept of bytes before explaining it

The archive is so large -- roughly 24 trillion bytes

The guide talks about bytes without having explained what it is. This makes it hard understand how much data 24 trillion bytes is.

Error: Cannot set property 'innerHTML' of null

Error in main.js:

Uncaught TypeError: Cannot set property 'innerHTML' of null
    at showRemaining (main.js:247)

Stack trace line:247

showRemaining @ main.js:247
setInterval (async)
bind @ main.js:262
init @ main.js:223
(anonymous) @ main.js:14
l @ jquery-3.3.1.min.js:2
c @ jquery-3.3.1.min.js:2
setTimeout (async)
(anonymous) @ jquery-3.3.1.min.js:2
u @ jquery-3.3.1.min.js:2
fireWith @ jquery-3.3.1.min.js:2
fire @ jquery-3.3.1.min.js:2
u @ jquery-3.3.1.min.js:2
fireWith @ jquery-3.3.1.min.js:2
ready @ jquery-3.3.1.min.js:2
_ @ jquery-3.3.1.min.js:2

Property:

document.getElementById("countdown").innerHTML = ""; // null

Countdown on archiveprogram.github.com is broken, or confusing

https://archiveprogram.github.com/

Maybe it should be removed, and the text updated?

Use plain text (.txt) file instead of markdown

I think this guide should be a plain txt file that can be read and understood easily. GitHub would easily parse markdown to readable text. But what would happen when someone reaches the code vault in an age when there are no such parsers? 🤔

License concept

The guide says nothing about the concept of license. Since this is about open source, and most of the open source projects are released under one of the currently available open source licenses, I think it would be useful to show an overview of them (or at least of the very concept).

What about putting programming language tutorials?

The people of the future will not know how to code JavaScript

Futurism books

I can't see in the Tech Tree any book regarding futurism, technological singularity, predictions, etc.[1]

[1] https://en.wikipedia.org/wiki/Ray_Kurzweil

File formats guide

This site[1] contains plenty of information about thousands of file formats, it's a wiki and CC-0.

We can ask for a XML dump of the whole site.

[1] http://fileformats.archiveteam.org

Replace Javascript titles

Instead of:

Learning JavaScript by Ethan Brown (O'Reilly)
Mastering JavaScript Functional Programming by Federico Kereki (Packt)

I recommend the following books:

JavaScript: The Definitive Guide by David Flanagan (O'Reilly)
JavaScript: The Good Parts by Douglas Crockford (O'Reilly)

Wikis and other knowledge projects

Hello! I just read in your blog post[1] that you will add a snapshot of Wikipedia to the vault. It's great!

I created some time ago two repos when I learnt about the GitHub Vault, Wiktionary[2] and Wikispecies[3]. They include XML dumps of those wiki sites, only current versions for pages, and some metadata (date, username of last edit, etc).

Are you going to include in the wiki snapshot only Wikipedia? Only English? Full history or current versions only?

Other interesting projects which could be added, if there aren't issues (copyright or other), are: Project Gutenberg (60,000 books), Open Library (index of all known books) and Rosetta Project.

Thanks!

[1] https://github.blog/2020-02-03-the-arctic-code-vault-starts-production-and-your-open-source-projects-are-being-archived/
[2] https://github.com/emijrp/dictionaries-timecapsule
[3] https://github.com/emijrp/species-timecapsule

What about programming tools?

Nothing on shells, libraries, user space tools, nor even text editors?

Localizing the Guide to different languages

While open source software is written by people from all over the world, "it is not guaranteed that the inheritors of this archive will know English," as mentioned by the guide.

The guide also serves as a great introduction to programming and data storage. Through translation, we can make it more accessible not only to the Vault's future handlers, but also to the general public interested in computer principles.

Volunteers

Language	GitHub user
Bulgarian	@sahwar
German	@marcauberer
Romanian	@vladfrangu
Simplified Chinese	@xyx0826
Spanish	@erubio0

Book about Data Compression?

I don't know how comprehensive the documentation of the XZ format included on every reel is, but if it is just the C source code of xzdec or a similar tool, it would be prudent to include some book about data compression concepts (huffman coding, arithmetic coding, LZ77, etc.). While it should be possible to port any program from any language to any other language without understanding why it works, this is certainly easier when you have some background what the program is doing.

Separate Compilers, assembler, and operating systems

Compilers, assembler, and operating systems are three different topics.

Assembly should be part of Programming languages (see #73).

The other should be split in two topics:

Compilers and Interpreters

Lex & Yacc by John R. Levine, Tony Mason, Doug Brown (O'Reilly)
Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman (Addison-Wesley)
Modern Compiler Implementation in C by Andrew W. Appel, Maia Ginsburg (Cambridge University Press)
Structure and Interpretation of Computer Programs, by Abelson, Sussman, and Sussman (MIT Press)

Operating Systems

Modern Operating Systems by Andrew S. Tanenbaum (Pearson PLC)
Operating Systems: Design and Implementation by Andrew S. Tanenbaum (Prentice Hall)

And maybe:

The Art of Unix Programming by Eric S. Raymond (Addison-Wesley, but also available under a Creative Commons license)

Theory of Computation and Complexity Theory

There are no books about computation models and complexity theory.

If there is no space for many books I suggest these four:

Introduction to the Theory of Computation by Sipser: an undergrad introduction to the Theory of Computation.
Computational Complexity: A Modern Approach by Arora and Barak: a more advanced (grad) book about Complexity Theory covering many theorems and topics.
The Annotated Turing: A Guided Tour Through Alan Turing's Historic Paper on Computability and the Turing Machine by Petzold: what the title says. It also gives some context about the paper and a little bit about Turing's life.
Computers and Intractability: A Guide to the Theory of NP-Completeness by Garey and Johsnon: a catalog of NP-Complete problems.

All of them are self-contained with sections or appendices covering the basics. They could be added to the "Fundamentals of computing and the Internet" section or the "Algorithms and data structures" section.

Years and calendars

When we talk about year 2020 or other dates, we should clarify we use Gregorian calendar.[1]

[1] https://en.wikipedia.org/wiki/Gregorian_calendar

German translation

Make a german translation for GUIDE.md

Stack Exchange Data Dump

Another suggestion for the Tech Tree, the Stack Exchange data dump.

https://archive.org/details/stackexchange

End frame inconsistency

So, using the hypothetical CPython example, the item in this list with the ID 12345 might have a start frame of 054321, a start byte of 03210321, an end frame of of 054545, and an end byte of 12321232.

This specifies the end frame as 54545, but in the next paragraph:

Decode all frames from the start frame, 54321, to the end frame, 54544

54544 is referred to as the end frame.

Is this because the value given for the end frame is exclusive and is actually the value for the frame after the last frame in the range? or is this just a mistake in the guide?

If it's the former, an explanation in the guide to clarify the inconsistency might be necessary.

Can we buy one little slice frame of the copy of the Archive Reel?

How much does it cost to produce a frame of the reel?
Can we buy one slice of that as a souvenir? Does this have a plan?
Can the souvenir record a short Hello World programme without compression? So that the buyer can try to use a source of illumination and some kind of magnifier to read the code directly, kind of fun.
Is the craftsmanship have a private patent? Can other factories imitation this?

Request for disclaimer about private/sensitive info

This is a great project and I can understand its significance. However, I would like to point out that several people may (unknowingly) have private and/or sensitive info recorded in their repositories which may be archived. While this information is likely publicly accessible and may have been scraped by others, it would still serve a user well to be able to delete said info when they so choose. Now that this information may be archived, users may suddenly want to double check and/or opt-out.

Of course, GitHub's Privacy Policy clearly notes that:

If you choose to store any Sensitive Personal Information on our servers, you are responsible for complying with any regulatory controls regarding that data.

but it is very likely that people have not read the Privacy Policy and it would thus be helpful to add a small disclaimer that sensitive information may be present in a user's repository and they may want to review it. Again, I understand that this is the user's responsibility and they should have dealt with it in the first place, but a small nudge in the right direction would be immensely helpful.

French translation

Can you add a French translation for GUIDE.md ?

Stop Project For Coronavirus?

Ice Can Transfer Viruses Lots More Than Air

Programming languages

The current list of programming languages is quite... arbitrary.

I suggest the following languages based on their historical importance and current popularity:

Python
Java
Javascript
C
LISP (*)

(*) "Lisp is worth learning for the profound enlightenment experience you will have when you finally get it; that experience will make you a better programmer for the rest of your days, even if you never actually use Lisp itself a lot." (Eric S. Raymond)

Other languages that could be mentioned:

C#
Swift
Kotlin
Go
Rust
Scala
Lua
Perl
Julia
Haskell

The tech tree does not leave enough information to rebuild civilation from collapse.

What follows, which we call the Tech Tree, is a selection of works intended to describe how the world makes and uses software today, as well as an overview of how computers work and the foundational technologies required to make and use computers

I dont think the current tech tree is nearly focused enough to archive "how computers work and the foundational technologies required to make and use computers"

Let's look at some examples.

Compilers, assembler, and operating systems

All of the three compiler book cover building a compiler for a low level imperative language (essentially C). There is no compiler about functional programming language (SML/Haskell), no compiler for declarative language (datalog), or for high-level dynamically typed language (smalltalk). Furthermore there is no book that talk about SAT/SMT solving, nor Garbage Collection, which is the foundation for compiler and programming language (C Compiler use graph coloring (NP complete) for register allocation).

Programming languages

All of the book only talk about contemporary programming language, with zero books about how to design a language. How does one design the type system to ensure safety, without sacrificing too much expressiveness, meanwhile allowing efficient type checking/inference? How do you manage different kind of effect (e.g. reference, concurrency, probability, nondeterminism, exception)? How do you design the language such that it contain a few simple yet universal construct, instead of having lots of ad hoc construct, and quickly becoming and overly-complex language (see Gedanken and Scheme)?

Scientific computing

Scientific Computing workload often consists of
0: lots of domain knowledge
1: ran with numerics algorithm (e.g. finite element methods)
2: optimization for superomputer

I didnt see any of them.

Machine learning

There are five books on deep learning, with no book on bayesian method/probabilistic graphical model/symbolic methods/classical machine learning.

Even if this is just for deep learning, it is still not enough - there is no mention on how to rebuild deep learning framework.

Recursion Error

Arrording to my calculations, RecursionError: maximum recursion depth exceeded

Programming Languages Summary - Choice of languages

I don't know where the choice of programming languages came from (or how it compares to the stats within the archive), but it seems to differ from those in the latest State of the Octoverse:
https://octoverse.github.com/#top-languages

Shell being the most obvious omission (and key to compiling quite a bit of software; see also Make etc).

Concepts of time

In many places in this document phrases are used which relate to time or the passage of time. These concepts are artificial constructs that may be meaningless in the future.

For example (emphasis mine)

Most modern languages include libraries of pre-written functions, and such libraries can be very voluminous and elaborate. Some of today's most popular programming languages include:

C, one of the oldest and fastest

Learning MySQL and MariaDB by Russell J. T. Dyer (O'Reilly)
PostgreSQL Development Essentials by Manpreet Kaur, Baji Shaik (Packt)

With a single title about data modeling and database design:

Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design, Michael J. Hernandez (Addison-Wesley Professional)

github / archive-program Goto Github PK

archive-program's People

Contributors

Stargazers

Watchers

Forkers

archive-program's Issues

1. Fundamentals of computing

2. Algorithms and data structures

3. Compilers and Interpreters

4. Programming Languages

5. Operating systems

6. Databases

7. Networking

Fundamentals of computing

Networking and Internet

Error in main.js:

Stack trace line:247

Property:

Volunteers

Compilers and Interpreters

Operating Systems

Recommend Projects

Recommend Topics

Recommend Org

Jobs