lex-2008 / backup3 Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 0.0 416 KB

Backups using rsync, busybox and SQLite

License: MIT License

Shell 71.47% HTML 7.29% JavaScript 14.95% Python 6.29%

backup backups bash busybox rsync sqlite3

backup3's People

Contributors

Stargazers

Watchers

backup3's Issues

maintain yesterday/last week/last hour views

after #35 is fixed, we should keep N dirs for these "protected" views.
This will make backups easier to restore if main backup/restore script is gone.

Add file size checking/fixing to check.sh

~~To show it in "show all versions of file" dialog~~

~~find command should include it in output~~
~~new column should be added to DB~~
check.sh should check and fix it

BACKUP_TIME defined before lock release

causing back-in-time dates (where created > deleted) for frequently changing files (rsync.log)

Should be fixable by setting $BACKUP_TIME after, not before calling acquire_lock.

Issue happens when several (>=3) instances of backup.sh run in parallel, like this:

instance 1 starts, sets time as 12:00 acquires the lock and starts doing it work
instance 2 starts, sets time as 12:01, and waits for lock
instance 3 starts, sets time as 12:05, and waits for lock
instance 1 finishes and releases the lock
now, in 50% of chances, instance 3 gets to the lock faster than instance 2
instance 3 acquires the lock and marks some file as "created at 12:05"
instance 3 finishes and releases the lock
instance 2 acquires the lock and marks same file as "deleted at 12:01"

And, behold: we get a file which is created at 12:05 and deleted at 12:01. This shouldn't happen.

Work with filesystems where colon is forbidden in filenames

Currently, it's there to keep time

Search by file name/date/path

filename search:
- exact ("точно", filename = "$1")
- like ("как", filename LIKE "$1") - user provides % as needed
- approx ("примерно", filename LIKE "%$1%") - adds % before and after automatically
directory
- any
- current
- current with subdirs
- any, containing words (dirname LIKE "%$2%")
created
- before
- after
deleted
- before
- after
existed
- at exact datetime
- between … and …
  - complete ("полностью", created before first date, deleted after second) - file must stay the same version during the whole period
  - partial ("частично", created before second date, deleted after first) - file version existed somewhere inside this period
case?

result: table

rename files to include "deleted" date

this will slow backup.sh a little, but will make it possible to rebuild database from files

also edit dedup and check scripts because they're changing deleted timestamp

password-protect subdirs

Currently, only top-level dirs are password-protected.

Instead, there should be a list of dirs to be protected, in an administrator-managed file.

send mime type when sending files

otherwise all files are downloaded instead of being shown in a browser.

stackoverflow suggests using file, but I haven't installed it on the target machine yet.

rework backup.sh order

"old files" should be done before "new files" - to decrease "for_update" index
"old files" should work in one pipeline: sqlite query should both update DB and produce output for file operations - like backup1.sh does
maybe increase pipe size
https://unix.stackexchange.com/questions/439196/change-buffer-size-of-named-pipe

check.sh should tolerate backup files in data dir

old2db check assumes that each file in data dir has ~ in its name and a corresponding entry in database. It's not true for backup *.par2/*.bak files.

Autoupdate

cd to bindir and run git pull, weekly or so.

Rework backup.sh pipelines

Currently, output of comm gets separated into "new" and "deleted" files, which are fed into four "{operate,sql} on {old,new} files" loops. Instead, it should be like this:

output of comm is cleaned [*] goes into two pipes: for files and for sql

"for sql" pipe get processed by sed and pipes into sqlite
- sed builds sql commands for both new and old files
- sqlite can use transaction here
"for files" pipe:
- either split into two: "new files" and "old files" and operate on them as before
- or sed expressions to build shell commands to operate on files (note that this will likely require $BACKUP_LIST file to contain file sizes, too, since we add them to database. But maybe that's good thing?)
[*] cleaning output of comm - remove filenames with ", remove inode numbers, maybe replace tabs with simpler way of distinguishing new/deleted files (N /D as before).

Sorting output of comm has pros and cons:

good for grouping operations in one directory together instead of running around
bad since we need to wait for comm to finish

but seems to have small impact in real life anyway

Move everything into run_rsync

Note that it will break "simple" method.

Reason: if we sync come dir only hourly, there's no need to go find into it every time. Moreover, if run_rsync figured out that we can't connect to target - we don't need to do it, either.

Solution is to run all find - comm - sed / sqlite stuff at the end of each run_rsync part.

Log failed commands

run scripts in set -e mode, and print if a command fails

Something like:

https://unix.stackexchange.com/questions/21930/last-failed-command-in-bash
https://stackoverflow.com/questions/3822621/how-to-exit-if-a-command-failed

Note, however, that this is busybox ash, not bash :)

Passwords

JS should check return code from fetch and ask user for a password if it's incorrect
shell should forbid access to files in root

par2create monthly

also check last month's checksums
run independently of "normal" backup procedure in order not to block it
right after first backup of the month:
get all current items (they will be in "monthly" backup) created during last month (otherwise they already have a backup - this can be checked in database)
if size is less than 300kb - just copy them to *.bak
otherwise run par2create -n1 $filename
~~optionally grep Recovery block count: 100 and save it in database - not sure why it's needed~~
clean.sh should rm "$filename"* - will have to read directory anyway
~~show.sh into separate dir~~
~~make checksum file~~
~~"inject" them into backup~~

check for inodes in old2current, not in current2old

it should speed up checking

Limit age for clean.sh

clean.sh should have an option to keep at least N monthly/hourly/daily backups.

Add "importance" SQL expression

Add user-defined "importance" SQL expression (default 1) - so based on other factors like dirname and created/deleted dates we could define relevant importance of files.

Issue here is that it will be harder to show on webUI - not clear what times are still valid. For example, what should progress bar show in these cases:

if file in dir a/b/ are cleared twice as early as in a/
if *.bak files are deleted twice as early, and we're in directory which has both *.bak and "normal" files

Generate timeline for current dir each time new dir opened

Evolution of #10

timeline API call should be renamed to dirtree (and show only dirs)
ls API call shold show all files, no matter the time
UI should generate timeline for current dir each time new dir opened

Store `/` at the end of dirname

it will enable "LIKE optimisation" and make building table old_files faster.

Probably it should also include starting ./, also. So for files in root dir, dirname will be ./, for others ./dirname/ and ./path/to/

Download not-password-protected files

should use direct links instead of JS

Speed up changesCache calculation in shouldBeAdded

Currently it loops through whole alltimes array, checking each element like this:

changesCache[time]=alltimes.filter(a=>a<=time).length;

Instead, it should use the fact that alltimes is ordered (TODO: it should be), and check only elements between prev checked item and current one.

check that inode of *.bak file and original differ

'cause otherwise it makes no sence

rebuild.sh breaks case of dir-file-dir entry change

If you backup a directory with following states:

time 1: a/b is a file
time 2: a is a file
time 3: a/c is a file

then, after rebuild.sh, database will have the following entries (among others):

directory a from time 1 until now
file a from time 2 until time 3

Note that they clearly overlap. check.sh --fix will "fix" it in unclean way - likely directory will exist only from time 1 until time 2, but not at time 3.

Download dir as tarball

show.sh should exclude root dirs
api.sh should call it
button should call api.sh

add `-r` to all `read`s

read without -r will mangle backslashes - shellcheck

check.sh should check that files.txt correlates with current dir

or just regenerate it on --fix

Exit if there's nothing to do

When there was no change in backed-up files, files.txt.diff will look like this:

D 0 backup.db
N 0 backup.db

In this case, we should exit instead of processing (it creates only duplicate database backup)

Add page for choosing backup dir to webui

useful when first selected is password protected

speed up api.sh?|timeline by parsing data client-side

instead of doing all calculations server-side, we should just

SELECT * FROM history WHERE dirname = '$root' OR dirname LIKE '$root/%'

and parse output client-side

migrate.sh creates a lot of directory entries

Issue is that in migration process, each directory gets a new inode every time (on every step). Hence, it is considered as deleted-created, and receives a new entry. Properly, this could be fixed by nullifying directory inode numbers in backup.sh, like this:

UPDATE fs SET inode=0 WHERE type='d';

but then it will be required to do this on every backup operation, which sounds wrong.

Hence, a workaround - we'll just discard all data and rebuild directories from scratch. Note that it will lose data directory - just like rebuild.sh does. Execute these commands in your SQLite console:

delete from history where type='d';
WITH RECURSIVE cte(org, parent, name, rest, pos, data1, data2) AS (
                SELECT dirname, '', '.', SUBSTR(dirname,3), 0, min(created), max(deleted) FROM history
                GROUP BY dirname
        UNION ALL
                SELECT org,
                SUBSTR(org,1,pos+length(name)+1) as parent,
                SUBSTR(rest,1,INSTR(rest, '/')-1) as name,
                SUBSTR(rest,INSTR(rest,'/')+1) as rest,
                pos+length(name)+1 as pos,
                data1, data2
        FROM cte
        WHERE rest <> ''
)
INSERT INTO history (inode, type, dirname, filename, created, deleted, freq)
SELECT 0, 'd', parent, name, min(data1), max(data2), 0
FROM cte
WHERE pos <> 0
GROUP BY parent, name;
analyze;
vacuum;

It uses recursive cte, as explained in https://stackoverflow.com/a/34665012

Add to backup only after rsync finished successfully

Currently, we run find and add to backup whatever state "current" directory is after rsync run finished. No matter if it was successful or partial. It can lead to an issue like this:

if you have edited few files which depend on each other (for example shell scripts which call each other), interrupted due to timeout rsync might have copied only one of them. In this case, if your later restore backup at this time - you end up with a broken system (old version of one file, new version of another).

Solution is to rsync to a temporary dir and ... that one to "current" one. Note that inodes, also of directories, must not change!

Cleanup readme

first mention features/benefits:

intellectual cleanup - main differentiator (I haven't seen it anywhere else)
- starts deleting old files only when you're running out of disk space
- prioritises hourly, daily, weekly, monthly backups ("daily" backup is state of backup dir at midnight, "monthly" - at midnight of first day of the month, etc) - so you have same number of daily backups as monthly
  - actually based not on number of backups, but their age - so before deleting a backup which was made 4 months ago, it will delete a backup which was made 5 days ago
all features of rsync (since it's most often used for file transfer)
- incremental backup - transfers only changed files, keeping history of several versions
- remote backup
  - from multiple remote systems to single backup server
  - minor requirements to remote systems - directory with valuable data should be made available to backup server via any of the following means:
    - either via rsync server which is configured and running
    - or as shared folder
    - or via SSH access
    - exotic methods are also possible (like using external hard drive to backup air-gapped remote systems)
  - "pull" remote sync - if a remote system is compromised - it can't wipe backups (it can fill your backups with garbage, however - but that's true for any backup system)
  - read-only access from backup server to remote systems - if backup server is compromised - it can't wipe other systems (it can wipe your backups, however - but that's true for any backup system which doesn't use write-once medium like CD-R)
- transparent protection against network transfer errors (when backing up from rsync server) - if a file is damaged due to network fluke, it will be retransferred
- filtering of files to backup based on their path/name/ext
lack of vendor lock-in - not using proprietary file format
- all files in backup are stored as plain files
- you can access them using any standard file manager
- it will be more correct after #36 is implemented - since currently files are renamed
easy to navigate WebUI
- restore any file or whole dir
- password-protect any dir
- remotely trigger backups
parity files to protect against bit rot
scheduled backup (scheduled manually via cron)
encryption (configured manually via encfs)

then installation on all platforms

debian/ubuntu, centos/rhel
basically, prepend $package_manager install to first two lines of https://github.com/Lex-2008/backup3#requirements

then configuration (simple/advanced usage), WebUI, and all other complications can be moved to a separate "extra features" file

Group config to config.sh

have a look at security

all SQL requests should use single quotes and double them
.. path in api.sh should be corrected/ignored

BTRFS compatibility

Main issue is that df output is unreliable, so clean.sh should be reworked:

it should be running separately from backup.sh, about once an hour
when less than 10% disk space left - it should start deleting files, keeping note on their sizes
when sum of sizes deleted files is more than 10% of disk space - stop and run sudo btrfs balance start -dusage=50
main backup.sh should just refuse to do anything when there's less than 10% disk space free

check for new files using sqlite, not comm

but first check performance

keep *.bak files in a separate dir

so they are not deduplicated

Ctrl+C in par2verify.sh during check for deleted file might leave renamed file

mv newfile oldfile
par2verify...
<press Ctrl+C here>
mv oldfile newfile

newfile might be left with oldfile name

Better preserve directories

As of now, directories are not tracked in "data" directory and exist only as records in database. As a result, rebuild.sh will happily wipe wipe empty directories. And they might be needed, for example, for mount points.

Cleanup rendering process

~~First, render dirs~~
~~Then, ask for current files (or history of all files?)~~
Check for current selected timestamp before showing files

ALSO when requesting password:
First, clean everything
Then, request…

lex-2008 / backup3 Goto Github PK

backup3's People

Contributors

Stargazers

Watchers

backup3's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs