GithubHelp home page GithubHelp logo

lex-2008 / backup3 Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 0.0 416 KB

Backups using rsync, busybox and SQLite

License: MIT License

Shell 71.47% HTML 7.29% JavaScript 14.95% Python 6.29%
backup backups bash busybox rsync sqlite3

backup3's People

Contributors

lex-2008 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

backup3's Issues

BACKUP_TIME defined before lock release

causing back-in-time dates (where created > deleted) for frequently changing files (rsync.log)

Should be fixable by setting $BACKUP_TIME after, not before calling acquire_lock.

Issue happens when several (>=3) instances of backup.sh run in parallel, like this:

  • instance 1 starts, sets time as 12:00 acquires the lock and starts doing it work
  • instance 2 starts, sets time as 12:01, and waits for lock
  • instance 3 starts, sets time as 12:05, and waits for lock
  • instance 1 finishes and releases the lock
  • now, in 50% of chances, instance 3 gets to the lock faster than instance 2
  • instance 3 acquires the lock and marks some file as "created at 12:05"
  • instance 3 finishes and releases the lock
  • instance 2 acquires the lock and marks same file as "deleted at 12:01"

And, behold: we get a file which is created at 12:05 and deleted at 12:01. This shouldn't happen.

Search by file name/date/path

  • filename search:

    • exact ("точно", filename = "$1")
    • like ("как", filename LIKE "$1") - user provides % as needed
    • approx ("примерно", filename LIKE "%$1%") - adds % before and after automatically
  • directory

    • any
    • current
    • current with subdirs
    • any, containing words (dirname LIKE "%$2%")
  • created

    • before
    • after
  • deleted

    • before
    • after
  • existed

    • at exact datetime
    • between … and …
      • complete ("полностью", created before first date, deleted after second) - file must stay the same version during the whole period
      • partial ("частично", created before second date, deleted after first) - file version existed somewhere inside this period
  • case?

result: table

rename files to include "deleted" date

this will slow backup.sh a little, but will make it possible to rebuild database from files

also edit dedup and check scripts because they're changing deleted timestamp

password-protect subdirs

Currently, only top-level dirs are password-protected.

Instead, there should be a list of dirs to be protected, in an administrator-managed file.

Autoupdate

cd to bindir and run git pull, weekly or so.

Rework backup.sh pipelines

Currently, output of comm gets separated into "new" and "deleted" files, which are fed into four "{operate,sql} on {old,new} files" loops. Instead, it should be like this:

output of comm is cleaned [*] goes into two pipes: for files and for sql

  • "for sql" pipe get processed by sed and pipes into sqlite
    • sed builds sql commands for both new and old files
    • sqlite can use transaction here
  • "for files" pipe:
    • either split into two: "new files" and "old files" and operate on them as before
    • or sed expressions to build shell commands to operate on files (note that this will likely require $BACKUP_LIST file to contain file sizes, too, since we add them to database. But maybe that's good thing?)
  • [*] cleaning output of comm - remove filenames with ", remove inode numbers, maybe replace tabs with simpler way of distinguishing new/deleted files (N /D as before).

Sorting output of comm has pros and cons:

  • good for grouping operations in one directory together instead of running around
  • bad since we need to wait for comm to finish

but seems to have small impact in real life anyway

Move everything into run_rsync

Note that it will break "simple" method.

Reason: if we sync come dir only hourly, there's no need to go find into it every time. Moreover, if run_rsync figured out that we can't connect to target - we don't need to do it, either.

Solution is to run all find - comm - sed / sqlite stuff at the end of each run_rsync part.

Passwords

  • JS should check return code from fetch and ask user for a password if it's incorrect
  • shell should forbid access to files in root

par2create monthly

  • also check last month's checksums
     
  • run independently of "normal" backup procedure in order not to block it
  • right after first backup of the month:
  • get all current items (they will be in "monthly" backup) created during last month (otherwise they already have a backup - this can be checked in database)
  • if size is less than 300kb - just copy them to *.bak
  • otherwise run par2create -n1 $filename
  • optionally grep Recovery block count: 100 and save it in database - not sure why it's needed
     
  • clean.sh should rm "$filename"* - will have to read directory anyway
     
  • show.sh into separate dir
  • make checksum file
  • "inject" them into backup

Add "importance" SQL expression

Add user-defined "importance" SQL expression (default 1) - so based on other factors like dirname and created/deleted dates we could define relevant importance of files.


Issue here is that it will be harder to show on webUI - not clear what times are still valid. For example, what should progress bar show in these cases:

  • if file in dir a/b/ are cleared twice as early as in a/
  • if *.bak files are deleted twice as early, and we're in directory which has both *.bak and "normal" files

Store `/` at the end of dirname

it will enable "LIKE optimisation" and make building table old_files faster.

Probably it should also include starting ./, also. So for files in root dir, dirname will be ./, for others ./dirname/ and ./path/to/

Speed up changesCache calculation in shouldBeAdded

Currently it loops through whole alltimes array, checking each element like this:

changesCache[time]=alltimes.filter(a=>a<=time).length;

Instead, it should use the fact that alltimes is ordered (TODO: it should be), and check only elements between prev checked item and current one.

rebuild.sh breaks case of dir-file-dir entry change

If you backup a directory with following states:

  • time 1: a/b is a file
  • time 2: a is a file
  • time 3: a/c is a file

then, after rebuild.sh, database will have the following entries (among others):

  • directory a from time 1 until now
  • file a from time 2 until time 3

Note that they clearly overlap. check.sh --fix will "fix" it in unclean way - likely directory will exist only from time 1 until time 2, but not at time 3.

Exit if there's nothing to do

When there was no change in backed-up files, files.txt.diff will look like this:

D 0 backup.db
N 0 backup.db

In this case, we should exit instead of processing (it creates only duplicate database backup)

migrate.sh creates a lot of directory entries

Issue is that in migration process, each directory gets a new inode every time (on every step). Hence, it is considered as deleted-created, and receives a new entry. Properly, this could be fixed by nullifying directory inode numbers in backup.sh, like this:

UPDATE fs SET inode=0 WHERE type='d';

but then it will be required to do this on every backup operation, which sounds wrong.

Hence, a workaround - we'll just discard all data and rebuild directories from scratch. Note that it will lose data directory - just like rebuild.sh does. Execute these commands in your SQLite console:

delete from history where type='d';
WITH RECURSIVE cte(org, parent, name, rest, pos, data1, data2) AS (
                SELECT dirname, '', '.', SUBSTR(dirname,3), 0, min(created), max(deleted) FROM history
                GROUP BY dirname
        UNION ALL
                SELECT org,
                SUBSTR(org,1,pos+length(name)+1) as parent,
                SUBSTR(rest,1,INSTR(rest, '/')-1) as name,
                SUBSTR(rest,INSTR(rest,'/')+1) as rest,
                pos+length(name)+1 as pos,
                data1, data2
        FROM cte
        WHERE rest <> ''
)
INSERT INTO history (inode, type, dirname, filename, created, deleted, freq)
SELECT 0, 'd', parent, name, min(data1), max(data2), 0
FROM cte
WHERE pos <> 0
GROUP BY parent, name;
analyze;
vacuum;

It uses recursive cte, as explained in https://stackoverflow.com/a/34665012

Add to backup only after rsync finished successfully

Currently, we run find and add to backup whatever state "current" directory is after rsync run finished. No matter if it was successful or partial. It can lead to an issue like this:

if you have edited few files which depend on each other (for example shell scripts which call each other), interrupted due to timeout rsync might have copied only one of them. In this case, if your later restore backup at this time - you end up with a broken system (old version of one file, new version of another).

Solution is to rsync to a temporary dir and ... that one to "current" one. Note that inodes, also of directories, must not change!

Cleanup readme

first mention features/benefits:

  • intellectual cleanup - main differentiator (I haven't seen it anywhere else)
    • starts deleting old files only when you're running out of disk space
    • prioritises hourly, daily, weekly, monthly backups ("daily" backup is state of backup dir at midnight, "monthly" - at midnight of first day of the month, etc) - so you have same number of daily backups as monthly
      • actually based not on number of backups, but their age - so before deleting a backup which was made 4 months ago, it will delete a backup which was made 5 days ago
  • all features of rsync (since it's most often used for file transfer)
    • incremental backup - transfers only changed files, keeping history of several versions
    • remote backup
      • from multiple remote systems to single backup server
      • minor requirements to remote systems - directory with valuable data should be made available to backup server via any of the following means:
        • either via rsync server which is configured and running
        • or as shared folder
        • or via SSH access
        • exotic methods are also possible (like using external hard drive to backup air-gapped remote systems)
      • "pull" remote sync - if a remote system is compromised - it can't wipe backups (it can fill your backups with garbage, however - but that's true for any backup system)
      • read-only access from backup server to remote systems - if backup server is compromised - it can't wipe other systems (it can wipe your backups, however - but that's true for any backup system which doesn't use write-once medium like CD-R)
    • transparent protection against network transfer errors (when backing up from rsync server) - if a file is damaged due to network fluke, it will be retransferred
    • filtering of files to backup based on their path/name/ext
  • lack of vendor lock-in - not using proprietary file format
    • all files in backup are stored as plain files
    • you can access them using any standard file manager
    • it will be more correct after #36 is implemented - since currently files are renamed
  • easy to navigate WebUI
    • restore any file or whole dir
    • password-protect any dir
    • remotely trigger backups
  • parity files to protect against bit rot
  • scheduled backup (scheduled manually via cron)
  • encryption (configured manually via encfs)

then installation on all platforms

then configuration (simple/advanced usage), WebUI, and all other complications can be moved to a separate "extra features" file

have a look at security

  • all SQL requests should use single quotes and double them
  • .. path in api.sh should be corrected/ignored

BTRFS compatibility

Main issue is that df output is unreliable, so clean.sh should be reworked:

  • it should be running separately from backup.sh, about once an hour

  • when less than 10% disk space left - it should start deleting files, keeping note on their sizes

  • when sum of sizes deleted files is more than 10% of disk space - stop and run sudo btrfs balance start -dusage=50

  • main backup.sh should just refuse to do anything when there's less than 10% disk space free

Better preserve directories

As of now, directories are not tracked in "data" directory and exist only as records in database. As a result, rebuild.sh will happily wipe wipe empty directories. And they might be needed, for example, for mount points.

Cleanup rendering process

First, render dirs
Then, ask for current files (or history of all files?)
Check for current selected timestamp before showing files

ALSO when requesting password:
First, clean everything
Then, request…

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.