Once files are added (or removed or moved), need to send those changes to all the other git clones, at both the git level and the key/value level.

immediate action items

  • At startup, and possibly periodically, or when the network connection changes, or some heuristic suggests that a remote was disconnected from us for a while, queue remotes for processing by the TransferScanner.
  • Ensure that when a remote receives content, and updates its location log, it syncs that update back out. Prerequisite for:
  • After git sync, identify new content that we don't have that is now available on remotes, and transfer. (Needed when we have a uni-directional connection to a remote, so it won't be uploading content to us.) Note: Does not need to use the TransferScanner, if we get and check a list of the changed files.

longer-term TODO

  • Test MountWatcher on LXDE.
  • git-annex needs a simple speed control knob, which can be plumbed through to, at least, rsync. A good job for an hour in an airport somewhere.
  • Find a way to probe available outgoing bandwidth, to throttle so we don't bufferbloat the network to death.
  • Investigate the XMPP approach like dvcs-autosync does, or other ways of signaling a change out of band.
  • Add a hook, so when there's a change to sync, a program can be run and do its own signaling.
  • --debug will show often unnecessary work being done. Optimise.
  • This assumes the network is connected. It's often not, so the cloud needs to be used to bridge between LANs.
  • Configurablity, including only enabling git syncing but not data transfer; only uploading new files but not downloading, and only downloading files in some directories and not others. See for use cases: Wishlist: options for syncing meta-data and data
  • speed up git syncing by using the cached ssh connection for it too (will need to use GIT_SSH, which needs to point to a command to run, not a shell command line)
  • Map the network of git repos, and use that map to calculate optimal transfers to keep the data in sync. Currently a naive flood fill is done instead.
  • Find a more efficient way for the TransferScanner to find the transfers that need to be done to sync with a remote. Currently it walks the git working copy and checks each file.

data syncing

There are two parts to data syncing. First, map the network and second, decide what to sync when.

Mapping the network can reuse code in git annex map. Once the map is built, we want to find paths through the network that reach all nodes eventually, with the least cost. This is a minimum spanning tree problem, except with a directed graph, so really a Arborescence problem.

With the map, we can determine which nodes to push new content to. Then we need to control those data transfers, sending to the cheapest nodes first, and with appropriate rate limiting and control facilities.

This probably will need lots of refinements to get working well.

first pass: flood syncing

Before mapping the network, the best we can do is flood all files out to every reachable remote. This is worth doing first, since it's the simplest way to get the basic functionality of the assistant to work. And we'll need this anyway.

TransferScanner

The TransferScanner thread needs to find keys that need to be Uploaded to a remote, or Downloaded from it.

How to find the keys to transfer? I'd like to avoid potentially expensive traversals of the whole git working copy if I can. (Currently, the TransferScanner does do the naive and possibly expensive scan of the git working copy.)

One way would be to do a git diff between the (unmerged) git-annex branches of the git repo, and its remote. Parse that for lines that add a key to either, and queue transfers. That should work fairly efficiently when the remote is a git repository. Indeed, git-annex already does such a diff when it's doing a union merge of data into the git-annex branch. It might even be possible to have the union merge and scan use the same git diff data.

But that approach has several problems:

  1. The list of keys it would generate wouldn't have associated git filenames, so the UI couldn't show the user what files were being transferred.
  2. Worse, without filenames, any later features to exclude files/directories from being transferred wouldn't work.
  3. Looking at a git diff of the git-annex branches would find keys that were added to either side while the two repos were disconnected. But if the two repos' keys were not fully in sync before they disconnected (which is quite possible; transfers could be incomplete), the diff would not show those older out of sync keys.

The remote could also be a special remote. In this case, I have to either traverse the git working copy, or perhaps traverse the whole git-annex branch (which would have the same problems with filesnames not being available).

If a traversal is done, should check all remotes, not just one. Probably worth handling the case where a remote is connected while in the middle of such a scan, so part of the scan needs to be redone to check it.

done

  1. Can use git annex sync, which already handles bidirectional syncing. When a change is committed, launch the part of git annex sync that pushes out changes. done; changes are pushed out to all remotes in parallel
  2. Watch .git/refs/remotes/ for changes (which would be pushed in from another node via git annex sync), and run the part of git annex sync that merges in received changes, and follow it by the part that pushes out changes (sending them to any other remotes). [The watching can be done with the existing inotify code! This avoids needing any special mechanism to notify a remote that it's been synced to.]
    done
  3. Periodically retry pushes that failed. done (every half an hour)
  4. Also, detect if a push failed due to not being up-to-date, pull, and repush. done
  5. Use a git merge driver that adds both conflicting files, so conflicts never break a sync. done

  6. on-disk transfers in progress information files (read/write/enumerate) done

  7. locking for the files, so redundant transfer races can be detected, and failed transfers noticed done
  8. transfer info for git-annex-shell done
  9. update files as transfers proceed. See progressbars (updating for downloads is easy; for uploads is hard)
  10. add Transfer queue TChan done
  11. add TransferInfo Map to DaemonStatus for tracking transfers in progress. done
  12. Poll transfer in progress info files for changes (use inotify again! wow! hammer, meet nail..), and update the TransferInfo Map done
  13. enqueue Transfers (Uploads) as new files are added to the annex by Watcher. done
  14. enqueue Tranferrs (Downloads) as new dangling symlinks are noticed by Watcher. done (Note: Needs git-annex branch to be merged before the tree is merged, so it knows where to download from. Checked and this is the case.)
  15. Write basic Transfer handling thread. Multiple such threads need to be able to be run at once. Each will need its own independant copy of the Annex state monad. done
  16. Write transfer control thread, which decides when to launch transfers. done
  17. Transfer watching has a race on kqueue systems, which makes finished fast transfers not be noticed by the TransferWatcher. Which in turn prevents the transfer slot being freed and any further transfers from happening. So, this approach is too fragile to rely on for maintaining the TransferSlots. Instead, need assistant threaded runtime, which would allow running something for sure when a transfer thread finishes. done
  18. Test MountWatcher on KDE, and add whatever dbus events KDE emits when drives are mounted. done

On "git syncing" point number 9, on OSX you could potentially do this on a semi-regular basis

system_profiler SPNetworkVolumeDataType
Volumes:

    net:

      Type: autofs
      Mount Point: /net
      Mounted From: map -hosts
      Automounted: Yes

    home:

      Type: autofs
      Mount Point: /home
      Mounted From: map auto_home
      Automounted: Yes

and

x00:~ jtang$ system_profiler SPUSBDataType
USB:

    USB High-Speed Bus:

      Host Controller Location: Built-in USB
      Host Controller Driver: AppleUSBEHCI
      PCI Device ID: 0x0aa9 
      PCI Revision ID: 0x00b1 
      PCI Vendor ID: 0x10de 
      Bus Number: 0x26 

        Hub:

          Product ID: 0x2504
          Vendor ID: 0x0424  (SMSC)
          Version: 0.01
          Speed: Up to 480 Mb/sec
          Location ID: 0x26200000 / 3
          Current Available (mA): 500
          Current Required (mA): 2

            USB to ATA/ATAPI Bridge:

              Capacity: 750.16 GB (750,156,374,016 bytes)
              Removable Media: Yes
              Detachable Drive: Yes
              BSD Name: disk1
              Product ID: 0x2338
              Vendor ID: 0x152d  (JMicron Technology Corp.)
              Version: 1.00
              Serial Number: 313541813001
              Speed: Up to 480 Mb/sec
              Manufacturer: JMicron
              Location ID: 0x26240000 / 5
              Current Available (mA): 500
              Current Required (mA): 2
              Partition Map Type: MBR (Master Boot Record)
              S.M.A.R.T. status: Not Supported
              Volumes:
                Porta-Disk:
                  Capacity: 750.16 GB (750,156,341,760 bytes)
                  Available: 668.42 GB (668,424,208,384 bytes)
                  Writable: Yes
                  File System: ExFAT
....

I think its possible to programatically get this information either from the CLI (it dumps out XML output if required) or some development library. There is also DBUS in macports, but I have never had much interaction with it, so I don't know if its good or bad on OSX.

Why rely on the cloud when you can instead use XMPP and jingle to perform NAT traversal for you? AFAIKT, it also means that traffic won't leave your router if the two endpoints are behind the same router.

How will the assistant know which files' data to distribute between the repos?

I'm using git-annex and it's numcopies attribute to maintain a redundant archive spread over different computers and usb drives. Not all drives should get a copy of everything, e.g. the usb drive I take to work should not automatically get a copy of family pictures.

How about .gitattributes?

  • * annex.auto-sync-data = false # don't automatically sync the data
  • archive/ annex.auto-push-repos = NAS # everything added to archive/ in any repo goes automatically to the NAS remote.
  • work/ annex.auto-synced-repos = LAPTOP WORKUSB # everything added to work/ in LAPTOP or WORKUSB gets synced to WORKUSB and LAPTOP
  • work/ annex.auto-push-repos = LAPTOP WORKUSB # stuff added to work/ anywhere gets synced to LAPTOP and WORKUSB
  • important/ annex.auto-sync-data = true # push data to all repos
  • webserver_logs/ annex.remote.WEBSERVER.auto-push-repos = S3 # only the assistant running in WEBSERVER pushes webserver_logs/ to S3 remote
Comments on this page are closed.