The git-annex assistant is being crowd funded on Kickstarter. I'll be blogging about my progress here on a semi-daily basis.
Spent yesterday and today making the WebApp handle adding removable drives.
While it needs more testing, I think that it's now possible to use the WebApp for a complete sneakernet usage scenario.
- Start up the webapp, let it make a local repo.
- Add some files, by clicking to open the file manager, and dragging them in.
- Plug in a drive, and tell the webapp to add it.
- Wait while files sync..
- Take the drive to another computer, and repeat the process there.
No command-line needed, and files will automatically be synced between two or more computers using the drive.
Sneakernet is only one usage scenario for the git-annex assistant, but I'm really happy to have one scenario 100% working!
Indeed, since the assistant and webapp can now actually do something
useful, I'll probably be merging them into master
soon.
Details follow..
So, yesterday's part of this was building the configuration page to add a removable drive. That needs to be as simple as possible, and it currently consists of a list of things git-annex thinks might be mount points of removable drives, along with how much free space they have. Pick a drive, click the pretty button, and away it goes..
(I decided to make the page so simple it doesn't even ask where you want to put the directory on the removable drive. It always puts it in a "annex" directory. I might add an expert screen later, but experts can always set this up themselves at the command line too.)
I also fought with Yesod and Bootstrap rather a lot to make the form look good. Didn't entirely succeed, and had to file a bug on Yesod about its handling of check boxes. (Bootstrap also has a bug, IMHO; its drop down lists are not always sized wide enough for their contents.)
Ideally this configuration page would listen for mount events, and refresh its list. I may add that eventually; I didn't have a handy channel it could use to do that, so defferred it. Another idea is to have the mount event listener detect removable drives that don't have an annex on them yet, and pop up an alert with a link to this configuration page.
Making the form led to a somewhat interesting problem: How to tell if a mounted
filesystem is a removable drive, or some random thing like /proc
or
a fuse filesystem. My answer, besides checking that the user can
write to it, was various heuristics, which seem to work ok, at least here..
[[!format haskell """
sane Mntent { mnt_dir = dir, mnt_fsname = dev }
{- We want real disks like /dev/foo, not
- dummy mount points like proc or tmpfs or
- gvfs-fuse-daemon. -}
| not ('/' elem
dev) = False
{- Just in case: These mount points are surely not
- removable disks. -}
| dir == "/" = False
| dir == "/tmp" = False
| dir == "/run/shm" = False
| dir == "/run/lock" = False
"""]]
Today I did all the gritty coding to make it create a git repository on the removable drive, and tell the Annex monad about it, and ensure it gets synced.
As part of that, it detects when the removable drive's filesystem doesn't support symlinks, and makes a bare repository in that case. Another expert level config option that's left out for now is to always make a bare repository, or even to make a directory special remote rather than a git repository at all. (But directory special remotes cannot support the sneakernet use case by themselves...)
Another somewhat interesting problem was what to call the git remotes that it sets up on the removable drive and the local repository. Again this could have an expert-level configuration, but the defaults I chose are to use the hostname as the remote name on the removable drive, and to use the basename of the mount point of the removable drive as the remote name in the local annex.
Originally, I had thought of this as cloning the repository to the drive.
But, partly due to luck, I started out just doing a git init
to make
the repository (I had a function lying around to do that..).
And as I worked on it some more, I realized this is not as simple as a clone. It's a bi-directional sync/merge, and indeed the removable drive may have all the data already in it, and the local repository have just been created. Handling all the edge cases of that (like, the local repository may not have a "master" branch yet..) was fun!
Nothing flashy today; I was up all night trying to download photos taken by a robot lowered onto Mars by a skycrane.
Some work on alerts. Added an alert when a file transfer succeeds or fails. Improved the alert combining code so it handles those alerts, and simplified it a lot, and made it more efficient.
Also made the text of action alerts change from present to past tense when
the action finishes. To support that I wrote a fun data type, a TenseString
that can be rendered in either tense.
Today I added a "Files" link in the navbar of the WebApp. It looks like a regular hyperlink, but clicking on it opens up your desktop's native file manager, to manage the files in the repository!
Quite fun to be able to do this kind of thing from a web page. :)
Made git annex init
(and the WebApp) automatically generate a description
of the repo when none is provided.
Also worked on the configuration pages some. I don't want to get ahead of myself by diving into the full configuration stage yet, but I am at least going to add a configuration screen to clone the repo to a removable drive.
After that, the list of transfers on the dashboard needs some love. I'll probably start by adding UI to cancel running transfers, and then try to get drag and drop reordering of transfers working.
Now installing git-annex automatically generates a freedesktop.org .desktop
file, and installs it, either system-wide (root) or locally (user). So
Menu -> Internet -> Git Annex
will start up the web app.
(I don't entirely like putting it on the Internet menu, but the Accessories menu is not any better (and much more crowded here), and there's really no menu where it entirely fits.)
I generated that file by writing a generic library to deal with freedesktop.org desktop files and locations. Which seemed like overkill at the time, but then I found myself continuing to use that library. Funny how that happens.
So, there's also another .desktop file that's used to autostart the
git-annex assistant
daemon when the user logs into the desktop.
This even works when git-annex is installed to the ugly non-PATH location
.cabal/bin/git-annex
by Cabal! To make that work, it records the path
the binary is at to a freedesktop.org data file, at install time.
That should all work in Gnome, KDE, XFCE, etc. Not Mac OSX I'm guessing...
Also today, I added a sidebar notification when the assistant notices new files. To make that work well, I implemented merging of related sidebar action notifications, so the effect is that there's one notification that collectes a list of recently added files, and transient notifications that show up if a really big file is taking a while to checksum.
I'm pleased that the notification interface is at a point where I was able to implement all that, entirely in pure functional code.
Based on the results of yesterday's poll, the WebApp defaults to
~/Desktop/annex
when run in the home directory. If there's no Desktop
directory, it uses just ~/annex
. And if run from some other place than
the home directory, it assumes you want to use cwd. Of course, you can
change this default, but I think it's a good one for most use cases.
My work today has all been on making one second of the total lifetime of the WebApp work. It's the very tricky second in between clicking on "Make repository" and being redirected to a WebApp running in your new repository. The trickiness involves threads, and MVars, and multiple web servers, and I don't want to go into details here. I'd rather forget. ;-)
Anyway, it works; you can run "git annex webapp" and be walked right
through to having a usable repository! Now I need to see about adding
that to the desktop menus, and making "git annex webapp", when run a second
time, remembering where your repository is. I'll use
~/.config/git-annex/repository
for storing that.
Started work on the interface displayed when the webapp is started with no existing git-annex repository. All this needs to do is walk the user through setting up a repository, as simply as possible.
A tricky part of this is that most of git-annex runs in the Annex monad, which requires a git-annex repository. Luckily, much of the webapp does not run in Annex, and it was pretty easy to work around the parts that do. Dodged a bullet there.
There will, however, be a tricky transition from this first run webapp, to a normally fully running git-annex assistant and webapp. I think the first webapp will have to start up all the normal threads once it makes the repository, and then redirect the user's web browser to the full webapp.
Anyway, the UI I've made is very simple: A single prompt, for the directory where the repository should go. With, eventually, tab completion, sanity checking (putting the repository in "/" is not good, and making it all of "$HOME" is probably unwise).
Ideally most users will accept the default, which will be something
like /home/username/Desktop/Annex
, and be through this step in seconds.
Suggestions for a good default directory name appreciated.. Putting it on a folder that will appear on the desktop seems like a good idea, when there's a Desktop directory. I'm unsure if I should name it something specific like "GitAnnex", or something generic like "Synced".
Time for the first of probably many polls!
What should the default directory name used by the git-annex assistant be?
[[!poll open=no 19 "Annex" 7 "GitAnnex" 10 "Synced" 0 "AutoSynced" 1 "Shared" 10 "something lowercase!" 1 "CowboyNeal" 1 "Annexbox"]]
(Note: This is a wiki. You can edit this page to add your own poll options!)
Lots of WebApp UI improvements, mostly around the behavior when displaying alert messages. Trying to make the alerts informative without being intrusively annoying, think I've mostly succeeded now.
Also, added an intro display. Shown is the display with only one repo; if there are more repos it also lists them all.
Focus today was writing a notification broadcaster library. This is a way to send a notification to a set of clients, any of which can be blocked waiting for a new notification to arrive. A complication is that any number of clients may be be dead, and we don't want stale notifications for those clients to pile up and leak memory.
It took me 3 tries to find the solution, which turns out to be head-smackingly simple: An array of SampleVars, one per client.
Using SampleVars means that clients only see the most recent notification, but when the notification is just "the assistant's state changed somehow; display a refreshed rendering of it", that's sufficient.
First use of that was to make the thread that woke up every 10 minutes and checkpointed the daemon status to disk also wait for a notification that it changed. So that'll be more current, and use less IO.
Second use, of course, was to make the WebApp block long polling clients until there is really a change since the last time the client polled.
To do that, I made one change to my Yesod routes:
[[!format diff """ -/status StatusR GET +/status/#NotificationId StatusR GET """]]
Now I find another reason to love Yesod, because after doing that,
I hit "make".. and fixed the type error. And hit make.. and fixed
the type error. And then it just freaking worked! Html was generated with
all urls to /status including a NotificationId
, and the handler for
that route got it and was able to use it:
[[!format haskell """ {- Block until there is an updated status to display. -} b <- liftIO $ getNotificationBroadcaster webapp liftIO $ waitNotification $ notificationHandleFromId b nid """]]
And now the WebApp is able to display transfers in realtime!
When I have both the WebApp and git annex get
running on the same screen,
the WebApp displays files that git-annex is transferring about as fast
as the terminal updates.
The progressbars still need to be sorted out, but otherwise the WebApp is a nice live view of file transfers.
I also had some fun with Software Transactional Memory. Now when the assistant moves a transfer from its queue of transfers to do, to its map of transfers that are currently running, it does so in an atomic transaction. This will avoid the transfer seeming to go missing (or be listed twice) if the webapp refreshes at just the wrong point in time. I'm really starting to get into STM.
Next up, I will be making the WebApp maintain a list of notices, displayed on its sidebar, scrolling new notices into view, and removing ones the user closes, and ones that expire. This will be used for displaying errors, as well as other communication with the user (such as displaying a notice while a git sync is in progress with a remote, etc). Seems worth doing now, so the basic UI of the WebApp is complete with no placeholders.
Some days I spend 2 hours chasing red herrings (like "perhaps my JSON ajax calls arn't running asynchronoously?") that turn out to be a simple one-word typo. This was one of them.
However, I did get the sidebar displaying alert messages, which can be easily sent to the user from any part of the assistant. This includes transient alerts of things it's doing, which disappear once the action finishes, and long-term alerts that are displayed until the user closes them. It even supports rendering arbitrary Yesod widgets as alerts, so they can also be used for asking questions, etc.
Time for a screencast!
The webapp now displays actual progress bars, for the actual transfers that the assistant is making! And it's seriously shiny.
Yes, I used Bootstrap. I can see why so many people are using it, that the common complaint is everything looks the same. I spent a few hours mocking up the transfer display part of the WebApp using Bootstrap, and arrived at something that doesn't entirely suck remarkably quickly.
The really sweet thing about Bootstrap is that when I resized my browser to the shape of a cell phone, it magically redrew the WebApp like so:
To update the display, the WebApp uses two techniques. On noscript browsers, it just uses a meta refresh, which is about the best I can do. I welcome feedback; it might be better to just have an "Update" button in this case.
With javascript enabled, it uses long polling, done over AJAX. There are some other options I considered, including websockets, and server-sent events. Websockets seem too new, and while there's a WAI module supporting server-sent events, and even an example of them in the Yesod book, the module is not packaged for Debian yet. Anyway, long polling is the most widely supported, so a good starting place. It seems to work fine too, I don't really anticipate needing the more sophisticated methods.
(Incidentially, this's the first time I've ever written code that uses AJAX.)
Currently the status display is rendered in html by the web server, and just updated into place by javascript. I like this approach since it keeps the javascript code to a minimum and the pure haskell code to a maximum. But who knows, I may have to switch to JSON that gets rendered by javascript, for some reason, later on.
I was very happy with Yesod when I managed to factor out a general purpose widget that adds long-polling and meta-refresh to any other widget. I was less happy with Yesod when I tried to include jquery on my static site and it kept serving up a truncated version of it. Eventually worked around what's seemingly a bug in the default WAI middleware, by disabling that middleware.
Also yesterday I realized there were about 30 comments stuck in moderation on this website. I thought I had a feed of those, but obviously I didn't. I've posted them all, and also read them all.
Next up is probably some cleanup of bugs and minor todos. Including
figuring out why watch
has started to segfault on OSX when it was
working fine before.
After that, I need to build a way to block the long polling request until the DaemonStatus and/or TransferQueue change from the version previously displayed by the WebApp. An interesting concurrency problem..
Once I have that working, I can reduce the current 3 second delay between refreshes to a very short delay, and the WebApp will update in near-realtime as changes come in.
After an all-nighter, I have git annex webapp
launching a WebApp!
It doesn't do anything useful yet, just uses Yesod to display a couple of hyperlinked pages and a favicon, securely.
The binary size grew rather alarmingly, BTW. :) Indeed, it's been growing for months..
-rwxr-xr-x 1 root root 9.4M Jul 21 16:59 git-annex-no-assistant-stripped
-rwxr-xr-x 1 joey joey 12M Jul 25 20:54 git-annex-no-webapp-stripped
-rwxr-xr-x 1 joey joey 17M Jul 25 20:52 git-annex-with-webapp-stripped
Along the way, some Not Invented Here occurred:
I didn't use the yesod scaffolded site, because it's a lot of what seems mostly to be cruft in this use case. And because I don't like code generated from templates that people are then expected to edit. Ugh. That's my least favorite part of Yesod. This added some pain, since I had to do everything the hard way.
I didn't use wai-handler-launch because:
- It seems broken on IPv6 capable machines (it always opens
http://127.0.0.1:port/
even though it apparently doesn't always listen there.. I think it was listening on my machine's ipv6 address instead. I know, I know; I should file a bug about this..) - It always uses port 4587, which is insane. What if you have two webapps?
- It requires javascript in the web browser, which is used to ping the server, and shut it down when the web browser closes (which behavior is wrong for git-annex anyway, since the daemon should stay running across browser closes).
- It opens the webapp on web server startup, which is wrong for git-annex;
instead the command
git annex webapp
will open the webapp, aftergit annex assistant
started the web server.
Instead, I rolled my own WAI webapp laucher, that binds to any free port
on localhost, It does use xdg-open
to launch the web browser,
like wai-handler-launch (or just open
on OS X).
Also, I wrote my own WAI logger, which logs using System.Log.Logger,
instead of to stdout, like runDebug
does.
The webapp only listens for connections from localhost, but that's
not sufficient "security". Instead, I added a secret token to
every url in the webapp, that only git annex webapp
knows about.
But, if that token is passed to xdg-open
on its command line,
it will be briefly visible to local attackers in the parameters of
xdg-open
.. And if the web browser's not already running, it'll run
with it as a parameter, and be very visible.
So instead, I used a nasty hack. On startup, the assistant
will create a html file, readably only by the user, that redirects
the user to the real site url. Then git annex webapp
will run
xdg-open on that file.
Making Yesod check the auth=
parameter (to verify that the secret token
is right) is when using Yesod started to pay off. Yesod has a simple
isAuthorized
method that can be overridden to do your own authentication
like this.
But Yesod really started to shine when I went to add the auth=
parameter
to every url in the webapp. There's a joinPath
method can can be used
to override the default url builder. And every type-safe url in the
application goes through there, so it's perfect for this.
I just had to be careful to make it not add auth=
to the url for the
favicon, which is included in the "Permission Denied" error page. That'd be
an amusing security hole..
Next up: Doing some AJAX to get a dynamic view of the state of the daemon, including currently running transfers, in the webapp. AKA stuff I've never done before, and that, unlike all this heavy Haskell Yesod, scares me. :)
Milestone: I can run git annex assistant
, plug in a USB drive, and it
automatically transfers files to get the USB drive and current repo back in
sync.
I decided to implement the naive scan, to find files needing to be
transferred. So it walks through git ls-files
and checks each file
in turn. I've deferred less expensive, more sophisticated approaches to later.
I did some work on the TransferQueue, which now keeps track of the length of the queue, and can block attempts to add Transfers to it if it gets too long. This was a nice use of STM, which let me implement that without using any locking.
[[!format haskell """ atomically $ do sz <- readTVar (queuesize q) if sz <= wantsz then enqueue schedule q t (stubInfo f remote) else retry -- blocks until queuesize changes """]]
Anyway, the point was that, as the scan finds Transfers to do, it doesn't build up a really long TransferQueue, but instead is blocked from running further until some of the files get transferred. The resulting interleaving of the scan thread with transfer threads means that transfers start fairly quickly upon a USB drive being plugged in, and kind of hides the innefficiencies of the scanner, which will most of the time be swamped out by the IO bound large data transfers.
At this point, the assistant should do a good job of keeping repositories in sync, as long as they're all interconnected, or on removable media like USB drives. There's lots more work to be done to handle use cases where repositories are not well-connected, but since the assistant's syncing now covers at least a couple of use cases, I'm ready to move on to the next phase. Webapp, here we come!
Made the MountWatcher update state for remotes located in a drive that gets mounted. This was tricky code. First I had to make remotes declare when they're located in a local directory. Then it has to rescan git configs of git remotes (because the git repo mounted at a mount point may change), and update all the state that a newly available remote can affect.
And it works: I plug in a drive containing one of my git remotes, and the assistant automatically notices it and syncs the git repositories.
But, data isn't transferred yet. When a disconnected remote becomes connected, keys should be transferred in both directions to get back into sync.
To that end, added Yet Another Thread; the TransferScanner thread will scan newly available remotes to find keys, and queue low priority transfers to get them fully in sync.
(Later, this will probably also be used for network remotes that become available when moving between networks. I think network-manager sends dbus events it could use..)
This new thread is missing a crucial peice, it doesn't yet have a way to find the keys that need to be transferred. Doing that efficiently (without scanning the whole git working copy) is Hard. I'm considering design possibilities..
Managed to find a minimal, 20 line test case for at least one of the ways git-annex was hanging with GHC's threaded runtime. Sent it off to haskell-cafe for analysis. thread
Further managed to narrow the bug down to MissingH's use of logging code, that git-annex doesn't use. bug report. So, I can at least get around this problem with a modified version of MissingH. Hopefully that was the only thing causing the hangs I was seeing!
Spent most of the day making file content transfers robust. There were lots of bugs, hopefully I've fixed most of them. It seems to work well now, even when I throw a lot of files at it.
One of the changes also sped up transfers; it no longer roundtrips to the
remote to verify it has a file. The idea here is that when the assistant is
running, repos should typically be fairly tightly synced to their remotes
by it, so some of the extra checks that the move
command does are
unnecessary.
Also spent some time trying to use ghc's threaded runtime, but continue to be baffled by the random hangs when using it. This needs fixing eventually; all the assistant's threads can potentially be blocked when it's waiting on an external command it has run.
Also changed how transfer info files are locked. The lock file is now separate from the info file, which allows the TransferWatcher thread to notice when an info file is created, and thus actually track transfers initiated by remotes.
I'm fairly close now to merging the assistant
branch into master
.
The data syncing code is very brute-force, but it will work well enough
for a first cut.
Next I can either add some repository network mapping, and use graph analysis to reduce the number of data transfers, or I can move on to the webapp. Not sure yet which I'll do. It's likely that since DebConf begins tomorrow I'll put off either of those big things until after the conference.
I didn't plan to work on git-annex much while at DebConf, because the conference always prevents the kind of concentration I need. But I unexpectedly also had to deal with three dead drives and illness this week.
That said, I have been trying to debug a problem with git-annex and Haskell's threaded runtime all week. It just hangs, randomly. No luck so far isolating why, although I now have a branch that hangs fairly reliably, and in which I am trying to whittle the entire git-annex code base (all 18 thousand lines!) into a nice test case.
This threaded runtime problem doesn't affect the assistant yet, but if I want to use Yesod in developing the webapp, I'll need the threaded runtime, and using the threaded runtime in the assistant generally would make it more responsive and less hacky.
Since this is a task I can work on without much concentration, I'll probably keep beating on it until I return home. Then I need to spend some quality thinking time on where to go next in the assistant.
Starting to travel, so limited time today.
Yet Another Thread added to the assistant, all it does is watch for changes to transfer information files, and update the assistant's map of transfers currently in progress. Now the assistant will know if some other repository has connected to the local repo and is sending or receiving a file's content.
This seemed really simple to write, it's just 78 lines of code. It worked 100% correctly the first time. :) But it's only so easy because I've got this shiny new inotify hammer that I keep finding places to use in the assistant.
Also, the new thread does some things that caused a similar thread (the merger thread) to go into a MVar deadlock. Luckily, I spent much of day 19 investigating and fixing that deadlock, even though it was not a problem at the time.
So, good.. I'm doing things right and getting to a place where rather nontrivial features can be added easily.
--
Next up: Enough nonsense with tracking transfers... Time to start actually transferring content around!
Since my last blog, I've been polishing the git annex watch
command.
First, I fixed the double commits problem. There's still some extra
committing going on in the git-annex
branch that I don't understand. It
seems like a shutdown event is somehow being triggered whenever
a git command is run by the commit thread.
I also made git annex watch
run as a proper daemon, with locking to
prevent multiple copies running, and a pid file, and everything.
I made git annex watch --stop
stop it.
Then I managed to greatly increase its startup speed. At startup, it generates "add" events for every symlink in the tree. This is necessary because it doesn't really know if a symlink is already added, or was manually added before it starter, or indeed was added while it started up. Problem was that these events were causing a lot of work staging the symlinks -- most of which were already correctly staged.
You'd think it could just check if the same symlink was in the index. But it can't, because the index is in a constant state of flux. The symlinks might have just been deleted and re-added, or changed, and the index still have the old value.
Instead, I got creative. :) We can't trust what the index says about the symlink, but if the index happens to contain a symlink that looks right, we can trust that the SHA1 of its blob is the right SHA1, and reuse it when re-staging the symlink. Wham! Massive speedup!
Then I started running git annex watch
on my own real git annex repos,
and noticed some problems.. Like it turns normal files already checked into
git into symlinks. And it leaks memory scanning a big tree. Oops..
I put together a quick screencast demoing git annex watch
.
While making the screencast, I noticed that git-annex watch
was spinning
in strace, which is bad news for powertop and battery usage. This seems to
be a GHC bug also affecting Xmonad. I
tried switching to GHC's threaded runtime, which solves that problem, but
causes git-annex to hang under heavy load. Tried to debug that for quite a
while, but didn't get far. Will need to investigate this further..
Am seeing indications that this problem only affects ghc 7.4.1; in
particular 7.4.2 does not seem to have the problem.
Really productive day today, now that I'm out of the threaded runtime tarpit!
First, brought back --debug
logging, better than before! As part of that, I
wrote some 250 lines of code to provide a IMHO more pleasant interface to
System.Process
(itself only 650 lines of code) that avoids all the
low-level setup, cleanup, and tuple unpacking. Now I can do things like
write to a pipe to a process, and ensure it exits nonzero, this easily:
withHandle StdinHandle createProcessSuccess (proc "git" ["hash-object", "--stdin"]) $ \h ->
hHutStr h objectdata
My interface also makes it easy to run nasty background processes, reading their output lazily.
lazystring <- withHandle StdoutHandle createBackgroundProcess (proc "find" ["/"]) hGetContents
Any true Haskellers are shuddering here, I really should be using conduits or pipes, or something. One day..
The assistant needs to detect when removable drives are attached, and sync with them. This is a reasonable thing to be working on at this point, because it'll make the currently incomplete data transfer code fully usable for the sneakernet use case, and firming that up will probably be a good step toward handing other use cases involving data transfer over the network, including cases where network remotes are transientely available.
So I've been playing with using dbus to detect mount events. There's a very nice Haskell library to use dbus.
This simple program will detect removable drives being mounted, and works on Xfce (as long as you have automounting enabled in its configuration), and should also work on Gnome, and, probably, KDE:
[[!format haskell """ {-# LANGUAGE OverloadedStrings #-}
import Data.List (sort) import DBus import DBus.Client import Control.Monad
main = do client <- connectSession
listen client mountadded $ \s ->
putStrLn (show s)
forever $ getLine -- let listener thread run forever
where
mountadded = matchAny
{ matchInterface = Just "org.gtk.Private.RemoteVolumeMonitor"
, matchMember = Just "MountAdded"
}
"""]]
(Yeah... "org.gtk.Private.RemoteVolumeMonitor". There are so
many things wrong with that string. What does gtk have to do with
mounting a drive? Why is it Private? Bleagh. Should I only match
the "MountAdded" member and not the interface? Seems everyone who does
this relies on google to find other people who have cargo-culted it,
or just runs dbus-monitor
and picks out things.
There seems to be no canonical list of events. Bleagh.)
Spent a while shaving a yak of needing a getmntent
interface in Haskell.
Found one in a hsshellscript library; since that library is not packaged
in Debian, and I don't really want to depend on it, I extracted just
the mtab and fstab parts of it into a little library in git-annex.
I've started putting together a MountWatcher thread. On systems without
dbus (do OSX or the BSDs have dbus?), or if dbus is not running, it polls
/etc/mtab
every 10 seconds for new mounts. When dbus is available,
it doesn't need the polling, and should notice mounts more quickly.
Open question: Should it still poll even when dbus is available? Some of us
like to mount our own drives, by hand and may have automounting disabled. It'd
be good if the assistant supported that. This might need a
annex.no-dbus
setting, but I'd rather avoid needing such manual
configuration.
One idea is to do polling in addition to dbus, if /etc/fstab
contains
mount points that seem to be removable drives, on which git remotes lives.
Or it could always do polling in addition to dbus, which is just some extra
work. Or, it could try to introspect dbus to see if mount events will
be generated.
The MountWatcher so far only detects new mounts and prints out what happened. Next up: Do something in response to them.
This will involve manipulating the Annex state to belatedly add the Remote on the mount point.. tricky. And then, for Git Remotes, it should pull/push the Remote to sync git data. Finally, for all remotes, it will need to queue Transfers of file contents from/to the newly available Remote.
Only had a few hours to work today, but my current focus is speed, and I
have indeed sped up parts of git annex watch
.
One thing folks don't realize about git is that despite a rep for being
fast, it can be rather slow in one area: Writing the index. You don't
notice it until you have a lot of files, and the index gets big. So I've
put a lot of effort into git-annex in the past to avoid writing the index
repeatedly, and queue up big index changes that can happen all at once. The
new git annex watch
was not able to use that queue. Today I reworked the
queue machinery to support the types of direct index writes it needs, and
now repeated index writes are eliminated.
... Eliminated too far, it turns out, since it doesn't yet ever flush that queue until shutdown! So the next step here will be to have a worker thread that wakes up periodically, flushes the queue, and autocommits. (This will, in fact, be the start of the syncing phase of my roadmap!) There's lots of room here for smart behavior. Like, if a lot of changes are being made close together, wait for them to die down before committing. Or, if it's been idle and a single file appears, commit it immediately, since this is probably something the user wants synced out right away. I'll start with something stupid and then add the smarts.
(BTW, in all my years of programming, I have avoided threads like the nasty bug-prone plague they are. Here I already have three threads, and am going to add probably 4 or 5 more before I'm done with the git annex assistant. So far, it's working well -- I give credit to Haskell for making it easy to manage state in ways that make it possible to reason about how the threads will interact.)
What about the races I've been stressing over? Well, I have an ulterior
motive in speeding up git annex watch
, and that's to also be able to
slow it down. Running in slow-mo makes it easy to try things that might
cause a race and watch how it reacts. I'll be using this technique when
I circle back around to dealing with the races.
Another tricky speed problem came up today that I also need to fix. On
startup, git annex watch
scans the whole tree to find files that have
been added or moved etc while it was not running, and take care of them.
Currently, this scan involves re-staging every symlink in the tree. That's
slow! I need to find a way to avoid re-staging symlinks; I may use git
cat-file
to check if the currently staged symlink is correct, or I may
come up with some better and faster solution. Sleeping on this problem.
Oh yeah, I also found one more race bug today. It only happens at startup and could only make it miss staging file deletions.
Kickstarter is over. Yay!
Today I worked on the bug where git annex watch
turned regular files
that were already checked into git into symlinks. So I made it check
if a file is already in git before trying to add it to the annex.
The tricky part was doing this check quickly. Unless I want to write my
own git index parser (or use one from Hackage), this check requires running
git ls-files
, once per file to be added. That won't fly if a huge
tree of files is being moved or unpacked into the watched directory.
Instead, I made it only do the check during git annex watch
's initial
scan of the tree. This should be OK, because once it's running, you
won't be adding new files to git anyway, since it'll automatically annex
new files. This is good enough for now, but there are at least two problems
with it:
- Someone might
git merge
in a branch that has some regular files, and it would add the merged in files to the annex. - Once
git annex watch
is running, if you modify a file that was checked into git as a regular file, the new version will be added to the annex.
I'll probably come back to this issue, and may well find myself directly querying git's index.
I've started work to fix the memory leak I see when running git annex
watch
in a large repository (40 thousand files). As always with a Haskell
memory leak, I crack open Real World Haskell's chapter on profiling.
Eventually this yields a nice graph of the problem:
So, looks like a few minor memory leaks, and one huge leak. Stared at this for a while and trying a few things, and got a much better result:
I may come back later and try to improve this further, but it's not bad memory usage. But, it's still rather slow to start up in such a large repository, and its initial scan is still doing too much work. I need to optimize more..
Beating my head against the threaded runtime some more. I can reproduce
one of the hangs consistently by running 1000 git annex add commands
in a loop. It hangs around 1% of the time, reading from git cat-file
.
Interestingly, git cat-file
is not yet running at this point --
git-annex has forked a child process, but the child has not yet exec'd it.
Stracing the child git-annex, I see it stuck in a futex. Adding tracing,
I see the child never manages to run any code at all.
This really looks like the problem is once again in MissingH, which uses
forkProcess
. Which happens to come with a big warning about being very
unsafe, in very subtle ways. Looking at the C code that the newer process
library uses when sparning a pipe to a process, it messes around with lots of
things; blocking signals, stopping a timer, etc. Hundreds of lines of C
code to safely start a child process, all doing things that MissingH omits.
That's the second time I've seemingly isolated a hang in the GHC threaded runtime to MissingH.
And so I've started converting git-annex to use the new process
library,
for running all its external commands. John Goerzen had mentioned process
to me once before when I found a nasty bug in MissingH, as the cool new
thing that would probably eliminate the System.Cmd.Utils
part of MissingH,
but I'd not otherwise heard much about it. (It also seems to have the
benefit of supporting Windows.)
This is a big change and it's early days, but each time I see a hang, I'm
converting the code to use process
, and so far the hangs have just gone
away when I do that.
Hours later... I've converted all of git-annex to use process
.
In the er, process, the --debug
switch stopped printing all the commands
it runs. I may try to restore that later.
I've not tested everything, but the test suite passes, even when using the threaded runtime. MILESTONE
Looking forward to getting out of these weeds and back to useful work..
Hours later yet.... The assistant
branch in git now uses the threaded
runtime. It works beautifully, using proper threads to run file transfers
in.
That should fix the problem I was seeing on OSX yesterday. Too tired to test it now.
--
Amazingly, all the assistant's own dozen or so threads and thread synch variables etc all work great under the threaded runtime. I had assumed I'd see yet more concurrency problems there when switching to it, but it all looks good. (Or whatever problems there are are subtle ones?)
I'm very relieved. The threaded logjam is broken! I had been getting increasingly worried that not having the threaded runtime available would make it very difficult to make the assistant perform really well, and cause problems with the webapp, perhaps preventing me from using Yesod.
Now it looks like smooth sailing ahead. Still some hard problems, but it feels like with inotify and kqueue and the threaded runtime all dealt with, the really hard infrastructure-level problems are behind me.
Pondering syncing today. I will be doing syncing of the git repository first, and working on syncing of file data later.
The former seems straightforward enough, since we just want to push all changes to everywhere. Indeed, git-annex already has a sync command that uses a smart technique to allow syncing between clones without a central bare repository. (Props to Joachim Breitner for that.)
But it's not all easy. Syncing should happen as fast as possible, so changes show up without delay. Eventually it'll need to support syncing between nodes that cannot directly contact one-another. Syncing needs to deal with nodes coming and going; one example of that is a USB drive being plugged in, which should immediately be synced, but network can also come and go, so it should periodically retry nodes it failed to sync with. To start with, I'll be focusing on fast syncing between directly connected nodes, but I have to keep this wider problem space in mind.
One problem with git annex sync
is that it has to be run in both clones
in order for changes to fully propagate. This is because git doesn't allow
pushing changes into a non-bare repository; so instead it drops off a new
branch in .git/refs/remotes/$foo/synced/master
. Then when it's run locally
it merges that new branch into master
.
So, how to trigger a clone to run git annex sync
when syncing to it?
Well, I just realized I have spent two weeks developing something that can
be repurposed to do that! Inotify can watch for changes to
.git/refs/remotes
, and the instant a change is made, the local sync
process can be started. This avoids needing to make another ssh connection
to trigger the sync, so is faster and allows the data to be transferred
over another protocol than ssh, which may come in handy later.
So, in summary, here's what will happen when a new file is created:
- inotify event causes the file to be added to the annex, and immediately committed.
- new branch is pushed to remotes (probably in parallel)
- remotes notice new sync branch and merge it
- (data sync, TBD later)
- file is fully synced and available
Steps 1, 2, and 3 should all be able to be accomplished in under a second.
The speed of git push
making a ssh connection will be the main limit
to making it fast. (Perhaps I should also reuse git-annex's existing ssh
connection caching code?)
In a series of airport layovers all day. Since I woke up at 3:45 am, didn't feel up to doing serious new work, so instead I worked through some OSX support backlog.
git-annex will now use Haskell's SHA library if the sha256sum
command is not available. That library is slow, but it's guaranteed to be
available; git-annex already depended on it to calculate HMACs.
Then I decided to see if it makes sense to use the SHA library
when adding smaller files. At some point, its slower implementation should
win over needing to fork and parse the output of sha256sum
. This was
the first time I tried out Haskell's
Criterion benchmarker,
and I built this simple benchmark in short order.
[[!format haskell """ import Data.Digest.Pure.SHA import Data.ByteString.Lazy as L import Criterion.Main import Common
testfile :: FilePath testfile = "/tmp/bar" -- on ram disk
main = defaultMain [ bgroup "sha256" [ bench "internal" $ whnfIO internal , bench "external" $ whnfIO external ] ]
internal :: IO String internal = showDigest . sha256 <$> L.readFile testfile
external :: IO String external = pOpen ReadFromPipe "sha256sum" [testfile] $ \h -> fst . separate (== ' ') <$> hGetLine h """]]
The nice thing about benchmarking in Airports is when you're running a benchmark locally, you don't want to do anything else with the computer, so can alternate people watching, spacing out, and analizing results.
100 kb file:
benchmarking sha256/internal
mean: 15.64729 ms, lb 15.29590 ms, ub 16.10119 ms, ci 0.950
std dev: 2.032476 ms, lb 1.638016 ms, ub 2.527089 ms, ci 0.950
benchmarking sha256/external
mean: 8.217700 ms, lb 7.931324 ms, ub 8.568805 ms, ci 0.950
std dev: 1.614786 ms, lb 1.357791 ms, ub 2.009682 ms, ci 0.950
75 kb file:
benchmarking sha256/internal
mean: 12.16099 ms, lb 11.89566 ms, ub 12.50317 ms, ci 0.950
std dev: 1.531108 ms, lb 1.232353 ms, ub 1.929141 ms, ci 0.950
benchmarking sha256/external
mean: 8.818731 ms, lb 8.425744 ms, ub 9.269550 ms, ci 0.950
std dev: 2.158530 ms, lb 1.916067 ms, ub 2.487242 ms, ci 0.950
50 kb file:
benchmarking sha256/internal
mean: 7.699274 ms, lb 7.560254 ms, ub 7.876605 ms, ci 0.950
std dev: 801.5292 us, lb 655.3344 us, ub 990.4117 us, ci 0.950
benchmarking sha256/external
mean: 8.715779 ms, lb 8.330540 ms, ub 9.102232 ms, ci 0.950
std dev: 1.988089 ms, lb 1.821582 ms, ub 2.181676 ms, ci 0.950
10 kb file:
benchmarking sha256/internal
mean: 1.586105 ms, lb 1.574512 ms, ub 1.604922 ms, ci 0.950
std dev: 74.07235 us, lb 51.71688 us, ub 108.1348 us, ci 0.950
benchmarking sha256/external
mean: 6.873742 ms, lb 6.582765 ms, ub 7.252911 ms, ci 0.950
std dev: 1.689662 ms, lb 1.346310 ms, ub 2.640399 ms, ci 0.950
It's possible to get nice graphical reports out of Criterion, but this is clear enough, so I stopped here. 50 kb seems a reasonable cutoff point.
I also used this to benchmark the SHA256 in Haskell's Crypto package. Surprisingly, it's a lot slower than even the Pure.SHA code. On a 50 kb file:
benchmarking sha256/Crypto
collecting 100 samples, 1 iterations each, in estimated 6.073809 s
mean: 69.89037 ms, lb 69.15831 ms, ub 70.71845 ms, ci 0.950
std dev: 3.995397 ms, lb 3.435775 ms, ub 4.721952 ms, ci 0.950
There's another Haskell library, SHA2, which I should try some time.
Random improvements day..
Got the merge conflict resolution code working in git annex assistant
.
Did some more fixes to the pushing and pulling code, covering some cases I missed earlier.
Git syncing seems to work well for me now; I've seen it recover from a variety of error conditions, including merge conflicts and repos that were temporarily unavailable.
There is definitely a MVar deadlock if the merger thread's inotify event handler tries to run code in the Annex monad. Luckily, it doesn't currently seem to need to do that, so I have put off debugging what's going on there.
Reworked how the inotify thread runs, to avoid the two inotify threads in the assistant now from both needing to wait for program termination, in a possibly conflicting manner.
Hmm, that seems to have fixed the MVar deadlock problem.
Been thinking about how to fix watcher commits unlocked files. Posted some thoughts there.
It's about time to move on to data syncing. While eventually that will need to build a map of the repo network to efficiently sync data over the fastest paths, I'm thinking that I'll first write a dumb version. So, two more threads:
Uploads new data to every configured remote. Triggered by the watcher thread when it adds content. Easy; just use a
TSet
of Keys to send.Downloads new data from the cheapest remote that has it. Could be triggered by the merger thread, after it merges in a git sync. Rather hard; how does it work out what new keys are in the tree without scanning it all? Scan through the git history to find newly created files? Maybe the watcher triggers this thread instead, when it sees a new symlink, without data, appear.
Both threads will need to be able to be stopped, and restarted, as needed to control the data transfer. And a lot of other control smarts will eventually be needed, but my first pass will be to do a straightforward implementation. Once it's done, the git annex assistant will be basically usable.
Since last post, I've worked on speeding up git annex watch
's startup time
in a large repository.
The problem was that its initial scan was naively staging every symlink in
the repository, even though most of them are, presumably, staged correctly
already. This was done in case the user copied or moved some symlinks
around while git annex watch
was not running -- we want to notice and
commit such changes at startup.
Since I already had the stat
info for the symlink, it can look at the
ctime
to see if the symlink was made recently, and only stage it if so.
This sped up startup in my big repo from longer than I cared to wait (10+
minutes, or half an hour while profiling) to a minute or so. Of course,
inotify events are already serviced during startup, so making it scan
quickly is really only important so people don't think it's a resource hog.
First impressions are important. :)
But what does "made recently" mean exactly? Well, my answer is possibly
over engineered, but most of it is really groundwork for things I'll need
later anyway. I added a new data structure for tracking the status of the
daemon, which is periodically written to disk by another thread (thread #6!)
to .git/annex/daemon.status
Currently it looks like this; I anticipate
adding lots more info as I move into the syncing stage:
lastRunning:1339610482.47928s
scanComplete:True
So, only symlinks made after the daemon was last running need to be expensively staged on startup. Although, as RichiH pointed out, this fails if the clock is changed. But I have been planning to have a cleanup thread anyway, that will handle this, and other potential problems, so I think that's ok.
Stracing its startup scan, it's fairly tight now. There are some repeated
getcwd
syscalls that could be optimised out for a minor speedup.
Added the sanity check thread. Thread #7! It currently only does one sanity check per day, but the sanity check is a fairly lightweight job, so I may make it run more frequently. OTOH, it may never ever find a problem, so once per day seems a good compromise.
Currently it's only checking that all files in the tree are properly staged
in git. I might make it git annex fsck
later, but fscking the whole tree
once per day is a bit much. Perhaps it should only fsck a few files per
day? TBD
Currently any problems found in the sanity check are just fixed and logged. It would be good to do something about getting problems that might indicate bugs fed back to me, in a privacy-respecting way. TBD
I also refactored the code, which was getting far too large to all be in one module.
I have been thinking about renaming git annex watch
to git annex assistant
,
but I think I'll leave the command name as-is. Some users might
want a simple watcher and stager, without the assistant's other features
like syncing and the webapp. So the next stage of the
roadmap will be a different command that also runs
watch
.
At this point, I feel I'm done with the first phase of inotify. It has a couple known bugs, but it's ready for brave beta testers to try. I trust it enough to be running it on my live data.
Well, sometimes you just have to go for the hack. Trying to find a way
to add additional options to git-annex-shell without breaking backwards
compatibility, I noticed that it ignores all options after --
, because
those tend to be random rsync options due to the way rsync runs it.
So, I've added a new class of options, that come in between, like
-- opt=val opt=val ... --
The parser for these will not choke on unknown options, unlike normal getopt. So this let me add the additional info I needed to pass to git-annex-shell to make it record transfer information. And if I need to pass more info in the future, that's covered too.
It's ugly, but since only git-annex runs git-annex-shell, this is an ugliness only I (and now you, dear reader) have to put up with.
Note to self: Command-line programs are sometimes an API, particularly if designed to be called remotely, and so it makes sense consider whether they are, and design expandability into them from day 1.
Anyway, we now have full transfer tracking in git-annex! Both sides of a transfer know what's being transferred, and from where, and have the info necessary to interrupt the transfer.
Also did some basic groundwork, adding a queue of transfers to perform, and adding to the daemon's status information a map of currently running transfers.
Next up: The daemon will use inotify to notice new and deleted transfer info files, and update its status info.
After a few days otherwise engaged, back to work today.
My focus was on adding the committing thread mentioned in day 4 speed. I got rather further than expected!
First, I implemented a really dumb thread, that woke up once per second,
checked if any changes had been made, and committed them. Of course, this
rather sucked. In the middle of a large operation like untarring a tarball,
or rm -r
of a large directory tree, it made lots of commits and made
things slow and ugly. This was not unexpected.
So next, I added some smarts to it. First, I wanted to stop it waking up every second when there was nothing to do, and instead blocking wait on a change occurring. Secondly, I wanted it to know when past changes happened, so it could detect batch mode scenarios, and avoid committing too frequently.
I played around with combinations of various Haskell thread communications
tools to get that information to the committer thread: MVar
, Chan
,
QSem
, QSemN
. Eventually, I realized all I needed was a simple channel
through which the timestamps of changes could be sent. However, Chan
wasn't quite suitable, and I had to add a dependency on
Software Transactional Memory,
and use a TChan
. Now I'm cooking with gas!
With that data channel available to the committer thread, it quickly got some very nice smart behavior. Playing around with it, I find it commits instantly when I'm making some random change that I'd want the git-annex assistant to sync out instantly; and that its batch job detection works pretty well too.
There's surely room for improvement, and I made this part of the code be an entirely pure function, so it's really easy to change the strategy. This part of the committer thread is so nice and clean, that here's the current code, for your viewing pleasure:
[[!format haskell """ {- Decide if now is a good time to make a commit.
- Note that the list of change times has an undefined order.
- Current strategy: If there have been 10 commits within the past second,
- a batch activity is taking place, so wait for later.
-}
shouldCommit :: UTCTime -> [UTCTime] -> Bool
shouldCommit now changetimes
| len == 0 = False
| len > 4096 = True -- avoid bloating queue too much
| length (filter thisSecond changetimes) < 10 = True
| otherwise = False -- batch activity
where
len = length changetimes
thisSecond t = now
diffUTCTime
t <= 1 """]]
Still some polishing to do to eliminate minor inefficiencies and deal with more races, but this part of the git-annex assistant is now very usable, and will be going out to my beta testers soon!
A rather frustrating and long day coding went like this:
1-3 pm
Wrote a single function, of which all any Haskell programmer needs to know is its type signature:
Lsof.queryDir :: FilePath -> IO [(FilePath, LsofOpenMode, ProcessInfo)]
When I'm spending another hour or two taking a unix utility like lsof and parsing its output, which in this case is in a rather complicated machine-parsable output format, I often wish unix streams were strongly typed, which would avoid this bother.
3-9 pm
Six hours spent making it defer annexing files until the commit thread wakes up and is about to make a commit. Why did it take so horribly long? Well, there were a number of complications, and some really bad bugs involving races that were hard to reproduce reliably enough to deal with.
In other words, I was lost in the weeds for a lot of those hours...
At one point, something glorious happened, and it was always making exactly one commit for batch mode modifications of a lot of files (like untarring them). Unfortunately, I had to lose that gloriousness due to another potential race, which, while unlikely, would have made the program deadlock if it happened.
So, it's back to making 2 or 3 commits per batch mode change. I also have a buglet that causes sometimes a second empty commit after a file is added. I know why (the inotify event for the symlink gets in late, after the commit); will try to improve commit frequency later.
9-11 pm
Put the capstone on the day's work, by calling lsof on a directory full of hardlinks to the files that are about to be annexed, to check if any are still open for write.
This works great! Starting up git annex watch
when processes have files
open is no longer a problem, and even if you're evil enough to try having
multiple processes open the same file, it will complain and not annex it
until all the writers close it.
(Well, someone really evil could turn the write bit back on after git annex
clears it, and open the file again, but then really evil people can do
that to files in .git/annex/objects
too, and they'll get their just
deserts when git annex fsck
runs. So, that's ok..)
Anyway, will beat on it more tomorrow, and if all is well, this will finally go out to the beta testers.
Back home and laptop is fixed.. back to work.
Warmup exercises:
Went in to make it queue transfers when a broken symlink is received, only to find I'd already written code to do that, and forgotten about it. Heh. Did check that the git-annex branch is always sent first, which will ensure that code always knows where to transfer a key from. I had probably not considered this wrinkle when first writing the code; it worked by accident.
Made the assistant check that a remote is known to have a key before queueing a download from it.
Fixed a bad interaction between the
git annex map
command and the assistant.
Tried using a modified version of MissingH
that doesn't use HSLogger
to make git-annex work with the threaded GHC runtime. Unfortunatly,
I am still seeing hangs in at least 3 separate code paths when
running the test suite. I may have managed to fix one of the hangs,
but have not grokked what's causing the others.
I now have access to a Mac OSX system, thanks to Kevin M. I've fixed some portability problems in git-annex with it before, but today I tested the assistant on it:
Found a problem with the kqueue code that prevents incoming pushes from being noticed.
The problem was that the newly added git ref file does not trigger an add event. The kqueue code saw a generic change event for the refs directory, but since the old file was being deleted and replaced by the new file, the kqueue code, which already had the old file in its cache, did not notice the file had been replaced.
I fixed that by making the kqueue code also track the inode of each file. Currently that adds the overhead of a stat of each file, which could be avoided if haskell exposed the inode returned by
readdir
. Room to optimise this later...Also noticed that the kqueue code was not separating out file deletions from directory deletions. IIRC Jimmy had once mentioned a problem with file deletions not being noticed by the assistant, and this could be responsible for that, although the directory deletion code seems to handle them ok normally. It was making the transfer watching thread not notice when any transfers finished, for sure. I fixed this oversight, looking in the cache to see if there used to be a file or a directory, and running the appropriate hook.
Even with these fixes, the assistant does not yet reliably transfer file contents on OSX. I think the problem is that with kqueue we're not guaranteed to get an add event, and a deletion event for a transfer info file -- if it's created and quickly deleted, the code that synthensizes those events doesn't run in time to know it existed. Since the transfer code relies on deletion events to tell when transfers are complete, it stops sending files after the first transfer, if the transfer ran so quickly it doesn't get the expected events.
So, will need to work on OSX support some more...
Today is a planning day. I have only a few days left before I'm off to Nicaragua for DebConf, where I'll only have smaller chunks of time without interruptions. So it's important to get some well-defined smallish chunks designed that I can work on later. See bulleted action items below (now moved to syncing. Each should be around 1-2 hours unless it turns out to be 8 hours... :)
First, worked on writing down a design, and some data types, for data transfer tracking (see syncing page). Found that writing down these simple data types before I started slinging code has clarified things a lot for me.
Most importantly, I realized that I will need to modify git-annex-shell
to record on disk what transfers it's doing, so the assistant can get that
information and use it to both avoid redundant transfers (potentially a big
problem!), and later to allow the user to control them using the web app.
While eventually the user will be able to use the web app to prioritize transfers, stop and start, throttle, etc, it's important to get the default behavior right. So I'm thinking about things like how to prioritize uploads vs downloads, when it's appropriate to have multiple downloads running at once, etc.
Worked on automatic merge conflict resolution today. I had expected to be able to use git's merge driver interface for this, but that interface is not sufficient. There are two problems with it:
- The merge program is run when git is in the middle of an operation that locks the index. So it cannot delete or stage files. I need to do both as part of my conflict resolution strategy.
- The merge program is not run at all when the merge conflict is caused by one side deleting a file, and the other side modifying it. This is an important case to handle.
So, instead, git-annex will use a regular git merge
, and if it fails, it
will fix up the conflicts.
That presented its own difficulty, of finding which files in the tree
conflict. git ls-files --unmerged
is the way to do that, but its output
is a quite raw form:
120000 3594e94c04db171e2767224db355f514b13715c5 1 foo
120000 35ec3b9d7586b46c0fd3450ba21e30ef666cfcd6 3 foo
100644 1eabec834c255a127e2e835dadc2d7733742ed9a 2 bar
100644 36902d4d842a114e8b8912c02d239b2d7059c02b 3 bar
I had to stare at the rather impenetrable documentation for hours and write a lot of parsing and processing code to get from that to these mostly self explanatory data types:
data Conflicting v = Conflicting
{ valUs :: Maybe v
, valThem :: Maybe v
} deriving (Show)
data Unmerged = Unmerged
{ unmergedFile :: FilePath
, unmergedBlobType :: Conflicting BlobType
, unmergedSha :: Conflicting Sha
} deriving (Show)
Not the first time I've whined here about time spent parsing unix command output, is it? :)
From there, it was relatively easy to write the actual conflict cleanup
code, and make git annex sync
use it. Here's how it looks:
$ ls -1
foo.png
bar.png
$ git annex sync
commit
# On branch master
nothing to commit (working directory clean)
ok
merge synced/master
CONFLICT (modify/delete): bar.png deleted in refs/heads/synced/master and modified in HEAD. Version HEAD of bar.png left in tree.
Automatic merge failed; fix conflicts and then commit the result.
bar.png: needs merge
(Recording state in git...)
[master 0354a67] git-annex automatic merge conflict fix
ok
$ ls -1
foo.png
bar.variant-a1fe.png
bar.variant-93a1.png
There are very few options for ways for the conflict resolution code to name conflicting variants of files. The conflict resolver can only use data present in git to generate the names, because the same conflict needs to be resolved the same everywhere.
So I had to choose between using the full key name in the filenames produced when resolving a merge, and using a shorter checksum of the key, that would be more user-friendly, but could theoretically collide with another key. I chose the checksum, and weakened it horribly by only using 32 bits of it!
Surprisingly, I think this is a safe choice. The worst that can happens if such a collision happens is another conflict, and the conflict resolution code will work on conflicts produced by the conflict resolution code! In such a case, it does fall back to putting the whole key in the filename: "bar.variant-SHA256-s2550--2c09deac21fa93607be0844fefa870b2878a304a7714684c4cc8f800fda5e16b.png"
Still need to hook this code into git annex assistant
.
Followed my plan from yesterday, and wrote a simple C library to interface
to kqueue
, and Haskell code to use that library. By now I think I
understand kqueue fairly well -- there are some very tricky parts to the
interface.
But... it still didn't work. After building all this, my code was failing the same way that the haskell kqueue library failed yesterday. I filed a bug report with a testcase.
Then I thought to ask on #haskell. Got sorted out in quick order! The
problem turns out to be that haskell's runtime has a periodic SIGALARM,
that is interrupting my kevent call. It can be worked around with +RTS -V0
,
but I put in a fix to retry to kevent when it's interrupted.
And now git-annex watch
can detect changes to directories on BSD and OSX!
Note: I said "detect", not "do something useful in response to". Getting
from the limited kqueue events to actually staging changes in the git repo
is going to be another day's work. Still, brave FreeBSD or OSX users
might want to check out the watch
branch from git and see if
git annex watch
will at least say it sees changes you make to your
repository.
So as not to bury the lead, I've been hard at work on my first day in Nicaragua, and the git-annex assistant fully syncs files (including their contents) between remotes now !!
Details follow..
Made the committer thread queue Upload Transfers when new files are added to the annex. Currently it tries to transfer the new content to every remote; this inefficiency needs to be addressed later.
Made the watcher thread queue Download Transfers when new symlinks appear that point to content we don't have. Typically, that will happen after an automatic merge from a remote. This needs to be improved as it currently adds Transfers from every remote, not just those that have the content.
This was the second place that needed an ordered list of remotes to talk to. So I cached such a list in the DaemonStatus state info. This will also be handy later on, when the webapp is used to add new remotes, so the assistant can know about them immediately.
Added YAT (Yet Another Thread), number 15 or so, the transferrer thread that waits for transfers to be queued and runs them. Currently a naive implementation, it runs one transfer at a time, and does not do anything to recover when a transfer fails.
Actually transferring content requires YAT, so that the transfer action can run in a copy of the Annex monad, without blocking all the assistant's other threads from entering that monad while a transfer is running. This is also necessary to allow multiple concurrent transfers to run in the future.
This is a very tricky piece of code, because that thread will modify the git-annex branch, and its parent thread has to invalidate its cache in order to see any changes the child thread made. Hopefully that's the extent of the complication of doing this. The only reason this was possible at all is that git-annex already support multiple concurrent processes running and all making independent changes to the git-annex branch, etc.
After all my groundwork this week, file content transferring is now fully working!
My laptop's SSD died this morning. I had some work from yesterday committed to the git repo on it, but not pushed as it didn't build. Luckily I was able to get that off the SSD, which is now a read-only drive -- even mounting it fails with fsck write errors.
Wish I'd realized the SSD was dying before the day before my trip to Nicaragua.. Getting back to a useful laptop used most of my time and energy today.
I did manage to fix transfers to not block the rest of the assistant's threads. Problem was that, without Haskell's threaded runtime, waiting on something like a rsync command blocks all threads. To fix this, transfers now are run in separate processes.
Also added code to allow multiple transfers to run at once. Each transfer
takes up a slot, with the number of free slots tracked by a QSemN
.
This allows the transfer starting thread to block until a slot frees up,
and then run the transfer.
This needs to be extended to be aware of transfers initiated by remotes.
The transfer watcher thread should detect those starting and stopping
and update the QSemN
accordingly. It would also be nice if transfers
initiated by remotes would be delayed when there are no free slots for them
... but I have not thought of a good way to do that.
There's a bug somewhere in the new transfer code, when two transfers are queued close together, the second one is lost and doesn't happen. Would debug this, but I'm spent for the day.
I made the MountWatcher only use dbus if it sees a client connected to dbus that it knows will send mount events, or if it can start up such a client via dbus. (Fancy!) Otherwise it falls back to polling. This should be enough to support users who manually mount things -- if they have gvfs installed, it'll be used to detect their manual mounts, even when a desktop is not running, and if they don't have gvfs, they get polling.
Also, I got the MountWatcher to work with KDE. Found a dbus event that's
emitted when KDE mounts a drive, and this is also used. If anyone with
some other desktop environment wants me to add support for it, and it uses
dbus, it should be easy: Run dbus-monitor
, plug in a drive, get
it mounted, and send me a transcript.
Of course, it'd also be nice to support anything similar on OSX that can provide mount event notifications. Not a priority though, since the polling code will work.
Some OS X fixes today..
- Jimmy pointed out that my
getmntent
code broke the build on OSX again. Sorry about that.. I keep thinking Unix portability nightmares are a 80's thing, not a 2010's thing. Anyway, adapted a lot of hackish C code to emulategetmntent
on BSD systems, and it seems to work. (I actually think the BSD interface to this is saner than Linux's, but I'd rather have either one than both, sigh..) - Kqueue was blocking all the threads on OSX. This is fixed, and the assistant seems to be working on OSX again.
I put together a preliminary page thanking everyone who contributed to the git-annex Kickstarter. thanks The wall-o-names is scary crazy humbling.
Improved --debug
mode for the assistant, now every thread says whenever
it's doing anything interesting, and also there are timestamps.
Had been meaning to get on with syncing to drives when they're mounted, but got sidetracked with the above. Maybe tomorrow. I did think through it in some detail as I was waking up this morning, and think I have a pretty good handle on it.
Worked today on two action items from my last blog post:
- on-disk transfers in progress information files (read/write/enumerate)
- locking for the files, so redundant transfer races can be detected, and failed transfers noticed
That's all done, and used by the get
, copy
, and move
subcommands.
Also, I made git-annex status
use that information to display any
file transfers that are currently in progress:
joey@gnu:~/lib/sound/misc>git annex status
[...]
transfers in progress:
downloading Vic-303.mp3 from leech
(Webapp, here we come!)
However... Files being sent or received by git-annex-shell
don't yet
have this transfer info recorded. The problem is that to do so,
git-annex-shell
will need to be run with a --remote=
parameter. But
old versions will of course fail when run with such an unknown parameter.
This is a problem I last faced in December 2011 when adding the --uuid=
parameter. That time I punted and required the remote git-annex-shell
be
updated to a new enough version to accept it. But as git-annex gets more widely
used and packaged, that's becoming less an option. I need to find a real
solution to this problem.
Last night I got git annex watch
to also handle deletion of files.
This was not as tricky as feared; the key is using git rm --ignore-unmatch
,
which avoids most problematic situations (such as a just deleted file
being added back before git is run).
Also fixed some races when git annex watch
is doing its startup scan of
the tree, which might be changed as it's being traversed. Now only one
thread performs actions at a time, so inotify events are queued up during
the scan, and dealt with once it completes. It's worth noting that inotify
can only buffer so many events .. Which might have been a problem except
for a very nice feature of Haskell's inotify interface: It has a thread
that drains the limited inotify buffer and does its own buffering.
Right now, git annex watch
is not as fast as it could be when doing
something like adding a lot of files, or deleting a lot of files.
For each file, it currently runs a git command that updates the index.
I did some work toward coalescing these into one command (which git annex
already does normally). It's not quite ready to be turned on yet,
because of some races involving git add
that become much worse
if it's delayed by event coalescing.
And races were the theme of today. Spent most of the day really
getting to grips with all the fun races that can occur between
modification happening to files, and git annex watch
. The inotify
page now has a long list of known races, some benign, and several,
all involving adding files, that are quite nasty.
I fixed one of those races this evening. The rest will probably involve
moving away from using git add
, which necessarily examines the file
on disk, to directly shoving the symlink into git's index.
BTW, it turns out that dvcs-autosync
has grappled with some of these same
races: http://comments.gmane.org/gmane.comp.version-control.home-dir/665
I hope that git annex watch
will be in a better place to deal with them,
since it's only dealing with git, and with a restricted portion of it
relevant to git-annex.
It's important that git annex watch
be rock solid. It's the foundation
of the git annex assistant. Users should not need to worry about races
when using it. Most users won't know what race conditions are. If only I
could be so lucky!
Not much available time today, only a few hours.
Main thing I did was fixed up the failed push tracking to use a better data structure. No need for a queue of failed pushes, all it needs is a map of remotes that have an outstanding failed push, and a timestamp. Now it won't grow in memory use forever anymore. :)
Finding the right thread mutex type for this turned out to be a bit of a challenge. I ended up with a STM TMVar, which is left empty when there are no pushes to retry, so the thread using it blocks until there are some. And, it can be updated transactionally, without races.
I also fixed a bug outside the git-annex assistant code. It was possible to crash git-annex if a local git repository was configured as a remote, and the repository was not available on startup. git-annex now ignores such remotes. This does impact the assistant, since it is a long running process and git repositories will come and go. Now it ignores any that were not available when it started up. This will need to be dealt with when making it support removable drives.
Syncing works! I have two clones, and any file I create in the first is immediately visible in the second. Delete that file from the second, and it's immediately removed from the first.
Most of my work today felt like stitching existing limbs onto a pre-existing
monster. Took the committer thread, that waits for changes and commits them,
and refashioned it into a pusher thread, that waits for commits and pushes
them. Took the watcher thread, that watches for files being made,
and refashioned it into a merger thread, that watches for git refs being
updated. Pulled in bits of the git annex sync
command to reanimate this.
It may be a shambling hulk, but it works.
Actually, it's not much of a shambling hulk; I refactored my code after copying it. ;)
I think I'm up to 11 threads now in the new
git annex assistant
command, each with its own job, and each needing
to avoid stepping on the other's toes. I did see one MVar deadlock error
today, which I have not managed to reproduce after some changes. I think
the committer thread was triggering the merger thread, which probably
then waited on the Annex state MVar the committer thread had held.
Anyway, it even pushes to remotes in parallel, and keeps track of remotes it failed to push to, although as of yet it doesn't do any attempt at periodically retrying.
One bug I need to deal with is that the push code assumes any change made to the remote has already been pushed back to it. When it hasn't, the push will fail due to not being a fast-forward. I need to make it detect this case and pull before pushing.
(I've pushed this work out in a new assistant branch
.)
... I'm getting tired of kqueue.
But the end of the tunnel is in sight. Today I made git-annex handle files that are still open for write after a kqueue creation event is received. Unlike with inotify, which has a new event each time a file is closed, kqueue only gets one event when a file is first created, and so git-annex needs to retry adding files until there are no writers left.
Eventually I found an elegant way to do that. The committer thread already wakes up every second as long as there's a pending change to commit. So for adds that need to be retried, it can just push them back onto the change queue, and the committer thread will wait one second and retry the add. One second might be too frequent to check, but it will do for now.
This means that git annex watch
should now be usable on OSX, FreeBSD, and
NetBSD! (It'll also work on Debian kFreeBSD once lsof is ported to it.)
I've meged kqueue support to master
.
I also think I've squashed the empty commits that were sometimes made.
Incidentally, I'm 50% through my first month, and finishing inotify was the first half of my roadmap for this month. Seem to be right on schedule.. Now I need to start thinking about syncing.
Good news! My beta testers report that the new kqueue code works on OSX. At least "works" as well as it does on Debian kFreeBSD. My crazy development strategy of developing on Debian kFreeBSD while targeting Mac OSX is vindicated. ;-)
So, I've been beating the kqueue code into shape for the last 12 hours, minus a few hours sleep.
First, I noticed it was seeming to starve the other threads. I'm using
Haskell's non-threaded runtime, which does cooperative multitasking between
threads, and my C code was never returning to let the other threads run.
Changed that around, so the C code runs until SIGALARMed, and then that
thread calls yield
before looping back into the C code. Wow, cooperative
multitasking.. I last dealt with that when programming for Windows 3.1!
(Should try to use Haskell's -threaded runtime sometime, but git-annex
doesn't work under it, and I have not tried to figure out why not.)
Then I made a single commit, with no testing, in which I made the kqueue code maintain a cache of what it expects in the directory tree, and use that to determine what files changed how when a change is detected. Serious code. It worked on the first go. If you were wondering why I'm writing in Haskell ... yeah, that's why.
And I've continued to hammer on the kqueue code, making lots of little
fixes, and at this point it seems almost able to handle the changes I
throw at it. It does have one big remaining problem; kqueue doesn't tell me
when a writer closes a file, so it will sometimes miss adding files. To fix
this, I'm going to need to make it maintain a queue of new files, and
periodically check them, with lsof
, to see when they're done being
written to, and add them to the annex. So while a file is being written
to, git annex watch
will have to wake up every second or so, and run
lsof
... and it'll take it at least 1 second to notice a file's complete.
Not ideal, but the best that can be managed with kqueue.
I released a version of git-annex over the weekend that includes the git
annex watch
command. There's a minor issue installing it from cabal on
OSX, which I've fixed in my tree. Nice timing: At least the watch command
should be shipped in the next Debian release, which freezes at the end of
the month.
Jimmy found out how kqueue blows up when there are too many directories to keep all open. I'm not surprised this happens, but it's nice to see exactly how. Odd that it happened to him at just 512 directories; I'd have guessed more. I have plans to fork watcher programs that each watch 512 directories (or whatever the ulimit is), to deal with this. What a pitiful interface is kqueue.. I have not thought yet about how the watcher programs would communicate back to the main program.
Back on the assistant front, I've worked today on making git syncing more robust. Now when a push fails, it tries a pull, and a merge, and repushes. That ensures that the push is, almost always, a fast-forward. Unless something else gets in a push first, anyway!
If a push still fails, there's Yet Another Thread, added today, that will wake up after 30 minutes and retry the push. It currently keeps retrying every 30 minutes until the push finally gets though. This will deal, to some degree, with those situations where a remote is only sometimes available.
I need to refine the code a bit, to avoid it keeping an ever-growing queue of failed pushes, if a remote is just dead. And to clear old failed pushes from the queue when a later push succeeds.
I also need to write a git merge driver that handles conflicts in the tree.
If two conflicting versions of a file foo
are saved, this would merge
them, renaming them to foo.X
and foo.Y
. Probably X and Y are the
git-annex keys for the content of the files; this way all clones will
resolve the conflict in a way that leads to the same tree. It's also
possible to get a conflict by one repo deleting a file, and another
modifying it. In this case, renaming the deleted file to foo.Y
may
be the right approach, I am not sure.
I glanced through some Haskell dbus bindings today. I belive there are dbus events available to detect when drives are mounted, and on Linux this would let git-annex notice and sync to usb drives, etc.
I've been investigating how to make git annex watch
work on
FreeBSD, and by extension, OSX.
One option is kqueue, which works on both operating systems, and allows very basic monitoring of file changes. There's also an OSX specific hfsevents interface.
Kqueue is far from optimal for git annex watch
, because it provides even
less information than inotify (which didn't really provide everything I
needed, thus the lsof hack). Kqueue doesn't have events for files being
closed, only an event when a file is created. So it will be difficult for
git annex watch
to know when a file is done being written to and can be
annexed. git annex will probably need to run lsof periodically to check when
recently added files are complete. (hsevents shares this limitation)
Kqueue also doesn't provide specific events when a file or directory is
moved. Indeed, it doesn't provide specific events about what changed at
all. All you get with kqueue is a generic "oh hey, the directory you're
watching changed in some way", and it's up to you to scan it to work out
how. So git annex will probably need to run git ls-tree --others
to find changes in the directory tree. This could be expensive with large
trees. (hsevents has per-file events on current versions of OSX)
Despite these warts, I want to try kqueue first, since it's more portable than hfsevents, and will surely be easier for me to develop support for, since I don't have direct access to OSX.
So I went to a handy Debian kFreeBSD porter box, and tried some kqueue stuff to get a feel for it. I got a python program that does basic directory monitoring with kqueue to work, so I know it's usable there.
Next step was getting kqueue working from Haskell. Should be easy, there's a Haskell library already. I spent a while trying to get it to work on Debian kFreeBSD, but ran into a problem that could be caused by the Debian kFreeBSD being different, or just a bug in the Haskell library. I didn't want to spend too long shaving this yak; I might install "real" FreeBSD on a spare laptop and try to get it working there instead.
But for now, I've dropped down to C instead, and have a simple C program that can monitor a directory with kqueue. Next I'll turn it into a simple library, which can easily be linked into my Haskell code. The Haskell code will pass it a set of open directory descriptors, and it'll return the one that it gets an event on. This is necessary because kqueue doesn't recurse into subdirectories on its own.
I've generally had good luck with this approach to adding stuff in Haskell; rather than writing a bit-banging and structure packing low level interface in Haskell, write it in C, with a simpler interface between C and Haskell.
First day of Kickstarter funded work!
Worked on inotify today. The watch
branch in git now does a pretty
good job of following changes made to the directory, annexing files
as they're added and staging other changes into git. Here's a quick
transcript of it in action:
joey@gnu:~/tmp>mkdir demo
joey@gnu:~/tmp>cd demo
joey@gnu:~/tmp/demo>git init
Initialized empty Git repository in /home/joey/tmp/demo/.git/
joey@gnu:~/tmp/demo>git annex init demo
init demo ok
(Recording state in git...)
joey@gnu:~/tmp/demo>git annex watch &
[1] 3284
watch . (scanning...) (started)
joey@gnu:~/tmp/demo>dd if=/dev/urandom of=bigfile bs=1M count=2
add ./bigfile 2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.835976 s, 2.5 MB/s
(checksum...) ok
(Recording state in git...)
joey@gnu:~/tmp/demo>ls -la bigfile
lrwxrwxrwx 1 joey joey 188 Jun 4 15:36 bigfile -> .git/annex/objects/Wx/KQ/SHA256-s2097152--e5ced5836a3f9be782e6da14446794a1d22d9694f5c85f3ad7220b035a4b82ee/SHA256-s2097152--e5ced5836a3f9be782e6da14446794a1d22d9694f5c85f3ad7220b035a4b82ee
joey@gnu:~/tmp/demo>git status -s
A bigfile
joey@gnu:~/tmp/demo>mkdir foo
joey@gnu:~/tmp/demo>mv bigfile foo
"del ./bigfile"
joey@gnu:~/tmp/demo>git status -s
AD bigfile
A foo/bigfile
Due to Linux's inotify interface, this is surely some of the most subtle, race-heavy code that I'll need to deal with while developing the git annex assistant. But I can't start wading, need to jump off the deep end to make progress!
The hardest problem today involved the case where a directory is moved outside of the tree that's being watched. Inotify will still send events for such directories, but it doesn't make sense to continue to handle them.
Ideally I'd stop inotify watching such directories, but a lot of state would need to be maintained to know which inotify handle to stop watching. (Seems like Haskell's inotify API makes this harder than it needs to be...)
Instead, I put in a hack that will make it detect inotify events from directories moved away, and ignore them. This is probably acceptable, since this is an unusual edge case.
The notable omission in the inotify code, which I'll work on next, is staging deleting of files. This is tricky because adding a file to the annex happens to cause a deletion event. I need to make sure there are no races where that deletion event causes data loss.
Today I worked on the race conditions, and fixed two of them. Both
were fixed by avoiding using git add
, which looks at the files currently
on disk. Instead, git annex watch
injects symlinks directly into git's
index, using git update-index
.
There is one bad race condition remaining. If multiple processes have a file open for write, one can close it, and it will be added to the annex. But then the other can still write to it.
Getting away from race conditions for a while, I made git annex watch
not annex .gitignore
and .gitattributes
files.
And, I made it handle running out of inotify descriptors. By default,
/proc/sys/fs/inotify/max_user_watches
is 8192, and that's how many
directories inotify can watch. Now when it needs more, it will print
a nice message showing how to increase it with sysctl
.
FWIW, DropBox also uses inotify and has the same limit. It seems to not
tell the user how to fix it when it goes over. Here's what git annex
watch
will say:
Too many directories to watch! (Not watching ./dir4299)
Increase the limit by running:
echo fs.inotify.max_user_watches=81920 | sudo tee -a /etc/sysctl.conf; sudo sysctl -p
git merge watch_
My cursor has been mentally poised here all day, but I've been reluctant to merge watch into master. It seems solid, but is it correct? I was able to think up a lot of races it'd be subject to, and deal with them, but did I find them all?
Perhaps I need to do some automated fuzz testing to reassure myself. I looked into using genbackupdata to that end. It's not quite what I need, but could be moved in that direction. Or I could write my own fuzz tester, but it seems better to use someone else's, because a) laziness and b) they're less likely to have the same blind spots I do.
My reluctance to merge isn't helped by the known bugs with files that are
either already open before git annex watch
starts, or are opened by two
processes at once, and confuse it into annexing the still-open file when one
process closes it.
I've been thinking about just running lsof
on every file as it's being
annexed to check for that, but in the end, lsof
is too slow. Since its
check involves trawling through all of /proc, it takes it a good half a
second to check a file, and adding 25 seconds to the time it takes to
process 100 files is just not acceptable.
But an option that could work is to run lsof
after a bunch of new files
have been annexed. It can check a lot of files nearly as fast as a single
one. In the rare case that an annexed file is indeed still open, it could
be moved back out of the annex. Then when its remaining writer finally
closes it, another inotify event would re-annex it.