Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas on speeding up no-op scenario #261

Open
KAction opened this issue Dec 10, 2023 · 5 comments
Open

Ideas on speeding up no-op scenario #261

KAction opened this issue Dec 10, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@KAction
Copy link

KAction commented Dec 10, 2023

This is output of treefmt run on clean checkout of my $dayjob repository:

>> treefmt

traversed 3563 files
matched 1102 files to formatters
left with 0 files after cache
of whom 0 files were re-formatted
all of this in 357ms

And I was wondering, why it takes so slow to do nothing with less than 4k
files. Instead of doing proper profiling, I just run strace treefmt and
counted the syscalls. And here what I found:

  1. For every formatter treefmt runs a bunch of statx(2) in attempt to
    detect if formatter changed. Such detection takes time and is inevitably
    prone to both false-positives and false-negatives.

    I suggest skipping it and leaving it to the user to bust the cache when he
    upgrades his toolchain. That would save N * M system calls, where N is
    amount of formatters and M is amount of entries in your $PATH.

    link to strace snippet

  2. treefmt executes statx(2) on each file that is matched by at least one
    rule. From content of eval-cache I infer that file is considered
    unmodified if it has the same mtime and size.

    This is known pain point of build systems, and there is no kernel interface
    to make it any more efficient. Luckily, there is userspace daemon
    watchman that can more efficiently
    answer the question "what files were changed".

    For example, in my case it takes watchman only 20ms to tell me what files
    were changed (I intentionally changed one file):

     $ time watchman since . c:1702188097:2344484:1:60 '*.nix'
     {
         "is_fresh_instance": false,
         "files": [
             {
                 "cclock": "c:1702188097:2344484:1:2",
                 "new": false,
                 "ctime": 1702188990,
                 "dev": 66306,
                 "mtime": 1702188990,
                 "uid": 1000,
                 "mode": 33188,
                 "size": 380,
                 "oclock": "c:1702188097:2344484:1:71",
                 "nlink": 1,
                 "ino": 27656322,
                 "gid": 100,
                 "exists": true,
                 "name": "shell.nix"
             }
         ],
         "version": "2023.08.14.00",
         "clock": "c:1702188097:2344484:1:85",
         "debug": {
             "cookie_files": [
                 "/home/kaction/devel/dayjob/backend/package/.watchman-cookie-despise-2344484-38"
             ]
         }
     }
    
     ________________________________________________________
     Executed in   20.58 millis    fish           external
        usr time    6.25 millis    0.00 micros    6.25 millis
        sys time    2.04 millis  792.00 micros    1.25 millis
    

    Not great, but still 15x improvement. Probably can be improved further by
    skipping watchman command line tool and talking Unix socket interface
    directly.

    There seems to be Rust watchman client library already

  3. This is just a gut feeling, not backed by counting of syscalls. treefmt
    uses toml format to store the cache. The toml format is A config file format for humans. Some suitable binary format, like sqlite3, gdbm or cdb
    should probably be more efficient.

Thoughts?

@KAction KAction added the enhancement New feature or request label Dec 10, 2023
@KAction KAction changed the title Idea on speeding up no-op scenario Ideas on speeding up no-op scenario Dec 10, 2023
@adrian-gierakowski
Copy link

I suggest skipping it and leaving it to the user to bust the cache when he
upgrades his toolchain.

This would be a major pain for people who frequently upgrade their nixpkgs. If anything, this should be opt-in

@zimbatm
Copy link
Member

zimbatm commented Dec 10, 2023

There are a lot of different points but overall I'm open to changing the algorithms. The current approach is designed as a local maxima between speed, accuracy and code complexity so that I could get something out. I fully expect to have to re-visit this for users with larger repos. The main thing that I care about is that we maintain the current user experience where the user doesn't have to think if they have to bust the cache or not. The whole premise of this project is to relieve the user from cognitive overhead.

Replacing the cache file format is a no-brainer. The only reason we are using this format is that we were already using the TOML dependency. I'm not opposed to using something more efficient. But it's also not a priority as it's not the main driver for slowdowns.

Regarding the change tracking, mtime isn't fully accurate, but humans don't change files at high frequency so it's generally OK. The accurate method would be to track the checksum of each file, and that's much slower. We could introduce a daemon, but that would increase the code complexity and add new failure scenarios. I know there has been a lot of research going into speeding up git status in large repos, so it would be good to look there first (actually it would be interesting to see the speed comparison with git status).

@zimbatm
Copy link
Member

zimbatm commented Dec 10, 2023

Git has a fsmonitor. That's a similar idea than what you mentioned.

Before going there, there are some internal optimisations to do. On nixpkgs:

$ git ls-files | wc -l
38435
$ time git status >/dev/null
real	0m0.233s
user	0m0.079s
sys	0m0.273s
$ cat treefmt.toml 
[formatter.echo]
command = "echo"
includes = [ "*.*" ]
$ time treefmt 
0 files changed in 1s (found 38389, matched 38389, cache misses 0)

real	0m1.318s
user	0m0.878s
sys	0m0.442s

@KAction
Copy link
Author

KAction commented Dec 17, 2023

I did some experiments with cache format, and results are not as straightforward as I expected. Loading msgpack instead of toml for my case is 2ms faster, but writing msgpack is 1.5ms slower.

zimbatm pushed a commit to KAction/treefmt that referenced this issue Mar 13, 2024
Anecodically, for project having 4k files and 7 different formatters,
this change can reduce time it takes for treefmt(1) to figure out that
it has nothing to do from ~350ms down to ~8m.

If "WATCHMAN_SOCK" environment variable is not set and "watchman(1)"
command are not available, "treefmt" transparently falls back on walking
the file system. Setting the environment variable is highly recommended,
since is avoids subprocess invocation.

Ref: numtide#261
@brianmcgee
Copy link
Member

V2 has a new concept of walkers, with default performance greatly improved:

# nixpkgs with no cache
traversed 42264 files
emitted 42264 files for processing
formatted 35462 files (623 changed) in 19.7s

# nixpkgs with hot cache
traversed 42264 files
emitted 0 files for processing
formatted 0 files (0 changed) in 283ms

With the new walker pattern it should be straightforward to have an implementation for watchman.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants