Git LFS is very slow for lots of files. It is also inefficient space-wise and ba...

unshavedyak · on April 24, 2023

> The integration with Git is also very clunky since it's based on smudge/clean filters (that were really meant to update templates in your text files or change line endings, not download big files).

There is no better integration though, right? I'm implementing an alternate for Git LFS because i also don't like it - but my past research led me no where. Git just doesn't have good support for this and all we can to is smudge or batch smudge.

iirc there's some nuance/difference to how Git Annex achieves this problem space, but fundamentally i think everyone basically does the same thing. No?

If there's a better way i'd love to know, since i'm literally replacing/reimplementing Git LFS for my own pipeline.

remram · on April 24, 2023

> There is no better integration though, right?

Yeah this is not specifically a critique of LFS, it's a critique of the whole thing. Some of those problems are definitely limitations of Git. I'm not trying to point fingers, but the result is bad.

unshavedyak · on April 24, 2023

Concretely, how does Smudge not solve the problem okay?

I dislike Smudge in that it felt like an abuse of an API. BUT, i'm not aware of any real problems with it.. but i could easily be missing something. So i'd love opinions on the subject :)

remram · on April 24, 2023

Integration is poor. Some tooling is not aware of it and will diff or show the pointer file instead, such as Git hosting platforms. It is also easy to end in situation where the pointer file hasn't been smudged, and Git might not report a change and might make it hard to find what is and isn't smudged... at least that has been my experience when dealing with large repositories where I sometimes want to only pull large files in specific folders.

throwaway290 · on April 25, 2023

I think either you get super integrated but restricted ecosystem (eg. you can use darke files but then darke files is all you can use) or you build on top of an established protocol and get something maximally compatible (any git server) but some software will ignore the extra bits.

throwaway290 · on April 24, 2023

Where is the inefficiency, Git LFS spec or some particular server? I think space is not a spec issue at all, an LFS server can use any deduplicating storage under the hood

The part where filters are used is not elegant, porcelain for LFS could be better.

unshavedyak · on April 24, 2023

For me it's just that Git LFS defaults to a server. I'm rewriting it largely because i loathe Git-LFS needing to hit some HTTP server. Running one locally because they defaulted to a Github Webserver mindset frustrates me.

throwaway290 · on April 25, 2023

Personally if files are large I prefer to store all of them and the entire history of their changes somewhere else, so a server makes sense to me...

unshavedyak · on April 25, 2023

Yea, i just don't want to be forced to. Ie to me storing them in another folder, a network filesystem, some SSH filesystem, a dumb KV store like S3, etc are all viable storage locations to it.

So that's what i'm writing because i'm just picky, i guess heh.

throwaway290 · on April 25, 2023

The spec says pointer is just an URL so it could probably be a file:... or s3:... URL if you have the right resolver logic

remram · on April 24, 2023

> I think space is not a spec issue at all, an LFS server can use any deduplicating storage under the hood

The client doesn't deduplicate, so you have to download whole objects and store them. Of course you can clean up the older version of objects, but that will force you to re-download if you change branch or go back in the history.

throwaway290 · on April 25, 2023

How would you deduplicate on the client at all?

If I have multiple copies of the same file in different subdirectories, and I need all of them for work, there's no way around storing all of them as multiple copies on my machine, right? Unless you do some symlinking or hardlinking magic, but that will likely break platform and software compatibility so I wouldn't want it to happen either...

Similar with history, how can I have my cake (not have previous versions occupy space locally) and eat it too (have them readily available locally without download)?

Or do you mean something like chunking binary files and then deduping chunks? Is it an LFS limitation, or simply that no Git LFS client/server implementation does it yet? It's still fairly new after all.