Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Repo size not shrinking after using --invert-paths #573

Open
KoningLeon opened this issue Jun 26, 2024 · 6 comments
Open

[Question] Repo size not shrinking after using --invert-paths #573

KoningLeon opened this issue Jun 26, 2024 · 6 comments

Comments

@KoningLeon
Copy link

KoningLeon commented Jun 26, 2024

In the past my team stored reports including data in our Azure DevOps Git repo which resulted in a size of 13.2gb. Thankfully we've seen the light and bettered our ways last year so the repo currently hasn't contained any reports with data for a while now. I wanted to use your tool to also remove any history of the files for the sake of repo size and security. Unfortunately I haven't been able to reduce the size of the pack files so far. I must admit I am far from a Git Guru so assume my knowledge is very limited :)

What I've done:

  1. Clone the repo from Azure DevOps to a local machine using git clone --mirror
  2. run python $gfr --invert-paths --path-glob '*/cache.abf' and python $gfr --invert-paths --path-glob '*.pbix' (these are the file types that hold the data)
  3. Clone the bare repo locally to a new folder so I can inspect the results. When I browse the Git history it seems the commits with the mentioned files are indeed gone but the pack file is still 13.2gb.

The logging for your tool gives me the impressions that any old and unneeded files are cleaned before repacking but maybe I've missed some flag or git command I'm supposed to run.

image

@newren
Copy link
Owner

newren commented Jul 2, 2024

It sounds like you took a guess at what was taking up space and removed some files, but a lot of your space is in other files. Run python $gfr --analyze from your project, and look at the files in the created $GIT_DIR/filter_repo/analysis report directory after the run. It should tell you what is large.

@KoningLeon
Copy link
Author

KoningLeon commented Jul 3, 2024

I knew the files that were the problems because they are the only ones that hold data. The repo now only takes up 80 mb after we remove the troublesome files. Somehow though it's not reflecting in the .pack file shrinking.

I did however take up your advice and ran the analyze command and I might have found something that could explain why the pack isn't shrinking. Some large files still show up in the path-all-sizes as < present > even though the files and folder are no longer part of the repo
image

And the same goes for the directories-all-sizes. The marked folders are no longer part of the repo, yet they are still marked as < present >.
image

@KoningLeon
Copy link
Author

Managed to get the desired result by doing:

  1. git clone --depth 2000
  2. Delete the entire Repo/PowerBi folder
  3. Run the git-filter-repo as per my original post
  4. Place back the Repo/PowerBI folder

Resulting in our repo going from 13.2gb to 150mb. This means losing the entire history for that specific folder but that is a sacrifice were are willing to make.

@newren
Copy link
Owner

newren commented Jul 3, 2024

Any chance you were using CMD to run your commands? If so, the problem may be that you used single quotes (') instead of double quotes ("). If you changed your command from:

python $gfr --invert-paths --path-glob '*/cache.abf' --path-glob '*.pbix' 

to

python $gfr --invert-paths --path-glob "*/cache.abf" --path-glob "*.pbix" 

that might have fixed things for you. Apparently (as I learned in #435), the former will cause CMD to tell git-filter-repo that you want to remove files matching '*/cache.abf' and '*.pbix', which you obviously don't have any of, while the latter correctly tells git-filter-repo that you want to remove files matching */cache.abf and *.pbix.

To my knowledge, this is unique to CMD; single quotes work fine in any other shell and don't do this crazy weirdness.

@KoningLeon
Copy link
Author

No, I was using the Powershell terminal from within VScode.

@newren
Copy link
Owner

newren commented Jul 3, 2024

Well, in that case, I'd suggest adding a --debug flag to your command so we can see what git-filter-repo actually saw; I have no idea if VScode did some weird interpretation either. And it'd be nice to see the large paths from the --analyze report both before and after you run git-filter-repo with the --debug flag.

That said, it sounds like you did find a solution, so if you don't want to debug further that's fine. But if you'd like to know what happened, the --debug output is the next piece of output I'd need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants