Removing things from a Git repository

Why remove things from Git?

Sometimes, bad things can be accidentally committed to Git; things like unnecessarily large files or plain text passwords. Removing them can be a great challenge, especially if the repository's history is large.

Fortunately there are ways to remove this information, but beware: the following this means tampering with repository history.

Depending on your situation, the impact of this could be minimal or catastrophic. A repository maintained and used by a single developer wouldn't have any issues, but a repository maintained and consumed by an in-house team will need to have some effective communication.

A publicly accessible repository is likely not going to be able to modify history. As mentioned by Scott Chacon and Ben Straub, …you should avoid pushing your work until you’re happy with it and ready to share it with the rest of the world (242).

With that covered, this guide will describe how to:

remove specific parts from a file (i.e. sensitive information); and
remove a file from the entire repository.

Removing specific parts from a file

To remove sensitive information, Git's rebase is the best choice. It allows specific commits to be edited so that those changes can flow through to all of the commits that follow.

Using rebase to remove specific parts

Imagine the following scenario: there is a Git repository with a history of three commits, and the second contains some secret information that should not be there.

Creating a commit that removes the password won't remove the password from the repository; anyone can still view the diffs between commits to see the password being added or removed. Instead, the second commit needs to be edited directly.

Doing this will modify the commit's ID and diverge the branch from its remote counterparts, i.e. tamper with the history.

To edit a file via rebase, get the ID of the commit containing the file to edit and run git rebase -i <id>^.

This will run Git's interactive rebase mode. The caret appended to the ID tells Git to include this commit in the rebase; without it the rebase would start directly after the specified ID and skip over the commit that should be edited.

Something along the lines of following will display; a list of all the commits, starting from the supplied ID to the tip of the current branch:

pick 2d3e784 Commit the password
pick 1fe3f99 Commit some other things

Each line represents a commit in the following format:

<action> <id> <commit-message>

The item of interest is <action>. To edit the second commit, the pick action should be changed to edit.

Edit the list to read like so:

edit 2d3e784 Commit the password
pick 1fe3f99 Commit some other things

Saving and closing the file will begin the rebase.

With the rebase underway, the file can be modified to remove the password. Make the changes, stage them, commit them with git commit --amend and run git rebase --continue to proceed.

The commit message will be displayed because the edit option was chosen; update it if necessary. The rebase will continue as normal until all commits have been handled.

Once the rebase operation is complete, running git show <id> (using the ID of the commit with the password) will show that the password no longer exists.

Congratulations, the password has been removed and you no longer have a blaringly obvious security risk.

Removing a file from the entire repository

Git repositories are generally small in size, never surpassing a few megabytes at best (of course, this depends on the project).

In some cases, binary files can be, either intentionally or accidentally, committed to a repository. When multiple binary files exist, the size of the repository can skyrocket.

These files need to be removed if the repository is to ever become reasonably quick to clone again.

When an entire file needs to be removed, there are several methods to choose from:

rebase works best for when one or two files need to be removed and there is only one branch;
filter-branch is better suited when multiple files need to be removed and there are multiple branches; and
a third-party tool such as BFG Repo-Cleaner may help for very large repositories, or for when files need to be removed based on a search query.

Using rebase to remove a file

Imagine the following scenario: a developer decided it would be amusing to commit a hefty JPG image of Crash Bandicoot, likely as some kind of sick joke.

While innocent in nature, if this Crash Bandicoot image is several thousand kilobytes of data, the lives of innocent developers are ruined if they ever try to clone the repository on a subpar Internet connection.

Deleting a large file in a commit will not fully remove it from the repository, and the total size of the repository will not decrease either.

The total size of a repository can be checked by using git count-objects -vH, where -v will print verbose data and -H will display the file sizes in a human readable format.

Running git count-objects -vH on the repository with the hypothetical Crash Bandicoot JPG (which is approximately 1.6MB in file size) gives the following output:

count: 0
size: 0 bytes
in-pack: 30
packs: 1
size-pack: 1.59 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

Depending on if the repository was cloned or if it's local, the 1.6MB will be listed under either size or size-pack. Definitions for all of the above fields can be accessed via git count-objects --help.

After making a commit that removes the JPG, rerunning the command gives a similar result:

count: 0
size: 0 bytes
in-pack: 31
packs: 1
size-pack: 1.59 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

As expected, the new commit doesn't remove the bloat. (in-pack has increased though, since there is now a new commit.)

To truly remove the file, the original commit of the image needs to be edited. Get the ID of the commit that adds the image and run the following:

$ git rebase -i <id>^

When presented with the list of commits that will be rebased, change pick to edit for the commit with the image:

edit ec79600 Added Crash Bandicoot for fun.
pick 60bee69 Even more changes

Once the rebase has begun the file can be removed. When removing a file from Git, run git rm <path> rather than rm <path>. In this case, it would be git rm crash-bandicoot.jpg.

Then amend the commit via git commit --amend and complete the rebase with git rebase --continue.

Once the rebase is complete running git count-objects -vH will reveal…

count: 1
size: 4.00 KiB
in-pack: 31
packs: 1
size-pack: 1.59 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

…that it did nothing? Worry not, it has done something, and anyone who clones this repository won't be hefting around the unnecessary data of a digital Crash Bandicoot.

The reason the size is still being reported as over 1MB is because the data is still being stored locally. It can be removed by running the following command:

$ git reflog expire --expire=now && git gc --prune=now --aggressive

I have written a separate post that explains this command. It should be noted that while this command shouldn't modify commit history, it is still a good idea to back up the local repository to be safe.

Running a count on the objects again will now report:

count: 0
size: 0 bytes
in-pack: 13
packs: 1
size-pack: 2.59 KiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

Congratulations, the large image has been removed and you now have a strong case for firing the person who did this.

Using filter-branch to remove a file

Rebasing can be unwieldy on larger repositories with multiple branches. While rebase is fine for single branch scenarios, removing a file on a repository with various branches is too much work.

This is where Git's filter-branch command comes into effect. The command filters across all branches in a single command, greatly simplifying the process.

A tutorial for using filter branch for file removal can be found in Pro Git (Chacon, Scott and Straub, Ben, pp. 462), which has also been made freely available.

To use filter-branch:

locate the original commit of the file to remove;
run filter-branch and use git rm to remove the file; and
clean up the local repository's reflog and loose objects.

Displaying all commits associated with a file is done via git log --oneline --branches <path>.

The last record in the log is the commit that introduces the file. Take the ID and, as explained in the tutorial, run:

$ git filter-branch --index-filter \ 'git rm --ignore-unmatch --cached <path>' -- <id>^.."

Where <path> is the file to remove and <id> is the ID of the commit located using the earlier log command.

The downside to filter-branch is that it can become very slow on large repositories, to the point of taking at least five minutes. I experienced this first hand, which led me to use a third-party tool.

Using BFG Repo-Cleaner to remove a file

The BFG Repo-Cleaner is an open source project by Roberto Tyley that aims to significantly speed up file and information removal in Git.

The project's documentation has clear usage instructions, and I strongly recommended checking it out if your goal is to remove very large or specific files.

Unfortunately there is no way of removing files by path; files can only be deleted based on the name. I had encountered a situation involving two large files with the same name and only one was being used. I had to remove these myself via filter-branch.

That being said, I was able to remove at least 90% of problematic files in a repository in a fraction of the time that it would have taken filter-branch to do so I was pretty chuffed.

Updating a remote repository

When the branches on a local and remote have diverged, the local branches must be pushed in a way that overrides the remote. This is done using the --force option in git push --force.

Any existing clones will need to be removed and cloned again to receive the updates.

In a team environment, it may not be this simple if people have been actively working on it and adding commits. Anyone with references to an old remote can update their links to the remote while preserving their work by:

stashing any changes via git stash; and
checking out the branch to be updated and running git reset --hard origin/<branch>.

This will replace the local branch with the updated remote while leaving all other branches intact. Repeat this process for every remote branch as required.

If working on a feature branch that has branched from a remote that has been updated, synchronise the feature branch like so:

$ git stash
$ git checkout <feature-branch>
$ git rebase <base-branch>
$ git stash apply

A real-world scenario would look like…

$ git stash
$ git checkout develop
$ git reset --hard origin develop
$ git checkout feature/some-work
$ git rebase develop
$ git stash apply

…where the end result will be feature/some-work branching off of the new develop branch.

Works cited

Chacon, Scott, and Straub, Ben. "7.6 Git Tools - Rewriting History." Pro Git. 2nd ed., Apress, 2014, pp. 242.

Chacon, Scott, and Straub, Ben. "10.7 Git Internals - Maintenance and Data Recovery." Pro Git. 2nd ed., Apress, 2014, pp. 462.

Tyley, Roberto. "BFG Repo-Cleaner." https://github.com/rtyley/bfg-repo-cleaner/. Accessed 25 February 2018.