Too Many Unclear Terms
The Staging Index Example

Table of Contents

Introduction
The Three Official Terms for Staging Index (+1 Unofficial Term)
Index
Cache
Staging Area
Staging Index
Why is "Staging Index" Needed?
The Three Trees
Purpose of the Staging Index / Staging Area
How Git Documentation Defines "Staging Index"
The Man Pages
Pro Git
Index and Staging Area
Cache
Glossary
Additional Comment About Staging Area vs Staging Index
Historical Context
Conclusion
My Personal Opinions About Fixing the Staging Index Term

Introduction

Git seems like it cannot properly define nor correctly explain terms. There are so many examples to choose from that I can't possibly begin to cover them all. And all Git terms seem to suffer from some similar issues. In this section, I've decided to use the "staging index" as an example and fully explain why this term is unclear. I want to re-emphasize: I'm picking on a single term, but most terms have plenty of issues.

The other sections I've written (Is a Staging Area Part of a Repository?, Demystifying Remotes, What is "Tracking"?) focus on a different issues, but you can see those problems are also heavily influenced by bad terminology. ("Remote" and "Tracking" are such unclear terms that they needed their own special sections.)

In my opinion, the only way to fix Git documentation is to start by fixing the official Git glossary. To fix the Git glossary, one must understand the problem. In this section "Too Many Unclear Terms", I will take the one example of "staging index" and dissect it so that the problem of Git termonilogy can be understood in its entirety.

The Three Official Terms for Staging Index (+1 Unofficial Term)

While doing research, I found three official terms that are supposedly the same: index, cache, and staging area. That's two terms too many. In good technical documentation, there should be no synonyms per term.

An interesting fourth term came from outside the official Git documentation: staging index. This supposedly also means the same as "index" / "cache" / "staging area".

And, occasionally, people say "stage the updated files" which is sort of like a fifth term.

If what I found is true, then these phrases should mean the exact same thing:

Just by looking at these phrases, we should be able to mostly figure out what is done with updated files, but it's actually tough to figure out.

For instance, if we update files "to the index" or "to the staging area", are all updated user files stored in a single Git file somewhere in the Git directory? (Hint: No, they are not.)

Is this "index" a list of pointers or references that point to updated files in the user directory? (Hint: No, it is not and does not.)

To be clear: I'm not stating that these five terms are synonyms. I'm saying the documentation is very unclear. How I came to the conclusion that these terms might be synonyms is complex and often contradicting. I will link to Git documentation that supports everything I said, but for now I'd like to start with surface reasons why I don't like the official terms used by Git documentation.

Index

I don't like the way Git uses the term "index" because it differs from the usual definition of index. A good (and typical) definition of "index" is as follows:

A list of entries. Each entry points (or links) to another entry, another index, a file, an object, or a variety of other things.

The above definition is one I made up, but it's similar to the one found at techterms.com and many other IT oriented sites and books. The term "index" is very broad and includes a lot of uses from hashes to database indexes. In Git, commit objects and annotated tags contain indexes but are themselves not indexes.

Git has defined "index" to mean... well, actually, Git documentation may define the word in their glossary, but it isn't consistent with their usage of "index". For now, I'll simply say it has something to do with the staging process. I'll specify the details of the Git documentation and how they utilize the term shortly. (As I said a moment ago, I don't want to yet list the Git documentation to backup this statement because it gets lengthy. I'm just listing why I don't like the official terms used by Git documentation.)

Out of the five possible terms I listed, the term "index" seems to be the preferred term by official Git documentation.

To reduce confusion, I'll use the term "staging index" in my documentation where possible. That way, there isn't a confusion by the reader between the generic term "index" (found outside of Git) and the more specific idea of a "staging index" (found within Git).

Cache

I also don't like the term "cache" because it's too abiguous. Yes, Merriam Webster says "cache" is something that is short-based or something frequently used, and a staging index in Git is often short lived, but cache actually has a very specific meaning in the IT world. The Technopedia definition of cache is better:

A cache, in computing, is a data storing technique that provides the ability to access data or files at a higher speed... Caching serves as an intermediary component between the primary storage appliance and the recipient hardware or software device to reduce the latency in data access.

The purpose of the staging index is not to reduce the latency in data access, but to gather changes for the next commit. Git's usage of the term "cache" is poor.

According to the Git glossary, the term "cache" is actually outdated and isn't found much in the Git documentation, but the term "cache" is used quite often on the command line when doing things to the "index". (In other words, the term "cache" has been mostly phased out of the documentation, but not on the command line.)

Staging Area

Finally, I don't like the term "staging area" to represent the same thing as an "index". The English word area refers to a region, range, or section reserved for a specific function -- which does not describe the general idea of an index or a pointer.

In a moment, I'll show a difference between the Git "staging index" and the Git "staging area", despite Git defining them to be the same. (I'll provide many links with a long explanation in a moment that back up my claims. Stay with me.)

Although it doesn't appear to be the preferred term like "index", the term "staging area" seems to be used often in official Git documentation.

Staging Index

Instead of using the term "index", a lot of the internet seems to have settled on the term "staging index". (Atlassian, w3docs.com, and geeksforgeeks are but a few).

I like this term, and I think this should be the official term -- so long as "staging index" and "staging area" are not the same thing.

Note: As mentioned above, I will follow internet convention and write "staging index" instead of just "index" in this documentation for clarity sake where possible.

Why is "Staging Index" Needed?

The Three Trees

In my glossary, I have a detailed definition for the term three trees explains in a little more detail the three stages a file goes through, but basically, a user file:

  1. Is worked on by the user in the working tree (which is where the operating system and IDEs have direct access to the user files)
  2. Is then copied into the staging index / staging area; this is where changes are gathered before being committed to the repository
  3. Is stored in the commit history as a commit

The "three trees" are officially defined in Pro Git, Chapter 7.7, but oddly are first described as the "three states" in Pro Git, Chapter 1.3.

Purpose of the Staging Index / Staging Area

The staging index / staging area represents the delta between the last commit in the local repository and the new commit that the Git user wants to create. This includes not just new files or changed files, but deleted files too.

I don't know the exact philosophy as to why the staging index / staging area was made part of the Git flow, but I'm sure it has to do with making sure the delta exact between two commits.

How Git Documentation Defines "Staging Index"

The Man Pages

Of course, the purpose of official Git man pages isn't to explicitly define the term "index" or its synonyms, but we can see the man pages focus on using the term "index" because of how often it is used.

Unfortunately, there is something very confusing lurking between the Git application and the man pages. For instance, the man page documentation for [git rm] explicitly use the term index, but the command itself uses the term "cached" to reference the staging index.

And while the documentation talks only about "index" and the command uses "cached", there is nothing in this documentation that says they are synonyms.

Now that is confusing.

Pro Git

Index and Staging Area

The Pro Git book uses both terms "index" and "staging area", and it seems to be well documented that "index" and "staging area" are the same thing.

In Chapter 1.3, it states:

[The] technical name in Git parlance is the “index”, but the phrase “staging area” works just as well.

Chapter 7.7 also states

...you can try changes out before committing them to your staging area (index)...

However, when we compare these two chapters, we find these two terms are used very differently:

The reason why one chapter uses "index" more and the other chapter uses "staging area" more is because they are actually talking about two different things.

Chapter 1.3 focuses on the basic overview to commiting user files to Git. Chapter 7.7 focuses on the indexes that point to user files. This shows that "staging area" and staging index should be considered two different things and that they are not synonyms.

Cache

Pro Git, Chapter 2.2 does show the word "cached" being used as a parameter for staging purposes. Specifically, it almost states that "staging area" and "cache" are one and the same when talking about the [git rm] command:

Another useful thing you may want to do is to keep the file in your working tree but remove it from your staging area. In other words, you may want to keep the file on your hard drive but not have Git track it anymore... To do this, use the --cached option:

$ git rm --cached README

In another place in Chapter 2.2, it states that the parameters --staged and --cached are synonyms for the [git diff] command.

...git diff --cached to see what you’ve staged so far (--staged and --cached are synonyms):

Glossary

As mentioned before, the Git glossary says cache is outdated term for index. This is the only place I've found in the Git glossary that talks about "cache" with any meaning.

Frustratingly, the glossary is not only missing the term staging area, but makes no mention of it at all.

Additional Comment About Staging Area vs Staging Index

When a person encounters the terms "staging area" and "staging index" for the first time, just the words alone sound like they are two different things. As pointed out above, the Pro Git documentation says they are the synonyms, but then go on to describe different things.

There is something else that bothers me greatly.

If we look at Pro Git, Chapter 1.3, it says not only are the "staging area" and the staging index are the same, but they are supposedly stored in a single file:

The staging area is a file, generally contained in your Git directory, that stores information about what will go into your next commit. Its technical name in Git parlance is the “index”, but the phrase “staging area” works just as well.

Adding to this confusion, Figure 6 in Pro Git Chapter 1.3 makes it look like staged files are never copied to the Git folder. (Hint: They are copied, but the process is much more complicated than a straight forward copy.)

This is one of many examples where it became clear to me this surface information provided in the Git documentation is incorrect and will only confuse people who don't yet understand Git.

There is another a worse definition also found in the About Staging Area section of Pro Git:

Unlike the other systems, Git has something called the "staging area" or "index". This is an intermediate area where commits can be formatted and reviewed before completing the commit... This allows you to stage only portions of a modified file. Gone are the days of making two logically unrelated modifications to a file before you realized that you forgot to commit one of them. Now you can just stage the change you need for the current commit and stage the other change for the next commit.

So, this seems confusing. Is the staging area an actual index? Or it is a reference / pointer? Or is it an area where "portions of a modified file" are located? Is the staging area found inside a Git repository or is this found somewhere outside of the repository accessible by the file system of an operating system? Official definitions and explanations leave all these answers unclear.

This is the reason why I came up with my glossary. My glossary can't fix all these problems with Git terminology, but it sheds a lot of light on answering these types of questions.

One question I could never get my glossary to answer was if a staging area exists inside or outside of the repository. I look at this specific problem in Is a Staging Area Part of a Repository?.

Historical Context

About the same time I published my glossary and this article, Felipe Contreras released a blog article which goes some into the history of the term "staging area" and how he tried to release patches for Git which specifically targeted the staging area problem. It's a long, but interesting read.

Conclusion

I've talked in detail about staging indexes / staging areas, but this is merely one of many examples of poor git terminology. Nearly all the vocabulary in Git is poorly defined and suffers from poor explanations like this. It would require too much time to go through every example. Nevertheless, the rest of this page is dedicated to showing how the vocabulary can be improved.

Fixing the problem won't be easy. The first thing that needs to happen is to come up with a clear, reliable, and comprehensive Git glossary for the everyday user. The current git glossary does not meet my standards. My glossary could never be adopted as the official glossary because its purpose is to clarify and explain official Git terminology (i.e., it's an addendum, not a replacement).

Without a reliable and consistent glossary, fixing the official documentation can't happen. A lack of a good glossary is one of the major reasons why so many unofficial websites and the official Git documentation wrongly describe what happens in Git.

I recognize that my glossary is probably incomplete and can probably be improved upon by people better than myself. I invite others to use my glossary as a starting point.

A Personal Note: If the Git glossary is going to be revamped, the definitions must be simple for new people to grasp. Perhaps, two definitions could be given for terms that are more complex: one definition for the porcelain and one for the plumbing.

My Personal Opinions About Fixing the Staging Index Term

I have some opinions I'd like to share about how to fix the problem of "staging index". I acknowledge that others may come up with better ideas.