Skip to main content

How the text is compressed?

 



🎯 What Does “Text Compression” Mean?

Text compression means reducing the number of bits needed to store or transmit text, without losing the information (or with acceptable loss in some cases).

Imagine you have a book with 1 million characters. Instead of storing each character in a fixed 8 bits (1 byte), you can find patterns to represent the same content with fewer bits.


🌱 Two Main Types of Text Compression

  1. Lossless Compression

    • No information is lost.

    • You can always reconstruct the exact original text.

    • Used for text files, code, documents.

    • Examples: ZIP, GZIP, PNG for images.

  2. Lossy Compression (rare for plain text)

    • Some information is discarded.

    • Not typically used for text because even a small change may change meaning.

    • More common in audio/video compression.


⚙️ How Does Lossless Text Compression Work?

There are many methods, but the main idea is finding redundancy—repeated patterns, common letters, or sequences.

Here are three popular techniques:


1️⃣ Statistical Coding (Entropy Coding)

  • Some letters or words appear more often (like "e", "the").

  • Assign shorter codes to frequent symbols, longer codes to rare ones.

  • Example: Huffman Coding

    • It builds a tree of symbols.

    • More frequent symbols are closer to the root (short codes).

    • Less frequent are further away (long codes).

    • So "e" could be 2 bits, "q" could be 6 bits.

  • Analogy: Like giving nicknames to people you mention often.


2️⃣ Dictionary-Based Compression

  • Replace repeated strings with references.

  • Example: Lempel-Ziv (LZ77, LZ78, LZW)

    • Scan through text and build a dictionary of repeating phrases.

    • When the same phrase repeats, store a pointer to the dictionary instead.

  • For instance:

    • Original: ABRACADABRA ABRACADABRA

    • Compressed: ABRACADABRA <pointer to earlier occurrence>

  • ZIP and GIF files often use this approach.


3️⃣ Run-Length Encoding (RLE)

  • When a character repeats many times in a row, store it as:

    • (character + count)

  • Example:

    • AAAAAA becomes A6

  • Very efficient for text with lots of repeated characters or spaces.


✏️ Example of Simple Compression

Imagine this text:

BBBBCCCCDDDDDDDD

Run-Length Encoding:

  • B4 C4 D8
    (meaning 4 B’s, 4 C’s, 8 D’s)


🧠 Why Does Compression Work?

Because:

  • Language is redundant (we repeat patterns).

  • Some letters/words are far more common.

  • Text often contains repeated phrases or structures.


🛠️ What Happens When You Open a Compressed File?

  • The compression program decodes the compressed data.

  • It reconstructs the exact original text by reversing the coding or using the stored dictionary.

  • This is called decompression.


In a nutshell:
Text compression is about spotting patterns and representing them more efficiently.