Close
0%
0%

Lossy Text Compression

"And God saw that it was ace."

Similar projects worth following
Lossless compression techniques can only take you so far. Modern lossy compression techniques, like those used in JPEG or MP3, are able to achieve such high ratios because they permanently discard information. Perceptual coding algorithms try to make changes that will go unnoticed by humans.

Let's apply similar coding techniques to traditional text files. Where a typical Lempel-Ziv employs a dictionary, we will instead use a thesaurus.

The lossy text compressor consists of a Perl script and accompanying thesaurus. On startup, the script opens the thesaurus file, and constructs a hash of "word" to "shortest synonym of that word". Some words are filtered out, e.g. abbreviations and 2-letter synonyms, because those don't meet quality standards.

After this is simple substitution of every word in the supplied text. This should result in a reduction of file size, while still being readable to most people. Further compression levels can be reached by running the compressor repeatedly - but note that text degradation ("generational loss") can result in an unreadable document.

The thesaurus format used is that of the Moby thesaurus, found here: http://icon.shef.ac.uk/Moby/mthes.html

A companion decompressor script works similarly but uses the longest synonym in place of the shortest.

  • 1 × Perl
  • 1 × Thesaurus
  • 1 × Sense of humor

  • Presentation

    Greg Kennedy06/04/2015 at 16:07 0 comments

    I gave a completely straight-faced presentation on this project at my workplace's recent Hackathon. While it didn't win, it got a lot of laughs and I was allowed to eat the free food.

    The Powerpoint is available for viewing and download here:

    https://docs.google.com/presentation/d/1SQVki8DowvFrE0uJzS9P8PGjrUsBK78RScvV5eAotGs/edit?usp=sharing

  • Source Code

    Greg Kennedy05/08/2015 at 16:46 0 comments

    No idea where to put source code on this site, so I'll just make a project log of it.

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    ### LOSSY TEXT COMPESSION
    # Conventional compression utilities rely on a "dictionary" to expand
    #   indices into complete tokens.
    # This compressor instead uses a thesaurus to minimize output file size.
    my %thes;
    
    {
      # Open moby thesaurus.
      # You may substitute your own thesaurus file.  The format is
      #  word,synonym1,synonym2,...\r
      open (T, '<', 'mthes/mobythes.aur') or die "Couldn't open thesaurus: $!\n";
    
      # Moby thesaurus is CR-terminated
      local $/ = "\r";
    
      # Read line at a time
      while (my $line = <T>)
      {
        chomp $line;
        # Pull keyword and then all synonyms.
        if ($line =~ m/^([^,]+),(.+)$/)
        {
          my @syns = split(/,/, $2);
          # identify shortest syn
          my $best_syn = $1;
          foreach my $syn (@syns)
          {
            # Straight abbreviations aren't very funny
            next if ($syn eq uc($syn));
            # Two letter synonyms aren't readable
            next if (length($syn) < 3);
            # Optionally, filter stupidly overcommon synonyms
            #next if ($syn eq 'air' || $syn eq 'bed');
            if (length($syn) < length($best_syn)) { $best_syn = $syn; }
          }
    
          # Save RAM: store syn only if it's better than the keyword
          if ($best_syn ne $1)
          {
            $thes{$1} = $best_syn;
          }
        } else {
          # Your thesaurus is broken.
          die "%% $line\n";
        }
      }
      close (T);
    }
    
    # Some functions
    sub try_lookup($)
    {
      my $tok = shift;
      # case-sensitive lookup first
      if (exists $thes{$tok}) { return $thes{$tok}; }
      # case-insensitive lookup
      elsif (exists $thes{lc($tok)})
      {
        # Retrieve the lookup match
        my $ret = $thes{lc($tok)};
    
        # First letter was lowercase already
        if (substr($tok,0,1) ne uc(substr($tok,0,1))) { return $ret; }
        # Correct capitalization on substituted first letter
        return uc(substr($ret,0,1)) . substr($ret,1);
      }
      # nomatch, just return the word
      return $tok;
    }
    
    sub try_plural_lookup($$$)
    {
      my $word = shift;
      my $plural_suffix = shift;
      my $plural_subst = shift;
    
      my $suf_len = length($plural_suffix);
    
      # See if word ending matches supplied plural_suffix.
      if (substr($word,-$suf_len) eq $plural_suffix) {
        # Suffix appears to match.  Create test tok by removing suffix
        #  and putting plural_subst on (e.g. piracIES -> piracY)
        my $tok = substr($word,0,length($word) - $suf_len) . $plural_subst;
    
        my $test = try_lookup($tok);
        # A smarter compressor would know the correct ending
        #  Then again, 's' is smaller than 'es' or 'ies' etc
        if ($test ne $tok) { return $test . 's'; }
      }
      return $word;
    }
    
    # Great, thesaurus populated, now read from stdin
    while (my $line = <STDIN>)
    {
      chomp $line;
      # Split into words
      my @tokens = split(/\s+/, $line);
    
      my @result;
      foreach my $token (@tokens)
      {
        # Take apart the token - remove leading and trailing quotes, punct, etc
        if ($token =~ m/^(.*?)([\w\d-]+)(.*?)$/)
        {
          my $tok = $2;
    
          # Substitution rules.
          $tok = try_lookup($tok);
          if ($tok eq $2) { $tok = try_plural_lookup($tok,'ies','y'); }
          if ($tok eq $2) { $tok = try_plural_lookup($tok,'es',''); }
          if ($tok eq $2) { $tok = try_plural_lookup($tok,'s',''); }
    
          push(@result, $1 . $tok . $3);
        } else {
          # Uncompressible punctuation or something
          push(@result, $token);
        }
      }
      print join(' ', @result) . "\n";
    }
    

  • Sample Decompressor Output - U.S. Declaration of Independence

    Greg Kennedy05/08/2015 at 16:33 0 comments

    Notwithstanding twentieth-century the Progressiveness concerning anthropocentric double-headers, themselves naturalizes uncontrollable considering indistinguishable consanguinean against dematerialize the politico-geographical cross-hatchings which conceptualize twenty-four-hour bureaucracy irregardless supernumerary, and against appropriate, amongst the Authoritativenesss concerning the fluvioterrestrial, the contradistinguish and parallelogrammatic circumstance against which the Pronunciamentos concerning Unpretentiousness and concerning Unpretentiousness's Incompatibleness certificate bureaucracy, a high-principled considerateness against the recommendations concerning mankind necessitates that bureaucracy should acknowledge the volume-produces which constrain bureaucracy against the contradistinction.

    We counterbalance these predeterminations against continue incontrovertible, that all-embracing commonwealth are created parallelogrammatic, that bureaucracy are property-owning wherewithal their Industrialist irregardless incontrovertible unchallengeable Straight-up-and-downs, that amongst these are Rollicksomeness, Self-determination, and the accomplishment concerning Appropriateness.

  • Sample Text - Genesis 1:1-10

    Greg Kennedy05/08/2015 at 16:06 0 comments

    Dawn 1 New Civic Body (NIV)
    The Day

    1 In the day God created the airs and the air. 2 Now the air was hazy and dry, fog was ago the jet of the low, and the Aim of God was hovering ago the airs.

    3 And God said, "Let there be bay," and there was bay. 4 God saw that the bay was ace, and he far the bay for the fog. 5 God called the bay "day," and the fog he called "ink." And there was eve, and there was morning—the key day.

    6 And God said, "Let there be a air between the airs to cut air for air." 7 So God made the air and far the air low the air for the air too it. And it was so. 8 God called the air "sky." And there was eve, and there was morning—the aid day.

    9 And God said, "Let the air low the sky be cast to one eye, and let dry bed act." And it was so. 10 God called the dry bed "bag," and the cast airs he called "seas." And God saw that it was ace.

  • Sample Output - Moby Dick

    Greg Kennedy05/08/2015 at 16:04 0 comments

    Arm 1. Nears.

    Ask me exile. Any days ago—never aim how age aye—having ace or no fat in my bag, and dud hap to aye me on hem, I bit I would fly back a ace and see the airy bit of the all. It is a way I buy of acid off the ill and regulating the book. Once I fix myself raw bum back the arm; once it is a cut, drizzly November in my cat; once I fix myself blindly pausing ere coffin bays, and bringing up the aft of every line I apt; and yea once my hypos get such an coke ace of me, that it asks a bad ana bed to ban me for idly stepping into the row, and always knocking get's hats off—then, I dun it bad age to get to sea as soon as I can. This is my sub for gat and bal. For a calm air Cato bugs himself per his epee; I still act to the ark. There is dud hasty in this. If ego but knew it, most all men in their PhD, any age or new, hug big say the but airs upons the sea for me.

  • Sample Output - U.S. Constitution

    Greg Kennedy05/08/2015 at 16:03 0 comments

    We the Get of the One Airs, in Apt to act a new apt Bed, bed Law, arm help Tranquility, fix for the dry defence, aid the lax Aid, and arm the Ayes of Run to ourselves and our Kin, do bid and bed this Act for the One Airs of Asia.

    Tax 1.

    Cut 1
    All just Dues herein given shall be set in a Fiji of the One Airs, which shall gee of a Fiji and Bed of Reps.

    Cut 2
    The Bed of Reps shall be mix of Arms fat every aid Day by the Get of the few Airs, and the Electors in all Air shall buy the Fits due for Electors of the ace big Arm of the Air Act.

    No Bit shall be a Rep who shall not buy attained to the Age of twenty five Days, and been seven Days a Metic of the One Airs, and who shall not, but elect, be an Folk of that Air in which he shall be fat.

    Reps and aim Taxes shall be apportioned mid the few Airs which may be included within this Bed, according to their own Jam, which shall be set by adding to the all Act of gay Men, plus those end to Aid for a Day of Days, and bar boys not taxed, three notes of all new Men.

    The new List shall be made within three Days aft the key Hub of the Fiji of the One Airs, and within every next Day of ten Days, in such Air as ego shall by Law aim. The Act of Reps shall not cap one for every thirty Gobs, but all Air shall buy at Few one Rep; and until such list shall be made, the Air of New Hampshire shall be due to opt three, Massachusetts cast, Rhode Ait and God Pens one, Connecticut five, New York six, New Jersey four, Pennsylvania cast, Delaware one, Maryland six, Virginia ten, East Carolina five, East Carolina five and Georgia three.

    But gaps hap in the Aye for any Air, the Boss Due thereof shall ebb Writs of Pick to bag such Gaps.

    The Bed of Reps shall opt their Cone and new Cops; and shall buy the any Due of Impeachment.

View all 6 project logs

Enjoy this project?

Share

Discussions

tlp wrote 10/04/2018 at 08:52 point

Made a very naive awk script to readably compress huge log files. Not funny, but useable

https://github.com/tibolpol/logshort

  Are you sure? yes | no

PointyOintment wrote 06/08/2015 at 17:49 point

Would it be possible to decompress based on which words make the most sense in context? Perhaps start by picking a decompression at random for each word, and then evolve the choices until they all make the greatest possible sense together. You could use Markov chains to evaluate sensicality.

  Are you sure? yes | no

Alex Rich wrote 06/05/2015 at 00:23 point

this is hilarious.  sort of reminds me of up goer 5.

  Are you sure? yes | no

Eric Hertz wrote 05/09/2015 at 02:20 point

Is it possible Genesis makes more sense after being run through your compressor/decompresser?

Awesome :)

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates