Twitter text compression challenge

Rules

  • Your program must have two modes: encoding and decoding .
  • When encoding :

  • Your program must take as input some human readable Latin1 text, presumably English.
  • It doesn't matter if you ignore punctuation marks.
  • You only need to worry about actual English words, not L337.
  • Any accented letters may be converted to simple ASCII.
  • You may choose how you want to deal with numbers.
  • 123
  • one two three
  • one hundred twenty three
  • 123
  • 1 2 3
  • one hundred twenty three
  • one two three
  • one hundred twenty three
  • 123
  • 1 2 3
  • Your program must output a message which can be represented in

  • 140 code points in the range U+0000U+10FFFF

    Excluding non-characters:

  • U+FFFE
  • U+FFFF
  • U+ n FFFE , U+ n FFFF where n is 110 hexadecimal
  • U+FDD0U+FDEF
  • U+D800U+DFFF (surrogate code points).
  • It may be output in any reasonable encoding of your choice; any encoding supported by GNU iconv will be considered reasonable, and your platform native encoding or locale encoding would likely be a good choice.

  • When decoding :

  • Your program should take as input the output of your encoding mode.
  • The text output should be an approximation of the input text.
  • The closer you can get to the original text, the better.
  • Doesn't need to have any punctuation.
  • The output text should be readable by a human, again presumably English.

  • Can be L337, or lol.
  • The decoding process may have no access to any other output of the encoding process other than the output specified above; that is, you can't upload the text somewhere and output the URL for the decoding process to download, or anything silly like that.
  • For the sake of consistency in user interface, your program must behave as follows:
  • Your program must be a script that can be set to executable on a platform with the appropriate interpreter, or a program that can be compiled into an executable.
  • Your program must take as its first argument either encode or decode to set the mode.
  • Your program must take input in at least one of the following ways:
  • Take input from standard in and produce output on standard out.
  • my-program encode <input.txt >output.utf
  • my-program decode <output.utf >output.txt
  • Take input from a file named in the second argument, and produce output in the file named in the third.
  • my-program encode input.txt output.utf
  • my-program decode output.utf output.txt
  • For your solution, please post:
  • Your code, in full, and/or a link to it hosted elsewhere (if it's very long, or requires many files to compile, or something).
  • An explanation of how it works, if it's not immediately obvious from the code or if the code is long and people will be interested in a summary.
  • An example text, with the original text, the text it compresses down to, and the decoded text.
  • If you are building on an idea that someone else had, please attribute them. It's OK to try to do a refinement of someone else's idea, but you must attribute them.
  • The rules are a variation on the rules for Twitter image encoding challenge .


    Not sure if I'll have the time/energy to follow this up with actual code, but here's my idea:

  • Any arbitrary LATIN 1 string under a certain length could be simply encoded (not even compressed) with no loss into 140 characters. The naive estimate is 280 characters, although with the code point restrictions in the contest rules, its probably a little shorter than that.
  • Strings slightly longer than the above length (lets guestimate between 280 and 500 characters) can most likely be shrunk using standard compression techniques, into a string short enough to allow the above encoding.
  • Anything longer than that, and we're starting lose information in the text. So execute the minimum number of the following steps to reduce the string to a length that can then be compressed/encoded using the above methods. Also, don't perform these replacements on the entire string if just performing them on a substring will make it short enough (I would probably walk through the string backwards).

  • Replace all LATIN 1 characters above 127 (primarily accented letters and funky symbols) with their closest equivalent in non-accented alphabetic characters, or possibly with a generic symbol replacement like "#"
  • Replace all uppercase letters with their equivalent lowercase form
  • Replace all non-alphanumerics (any remaining symbols or punctuation marks) with a space
  • Replace all numbers with 0
  • Ok, so now we've eliminated as many excess characters as we can reasonably get rid of. Now we're going to do some more dramatic reductions:

  • Replace all double-letters (balloon) with a single letter (balon). Will look weird, but still hopefully decipherable by the reader.
  • Replace other common letter combinations with shorter equivalents (CK with K, WR with R, etc)
  • Ok, that's about as far as we can go and have the text be readable. Beyond this, lets see if we can come up with a method so that the text will resemble the original, even if it isn't ultimately deciperable (again, perform this one character at a time from the end of the string, and stop when it is short enough):

  • Replace all vowels (aeiouy) with a
  • Replace all "tall" letters (bdfhklt) with l
  • Replace all "short" letters (cmnrsvwxz) with n
  • Replace all "hanging" letters (gjpq) with p
  • This should leave us with a string consisting of exactly 5 possible values (a, l, n, p, and space), which should allow us to encode pretty lengthy strings.

    Beyond that, we'd simply have to truncate.

    Only other technique I can think of would be to do dictionary-based encoding, for common words or groups of letters. This might give us some benefit for proper sentences, but probably not for arbitrary strings.


    Here is my variant for actual English.

    Each code point have something like 1100000 possible states. Well, that's a lot of space.

    So, we stem all original text and get Wordnet synsets from it. Numbers are cast into english names ("fourty two"). 1,1M states will allow us to hold synset id (which can be between 0 and 82114), position inside synset(~10 variants, i suppose) and synset type (which is one of four - noun, verb, adjective, adverb). We even may have enough space to store original form of word (like verb tense id).

    Decoder just feeds synsets to Wordnet and retrieves corresponding words.

    Source text:

    A white dwarf is a small star composed mostly of electron-degenerate matter. Because a
    white dwarf's mass is comparable to that of the Sun and its volume is comparable to that 
    of the Earth, it is very dense.
    

    Becomes:

    A white dwarf be small star composed mostly electron degenerate matter because white
    dwarf mass be comparable sun IT volume be comparable earth IT be very dense
    

    (tested with Online Wordnet). This "code" should take 27 code points. Ofcourse all "gibberish" like 'lol' and 'L33T' will be lost forever.


    PAQ8O10T << FTW

    链接地址: http://www.djcxy.com/p/42742.html

    上一篇: 铁Python / Iron Ruby EXE

    下一篇: Twitter文本压缩挑战