ROT 13 and Caesar Ciphers
HomeBlogAbout UsWorkContentContact Us
 
 Advertisement 

ROT13 and Caesar Ciphers

THE QUICK BROWN FOX SAID I LOVE LUCY

GUR DHVPX OEBJA SBK FNVQ V YBIR YHPL

ROT13 (short for “rotate 13 places”), is an obfuscation technique familiar to nerds, geeks, and computer programmers. It’s commonly used in online forums as a means of hiding or obscuring spoilers, punchlines, hints, and (sometimes) offensive content.

I’m hesitant to call it encryption because it’s so weak. What it is is a simple substitution cipher (in fact, it's worse than that because, as described below, it's a Caesar Cipher in which the offset for each character is the same and fixed!)

To execute ROT13 you take a letter and shift it along 13 places (and if you go over the end past ‘Z’, you wrap around again to ‘A’). It jumbles the letters up sufficiently such that, at first glance, you can’t read the message, and that, these days, is its only real purpose. If you attempt to use it to store/encrypt passwords or sensitive information you deserve to have your programming license revoked!

It’s such a popular technique that some text editors and news readers have ROT13 functionality built in!

As an example of how this works the word “HELLO” gets converted by ROT13 into “URYYB”

Traditionally, ROT13 is only applied to the letters ‘A-Z’ and ‘a-z’ so that case, numbers, and other punctuation are preserved.

When you were a kid (perhaps you still are one), you might have had fun using some kind of Spy code/de-code wheel. On these devices, the letters ‘A-Z’ are written on concentric disks which can rotated to offset the alphabet.

Two ‘secret agents’ agree on the offset beforehand then, to encode a message, the desired letter is selected on the inner wheel, and the coded letter read on the outside wheel. The process is then inverted by the decoder to read the message. To use a decoding wheel for ROT13, simply rotate the wheel 13 places.

Why ROT13?

Because the English alphabet has 26 characters, ROT13 has the interesting property that it is self-inverting. Performed twice on a piece of text reverts the text back to the original. It is for this reason that ROT13 became so popular.

ROT13('HELLO') = 'URYYB'

ROT13('URYYB') = 'HELLO'

ROT13(ROT13('HELLO')) = 'HELLO'

If you are familiar with Boolean logic, this is a property similar to the XOR operator. If performed twice with the same argument, XOR returns the input to the same value.

To encode/de-code in ROT13 you only need one command, and you can't get it the wrong way round either!

ROT5

Similar to ROT13, which applies to letters, it’s possible to obfuscate numbers with a similar self-inverting rotation of five places.

43,252,003,274,489,856,00098,707,558,729,934,301,555

There is a hybrid system which encodes text using ROT13, numbers using ROT5, and leaves all other characters unaffected.

ROT5 is subtle, numbers just look like numbers should!

ROT47

Another (less-popular) variant is ROT47 which shifts the 94 characters from ASCII 33 (which is the “!” directly after the space) to ASCII 126 “~”. This obfuscates letters, numbers, and punctuation characters but still keeps the output in 7-bit ‘safe’ printable ASCII.

Call the number (425)-555-1212, and ask for "Princess"

r2== E96 ?F>36C WcadX\ddd\`a`a[ 2?5 2D< 7@C Q!C:?46DDQ

ROT47 is far from subtle; it's pretty clear that the message above has been encoded.

Caesar Ciphers

THE QUICK BROWN FOX SAID I LOVE LUCY

GUR DHVPX OEBJA SBK FNVQ V YBIR YHPL

Using ROT13 as anything more than obfuscation technique has more security holes than a piece of Swiss Cheese. A simple rotation cipher is given the name Caesar Cipher, after Julius Ceasar, as it is documented he used this technique to 'protect' messages to his troops (he is documented as using ROT3, whilst his nephew used ROT1).

Once you know the technique used, it's fairly trivial (even using brute force if you don't know the offset), to enumerate all possible versions to reveal the source message!

ROTnCipher
0THEQUICKBROWNFOXSAIDILOVELUCY
+1UIFRVJDLCSPXOGPYTBJEJMPWFMVDZ
+2VJGSWKEMDTQYPHQZUCKFKNQXGNWEA
+3WKHTXLFNEURZQIRAVDLGLORYHOXFB
+4XLIUYMGOFVSARJSBWEMHMPSZIPYGC
+5YMJVZNHPGWTBSKTCXFNINQTAJQZHD
+6ZNKWAOIQHXUCTLUDYGOJORUBKRAIE
+7AOLXBPJRIYVDUMVEZHPKPSVCLSBJF
+8BPMYCQKSJZWEVNWFAIQLQTWDMTCKG
+9CQNZDRLTKAXFWOXGBJRMRUXENUDLH
+10DROAESMULBYGXPYHCKSNSVYFOVEMI
+11ESPBFTNVMCZHYQZIDLTOTWZGPWFNJ
+12FTQCGUOWNDAIZRAJEMUPUXAHQXGOK
+13GURDHVPXOEBJASBKFNVQVYBIRYHPL

There are only 25 rotations to try by brute force!

Substitution Ciphers

Closely related to Caesar Ciphers are Substitution Ciphers. These still map 1:1 between each character in the source text and cipher text, but adjacent characters in the source do not have to map to adjacent ones in the destination.

If spaces are preserved in the encoding, it's easy to see where the word breaks are, and thus you can guess at what you think are the more popular words. As each character is always converted over to the same replacement character, common words (and commonly occurring groupings and patterns of letters) start to jump out of the page very quickly (especially if the message is quite long). Removing the white space between words adds a trivial level of complexity.

Substitution ciphers don't have to just use other letters. Symbols can be used. Two of the most well known examples of this are "The Dancing Men", from the famous Sherlock Holmes story, and the "PigPen Cipher" which uses fragments of grids and dots to represent the alphabet.

Solving Substitution Ciphers

Even with, or without, spaces removed, substitution ciphers are fairly trivial to guess. There's a 1:1 mapping between each character so, once you know one conversion, you know all other occurrences of that same character (and you also know that this letter can't be used again).

The solution space is so small that the solving of these is a hobby (like solving Crossword puzzles, word searches, or Sudoku). These puzzles are called Cryptograms.

An example is shown on the left for the quote:

"Style and structure are the essence of a book; great ideas are hogwash." - Vladimir Nabokov

The strategy for solving cryptograms is a combination of brute force, heuristics, and letter/word frequency.

Not all letters in the English language are used equally. Some, like the letters 'E', 'T' and 'A' are used very frequently. Unless the message we are trying to decode is very obscure, we'd expect the distribution of symbols in the solution to follow a similar profile. This alone could give a first pass for decoding a message; we simply apply the frequency of letters used in the secret message to the frequency we expect for each letter.

From Wikipedia, here is the ordering of letters in English language (taken from a corpus of many hundreds and thousands of documents):

ETAOINSHRDLUCMFWYPVBGKQJXZ

(You might also like an article I wrote few years ago about the game of Hangman and letter distribution).

If our secret message is a representative sample of the entire English language, we'd expect the symbol representing "E" to be the most frequently occurring in our message, followed by "T", then "A" …

This is far from perfect; the chances are our message is short, and so the letters might not follow this distribution perfectly (or even have enough granularity, or even use all the letters of the alphabet). We can narrow down solutions using brute-force and Chi-squared tests of letters frequency based on expected, but there is so much more we can do very easily.

If white spaces are present, we can apply knowledge of the words in the English language. We know that there are only a limited number of two, three and four letter words, and these words are common. Did you know that one third of all printed English materials are made up of the top 25 occurring words? (The most popular 100 words make up approximately half of all printed English!)

the, of, and, a, to, in, is, you, that, it, he, was, for, on, are, as, with, his, they, I, at, be, this, have, from

Guessing which word could be which and corroborating this with what these symbols/letters would be like in the other words could be a great help.

If there is no white space to give word breaks, we can still apply statistical techniques. Certain combinations of letters often occur together. It's very common to have "TH" next to each other and "ER" and "RE". Certain letters often occur in double form, like "OO", "EE" and "LL". Conversely, have you ever seen a word containing "JJ"*?

We're taught an early age that it's very common for "Q" to be followed by "U". Although "Q" is not a popular letter, if we do identify one, there's a very good chance the letter after it is a "U". There are similar rules with other letters.

Most (not all), words contain at least one vowel (AEIOU), and if you include "Y" as a vowel you practically include all words. The more letters you lock in, the easier it gets to solve the rest (both because you have partial words to complete, and the unused letter pool is smaller).

*I can only think of: HAJJ, HAJJES, HAJJI, HAJJIS

Let's take a look

I was curious about letter distribution, so I downloaded a dozen books in plain text from Project Gutenberg. This site has 50,000 free books available!

If you are interested, the books I selected (randomly from the fiction collection) were:

  1. 20,000 Leagues under the Sea, Jules Verne

  2. A Tale of Two Cities, Charles Dickens

  3. Around the World in 80 days, Jules Verne

  4. Little Women, Louisa May Alcott

  5. Alice's Adventure in Wonderland, Lewis Caroll

  6. Anna Karenina, Leo Tolstoy

  7. The Arabian Nights, Sir Richard Burton

  8. The Canterbuy Tales, Geoffery Chaucer

  9. The Journey to the Center of the Earth, Jules Verne

  10. The Wonderful Wizard of Oz, L. Frank Baum

  11. War and Peace, Leo Tolstoy

  12. Ben-Hur, Lew Wallace

Obviously, the more books you sample the more refined your distribution will become for generic solving. Alternatively, if you have some idea of the context of your secret message, you might elect to sample a more specific set of books to more accurately represent the sample you have.

Single Letter Frequency

Based on the books above, here is the single letter frequency distribution. The percentages show the percentage over the total of all the letters in these books (Approximately 9 million letters).

Here is the same data plotted in sorted order. The ordering is slightly different to the answer given by Wikipedia, but that's because we're using different samples.

Bigrams

Next I looked at the frequency of all bigrams (also called couplets of letters, adjacent pairs, and sometimes called digrams).

To generate this list I ignored any white space and punctuation characters. So, for example, in addition to containing all the letters that occur next to each other inside of words, this list also contains entries for the words that end with the first character that occur adjacent to words that starts with the second. This will help if your secret message does not contain white space that allows you to determine where the line breaks are.

Here are the most frequently occurring top 50 adjacent pairs of letters:

Interestingly, even though there are 26 characters, the total number of bigrams in my sample is not 26 × 26 (=676). Instead there are 643 distinct items. Not every possible pairing of characters occur (for instance the pairing "QZ" or "ZX" never occurred in the books I sampled).

As expected the frequency of "TH" and "HE" dominate. These bigrams are popular in many words as well as the most common words.

Note - This is a great data to use if you have no information about either of the characters. However, if you know, for instance, one of the characters in the pair you can use this information to find conditional probability. For instance, if you know a pair is "Q?", then the there is 99.9% chance that the missing unknown character is a "U".

Here are the top 200 bigrams in tabular order:

#2-gram%
#1TH3.322%
#2HE3.108%
#3AN1.838%
#4ER1.820%
#5IN1.801%
#6ND1.434%
#7RE1.373%
#8ED1.244%
#9ES1.220%
#10HA1.163%
#11TO1.101%
#12EN1.096%
#13EA1.071%
#14AT1.062%
#15HI1.046%
#16ON1.042%
#17ST1.011%
#18OU1.000%
#19NT0.988%
#20NG0.949%
#21AS0.909%
#22IT0.899%
#23IS0.881%
#24ET0.844%
#25OR0.832%
#26TE0.797%
#27SE0.767%
#28OF0.746%
#29AR0.735%
#30TI0.719%
#31LE0.706%
#32SA0.690%
#33VE0.637%
#34NE0.636%
#35AL0.629%
#36ME0.625%
#37RO0.608%
#38NO0.598%
#39SH0.592%
#40OT0.589%
#41DE0.588%
#42EL0.578%
#43TA0.564%
#44LL0.561%
#45TT0.560%
#46SO0.546%
#47RI0.543%
#48DT0.538%
#49HO0.536%
#50WA0.531%
#2-gram%
#51SS0.506%
#52RA0.500%
#53EW0.496%
#54EE0.492%
#55WH0.490%
#56SI0.478%
#57OM0.477%
#58DI0.473%
#59BE0.467%
#60DA0.464%
#61AD0.461%
#62MA0.453%
#63EC0.450%
#64EM0.446%
#65WI0.441%
#66CH0.440%
#67CO0.438%
#68CE0.437%
#69UT0.436%
#70OW0.435%
#71RT0.432%
#72LI0.431%
#73NA0.415%
#74LA0.401%
#75FO0.400%
#76RS0.397%
#77EI0.389%
#78AI0.382%
#79UR0.379%
#80LO0.379%
#81WE0.378%
#82DO0.371%
#83LY0.369%
#84IM0.366%
#85IL0.365%
#86US0.362%
#87GH0.357%
#88EH0.353%
#89ID0.350%
#90NS0.349%
#91FT0.346%
#92OO0.344%
#93IC0.342%
#94TS0.331%
#95UN0.330%
#96EF0.322%
#97EO0.321%
#98HT0.321%
#99YO0.320%
#100EP0.316%
#2-gram%
#101DS0.311%
#102PE0.310%
#103NI0.307%
#104NC0.303%
#105OS0.303%
#106AC0.301%
#107LD0.296%
#108CA0.286%
#109MO0.284%
#110UL0.279%
#111OL0.275%
#112DH0.273%
#113IO0.269%
#114KE0.268%
#115TR0.265%
#116IE0.264%
#117IR0.263%
#118EV0.263%
#119AM0.252%
#120TW0.250%
#121FA0.249%
#122GE0.248%
#123AY0.244%
#124GA0.243%
#125PR0.239%
#126EY0.233%
#127WO0.232%
#128SW0.229%
#129PA0.229%
#130MI0.228%
#131RY0.220%
#132GO0.218%
#133EB0.215%
#134FI0.210%
#135YA0.207%
#136AV0.206%
#137BU0.206%
#138RD0.205%
#139YT0.204%
#140PO0.203%
#141SP0.200%
#142IG0.200%
#143OV0.200%
#144FE0.198%
#145FR0.190%
#146DW0.188%
#147SU0.186%
#148EG0.183%
#149AP0.182%
#150NH0.181%
#2-gram%
#151DR0.177%
#152DB0.177%
#153AB0.176%
#154YS0.176%
#155OD0.174%
#156TU0.173%
#157VI0.173%
#158GT0.171%
#159TL0.170%
#160SC0.165%
#161PL0.164%
#162LT0.164%
#163IF0.163%
#164TY0.159%
#165AG0.159%
#166RR0.159%
#167YE0.159%
#168MY0.156%
#169BO0.156%
#170KI0.149%
#171BL0.148%
#172CT0.147%
#173OP0.146%
#174GI0.146%
#175DN0.146%
#176UG0.145%
#177OH0.142%
#178GR0.142%
#179RM0.141%
#180UP0.140%
#181RN0.140%
#182OK0.139%
#183IV0.137%
#184SM0.136%
#185IA0.135%
#186OA0.132%
#187RH0.130%
#188DD0.126%
#189PI0.125%
#190OI0.124%
#191AW0.123%
#192SL0.123%
#193EX0.121%
#194SB0.119%
#195MP0.119%
#196NW0.119%
#197DM0.118%
#198BA0.118%
#199AK0.118%
#200SF0.116%

Trigrams

The next logical expansion is to look at trigrams (sequences of three letters).

There are 9,671 distinct trigrams in my sample (cf. 26 × 26 × 26 = 17,576 possible).

Again seeing "THE" at the top is no surprise, neither is "AND". These are both popular words in their own right, and sub-strings of other words. "ING" comes next as the suffix for many verbs, followed by many other triplets you can find inside common words.

Here are the top 200 trigrams in tabular form:

#3-gram%
#1THE2.049%
#2AND1.097%
#3ING0.758%
#4HER0.615%
#5THA0.449%
#6ERE0.420%
#7HIS0.416%
#8HAT0.412%
#9ETH0.348%
#10DTH0.341%
#11ENT0.326%
#12NTH0.323%
#13THI0.306%
#14FOR0.304%
#15OTH0.303%
#16ITH0.302%
#17WAS0.300%
#18HES0.297%
#19SHE0.285%
#20WIT0.271%
#21TTH0.270%
#22INT0.249%
#23EAN0.246%
#24FTH0.243%
#25ALL0.241%
#26TER0.240%
#27OFT0.240%
#28VER0.237%
#29NOT0.232%
#30EDT0.232%
#31YOU0.227%
#32EST0.223%
#33ERS0.216%
#34GHT0.215%
#35ION0.212%
#36STH0.204%
#37REA0.202%
#38HIM0.199%
#39ESS0.199%
#40SAN0.197%
#41NDT0.192%
#42HAD0.191%
#43EAR0.189%
#44RTH0.184%
#45RES0.183%
#46HEM0.182%
#47ONE0.180%
#48HEN0.180%
#49EDA0.179%
#50HEW0.179%
#3-gram%
#51NCE0.178%
#52HOU0.177%
#53EVE0.175%
#54AST0.174%
#55ATT0.172%
#56OME0.172%
#57ONT0.171%
#58OUT0.171%
#59HIN0.170%
#60MAN0.170%
#61TIN0.170%
#62NGT0.168%
#63HEA0.167%
#64STO0.167%
#65HEC0.165%
#66ATI0.164%
#67THO0.162%
#68BUT0.161%
#69ESA0.161%
#70ATH0.160%
#71TAN0.160%
#72HAN0.156%
#73DIN0.155%
#74TIO0.154%
#75HED0.153%
#76ERA0.152%
#77AVE0.152%
#78EOF0.152%
#79NDS0.151%
#80TOT0.151%
#81RIN0.151%
#82DTO0.150%
#83OUL0.150%
#84ERT0.148%
#85TED0.146%
#86RED0.146%
#87NDE0.145%
#88OUN0.143%
#89IGH0.143%
#90RAN0.142%
#91WHI0.142%
#92ORE0.142%
#93OUR0.141%
#94EWA0.141%
#95ORT0.141%
#96ETO0.140%
#97ILL0.140%
#98DAN0.140%
#99NTO0.138%
#100EDI0.137%
#3-gram%
#101ANT0.136%
#102WER0.136%
#103ULD0.135%
#104ATE0.134%
#105AID0.134%
#106YTH0.133%
#107SOF0.133%
#108ICH0.132%
#109STA0.130%
#110ECO0.130%
#111WHE0.128%
#112HEH0.128%
#113ARE0.127%
#114AIN0.126%
#115UGH0.125%
#116EIN0.125%
#117EAS0.124%
#118SAI0.124%
#119ONS0.123%
#120IST0.122%
#121OVE0.122%
#122EHA0.120%
#123OUS0.120%
#124NDI0.119%
#125SIN0.119%
#126ERI0.117%
#127CON0.117%
#128STE0.116%
#129MEN0.116%
#130UND0.116%
#131DER0.116%
#132NIN0.116%
#133SHA0.115%
#134NDA0.115%
#135NGA0.115%
#136EAT0.115%
#137HEL0.115%
#138RET0.114%
#139ASS0.114%
#140ISH0.113%
#141TOF0.113%
#142COM0.113%
#143EEN0.112%
#144HEP0.112%
#145HTH0.112%
#146HET0.111%
#147NOW0.108%
#148HEY0.108%
#149EDH0.107%
#150ROM0.107%
#3-gram%
#151FRO0.107%
#152EHE0.107%
#153ESE0.106%
#154DHE0.106%
#155ELL0.106%
#156EFO0.106%
#157NED0.105%
#158GTH0.105%
#159LEA0.105%
#160HAV0.104%
#161KIN0.104%
#162WHO0.104%
#163COU0.104%
#164ART0.103%
#165NTE0.102%
#166HEI0.102%
#167ENE0.101%
#168HEF0.101%
#169ESO0.101%
#170SEL0.100%
#171DNO0.100%
#172OUG0.100%
#173IVE0.099%
#174EDO0.099%
#175WHA0.099%
#176AME0.098%
#177HEE0.098%
#178HIC0.098%
#179STI0.096%
#180INE0.096%
#181EAD0.096%
#182EME0.096%
#183ERO0.096%
#184DHI0.095%
#185EMA0.095%
#186STR0.094%
#187NDH0.094%
#188SSI0.094%
#189ERY0.094%
#190BLE0.093%
#191CHA0.093%
#192OOK0.093%
#193INA0.092%
#194SHO0.092%
#195TOH0.091%
#196NAN0.091%
#197IDE0.091%
#198OSE0.090%
#199DRE0.089%
#200IND0.089%

Quadgrams (or is is Tetragrams?)

After three, comes four. I'm not sure if it's correct to call them quadgrams or tetragrams (Latin or Greek?), so instead we'll just call them n-grams or 4-grams.

There were 87,526 4-grams in my book samples (cf. 26 × 26 × 26 × 26 = 456,976 possible; less than 20% of the theoretical possible combinations).

Here are the top 200:

#4-gram%
#1THER0.325%
#2THAT0.302%
#3WITH0.256%
#4DTHE0.253%
#5NTHE0.250%
#6OTHE0.219%
#7OFTH0.217%
#8FTHE0.206%
#9THES0.203%
#10TTHE0.192%
#11HERE0.189%
#12EAND0.183%
#13ETHE0.177%
#14ANDT0.164%
#15THEM0.162%
#16SAND0.161%
#17TION0.151%
#18INGT0.144%
#19NDTH0.143%
#20THIS0.139%
#21OULD0.134%
#22INTH0.132%
#23THEC0.132%
#24STHE0.130%
#25TOTH0.129%
#26ANDS0.129%
#27EDTH0.129%
#28IGHT0.122%
#29THIN0.118%
#30SAID0.118%
#31EVER0.114%
#32ATTH0.111%
#33RTHE0.110%
#34THOU0.110%
#35WERE0.109%
#36THEY0.106%
#37HING0.106%
#38DAND0.105%
#39NGTH0.103%
#40TAND0.103%
#41THEP0.101%
#42INGA0.099%
#43OUGH0.095%
#44EDTO0.095%
#45THEW0.094%
#46THEN0.094%
#47EWAS0.094%
#48ONTH0.093%
#49HICH0.092%
#50FROM0.092%
#4-gram%
#51WHIC0.092%
#52HAVE0.090%
#53WHAT0.090%
#54ANDA0.090%
#55EFOR0.086%
#56THEF0.084%
#57HTHE0.084%
#58UGHT0.083%
#59TING0.083%
#60KING0.082%
#61ATHE0.081%
#62ANDW0.081%
#63ERTH0.081%
#64THEI0.080%
#65ANDH0.080%
#66HEWA0.078%
#67DNOT0.078%
#68RAND0.077%
#69VERY0.077%
#70THEE0.075%
#71THET0.075%
#72FORT0.075%
#73ANDI0.075%
#74GTHE0.075%
#75THED0.075%
#76HEHA0.074%
#77THEL0.074%
#78YTHE0.073%
#79HAND0.072%
#80HESA0.071%
#81HECO0.071%
#82YAND0.071%
#83EHAD0.071%
#84ORTH0.071%
#85INGH0.070%
#86SELF0.070%
#87WHEN0.069%
#88ERED0.069%
#89THEB0.069%
#90THEH0.067%
#91MENT0.067%
#92NAND0.067%
#93EDAN0.066%
#94OUND0.066%
#95SOME0.065%
#96NDER0.065%
#97NING0.065%
#98HERS0.064%
#99HATH0.063%
#100TWAS0.063%
#4-gram%
#101ATIO0.063%
#102RING0.063%
#103INGS0.062%
#104INGO0.061%
#105OVER0.061%
#106HATT0.060%
#107ETHA0.059%
#108WOUL0.059%
#109ENTH0.059%
#110THAN0.058%
#111ERAN0.058%
#112EDHI0.058%
#113LOOK0.058%
#114THTH0.056%
#115DWIT0.056%
#116HATI0.056%
#117HEAR0.056%
#118ITHA0.055%
#119EOFT0.055%
#120THEA0.055%
#121THEG0.055%
#122NGTO0.055%
#123INCE0.054%
#124ASTH0.054%
#125HEIR0.054%
#126WILL0.054%
#127BEEN0.053%
#128FORE0.053%
#129MTHE0.053%
#130INGI0.053%
#131NOTH0.052%
#132LING0.052%
#133MAND0.052%
#134INTO0.051%
#135STAN0.051%
#136THEO0.051%
#137LLTH0.051%
#138RETH0.051%
#139EDIN0.051%
#140HESE0.051%
#141HERA0.051%
#142DING0.050%
#143HOUG0.050%
#144ETHI0.050%
#145ANDR0.050%
#146TOHI0.049%
#147DTHA0.049%
#148TTER0.049%
#149ANCE0.049%
#150KNOW0.049%
#4-gram%
#151TIME0.049%
#152REAT0.048%
#153SWER0.048%
#154COUL0.048%
#155UNDE0.048%
#156LIKE0.048%
#157HEMA0.047%
#158SOFT0.047%
#159YOUR0.047%
#160ITHT0.047%
#161PRIN0.047%
#162NESS0.047%
#163EREA0.047%
#164LTHE0.047%
#165RINC0.046%
#166NHIS0.046%
#167WASA0.046%
#168DHIS0.046%
#169RESS0.046%
#170IONS0.045%
#171DHER0.045%
#172LAND0.045%
#173NDIN0.045%
#174DHIM0.044%
#175MORE0.044%
#176ERIN0.044%
#177ABLE0.044%
#178ESAI0.044%
#179ERES0.044%
#180ENCE0.044%
#181ESAN0.044%
#182OUNT0.043%
#183TTLE0.043%
#184HATS0.043%
#185COME0.043%
#186HEST0.043%
#187LONG0.042%
#188PRES0.042%
#189UTTH0.042%
#190EYOU0.042%
#191WHER0.042%
#192TOBE0.042%
#193ABOU0.041%
#194METH0.041%
#195EWIT0.041%
#196HERO0.041%
#197HIMS0.041%
#198NDRE0.041%
#199NDHE0.041%
#200OMTH0.041%

Things get a little more complicated as we move to four characters. Top of the list is "THER", some of which could be from the word "THE", followed by a word starting with "R", but a a most of the frequency of "THER" comes as it being part of words like "THERE" and "OTHER" (and all those other words that have this sub-string contained in them).

Looking through the list it is easy to see words that are distinct popular four character words in their own right as well the sub-strings.

5-grams

There were 434,396 5-grams (cf. 26 × 26 × 26 × 26 × 26 = 11,881,376 possible; less than 4% of the theoretical possible combinations).

Here are the top 200:

#5-gram%
#1OFTHE0.190%
#2ANDTH0.122%
#3TOTHE0.116%
#4INTHE0.112%
#5THERE0.108%
#6NDTHE0.106%
#7EDTHE0.097%
#8WHICH0.092%
#9ATTHE0.090%
#10OTHER0.090%
#11INGTH0.085%
#12THING0.081%
#13ONTHE0.075%
#14NGTHE0.074%
#15OUGHT0.064%
#16ATION0.063%
#17WOULD0.059%
#18EDAND0.056%
#19THECO0.055%
#20DWITH0.055%
#21THEIR0.053%
#22HEHAD0.053%
#23INGTO0.052%
#24EOFTH0.052%
#25HEWAS0.051%
#26FORTH0.051%
#27ERTHE0.051%
#28THOUG0.050%
#29HOUGH0.049%
#30HATTH0.049%
#31COULD0.048%
#32THATT0.048%
#33EVERY0.048%
#34ERAND0.047%
#35THTHE0.047%
#36WITHA0.046%
#37DTHAT0.046%
#38WITHT0.046%
#39THESE0.046%
#40ETHAT0.043%
#41PRINC0.043%
#42ITHTH0.043%
#43THATH0.043%
#44ORTHE0.043%
#45ESAID0.042%
#46THEMA0.042%
#47THATI0.042%
#48ENTHE0.042%
#49RINCE0.042%
#50EFORE0.041%
#5-gram%
#51ABOUT0.040%
#52ESAND0.040%
#53ATHER0.040%
#54SOFTH0.039%
#55ITWAS0.039%
#56ASTHE0.039%
#57AFTER0.038%
#58SWERE0.038%
#59UNDER0.038%
#60EWITH0.038%
#61WHERE0.037%
#62WITHH0.037%
#63FROMT0.037%
#64ALLTH0.036%
#65ETHER0.036%
#66LLTHE0.036%
#67INGAN0.036%
#68ANDHE0.036%
#69OMTHE0.035%
#70ROMTH0.035%
#71THEMO0.035%
#72HATHE0.035%
#73EDWIT0.035%
#74AGAIN0.034%
#75NEVER0.034%
#76INGHI0.034%
#77CEAND0.034%
#78BEFOR0.034%
#79THEPR0.034%
#80TTHAT0.033%
#81NGAND0.033%
#82THATS0.033%
#83OULDN0.033%
#84RETHE0.033%
#85TOFTH0.033%
#86SAIDT0.032%
#87THERS0.032%
#88COUNT0.032%
#89TIONS0.032%
#90STAND0.032%
#91EDHIM0.032%
#92UTTHE0.032%
#93HIMSE0.032%
#94OFHIS0.032%
#95MSELF0.031%
#96NTOTH0.031%
#97THEWA0.031%
#98STHAT0.031%
#99ITTLE0.031%
#100IMSEL0.031%
#5-gram%
#101NTHES0.031%
#102LITTL0.031%
#103INGIN0.031%
#104HECOU0.030%
#105ROUGH0.030%
#106THESA0.030%
#107ANDIN0.030%
#108BYTHE0.030%
#109RIGHT0.029%
#110ANDRE0.029%
#111HESAI0.029%
#112THECA0.029%
#113THEST0.029%
#114THERO0.029%
#115LIGHT0.029%
#116TOHIM0.028%
#117CTION0.028%
#118THERA0.028%
#119WITHO0.028%
#120EANDT0.028%
#121IONOF0.028%
#122IDNOT0.028%
#123HADBE0.027%
#124HEREW0.027%
#125INGOF0.027%
#126ITHOU0.027%
#127SANDT0.027%
#128ETHIN0.027%
#129ULDNO0.027%
#130GREAT0.027%
#131ROUND0.027%
#132DIDNO0.027%
#133HOULD0.027%
#134SHOUL0.027%
#135EDHER0.026%
#136LDNOT0.026%
#137OTHIN0.026%
#138THOUT0.026%
#139THEWO0.026%
#140HEREA0.026%
#141EDHIS0.025%
#142DTHES0.025%
#143INHIS0.025%
#144INGHE0.025%
#145SWITH0.025%
#146DTHEM0.025%
#147NTHAT0.025%
#148ANDWH0.025%
#149YTHIN0.024%
#150THELA0.024%
#5-gram%
#151HENTH0.024%
#152THATW0.024%
#153DBEEN0.024%
#154NOTHE0.024%
#155SOMET0.024%
#156FIRST0.024%
#157TWITH0.024%
#158NSWER0.024%
#159ADBEE0.023%
#160THEFI0.023%
#161ITHHI0.023%
#162PRESS0.023%
#163FTHES0.023%
#164WASTH0.023%
#165HERAN0.023%
#166STILL0.023%
#167AKING0.023%
#168LOOKE0.023%
#169BUTTH0.023%
#170ASKED0.023%
#171TIONO0.023%
#172OOKED0.023%
#173SHEHA0.023%
#174TOHER0.022%
#175ANDSO0.022%
#176ESTHE0.022%
#177URNED0.022%
#178THEHA0.022%
#179ANDSA0.022%
#180OMETH0.022%
#181ERING0.022%
#182IERRE0.022%
#183PIERR0.022%
#184THINK0.022%
#185DTHER0.022%
#186INTOT0.022%
#187ANSWE0.022%
#188SHEWA0.022%
#189PLACE0.022%
#190NOTHI0.022%
#191HIMAN0.022%
#192NCEAN0.021%
#193TTING0.021%
#194ONAND0.021%
#195THEYW0.021%
#196TTHES0.021%
#197WHILE0.021%
#198TURNE0.021%
#199SSION0.021%
#200WASNO0.021%

Things get even more interesting here. "OFTHE", "ANDTH", "TOTHE" and "INTHE" at the top are all obvious concatenations of two words. "THERE" comes next.

As the n-grams become longer it's possible to start seeing more distinct (and specific) words. As I was testing the code out with smaller books, after pausing and viewing the intermediate results, it was possible to identify sub-strings of titles characters and specific nouns in the books.

This shows us that, unless we're aiming to decode a message with a defined dictionary of possible words, going too deep into n-gram analysis will start to hurt us. Up to about 4-grams, we're mapping the characteristics of the English language. Above 4-grams, it's looking like we are starting to map more to words than distributions of groupings of letters.

6-grams

The error of going too deep into n-gram is confirmed looking at this list. It doesn't take too long see specific words that obviously belong to one specific book.

There were 1,239,584 6-grams (cf. 26 × 26 × 26 × 26 × 26 × 26 = 308,915,776 possible; less than 0.4% of the theoretical possible combinations).

Here are the top 200:

#6-gram%
#1ANDTHE0.090%
#2INGTHE0.063%
#3THOUGH0.049%
#4EOFTHE0.044%
#5WITHTH0.042%
#6THATTH0.042%
#7PRINCE0.042%
#8HATTHE0.039%
#9ITHTHE0.037%
#10FROMTH0.035%
#11FORTHE0.035%
#12SOFTHE0.035%
#13EDWITH0.034%
#14HOUGHT0.034%
#15BEFORE0.033%
#16ROMTHE0.032%
#17HIMSEL0.031%
#18IMSELF0.031%
#19LITTLE0.031%
#20INGAND0.029%
#21TOFTHE0.029%
#22NTOTHE0.029%
#23HESAID0.028%
#24THATHE0.028%
#25OULDNO0.027%
#26SHOULD0.027%
#27DIDNOT0.026%
#28ULDNOT0.026%
#29WITHOU0.025%
#30ALLTHE0.024%
#31ITHOUT0.024%
#32THEREW0.024%
#33YTHING0.023%
#34ADBEEN0.023%
#35HADBEE0.023%
#36ETHING0.023%
#37WITHHI0.023%
#38LOOKED0.022%
#39PIERRE0.022%
#40OTHING0.022%
#41HENTHE0.022%
#42OFTHES0.022%
#43ANSWER0.022%
#44NCEAND0.021%
#45COULDN0.021%
#46EANDTH0.021%
#47INTOTH0.021%
#48NOTHIN0.021%
#49TURNED0.021%
#50HIMAND0.020%
#6-gram%
#51SANDTH0.020%
#52ROUGHT0.020%
#53INGHIS0.020%
#54TIONOF0.020%
#55NOTHER0.020%
#56SHEHAD0.020%
#57SAIDTH0.019%
#58OTHERS0.019%
#59SHEWAS0.019%
#60PEOPLE0.019%
#61ECOULD0.019%
#62NDTHAT0.019%
#63THECOU0.019%
#64EWOULD0.019%
#65OFTHEM0.019%
#66HERAND0.019%
#67EDTHAT0.018%
#68DTOTHE0.018%
#69OULDBE0.018%
#70ANOTHE0.018%
#71THESAM0.018%
#72OUGHTH0.017%
#73METHIN0.017%
#74WHICHH0.017%
#75WASTHE0.017%
#76RINCES0.017%
#77HESAME0.017%
#78OMETHI0.017%
#79SOMETH0.017%
#80WHENTH0.017%
#81THROUG0.017%
#82HROUGH0.017%
#83FATHER0.017%
#84AIDTHE0.017%
#85SEEMED0.017%
#86MOTHER0.017%
#87DINTHE0.017%
#88EINTHE0.017%
#89ANDTHA0.017%
#90UNDERS0.017%
#91PRESEN0.017%
#92ECAUSE0.016%
#93THEPRI0.016%
#94OFTHEC0.016%
#95NDERST0.016%
#96INCESS0.016%
#97BUTTHE0.016%
#98INGHER0.016%
#99HEREWA0.016%
#100DTHERE0.016%
#6-gram%
#101EREWAS0.016%
#102LOOKIN0.016%
#103OOKING0.016%
#104EOTHER0.015%
#105OULDHA0.015%
#106THEFIR0.015%
#107THINGS0.015%
#108THEWOR0.015%
#109MOMENT0.015%
#110THEYWE0.015%
#111FRIEND0.015%
#112NWHICH0.015%
#113RETURN0.015%
#114THEMAN0.015%
#115FRENCH0.015%
#116ITHHIS0.015%
#117NOFTHE0.015%
#118ATIONS0.015%
#119NSWERE0.015%
#120ETHOUG0.015%
#121EANDRE0.014%
#122INTHES0.014%
#123EPRINC0.014%
#124UGHTHE0.014%
#125ALWAYS0.014%
#126NGWITH0.014%
#127LDHAVE0.014%
#128ETHERE0.014%
#129ULDHAV0.014%
#130WASNOT0.014%
#131EDTOTH0.014%
#132TTHERE0.014%
#133ERETHE0.014%
#134NGTHAT0.014%
#135ESSION0.014%
#136HECOUL0.014%
#137ECOUNT0.014%
#138ATASHA0.014%
#139ABOUTT0.014%
#140SINTHE0.014%
#141NATASH0.014%
#142ROFTHE0.014%
#143VERTHE0.014%
#144INGHIM0.014%
#145EVERYT0.014%
#146APPEAR0.013%
#147ETOTHE0.013%
#148EYWERE0.013%
#149ROTHER0.013%
#150SWERED0.013%
#6-gram%
#151WHICHT0.013%
#152UPONTH0.013%
#153RESENT0.013%
#154HEYWER0.013%
#155TINTHE0.013%
#156EFIRST0.013%
#157INGTHA0.013%
#158HATSHE0.013%
#159HEWOUL0.013%
#160POSSIB0.013%
#161BECAUS0.013%
#162INGWIT0.013%
#163ANDREW0.013%
#164EDTHEM0.013%
#165RINCEA0.013%
#166OUTTHE0.013%
#167IONAND0.013%
#168ESTION0.013%
#169NDWITH0.013%
#170HAVING0.013%
#171PRESSI0.013%
#172NDTHEN0.013%
#173TEDTHE0.013%
#174THEREA0.013%
#175INCEAN0.013%
#176TOTHES0.013%
#177ERSELF0.013%
#178ECTION0.013%
#179THERES0.013%
#180ANDHIS0.013%
#181HERSEL0.013%
#182OFTHEP0.013%
#183THEREI0.013%
#184SHESAI0.013%
#185PONTHE0.013%
#186CEANDR0.013%
#187RSTAND0.013%
#188VERYTH0.013%
#189QUESTI0.013%
#190WHENHE0.013%
#191THECON0.013%
#192HEOTHE0.012%
#193NEOFTH0.012%
#194THEOTH0.012%
#195UESTIO0.012%
#196INGFOR0.012%
#197EELING0.012%
#198HECOUN0.012%
#199EXPRES0.012%
#200STHERE0.012%

Final return to ROT13

As I was messing with ROT13, It wondered if it was possible to to apply ROT13 to a word and make an entirely different (valid) word. A few lines of SQL late revealed there are quite a few possible. The longest found in my dictionary file was NOWHERE ↔ ABJURER

NA↔AN   NAAN↔ANNA   NAG↔ANT   NAN↔ANA   NAVY↔ANIL   NE↔AR   NIB↔AVO   NO↔AB   NOB↔ABO   NOON↔ABBA   NOWHERE↔ABJURER   NU↔AH   NUN↔AHA   OHO↔BUB   ON↔BA   ONE↔BAR   ONES↔BARF   ONYX↔BALK   OR↔BE   ORA↔BEN   ORRA↔BEEN   ORT↔BEG   OVA↔BIN   PENNY↔CRAAL   PENT↔CRAG   PERRY↔CREEL   PRY↔CEL   PUNG↔CHAT   PURS↔CHEF   RAIL↔ENVY   RAT↔ENG   RE↔ER   REAR↔ERNE   REE↔ERR   REEF↔ERRS   REF↔ERS   RET↔ERG   ROOF↔EBBS   SEL↔FRY   SENT↔FRAG   SERER↔FRERE   SHA↔FUN   SHE↔FUR   SYNC↔FLAP   TANG↔GNAT   TERRA↔GREEN   THY↔GUL   TRY↔GEL   TUNG↔GHAT   UN↔HA   UREA↔HERN   VEX↔IRK   WHA↔JUN   WHEN↔JURA  

Encryption Humour

This web page is encrypted with ROT26.

 

You can find a complete list of all the articles here.      Click here to receive email alerts on new articles.

© 2009-2015 DataGenetics    Privacy Policy