Advertisement

ROT13 and Caesar Ciphers

THE QUICK BROWN FOX SAID I LOVE LUCY

GUR DHVPX OEBJA SBK FNVQ V YBIR YHPL

ROT13 (short for “rotate 13 places”), is an obfuscation technique familiar to nerds, geeks, and computer programmers. It’s commonly used in online forums as a means of hiding or obscuring spoilers, punchlines, hints, and (sometimes) offensive content.

I’m hesitant to call it encryption because it’s so weak. What it is is a simple substitution cipher (in fact, it's worse than that because, as described below, it's a Caesar Cipher in which the offset for each character is the same and fixed!)

To execute ROT13 you take a letter and shift it along 13 places (and if you go over the end past ‘Z’, you wrap around again to ‘A’). It jumbles the letters up sufficiently such that, at first glance, you can’t read the message, and that, these days, is its only real purpose. If you attempt to use it to store/encrypt passwords or sensitive information you deserve to have your programming license revoked!

It’s such a popular technique that some text editors and news readers have ROT13 functionality built in!

As an example of how this works the word “HELLO” gets converted by ROT13 into “URYYB”

Traditionally, ROT13 is only applied to the letters ‘A-Z’ and ‘a-z’ so that case, numbers, and other punctuation are preserved.

When you were a kid (perhaps you still are one), you might have had fun using some kind of Spy code/de-code wheel. On these devices, the letters ‘A-Z’ are written on concentric disks which can rotated to offset the alphabet.

Two ‘secret agents’ agree on the offset beforehand then, to encode a message, the desired letter is selected on the inner wheel, and the coded letter read on the outside wheel. The process is then inverted by the decoder to read the message. To use a decoding wheel for ROT13, simply rotate the wheel 13 places.

Why ROT13?

Because the English alphabet has 26 characters, ROT13 has the interesting property that it is self-inverting. Performed twice on a piece of text reverts the text back to the original. It is for this reason that ROT13 became so popular.

ROT13('HELLO') = 'URYYB'

ROT13('URYYB') = 'HELLO'

ROT13(ROT13('HELLO')) = 'HELLO'

If you are familiar with Boolean logic, this is a property similar to the XOR operator. If performed twice with the same argument, XOR returns the input to the same value.

To encode/de-code in ROT13 you only need one command, and you can't get it the wrong way round either!

ROT5

Similar to ROT13, which applies to letters, it’s possible to obfuscate numbers with a similar self-inverting rotation of five places.

43,252,003,274,489,856,000 ↔ 98,707,558,729,934,301,555

There is a hybrid system which encodes text using ROT13, numbers using ROT5, and leaves all other characters unaffected.

ROT5 is subtle, numbers just look like numbers should!

ROT47

Another (less-popular) variant is ROT47 which shifts the 94 characters from ASCII 33 (which is the “!” directly after the space) to ASCII 126 “~”. This obfuscates letters, numbers, and punctuation characters but still keeps the output in 7-bit ‘safe’ printable ASCII.

Call the number (425)-555-1212, and ask for "Princess"

r2== E96 ?F>36C WcadX\ddd\`a`a[ 2?5 2D< 7@C Q!C:?46DDQ

ROT47 is far from subtle; it's pretty clear that the message above has been encoded.

Caesar Ciphers

THE QUICK BROWN FOX SAID I LOVE LUCY

GUR DHVPX OEBJA SBK FNVQ V YBIR YHPL

Using ROT13 as anything more than obfuscation technique has more security holes than a piece of Swiss Cheese. A simple rotation cipher is given the name Caesar Cipher, after Julius Ceasar, as it is documented he used this technique to 'protect' messages to his troops (he is documented as using ROT3, whilst his nephew used ROT1).

Once you know the technique used, it's fairly trivial (even using brute force if you don't know the offset), to enumerate all possible versions to reveal the source message!

ROTn	Cipher
0	THEQUICKBROWNFOXSAIDILOVELUCY
+1	UIFRVJDLCSPXOGPYTBJEJMPWFMVDZ
+2	VJGSWKEMDTQYPHQZUCKFKNQXGNWEA
+3	WKHTXLFNEURZQIRAVDLGLORYHOXFB
+4	XLIUYMGOFVSARJSBWEMHMPSZIPYGC
+5	YMJVZNHPGWTBSKTCXFNINQTAJQZHD
+6	ZNKWAOIQHXUCTLUDYGOJORUBKRAIE
+7	AOLXBPJRIYVDUMVEZHPKPSVCLSBJF
+8	BPMYCQKSJZWEVNWFAIQLQTWDMTCKG
+9	CQNZDRLTKAXFWOXGBJRMRUXENUDLH
+10	DROAESMULBYGXPYHCKSNSVYFOVEMI
+11	ESPBFTNVMCZHYQZIDLTOTWZGPWFNJ
+12	FTQCGUOWNDAIZRAJEMUPUXAHQXGOK
+13	GURDHVPXOEBJASBKFNVQVYBIRYHPL
…	…

There are only 25 rotations to try by brute force!

Substitution Ciphers

Closely related to Caesar Ciphers are Substitution Ciphers. These still map 1:1 between each character in the source text and cipher text, but adjacent characters in the source do not have to map to adjacent ones in the destination.

If spaces are preserved in the encoding, it's easy to see where the word breaks are, and thus you can guess at what you think are the more popular words. As each character is always converted over to the same replacement character, common words (and commonly occurring groupings and patterns of letters) start to jump out of the page very quickly (especially if the message is quite long). Removing the white space between words adds a trivial level of complexity.

Substitution ciphers don't have to just use other letters. Symbols can be used. Two of the most well known examples of this are "The Dancing Men", from the famous Sherlock Holmes story, and the "PigPen Cipher" which uses fragments of grids and dots to represent the alphabet.

Solving Substitution Ciphers

Even with, or without, spaces removed, substitution ciphers are fairly trivial to guess. There's a 1:1 mapping between each character so, once you know one conversion, you know all other occurrences of that same character (and you also know that this letter can't be used again).

The solution space is so small that the solving of these is a hobby (like solving Crossword puzzles, word searches, or Sudoku). These puzzles are called Cryptograms.

An example is shown on the left for the quote:

"Style and structure are the essence of a book; great ideas are hogwash." - Vladimir Nabokov

The strategy for solving cryptograms is a combination of brute force, heuristics, and letter/word frequency.

Not all letters in the English language are used equally. Some, like the letters 'E', 'T' and 'A' are used very frequently. Unless the message we are trying to decode is very obscure, we'd expect the distribution of symbols in the solution to follow a similar profile. This alone could give a first pass for decoding a message; we simply apply the frequency of letters used in the secret message to the frequency we expect for each letter.

From Wikipedia, here is the ordering of letters in English language (taken from a corpus of many hundreds and thousands of documents):

ETAOINSHRDLUCMFWYPVBGKQJXZ

(You might also like an article I wrote few years ago about the game of Hangman and letter distribution).

If our secret message is a representative sample of the entire English language, we'd expect the symbol representing "E" to be the most frequently occurring in our message, followed by "T", then "A" …

This is far from perfect; the chances are our message is short, and so the letters might not follow this distribution perfectly (or even have enough granularity, or even use all the letters of the alphabet). We can narrow down solutions using brute-force and Chi-squared tests of letters frequency based on expected, but there is so much more we can do very easily.

If white spaces are present, we can apply knowledge of the words in the English language. We know that there are only a limited number of two, three and four letter words, and these words are common. Did you know that one third of all printed English materials are made up of the top 25 occurring words? (The most popular 100 words make up approximately half of all printed English!)

the, of, and, a, to, in, is, you, that, it, he, was, for, on, are, as, with, his, they, I, at, be, this, have, from

Guessing which word could be which and corroborating this with what these symbols/letters would be like in the other words could be a great help.

If there is no white space to give word breaks, we can still apply statistical techniques. Certain combinations of letters often occur together. It's very common to have "TH" next to each other and "ER" and "RE". Certain letters often occur in double form, like "OO", "EE" and "LL". Conversely, have you ever seen a word containing "JJ"*?

We're taught an early age that it's very common for "Q" to be followed by "U". Although "Q" is not a popular letter, if we do identify one, there's a very good chance the letter after it is a "U". There are similar rules with other letters.

Most (not all), words contain at least one vowel (AEIOU), and if you include "Y" as a vowel you practically include all words. The more letters you lock in, the easier it gets to solve the rest (both because you have partial words to complete, and the unused letter pool is smaller).

*I can only think of: HAJJ, HAJJES, HAJJI, HAJJIS

Let's take a look

I was curious about letter distribution, so I downloaded a dozen books in plain text from Project Gutenberg. This site has 50,000 free books available!

If you are interested, the books I selected (randomly from the fiction collection) were:

20,000 Leagues under the Sea, Jules Verne
A Tale of Two Cities, Charles Dickens
Around the World in 80 days, Jules Verne
Little Women, Louisa May Alcott
Alice's Adventure in Wonderland, Lewis Caroll
Anna Karenina, Leo Tolstoy
The Arabian Nights, Sir Richard Burton
The Canterbuy Tales, Geoffery Chaucer
The Journey to the Center of the Earth, Jules Verne
The Wonderful Wizard of Oz, L. Frank Baum
War and Peace, Leo Tolstoy
Ben-Hur, Lew Wallace

Obviously, the more books you sample the more refined your distribution will become for generic solving. Alternatively, if you have some idea of the context of your secret message, you might elect to sample a more specific set of books to more accurately represent the sample you have.

Single Letter Frequency

Based on the books above, here is the single letter frequency distribution. The percentages show the percentage over the total of all the letters in these books (Approximately 9 million letters).

Here is the same data plotted in sorted order. The ordering is slightly different to the answer given by Wikipedia, but that's because we're using different samples.

Bigrams

Next I looked at the frequency of all bigrams (also called couplets of letters, adjacent pairs, and sometimes called digrams).

To generate this list I ignored any white space and punctuation characters. So, for example, in addition to containing all the letters that occur next to each other inside of words, this list also contains entries for the words that end with the first character that occur adjacent to words that starts with the second. This will help if your secret message does not contain white space that allows you to determine where the line breaks are.

Here are the most frequently occurring top 50 adjacent pairs of letters:

Interestingly, even though there are 26 characters, the total number of bigrams in my sample is not 26 × 26 (=676). Instead there are 643 distinct items. Not every possible pairing of characters occur (for instance the pairing "QZ" or "ZX" never occurred in the books I sampled).

As expected the frequency of "TH" and "HE" dominate. These bigrams are popular in many words as well as the most common words.

Note - This is a great data to use if you have no information about either of the characters. However, if you know, for instance, one of the characters in the pair you can use this information to find conditional probability. For instance, if you know a pair is "Q?", then the there is 99.9% chance that the missing unknown character is a "U".

Here are the top 200 bigrams in tabular order:

#	2-gram	%
#1	TH	3.322%
#2	HE	3.108%
#3	AN	1.838%
#4	ER	1.820%
#5	IN	1.801%
#6	ND	1.434%
#7	RE	1.373%
#8	ED	1.244%
#9	ES	1.220%
#10	HA	1.163%
#11	TO	1.101%
#12	EN	1.096%
#13	EA	1.071%
#14	AT	1.062%
#15	HI	1.046%
#16	ON	1.042%
#17	ST	1.011%
#18	OU	1.000%
#19	NT	0.988%
#20	NG	0.949%
#21	AS	0.909%
#22	IT	0.899%
#23	IS	0.881%
#24	ET	0.844%
#25	OR	0.832%
#26	TE	0.797%
#27	SE	0.767%
#28	OF	0.746%
#29	AR	0.735%
#30	TI	0.719%
#31	LE	0.706%
#32	SA	0.690%
#33	VE	0.637%
#34	NE	0.636%
#35	AL	0.629%
#36	ME	0.625%
#37	RO	0.608%
#38	NO	0.598%
#39	SH	0.592%
#40	OT	0.589%
#41	DE	0.588%
#42	EL	0.578%
#43	TA	0.564%
#44	LL	0.561%
#45	TT	0.560%
#46	SO	0.546%
#47	RI	0.543%
#48	DT	0.538%
#49	HO	0.536%
#50	WA	0.531%

#	2-gram	%
#51	SS	0.506%
#52	RA	0.500%
#53	EW	0.496%
#54	EE	0.492%
#55	WH	0.490%
#56	SI	0.478%
#57	OM	0.477%
#58	DI	0.473%
#59	BE	0.467%
#60	DA	0.464%
#61	AD	0.461%
#62	MA	0.453%
#63	EC	0.450%
#64	EM	0.446%
#65	WI	0.441%
#66	CH	0.440%
#67	CO	0.438%
#68	CE	0.437%
#69	UT	0.436%
#70	OW	0.435%
#71	RT	0.432%
#72	LI	0.431%
#73	NA	0.415%
#74	LA	0.401%
#75	FO	0.400%
#76	RS	0.397%
#77	EI	0.389%
#78	AI	0.382%
#79	UR	0.379%
#80	LO	0.379%
#81	WE	0.378%
#82	DO	0.371%
#83	LY	0.369%
#84	IM	0.366%
#85	IL	0.365%
#86	US	0.362%
#87	GH	0.357%
#88	EH	0.353%
#89	ID	0.350%
#90	NS	0.349%
#91	FT	0.346%
#92	OO	0.344%
#93	IC	0.342%
#94	TS	0.331%
#95	UN	0.330%
#96	EF	0.322%
#97	EO	0.321%
#98	HT	0.321%
#99	YO	0.320%
#100	EP	0.316%

#	2-gram	%
#101	DS	0.311%
#102	PE	0.310%
#103	NI	0.307%
#104	NC	0.303%
#105	OS	0.303%
#106	AC	0.301%
#107	LD	0.296%
#108	CA	0.286%
#109	MO	0.284%
#110	UL	0.279%
#111	OL	0.275%
#112	DH	0.273%
#113	IO	0.269%
#114	KE	0.268%
#115	TR	0.265%
#116	IE	0.264%
#117	IR	0.263%
#118	EV	0.263%
#119	AM	0.252%
#120	TW	0.250%
#121	FA	0.249%
#122	GE	0.248%
#123	AY	0.244%
#124	GA	0.243%
#125	PR	0.239%
#126	EY	0.233%
#127	WO	0.232%
#128	SW	0.229%
#129	PA	0.229%
#130	MI	0.228%
#131	RY	0.220%
#132	GO	0.218%
#133	EB	0.215%
#134	FI	0.210%
#135	YA	0.207%
#136	AV	0.206%
#137	BU	0.206%
#138	RD	0.205%
#139	YT	0.204%
#140	PO	0.203%
#141	SP	0.200%
#142	IG	0.200%
#143	OV	0.200%
#144	FE	0.198%
#145	FR	0.190%
#146	DW	0.188%
#147	SU	0.186%
#148	EG	0.183%
#149	AP	0.182%
#150	NH	0.181%

#	2-gram	%
#151	DR	0.177%
#152	DB	0.177%
#153	AB	0.176%
#154	YS	0.176%
#155	OD	0.174%
#156	TU	0.173%
#157	VI	0.173%
#158	GT	0.171%
#159	TL	0.170%
#160	SC	0.165%
#161	PL	0.164%
#162	LT	0.164%
#163	IF	0.163%
#164	TY	0.159%
#165	AG	0.159%
#166	RR	0.159%
#167	YE	0.159%
#168	MY	0.156%
#169	BO	0.156%
#170	KI	0.149%
#171	BL	0.148%
#172	CT	0.147%
#173	OP	0.146%
#174	GI	0.146%
#175	DN	0.146%
#176	UG	0.145%
#177	OH	0.142%
#178	GR	0.142%
#179	RM	0.141%
#180	UP	0.140%
#181	RN	0.140%
#182	OK	0.139%
#183	IV	0.137%
#184	SM	0.136%
#185	IA	0.135%
#186	OA	0.132%
#187	RH	0.130%
#188	DD	0.126%
#189	PI	0.125%
#190	OI	0.124%
#191	AW	0.123%
#192	SL	0.123%
#193	EX	0.121%
#194	SB	0.119%
#195	MP	0.119%
#196	NW	0.119%
#197	DM	0.118%
#198	BA	0.118%
#199	AK	0.118%
#200	SF	0.116%

Trigrams

The next logical expansion is to look at trigrams (sequences of three letters).

There are 9,671 distinct trigrams in my sample (cf. 26 × 26 × 26 = 17,576 possible).

Again seeing "THE" at the top is no surprise, neither is "AND". These are both popular words in their own right, and sub-strings of other words. "ING" comes next as the suffix for many verbs, followed by many other triplets you can find inside common words.

Here are the top 200 trigrams in tabular form:

#	3-gram	%
#1	THE	2.049%
#2	AND	1.097%
#3	ING	0.758%
#4	HER	0.615%
#5	THA	0.449%
#6	ERE	0.420%
#7	HIS	0.416%
#8	HAT	0.412%
#9	ETH	0.348%
#10	DTH	0.341%
#11	ENT	0.326%
#12	NTH	0.323%
#13	THI	0.306%
#14	FOR	0.304%
#15	OTH	0.303%
#16	ITH	0.302%
#17	WAS	0.300%
#18	HES	0.297%
#19	SHE	0.285%
#20	WIT	0.271%
#21	TTH	0.270%
#22	INT	0.249%
#23	EAN	0.246%
#24	FTH	0.243%
#25	ALL	0.241%
#26	TER	0.240%
#27	OFT	0.240%
#28	VER	0.237%
#29	NOT	0.232%
#30	EDT	0.232%
#31	YOU	0.227%
#32	EST	0.223%
#33	ERS	0.216%
#34	GHT	0.215%
#35	ION	0.212%
#36	STH	0.204%
#37	REA	0.202%
#38	HIM	0.199%
#39	ESS	0.199%
#40	SAN	0.197%
#41	NDT	0.192%
#42	HAD	0.191%
#43	EAR	0.189%
#44	RTH	0.184%
#45	RES	0.183%
#46	HEM	0.182%
#47	ONE	0.180%
#48	HEN	0.180%
#49	EDA	0.179%
#50	HEW	0.179%

#	3-gram	%
#51	NCE	0.178%
#52	HOU	0.177%
#53	EVE	0.175%
#54	AST	0.174%
#55	ATT	0.172%
#56	OME	0.172%
#57	ONT	0.171%
#58	OUT	0.171%
#59	HIN	0.170%
#60	MAN	0.170%
#61	TIN	0.170%
#62	NGT	0.168%
#63	HEA	0.167%
#64	STO	0.167%
#65	HEC	0.165%
#66	ATI	0.164%
#67	THO	0.162%
#68	BUT	0.161%
#69	ESA	0.161%
#70	ATH	0.160%
#71	TAN	0.160%
#72	HAN	0.156%
#73	DIN	0.155%
#74	TIO	0.154%
#75	HED	0.153%
#76	ERA	0.152%
#77	AVE	0.152%
#78	EOF	0.152%
#79	NDS	0.151%
#80	TOT	0.151%
#81	RIN	0.151%
#82	DTO	0.150%
#83	OUL	0.150%
#84	ERT	0.148%
#85	TED	0.146%
#86	RED	0.146%
#87	NDE	0.145%
#88	OUN	0.143%
#89	IGH	0.143%
#90	RAN	0.142%
#91	WHI	0.142%
#92	ORE	0.142%
#93	OUR	0.141%
#94	EWA	0.141%
#95	ORT	0.141%
#96	ETO	0.140%
#97	ILL	0.140%
#98	DAN	0.140%
#99	NTO	0.138%
#100	EDI	0.137%

#	3-gram	%
#101	ANT	0.136%
#102	WER	0.136%
#103	ULD	0.135%
#104	ATE	0.134%
#105	AID	0.134%
#106	YTH	0.133%
#107	SOF	0.133%
#108	ICH	0.132%
#109	STA	0.130%
#110	ECO	0.130%
#111	WHE	0.128%
#112	HEH	0.128%
#113	ARE	0.127%
#114	AIN	0.126%
#115	UGH	0.125%
#116	EIN	0.125%
#117	EAS	0.124%
#118	SAI	0.124%
#119	ONS	0.123%
#120	IST	0.122%
#121	OVE	0.122%
#122	EHA	0.120%
#123	OUS	0.120%
#124	NDI	0.119%
#125	SIN	0.119%
#126	ERI	0.117%
#127	CON	0.117%
#128	STE	0.116%
#129	MEN	0.116%
#130	UND	0.116%
#131	DER	0.116%
#132	NIN	0.116%
#133	SHA	0.115%
#134	NDA	0.115%
#135	NGA	0.115%
#136	EAT	0.115%
#137	HEL	0.115%
#138	RET	0.114%
#139	ASS	0.114%
#140	ISH	0.113%
#141	TOF	0.113%
#142	COM	0.113%
#143	EEN	0.112%
#144	HEP	0.112%
#145	HTH	0.112%
#146	HET	0.111%
#147	NOW	0.108%
#148	HEY	0.108%
#149	EDH	0.107%
#150	ROM	0.107%

#	3-gram	%
#151	FRO	0.107%
#152	EHE	0.107%
#153	ESE	0.106%
#154	DHE	0.106%
#155	ELL	0.106%
#156	EFO	0.106%
#157	NED	0.105%
#158	GTH	0.105%
#159	LEA	0.105%
#160	HAV	0.104%
#161	KIN	0.104%
#162	WHO	0.104%
#163	COU	0.104%
#164	ART	0.103%
#165	NTE	0.102%
#166	HEI	0.102%
#167	ENE	0.101%
#168	HEF	0.101%
#169	ESO	0.101%
#170	SEL	0.100%
#171	DNO	0.100%
#172	OUG	0.100%
#173	IVE	0.099%
#174	EDO	0.099%
#175	WHA	0.099%
#176	AME	0.098%
#177	HEE	0.098%
#178	HIC	0.098%
#179	STI	0.096%
#180	INE	0.096%
#181	EAD	0.096%
#182	EME	0.096%
#183	ERO	0.096%
#184	DHI	0.095%
#185	EMA	0.095%
#186	STR	0.094%
#187	NDH	0.094%
#188	SSI	0.094%
#189	ERY	0.094%
#190	BLE	0.093%
#191	CHA	0.093%
#192	OOK	0.093%
#193	INA	0.092%
#194	SHO	0.092%
#195	TOH	0.091%
#196	NAN	0.091%
#197	IDE	0.091%
#198	OSE	0.090%
#199	DRE	0.089%
#200	IND	0.089%

Quadgrams (or is is Tetragrams?)

After three, comes four. I'm not sure if it's correct to call them quadgrams or tetragrams (Latin or Greek?), so instead we'll just call them n-grams or 4-grams.

There were 87,526 4-grams in my book samples (cf. 26 × 26 × 26 × 26 = 456,976 possible; less than 20% of the theoretical possible combinations).

Here are the top 200:

#	4-gram	%
#1	THER	0.325%
#2	THAT	0.302%
#3	WITH	0.256%
#4	DTHE	0.253%
#5	NTHE	0.250%
#6	OTHE	0.219%
#7	OFTH	0.217%
#8	FTHE	0.206%
#9	THES	0.203%
#10	TTHE	0.192%
#11	HERE	0.189%
#12	EAND	0.183%
#13	ETHE	0.177%
#14	ANDT	0.164%
#15	THEM	0.162%
#16	SAND	0.161%
#17	TION	0.151%
#18	INGT	0.144%
#19	NDTH	0.143%
#20	THIS	0.139%
#21	OULD	0.134%
#22	INTH	0.132%
#23	THEC	0.132%
#24	STHE	0.130%
#25	TOTH	0.129%
#26	ANDS	0.129%
#27	EDTH	0.129%
#28	IGHT	0.122%
#29	THIN	0.118%
#30	SAID	0.118%
#31	EVER	0.114%
#32	ATTH	0.111%
#33	RTHE	0.110%
#34	THOU	0.110%
#35	WERE	0.109%
#36	THEY	0.106%
#37	HING	0.106%
#38	DAND	0.105%
#39	NGTH	0.103%
#40	TAND	0.103%
#41	THEP	0.101%
#42	INGA	0.099%
#43	OUGH	0.095%
#44	EDTO	0.095%
#45	THEW	0.094%
#46	THEN	0.094%
#47	EWAS	0.094%
#48	ONTH	0.093%
#49	HICH	0.092%
#50	FROM	0.092%

#	4-gram	%
#51	WHIC	0.092%
#52	HAVE	0.090%
#53	WHAT	0.090%
#54	ANDA	0.090%
#55	EFOR	0.086%
#56	THEF	0.084%
#57	HTHE	0.084%
#58	UGHT	0.083%
#59	TING	0.083%
#60	KING	0.082%
#61	ATHE	0.081%
#62	ANDW	0.081%
#63	ERTH	0.081%
#64	THEI	0.080%
#65	ANDH	0.080%
#66	HEWA	0.078%
#67	DNOT	0.078%
#68	RAND	0.077%
#69	VERY	0.077%
#70	THEE	0.075%
#71	THET	0.075%
#72	FORT	0.075%
#73	ANDI	0.075%
#74	GTHE	0.075%
#75	THED	0.075%
#76	HEHA	0.074%
#77	THEL	0.074%
#78	YTHE	0.073%
#79	HAND	0.072%
#80	HESA	0.071%
#81	HECO	0.071%
#82	YAND	0.071%
#83	EHAD	0.071%
#84	ORTH	0.071%
#85	INGH	0.070%
#86	SELF	0.070%
#87	WHEN	0.069%
#88	ERED	0.069%
#89	THEB	0.069%
#90	THEH	0.067%
#91	MENT	0.067%
#92	NAND	0.067%
#93	EDAN	0.066%
#94	OUND	0.066%
#95	SOME	0.065%
#96	NDER	0.065%
#97	NING	0.065%
#98	HERS	0.064%
#99	HATH	0.063%
#100	TWAS	0.063%

#	4-gram	%
#101	ATIO	0.063%
#102	RING	0.063%
#103	INGS	0.062%
#104	INGO	0.061%
#105	OVER	0.061%
#106	HATT	0.060%
#107	ETHA	0.059%
#108	WOUL	0.059%
#109	ENTH	0.059%
#110	THAN	0.058%
#111	ERAN	0.058%
#112	EDHI	0.058%
#113	LOOK	0.058%
#114	THTH	0.056%
#115	DWIT	0.056%
#116	HATI	0.056%
#117	HEAR	0.056%
#118	ITHA	0.055%
#119	EOFT	0.055%
#120	THEA	0.055%
#121	THEG	0.055%
#122	NGTO	0.055%
#123	INCE	0.054%
#124	ASTH	0.054%
#125	HEIR	0.054%
#126	WILL	0.054%
#127	BEEN	0.053%
#128	FORE	0.053%
#129	MTHE	0.053%
#130	INGI	0.053%
#131	NOTH	0.052%
#132	LING	0.052%
#133	MAND	0.052%
#134	INTO	0.051%
#135	STAN	0.051%
#136	THEO	0.051%
#137	LLTH	0.051%
#138	RETH	0.051%
#139	EDIN	0.051%
#140	HESE	0.051%
#141	HERA	0.051%
#142	DING	0.050%
#143	HOUG	0.050%
#144	ETHI	0.050%
#145	ANDR	0.050%
#146	TOHI	0.049%
#147	DTHA	0.049%
#148	TTER	0.049%
#149	ANCE	0.049%
#150	KNOW	0.049%

#	4-gram	%
#151	TIME	0.049%
#152	REAT	0.048%
#153	SWER	0.048%
#154	COUL	0.048%
#155	UNDE	0.048%
#156	LIKE	0.048%
#157	HEMA	0.047%
#158	SOFT	0.047%
#159	YOUR	0.047%
#160	ITHT	0.047%
#161	PRIN	0.047%
#162	NESS	0.047%
#163	EREA	0.047%
#164	LTHE	0.047%
#165	RINC	0.046%
#166	NHIS	0.046%
#167	WASA	0.046%
#168	DHIS	0.046%
#169	RESS	0.046%
#170	IONS	0.045%
#171	DHER	0.045%
#172	LAND	0.045%
#173	NDIN	0.045%
#174	DHIM	0.044%
#175	MORE	0.044%
#176	ERIN	0.044%
#177	ABLE	0.044%
#178	ESAI	0.044%
#179	ERES	0.044%
#180	ENCE	0.044%
#181	ESAN	0.044%
#182	OUNT	0.043%
#183	TTLE	0.043%
#184	HATS	0.043%
#185	COME	0.043%
#186	HEST	0.043%
#187	LONG	0.042%
#188	PRES	0.042%
#189	UTTH	0.042%
#190	EYOU	0.042%
#191	WHER	0.042%
#192	TOBE	0.042%
#193	ABOU	0.041%
#194	METH	0.041%
#195	EWIT	0.041%
#196	HERO	0.041%
#197	HIMS	0.041%
#198	NDRE	0.041%
#199	NDHE	0.041%
#200	OMTH	0.041%

Things get a little more complicated as we move to four characters. Top of the list is "THER", some of which could be from the word "THE", followed by a word starting with "R", but a a most of the frequency of "THER" comes as it being part of words like "THERE" and "OTHER" (and all those other words that have this sub-string contained in them).

Looking through the list it is easy to see words that are distinct popular four character words in their own right as well the sub-strings.

5-grams

There were 434,396 5-grams (cf. 26 × 26 × 26 × 26 × 26 = 11,881,376 possible; less than 4% of the theoretical possible combinations).

Here are the top 200:

#	5-gram	%
#1	OFTHE	0.190%
#2	ANDTH	0.122%
#3	TOTHE	0.116%
#4	INTHE	0.112%
#5	THERE	0.108%
#6	NDTHE	0.106%
#7	EDTHE	0.097%
#8	WHICH	0.092%
#9	ATTHE	0.090%
#10	OTHER	0.090%
#11	INGTH	0.085%
#12	THING	0.081%
#13	ONTHE	0.075%
#14	NGTHE	0.074%
#15	OUGHT	0.064%
#16	ATION	0.063%
#17	WOULD	0.059%
#18	EDAND	0.056%
#19	THECO	0.055%
#20	DWITH	0.055%
#21	THEIR	0.053%
#22	HEHAD	0.053%
#23	INGTO	0.052%
#24	EOFTH	0.052%
#25	HEWAS	0.051%
#26	FORTH	0.051%
#27	ERTHE	0.051%
#28	THOUG	0.050%
#29	HOUGH	0.049%
#30	HATTH	0.049%
#31	COULD	0.048%
#32	THATT	0.048%
#33	EVERY	0.048%
#34	ERAND	0.047%
#35	THTHE	0.047%
#36	WITHA	0.046%
#37	DTHAT	0.046%
#38	WITHT	0.046%
#39	THESE	0.046%
#40	ETHAT	0.043%
#41	PRINC	0.043%
#42	ITHTH	0.043%
#43	THATH	0.043%
#44	ORTHE	0.043%
#45	ESAID	0.042%
#46	THEMA	0.042%
#47	THATI	0.042%
#48	ENTHE	0.042%
#49	RINCE	0.042%
#50	EFORE	0.041%

#	5-gram	%
#51	ABOUT	0.040%
#52	ESAND	0.040%
#53	ATHER	0.040%
#54	SOFTH	0.039%
#55	ITWAS	0.039%
#56	ASTHE	0.039%
#57	AFTER	0.038%
#58	SWERE	0.038%
#59	UNDER	0.038%
#60	EWITH	0.038%
#61	WHERE	0.037%
#62	WITHH	0.037%
#63	FROMT	0.037%
#64	ALLTH	0.036%
#65	ETHER	0.036%
#66	LLTHE	0.036%
#67	INGAN	0.036%
#68	ANDHE	0.036%
#69	OMTHE	0.035%
#70	ROMTH	0.035%
#71	THEMO	0.035%
#72	HATHE	0.035%
#73	EDWIT	0.035%
#74	AGAIN	0.034%
#75	NEVER	0.034%
#76	INGHI	0.034%
#77	CEAND	0.034%
#78	BEFOR	0.034%
#79	THEPR	0.034%
#80	TTHAT	0.033%
#81	NGAND	0.033%
#82	THATS	0.033%
#83	OULDN	0.033%
#84	RETHE	0.033%
#85	TOFTH	0.033%
#86	SAIDT	0.032%
#87	THERS	0.032%
#88	COUNT	0.032%
#89	TIONS	0.032%
#90	STAND	0.032%
#91	EDHIM	0.032%
#92	UTTHE	0.032%
#93	HIMSE	0.032%
#94	OFHIS	0.032%
#95	MSELF	0.031%
#96	NTOTH	0.031%
#97	THEWA	0.031%
#98	STHAT	0.031%
#99	ITTLE	0.031%
#100	IMSEL	0.031%

#	5-gram	%
#101	NTHES	0.031%
#102	LITTL	0.031%
#103	INGIN	0.031%
#104	HECOU	0.030%
#105	ROUGH	0.030%
#106	THESA	0.030%
#107	ANDIN	0.030%
#108	BYTHE	0.030%
#109	RIGHT	0.029%
#110	ANDRE	0.029%
#111	HESAI	0.029%
#112	THECA	0.029%
#113	THEST	0.029%
#114	THERO	0.029%
#115	LIGHT	0.029%
#116	TOHIM	0.028%
#117	CTION	0.028%
#118	THERA	0.028%
#119	WITHO	0.028%
#120	EANDT	0.028%
#121	IONOF	0.028%
#122	IDNOT	0.028%
#123	HADBE	0.027%
#124	HEREW	0.027%
#125	INGOF	0.027%
#126	ITHOU	0.027%
#127	SANDT	0.027%
#128	ETHIN	0.027%
#129	ULDNO	0.027%
#130	GREAT	0.027%
#131	ROUND	0.027%
#132	DIDNO	0.027%
#133	HOULD	0.027%
#134	SHOUL	0.027%
#135	EDHER	0.026%
#136	LDNOT	0.026%
#137	OTHIN	0.026%
#138	THOUT	0.026%
#139	THEWO	0.026%
#140	HEREA	0.026%
#141	EDHIS	0.025%
#142	DTHES	0.025%
#143	INHIS	0.025%
#144	INGHE	0.025%
#145	SWITH	0.025%
#146	DTHEM	0.025%
#147	NTHAT	0.025%
#148	ANDWH	0.025%
#149	YTHIN	0.024%
#150	THELA	0.024%

#	5-gram	%
#151	HENTH	0.024%
#152	THATW	0.024%
#153	DBEEN	0.024%
#154	NOTHE	0.024%
#155	SOMET	0.024%
#156	FIRST	0.024%
#157	TWITH	0.024%
#158	NSWER	0.024%
#159	ADBEE	0.023%
#160	THEFI	0.023%
#161	ITHHI	0.023%
#162	PRESS	0.023%
#163	FTHES	0.023%
#164	WASTH	0.023%
#165	HERAN	0.023%
#166	STILL	0.023%
#167	AKING	0.023%
#168	LOOKE	0.023%
#169	BUTTH	0.023%
#170	ASKED	0.023%
#171	TIONO	0.023%
#172	OOKED	0.023%
#173	SHEHA	0.023%
#174	TOHER	0.022%
#175	ANDSO	0.022%
#176	ESTHE	0.022%
#177	URNED	0.022%
#178	THEHA	0.022%
#179	ANDSA	0.022%
#180	OMETH	0.022%
#181	ERING	0.022%
#182	IERRE	0.022%
#183	PIERR	0.022%
#184	THINK	0.022%
#185	DTHER	0.022%
#186	INTOT	0.022%
#187	ANSWE	0.022%
#188	SHEWA	0.022%
#189	PLACE	0.022%
#190	NOTHI	0.022%
#191	HIMAN	0.022%
#192	NCEAN	0.021%
#193	TTING	0.021%
#194	ONAND	0.021%
#195	THEYW	0.021%
#196	TTHES	0.021%
#197	WHILE	0.021%
#198	TURNE	0.021%
#199	SSION	0.021%
#200	WASNO	0.021%

Things get even more interesting here. "OFTHE", "ANDTH", "TOTHE" and "INTHE" at the top are all obvious concatenations of two words. "THERE" comes next.

As the n-grams become longer it's possible to start seeing more distinct (and specific) words. As I was testing the code out with smaller books, after pausing and viewing the intermediate results, it was possible to identify sub-strings of titles characters and specific nouns in the books.

This shows us that, unless we're aiming to decode a message with a defined dictionary of possible words, going too deep into n-gram analysis will start to hurt us. Up to about 4-grams, we're mapping the characteristics of the English language. Above 4-grams, it's looking like we are starting to map more to words than distributions of groupings of letters.

6-grams

The error of going too deep into n-gram is confirmed looking at this list. It doesn't take too long see specific words that obviously belong to one specific book.

There were 1,239,584 6-grams (cf. 26 × 26 × 26 × 26 × 26 × 26 = 308,915,776 possible; less than 0.4% of the theoretical possible combinations).

Here are the top 200:

#	6-gram	%
#1	ANDTHE	0.090%
#2	INGTHE	0.063%
#3	THOUGH	0.049%
#4	EOFTHE	0.044%
#5	WITHTH	0.042%
#6	THATTH	0.042%
#7	PRINCE	0.042%
#8	HATTHE	0.039%
#9	ITHTHE	0.037%
#10	FROMTH	0.035%
#11	FORTHE	0.035%
#12	SOFTHE	0.035%
#13	EDWITH	0.034%
#14	HOUGHT	0.034%
#15	BEFORE	0.033%
#16	ROMTHE	0.032%
#17	HIMSEL	0.031%
#18	IMSELF	0.031%
#19	LITTLE	0.031%
#20	INGAND	0.029%
#21	TOFTHE	0.029%
#22	NTOTHE	0.029%
#23	HESAID	0.028%
#24	THATHE	0.028%
#25	OULDNO	0.027%
#26	SHOULD	0.027%
#27	DIDNOT	0.026%
#28	ULDNOT	0.026%
#29	WITHOU	0.025%
#30	ALLTHE	0.024%
#31	ITHOUT	0.024%
#32	THEREW	0.024%
#33	YTHING	0.023%
#34	ADBEEN	0.023%
#35	HADBEE	0.023%
#36	ETHING	0.023%
#37	WITHHI	0.023%
#38	LOOKED	0.022%
#39	PIERRE	0.022%
#40	OTHING	0.022%
#41	HENTHE	0.022%
#42	OFTHES	0.022%
#43	ANSWER	0.022%
#44	NCEAND	0.021%
#45	COULDN	0.021%
#46	EANDTH	0.021%
#47	INTOTH	0.021%
#48	NOTHIN	0.021%
#49	TURNED	0.021%
#50	HIMAND	0.020%

#	6-gram	%
#51	SANDTH	0.020%
#52	ROUGHT	0.020%
#53	INGHIS	0.020%
#54	TIONOF	0.020%
#55	NOTHER	0.020%
#56	SHEHAD	0.020%
#57	SAIDTH	0.019%
#58	OTHERS	0.019%
#59	SHEWAS	0.019%
#60	PEOPLE	0.019%
#61	ECOULD	0.019%
#62	NDTHAT	0.019%
#63	THECOU	0.019%
#64	EWOULD	0.019%
#65	OFTHEM	0.019%
#66	HERAND	0.019%
#67	EDTHAT	0.018%
#68	DTOTHE	0.018%
#69	OULDBE	0.018%
#70	ANOTHE	0.018%
#71	THESAM	0.018%
#72	OUGHTH	0.017%
#73	METHIN	0.017%
#74	WHICHH	0.017%
#75	WASTHE	0.017%
#76	RINCES	0.017%
#77	HESAME	0.017%
#78	OMETHI	0.017%
#79	SOMETH	0.017%
#80	WHENTH	0.017%
#81	THROUG	0.017%
#82	HROUGH	0.017%
#83	FATHER	0.017%
#84	AIDTHE	0.017%
#85	SEEMED	0.017%
#86	MOTHER	0.017%
#87	DINTHE	0.017%
#88	EINTHE	0.017%
#89	ANDTHA	0.017%
#90	UNDERS	0.017%
#91	PRESEN	0.017%
#92	ECAUSE	0.016%
#93	THEPRI	0.016%
#94	OFTHEC	0.016%
#95	NDERST	0.016%
#96	INCESS	0.016%
#97	BUTTHE	0.016%
#98	INGHER	0.016%
#99	HEREWA	0.016%
#100	DTHERE	0.016%

#	6-gram	%
#101	EREWAS	0.016%
#102	LOOKIN	0.016%
#103	OOKING	0.016%
#104	EOTHER	0.015%
#105	OULDHA	0.015%
#106	THEFIR	0.015%
#107	THINGS	0.015%
#108	THEWOR	0.015%
#109	MOMENT	0.015%
#110	THEYWE	0.015%
#111	FRIEND	0.015%
#112	NWHICH	0.015%
#113	RETURN	0.015%
#114	THEMAN	0.015%
#115	FRENCH	0.015%
#116	ITHHIS	0.015%
#117	NOFTHE	0.015%
#118	ATIONS	0.015%
#119	NSWERE	0.015%
#120	ETHOUG	0.015%
#121	EANDRE	0.014%
#122	INTHES	0.014%
#123	EPRINC	0.014%
#124	UGHTHE	0.014%
#125	ALWAYS	0.014%
#126	NGWITH	0.014%
#127	LDHAVE	0.014%
#128	ETHERE	0.014%
#129	ULDHAV	0.014%
#130	WASNOT	0.014%
#131	EDTOTH	0.014%
#132	TTHERE	0.014%
#133	ERETHE	0.014%
#134	NGTHAT	0.014%
#135	ESSION	0.014%
#136	HECOUL	0.014%
#137	ECOUNT	0.014%
#138	ATASHA	0.014%
#139	ABOUTT	0.014%
#140	SINTHE	0.014%
#141	NATASH	0.014%
#142	ROFTHE	0.014%
#143	VERTHE	0.014%
#144	INGHIM	0.014%
#145	EVERYT	0.014%
#146	APPEAR	0.013%
#147	ETOTHE	0.013%
#148	EYWERE	0.013%
#149	ROTHER	0.013%
#150	SWERED	0.013%

#	6-gram	%
#151	WHICHT	0.013%
#152	UPONTH	0.013%
#153	RESENT	0.013%
#154	HEYWER	0.013%
#155	TINTHE	0.013%
#156	EFIRST	0.013%
#157	INGTHA	0.013%
#158	HATSHE	0.013%
#159	HEWOUL	0.013%
#160	POSSIB	0.013%
#161	BECAUS	0.013%
#162	INGWIT	0.013%
#163	ANDREW	0.013%
#164	EDTHEM	0.013%
#165	RINCEA	0.013%
#166	OUTTHE	0.013%
#167	IONAND	0.013%
#168	ESTION	0.013%
#169	NDWITH	0.013%
#170	HAVING	0.013%
#171	PRESSI	0.013%
#172	NDTHEN	0.013%
#173	TEDTHE	0.013%
#174	THEREA	0.013%
#175	INCEAN	0.013%
#176	TOTHES	0.013%
#177	ERSELF	0.013%
#178	ECTION	0.013%
#179	THERES	0.013%
#180	ANDHIS	0.013%
#181	HERSEL	0.013%
#182	OFTHEP	0.013%
#183	THEREI	0.013%
#184	SHESAI	0.013%
#185	PONTHE	0.013%
#186	CEANDR	0.013%
#187	RSTAND	0.013%
#188	VERYTH	0.013%
#189	QUESTI	0.013%
#190	WHENHE	0.013%
#191	THECON	0.013%
#192	HEOTHE	0.012%
#193	NEOFTH	0.012%
#194	THEOTH	0.012%
#195	UESTIO	0.012%
#196	INGFOR	0.012%
#197	EELING	0.012%
#198	HECOUN	0.012%
#199	EXPRES	0.012%
#200	STHERE	0.012%

Final return to ROT13

As I was messing with ROT13, It wondered if it was possible to to apply ROT13 to a word and make an entirely different (valid) word. A few lines of SQL late revealed there are quite a few possible. The longest found in my dictionary file was NOWHERE ↔ ABJURER

NA↔AN NAAN↔ANNA NAG↔ANT NAN↔ANA NAVY↔ANIL NE↔AR NIB↔AVO NO↔AB NOB↔ABO NOON↔ABBA NOWHERE↔ABJURER NU↔AH NUN↔AHA OHO↔BUB ON↔BA ONE↔BAR ONES↔BARF ONYX↔BALK OR↔BE ORA↔BEN ORRA↔BEEN ORT↔BEG OVA↔BIN PENNY↔CRAAL PENT↔CRAG PERRY↔CREEL PRY↔CEL PUNG↔CHAT PURS↔CHEF RAIL↔ENVY RAT↔ENG RE↔ER REAR↔ERNE REE↔ERR REEF↔ERRS REF↔ERS RET↔ERG ROOF↔EBBS SEL↔FRY SENT↔FRAG SERER↔FRERE SHA↔FUN SHE↔FUR SYNC↔FLAP TANG↔GNAT TERRA↔GREEN THY↔GUL TRY↔GEL TUNG↔GHAT UN↔HA UREA↔HERN VEX↔IRK WHA↔JUN WHEN↔JURA

Encryption Humour

This web page is encrypted with ROT26.

You can find a complete list of all the articles here. Click here to receive email alerts on new articles.