There’s not a lot of science involved in today’s short post. It’s more an excuse to write some SQL queries, plot some graphs, and create a few random pieces of trivia.

### Substitution Cypher

There’s a popular substitution cypher used by kids which replace letters with their numerical position in the English alphabet.

A=1 B=2 C=3 … Z=26

So, for instance, using the cypher, the word KISS changes to 11-9-19-19.

For today’s post we’re going to take a dictionary of English words, replace each letter with its number, then sum these values together to produce a single number for each word.

For example, using the word CAUSE we get a score of 3+1+21+19+5 = 49.

This score is a (very poor) HASH function for our words. Incredibly poor. I hope you can see why (you could probably write an entire book why this is a poor choice for a hash function). Here, in no particular order, are a selection of some of the things that this function does poorly.

• No account is made of the position of the letters, just their value. Words that contain the same combination of letters in different order (anagrams), have the same hash value. e.g. SAUCE has the same hash value as CAUSE.

• You can add an arbitrary value to one letter e.g.+1 to C to make a D, then subtract the same value from another letter e.g. -1 from U to make T. Rearranging these letters you can make the brand new word DATES, which has the same score.

• You can do this arithmetic in any combination to make words that contain none of the same letters e.g. GIDDY.

• You don't have to just use five letters, you can make words as short as three letters, such as FRY and HUT, or as long as nine with ABDICATED.

• Using the dictionary file on my server I was able to create 613 words with a bad hash score of 49.

### Distribution of Bad Hash values

Here is a breakdown of the distribution of bad hash values using the dictionary:

• The smallest hash value is 1 and this, of course, is the score for the word A.

• The highest hash value in my dictionary was 319 for the word REINSTITUTIONALIZATIONS.

• The most common bad hash score was 93, and there were 1,963 distinct words with this same score with lengths in the range 5-13 letters.

### By length

Pivoting this data by word length, here is a chart showing the range of bad hash values by the length of the word:

• When the words are short, there is a small range of hash values.

• As the length increases, the opportunities to introduce more letters and have a wider range increases.

• The peak is for 13 letter words, then, as the length continues to increase, even though there is greater flexibility, there are less valid words, and the range decreases again.

### ASCII

Binary Oct Dec Hex Char
100 0001 101 65 41 A
100 0010 102 66 42 B
100 0011 103 67 43 C
100 0100 104 68 44 D
100 0101 105 69 45 E
100 0110 106 70 46 F
100 0111 107 71 47 G
100 1000 110 72 48 H
100 1001 111 73 49 I
100 1010 112 74 4A J
100 1011 113 75 4B K
100 1100 114 76 4C L
100 1101 115 77 4D M
100 1110 116 78 4E N
100 1111 117 79 4F O
101 0000 120 80 50 P
101 0001 121 81 51 Q
101 0010 122 82 52 R
101 0011 123 83 53 S
101 0100 124 84 54 T
101 0101 125 85 55 U
101 0110 126 86 56 V
101 0111 127 87 57 W
101 1000 130 88 58 X
101 1001 131 89 59 Y
101 1010 132 90 5A Z

Most Western computers encode letters for storage in something called ASCII (American Standard Code for Information Interchange). This standard, dating back to the 1960's, defines the numerical values used to represent not just letters, but digits, punctuation characters, a smattering of accents and the occasional math and currency symbols. It even has a bell-chime, and distinct carriage return and line-feed control characters for when information was sent to Teletype printing devices.

(It's since been superseded by Unicode, which offers significantly more space for international characters, but the vanilla alphabet remains in the same place).

In ASCII, upper-case letters start at with the value A=65, B=66 …

Why these, seemingly arbitrary, values? Well, it relates to binary and, as you can see from the table on the left, the letters are encoded starting at 64 + the character.

(A lot of thought and inherited history went into the creation of the ASCII table, with numeric digits placed to allow easier conversion to BCD [Binary Coded Decimal] and other punctuation marks like !@# were kept in their corresponding shifted positions as they were on original typewriter keys).

### Numerologists

Numerologists (but probably nobody else), will take pleasure from the fact that if you use the ASCII values of letters (A=65, B=66 …) instead of the ordinal value (A=1, B=2 …) that the bad hash for the word ANTIPAPAL is 666 …

Non-numerologists might be more excited to learn that there are 341 other words with the same claim to fame: