There’s not a lot of science involved in today’s short post. It’s more an excuse to write some SQL queries, plot some graphs, and create a few random pieces of trivia.
There’s a popular substitution cypher used by kids which replace letters with their numerical position in the English alphabet.
A=1 B=2 C=3 … Z=26
So, for instance, using the cypher, the word KISS changes to 11-9-19-19.
For today’s post we’re going to take a dictionary of English words, replace each letter with its number, then sum these values together to produce a single number for each word.
For example, using the word CAUSE we get a score of 3+1+21+19+5 = 49.
This score is a (very poor) HASH function for our words. Incredibly poor. I hope you can see why (you could probably write an entire book why this is a poor choice for a hash function). Here, in no particular order, are a selection of some of the things that this function does poorly.
No account is made of the position of the letters, just their value. Words that contain the same combination of letters in different order (anagrams), have the same hash value. e.g. SAUCE has the same hash value as CAUSE.
You can add an arbitrary value to one letter e.g.+1 to C to make a D, then subtract the same value from another letter e.g. -1 from U to make T. Rearranging these letters you can make the brand new word DATES, which has the same score.
You can do this arithmetic in any combination to make words that contain none of the same letters e.g. GIDDY.
You don't have to just use five letters, you can make words as short as three letters, such as FRY and HUT, or as long as nine with ABDICATED.
Using the dictionary file on my server I was able to create 613 words with a bad hash score of 49.
Here is a breakdown of the distribution of bad hash values using the dictionary:
The smallest hash value is 1 and this, of course, is the score for the word A.
The highest hash value in my dictionary was 319 for the word REINSTITUTIONALIZATIONS.
The most common bad hash score was 93, and there were 1,963 distinct words with this same score with lengths in the range 5-13 letters.
Pivoting this data by word length, here is a chart showing the range of bad hash values by the length of the word:
When the words are short, there is a small range of hash values.
As the length increases, the opportunities to introduce more letters and have a wider range increases.
The peak is for 13 letter words, then, as the length continues to increase, even though there is greater flexibility, there are less valid words, and the range decreases again.
|
Most Western computers encode letters for storage in something called ASCII (American Standard Code for Information Interchange). This standard, dating back to the 1960's, defines the numerical values used to represent not just letters, but digits, punctuation characters, a smattering of accents and the occasional math and currency symbols. It even has a bell-chime, and distinct carriage return and line-feed control characters for when information was sent to Teletype printing devices. (It's since been superseded by Unicode, which offers significantly more space for international characters, but the vanilla alphabet remains in the same place). ![]() |
In ASCII, upper-case letters start at with the value A=65, B=66 …
Why these, seemingly arbitrary, values? Well, it relates to binary and, as you can see from the table on the left, the letters are encoded starting at 64 + the character.
(A lot of thought and inherited history went into the creation of the ASCII table, with numeric digits placed to allow easier conversion to BCD [Binary Coded Decimal] and other punctuation marks like !@# were kept in their corresponding shifted positions as they were on original typewriter keys).
Numerologists (but probably nobody else), will take pleasure from the fact that if you use the ASCII values of letters (A=65, B=66 …) instead of the ordinal value (A=1, B=2 …) that the bad hash for the word ANTIPAPAL is 666 …
Non-numerologists might be more excited to learn that there are 341 other words with the same claim to fame:
ABOLISHES ABSENTEES ACCEPTERS ACCOUTRED ACETYLENE ACHIEVERS ACTINIANS ADVOCATES AESTHETIC AGUEWEEDS AIRFRAMES ALCHEMIST ALGERINES ALLOGRAPH AMBROSIAL AMNESTIED AMPLIFIES ANALITIES ANALOGIZE ANIMALIZE ANTINODAL ANTIPAPAL APPRAISEE ARAGONITE ARCHAIZES ARCHDUKES ARMCHAIRS ARTICHOKE ARTIFICES ASCERTAIN ASPIRATAE AUGMENTED BACKDROPT BANDEROLS BANEBERRY BANKBOOKS BANTERING BARBEQUES BATHROBES BECOWARDS BENTHONIC BESCREENS BESLIMING BESMILING BESWARMED BEWRAPPED BICKERERS BIGNONIAS BILLETING BIOETHICS BIPINNATE BIRDFARMS BLACKBOYS BLANDNESS BLINKARDS BLOTCHING BLOVIATED BLUEBELLS BOOGERMAN BOOKCASES BOOMERANG BRIDEWELL BRISANCES BROOMBALL BUBALISES BUFFETING CABEZONES CACHEPOTS CADASTERS CADASTRES CALISAYAS CALLALOOS CALVARIES CAMOMILES CAMPFIRES CAMSHAFTS CANALIZES CANVASING CAPONATAS CARROCHES CASEBOOKS CATALEXES CATCHPOLL CATENOIDS CATFISHES CATHOLICS CAVALIERS CAVALRIES CAVATINAS CELLMATES CELLULASE CEMENTING CENTIGRAM CHAMBRAYS CHANDLERY CHAPBOOKS CHARLOCKS CHASUBLES CHEAPNESS CHECKLIST CHEEKFULS CHIVAREES CHLAMYDES CHOPPERED CHROMATIC COALHOLES COCKLEBUR COLLIMATE COLORIFIC COMBATIVE COMMENCES COMMENDER CONCHOIDS CONDUCING CONFEREES CONFESSED CONFIDENT CONSIGNED CRANIATES CREMATING CREOLISED CRICETIDS DARNEDEST DAYDREAMS DEADWOODS DEATHBLOW DECEIVERS DECORATES DEFERMENT DEFLATERS DEHYDRATE DEMENTIAS DENSIFIES DIALOGERS DIATHESES DIETARIES DIFFICULT DIGASTRIC DIGITALIS DIGRESSED DISCLOSED DISPLEASE DIVIDENDS DRAGGIEST DRAMATISE DREADFULS DROPHEADS DUMBBELLS DYSPHAGIA ECOLOGIES EDUCATIVE EMANATIVE EMBODIERS EMBOWERED EMBRACERY EMITTANCE ENLIVENED ENTRAINED EPHEMERAS EXCLAIMER EXPLAINED FACEDOWNS FAIRYLAND FANCINESS FANTASIAS FARADISMS FARRAGOES FEOFFMENT FERMENTED FEUDALISM FIGEATERS FILIGREES FILMLANDS FIREBIRDS FIREHALLS FLAVANONE FLORIATED FORBODING FOREHANDS FORJUDGED FRAUGHTED GADABOUTS GALBANUMS GALVANISE GANGRENES GERIATRIC GIDDINESS GINGERING GLANDULAR GLOBALISM GOLCONDAS GRADELESS GRAVESIDE GREATCOAT GRILLAGES HAEMATINS HAGBUSHES HAMARTIAS HAPPENING HARDWIRED HAVOCKING HEADFIRST HECTOGRAM HICCUPING HOMESTEAD HONORABLE HYDATHODE IDEALIZES IMAGISTIC IMPLEDGES IMPRECATE IMPUDENCE INDIGOIDS INFANTILE INFOLDING INORGANIC KAFFIYEHS KEELBOATS KERFUFFLE LABRADORS LAICIZING LANDAULET LANDLINES LARBOARDS LAURELLED LEADSCREW LEGISLATE LIGNIFIES LOGICIZED MANICALLY MARSHLAND MASSAGING MASTHEADS MEANWHILE MEDIATION MEDIEVALS MEMBRANES MIDRANGES MINIBIKER MONADNOCK MONOGAMIC MORTGAGED MUCILAGES MUFFLERED NAUSEATED NEURALGIC NICCOLITE NICKERING NIGHTLIFE OBLIGATES ODDSMAKER OVERCOACH PACIFYING PAGANDOMS PALATABLY PANELLING PANORAMIC PARAFFINS PARANOEAS PATINATED PEBBLIEST PEDOPHILE PESTICIDE PICADORES PILCHARDS PIPELINED PLASMODIA PLAYBACKS PLAYFIELD PRECEDENT PRECENTED PRECREASE PREJUDGED PREVIABLE PROBABLES QUEBRACHO RACKETIER RAINMAKER RANSACKER REACCEPTS REANNEXED RECLOTHED RECOMMEND REDOUNDED REFOLDING REFUTABLE REINFLATE REJECTEES REJOICING RELEASING RELICENSE REMAINING RENIGGING RESEALING RHACHISES RIBBONING RUTABAGAS SAXIFRAGE SCALEPANS SCREWBEAN SCROOCHED SEBACEOUS SECONDING SECTARIAN SERENADES SHARPENED SHASHLICK SHEEPFOLD SHELLACKS SIDEKICKS SIDETRACK SIGNALMAN SKINHEADS SLEIGHING SLUGABEDS SONICATED SPACEWARD SPECIFIER STEAMERED TABLETING TAILPLANE TARIFFING TEAZELLED TENANCIES THATCHING THINCLADS TICTOCKED TOLERABLE TRACKSIDE TRAMELLED TREADLING UNBENDING UNDELUDED UNDRAINED VALENCIES VENENATED VICEREINE VIDEODISC VOCALISED WASHABLES WEEKENDER WEIGELIAS
You can find a complete list of all the articles here. Click here to receive email alerts on new articles.