The Small World of English

154 points by michaeld123 2 days ago

hftf 2 days ago

I really enjoyed the article, reading it more from the perspective of what 21st-century lexicography could be, less as a customer of a word game however thoughtfully designed. As a Wiktionary editor (and Android user who's also grown out of bare word-relationship puzzle games) though, it's sad that there seems to be no way to just use the end-product network as a reference, which I would love to do, but I suppose they did spend a million bucks on it.

I'll also use this post to wish that more people would edit Wiktionary. It has such a good mission (information on all words) and yet there are only like 80 people editing on any given day or whatever. In some languages, it's even the best or most updated dictionary available. The barriers to entry and bureaucracy are really not high for HN audience types.

mmooss 2 days ago

> it's sad that there seems to be no way to just use the end-product network as a reference, which I would love to do, but I suppose they did spend a million bucks on it.
From the OP: "This research and computational scale was made possible by $295k NSF SBIR seed funding (#2329817) and $150k Microsoft Azure compute resources." Does that NSF funding mean it's open source? Also, I'm not 100% sure that the quote applies to all the research rather than just one component of it.
> I'll also use this post to wish that more people would edit Wiktionary. It has such a good mission (information on all words) ...
I support open source, contribute to it, and love the spirit of Wiktionary, I don't understand the practical reality of applying 'wisdom of the crowds' to a dictionary, especially the English edition, for two reasons:
Definitions are highly accurate (complete, correct, consistent), highly precise things - otherwise, what is their value? Assuming Wiktionary is descriptive - reporting the words' actual usage - it takes quite a bit of scholarship, skill, and editorial resources not to mislead people. I can't just write what I think it means - the meaning to me might not match the meaning to the person at the next desk. It takes quite a bit of research, using powerful (and sometimes expensive) tools, and understanding of lexicography to be complete and also precisely correct, including usages in places and times that are mostly unknown to any particular author. Also, writing definitions is tricky: You are using words - which have those aformentioned problems with meaning - to define words. Also, any writing anywhere can be easily misinterpreted - skill and editors are needed to avoid misunderstanding. How is the accuracy and precision problem solved?
Also, in English there are already many authoritative sources, many with a century of profesional lexicography behind them by the best in the business. Some are free. There are also meta-lookup engines such as Wordnik and OneLook. Why use Wiktionary? The few times I've compared definitions or etymologies, the authoritative sources almost always exceed or equal Wiktionary (though online copies of older print editions suffer from the minimalism caused by the constraint of printing costs). Arguably, there is nothing else both unabridged and free: Oxford unabridged costs $, so does Merriam-Webster (the free edition is abridged); American Heritage is free, but has the minimalism issue I mentioned above.
- bloak 2 days ago
  
  "Why use Wiktionary?"
  I can answer that one. I have free access to the Oxford English Dictionary (OED), which is brilliant and generally more detailed and reliable than Wiktionary when it has the word I'm looking for, but their login page is so awful that I sometimes use en.wiktionary.org instead just to save my time and temper. Also, en.wiktionary.org has proper nouns, other languages, and occasionally it has some recent or technical English word that OED does not have. So if I'm doing some serious amateur research: OED. But if I'm doing a crossword and want to check that a word exists and is spelt how I think it is: Wiktionary.
  
  mmooss 2 days ago
  
  > their login page is so awful
  I've used the OED login page: username, pw, [] keep me logged in. What is so awful?
- genewitch 2 days ago
  
  I'm one of those people who says, unironically, "words have meanings." I readily argue with people who present "language is living and evolves" - sure, but in order to communicate we have to agree on a decent subset of overall definitions.
  I enjoy etymology, maybe too much. It's like magic, finding out what a barrow was, or how filibuster has a direct lineage to pirates (freebooters... In Dutch.)
  I can't afford, really, the nicer old English, scandi, frisan, Norse, etc. etymology dictionaries. I have incomplete scans that were printed and bound of some of them. I still have 6 etymology dictionaries, so I can be about as quick getting a dictionary as getting on the computer and going to !eo.
  
  PaulDavisThe1st 2 days ago
  
  > in order to communicate we have to agree on a decent subset of overall definitions.
  sociologically speaking, however, it is precisely that agreement that is what evolves alongside changes in spelling, pronounciation (and occasionally "new" words).
  
  protocolture a day ago
  
  >I'm one of those people who says, unironically, "words have meanings." I readily argue with people who present "language is living and evolves" - sure, but in order to communicate we have to agree on a decent subset of overall definitions.
  A few things.
  >we have to agree on a decent subset of overall definitions.
  Yes but we should fairly obviously understand that a word can have multiple, often competing meanings, and make an effort to learn the new ones as they become available.
  As language shifts, and its shifted rapidly in my own lifetime, you can either make an effort to keep up, or be a sourpuss and refuse to understand changes in language.
  It seems to me there's usually a political dimension to people who refuse to understand what people mean, because its easier to denigrate people if they cling to definitions that aren't intended by their political opponents use of a word.
  I see this shit constantly mind. Gender. Liberty. Capitalism. Communism. People get stuck fighting useless battles over the right to define a word instead of just learning and embracing their opponents intention.
  
  genewitch 20 hours ago
  
  > It seems to me there's usually a political dimension to people who refuse to understand what people mean, because its easier to denigrate people if they cling to definitions that aren't intended by their political opponents use of a word.
  and to an extent, the rest of your comment - the solution, according to my PhD friend, is to establish the framing of the argument before you actually have the argument. It's more fun to not establish framing, but it's more effective to establish framing, first. I wonder if i have the publication (thesis?) he made on my NAS.
  
  protocolture 18 hours ago
  
  Yeah absolutely. I tend to just use definitions when I want someone to get my meaning rather than hotly contested words.
- hftf 2 days ago
  
  I don't think definitions "are" highly accurate precise things. Sometimes yes. The same scholarship, skill, and need to not mislead also applies for so many other things: encyclopedic articles, taxonomies, news, maps, operating systems. Do people still question the value of Wikipedia, OpenStreetMap? Yeah, there are problems with them, and with peer review. Using fuzzy words (or fuzzy phonetic symbols, fuzzy categories, fuzzy semantic links…) to define words is a problem (if at all) of literally any dictionary. I don't see any of these as particularly unique obstacles for Wiktionary.
  Unabridged dictionaries take decades to release new editions and are still navigating transition into the exploding digital age. They are so expansive in scope, while often so limited in resources, and barely accept any crowd contributions. Such deliberately slow-going is often a good thing, but words also change quite quickly and these sources are now playing a very long game of catch-up. (Yesterday I tried to verify the latter English senses of "fandango" on Wiktionary with other dictionaries; OED's entry has not been touched for 131 years! What am I going to do with that, I need to use / understand the word now!)
  Wiktionary is the big web-native word-resource (and is not cluttered with commercial junk) – allowing links, expandable quotes, images, diagrams, etc. that print's minimalism suffers from as you mention. When someone in 2025 wants information on a word, they'll likely use a search engine and click a link to Wiktionary (where Google blurbs steal some data from). Maybe they are a student wanting to confirm their nonstandard pronunciation with the IPA (still rarely used in mainstream English dictionaries) or if it's recognized in their own dialect (mainstream dictionaries rarely provide more than UK and US pronunciations) – if enough people have the same question, Wiktionary seems like the best place to put the answer – or see an accessible etymology tree. While you probably know this, it's also worth reminding that English Wiktionary isn't just for English words, it is a dictionary of all languages' words, which is written in English. It has metadata and links connecting languages' words that you can't find elsewhere.
  Yes, I indeed do want people to just write what they think a word means – as a starting point in a collaborative refining process. I believe the number of word-users in the world with valuable potential contributions is a lot closer to a billion than the thousand gatekeepers working hard on classical dictionaries. The barrier to entry is really low, but the tooling could still be much better. This is one reason i'm putting my appeal under this article - because I think (professional) lexicography can stand to evolve more in the 21st century. (And are people today really buying enough dictionaries to sustain a professional version of Wiktionary, or even a professional dictionary offered in structured data form?) If we don't contribute to a crowdsourced dictionary, then we won't have any such thing.
  (Meta-lookup sites are link/search engines, not dictionaries and IME really don't do a good job synthesizing their information or conventions.)
  
  mmooss 2 days ago
  
  Wiktionary can be of great value without denigrating others.
  > Unabridged dictionaries take decades to release new editions and are still navigating transition into the exploding digital age.
  OED is now a 100% online service - a website - that releases updates every quarter, like much software. I don't see them 'still navigating' at all.
  > barely accept any crowd contributions.
  OED is famous for being arguably the first crowd-sourced research project. James Murray, the first great editor and driving force behind the first edition, solicited contributions from the public of usages of words and had a massive filing system of slips with all the contributions.
  "Dictionary work relied on so much correspondence that a post box was installed right outside Murray’s Oxford home ...". "His children (eventually there were eleven) were paid pocket money to sort the dictionary slips into alphabetical order upon arrival." [0]
  Today OED still solicits contributions, including specific appeals to the public. Every entry in the OED has a 'Contribute' button.
  https://www.oed.com/information/using-the-oed/contributing-t...
  > (Yesterday I tried to verify the latter English senses of "fandango" on Wiktionary with other dictionaries; OED's entry has not been touched for 131 years! What am I going to do with that, I need to use / understand the word now!)
  You are misunderstanding what 'revise' means to the OED (which is unnecessarily confusing); they still update entries without a full revision. If you look at the entry history:
  fandango, n. was first published in 1894; not yet revised.
  fandango, n. was last modified in March 2025.
  > I don't think definitions "are" highly accurate precise things. Sometimes yes. The same scholarship, skill, and need to not mislead also applies for so many other things: encyclopedic articles, taxonomies, news, maps, operating systems. Do people still question the value of Wikipedia, OpenStreetMap?
  I think there's a difference between requirements - or expectations - for a dictionary and Wikipedia:
  My guess is that people don't question Wikipedia because they have different expectations for it: They don't expect accuracy, as defined by the Three Cs: Completeness, Correctness, Consistency. Wikipedia is more the accumulation of information generally believed about a topic (with some standards, imperfectly followed, for secondary source support - but secondary sources reflect general, consensus belief). It's not expected to be Complete; no encyclopedia can completely cover any topic - the point is to be a starting place, a summary - and anyway Wikipedia is a sort of work in progress. It's not expected to be Correct; it's what people generally believe. And Consistency is tough with so many authors. It's really an product of the post-truth era; that's what people want - just try questioning it.
  People's expectation for dictionaries - or my expectation at least :) - is not a starting point but the final word. Almost always I already have an idea of what the word means - from partial knowledge, from experience, from context, from its components. I'm expecting the Three Cs from the dictionary, to put a fine point on my understanding and use of the word, to fill in my blind spots - including knowledge of how others have been understanding and using the word.
  Maybe Wiktionary just isn't for me. But I worry that people do assume it's CCC - many people believe anything they read is accurate, especially something from an authoritative-looking source - and are confused by it.
  [0] https://www.oed.com/information/about-the-oed/history-of-the...
0cf8612b2e1e 2 days ago

Could I make a plea to make a wikitionary export easier to find/use? Assuming I can even find the magical page which hosts them, Wikipedia dumps are terribly documented and seem to incorporate shorthand which I do not recognize.
- michaeld123 2 days ago
  
  And they are full of wiki markup, templates, and inconsistent formatting. A human brain can easily understand it, but automated parsing is impossible (pre LLM).
  
  wahnfrieden 2 days ago
  
  https://kaikki.org/index.html
Suppafly 2 days ago

>I'll also use this post to wish that more people would edit Wiktionary.
If it's anything like wikipedia, there is probably a reason more people aren't working on it, and it's because the existing people discourage it.
- hftf 2 days ago
  
  I get the impulse to assume they'd be alike, but I've found that Wiktionary really isn't much like Wikipedia.
  
  card_zero 2 days ago
  
  The Wiktionary equivalent of citing sources confuses me.
  https://en.wiktionary.org/wiki/Wiktionary:Criteria_for_inclu...
  Which words should be attested? Presumably only uncommon ones? And how is it done, is the "quotes" section the attestation? Is there vandalism to clean up, like people adding their own names to define themselves as awesome? Wiktionary seems to "just work", and I don't really understand what holds it together.
  
  AStonesThrow a day ago
  
  I have a feeling that LLM model collapse will be accelerated as humans lose control of smaller Wiki projects like Wiktionary.
  They’ll be unable to effectively patrol or prevent generative updates to the project, and for all intensive porpoises, humans will be unwilling to step foot into disputes, and AI will have free reign to redefine all human knowledge.
michaeld123 2 days ago

I second that! I have edited a few Wiktionary pages myself, and find it's a better overall environment than Wikipedia, if you can find something meaningful to add.

dhashe 2 days ago

This is very cool. In puzzlehunts, we often use tools to assist with solving and writing puzzles (the classic example is https://nutrimatic.org ).

Years ago, I wrote a puzzlehunt puzzle that involved navigating through words where an edge existed if the two words formed a common 2-gram (that is, they often appeared one after another in a text dump of Wikipedia).

For example, a fragment of the graph from the puzzle is: mit -> press -> office <- post <- blog.

This work is obviously much more advanced, and it's very cool to see that they managed to make it work with semantic connections. I was able to get away with a much simpler approach since I only cared about 2-grams over a set of about 1000 words (I literally used a grep command over the entire text of the English wikipedia; it took about a day to run).

But the core idea is shared: 1) wanting to build a graph representation of word connections for a puzzle, 2) it being way to much work to do that manually, 3) you would miss a bunch of edges if you did do it manually, so 4) use programming tools to construct a dataset, and then 5) the end result is surprisingly fun for the user because the dataset is comprehensive and it feels really natural.

If anyone is curious, the puzzlehunt puzzle is here: https://dhashe.com/files/puzzles/word-wide-web.pdf

And the solution is here: https://dhashe.com/files/puzzles/word-wide-web-sol.pdf

And a fair warning to anyone unfamiliar with puzzlehunt puzzles: they do not come with instructions and it is very common to get stuck when solving them, especially when solving them alone. You have not completely solved a puzzlehunt puzzle until you extract an answer word or phrase from the puzzle. This one has an extra layer after filling in the words in the graph. Peeking at the solution is encouraged if you get stuck.

michaeld123 2 days ago

That's an interesting puzzle. I hope more types of word puzzles continue to be created.

jcmeyrignac 2 days ago

Nice work! Here is a similar idea: https://wordassociations.net/en

In french, there is a game to build relations with words (they provide a word, and you have to type the most related words): https://www.jeuxdemots.org They reached 677 million of relations in 2024!

slantaclaus 2 days ago

I remember in college I got all stoned in the library and determined that you could find a semantic pathway using synonyms to relate completely opposite terms with only a few nodes. Completely blew my mind and I still think about it sometimes.

permo-w a day ago

do you use the concept in any useful way?
- slantaclaus a day ago
  
  It reminds me I guess that language can be deconstructed and reconstructed at will and that words are tools, not necessarily determiners of truth. And that the problem of philosophy is the problem of language. So no, not really
michaeld123 2 days ago

It's true!

cadamsdotcom 2 days ago

Such an amazing data set with the amount of curation you’ve done and the care with which it’s been put together.

It’d be highly valuable as a thesaurus API.

michaeld123 2 days ago

Thanks!.... Does anyone pay for thesaurus APIs anymore?
- vessenes 2 days ago
  
  My first thought was the small but highly nerdy group of crossword makers. It would be awesome when making a puzzle to get one and two hop words for your current words. And in cluing. This is not a large market, to be clear. But it is a market that pays for tooling.

totaldude87 2 days ago

I was looking for a similar app for my upcoming book! At times it’s very hard to get the word that we are looking for and hope this solves it!

I know this is not related to the app but still wanted to appreciate the thought

us-merul 2 days ago

I really liked this article and these types of analyses always capture me. I just had to try out the game then. I nailed the link to "moon" from "rise" on my first try. Then I was a bit let-down for my first real task to get to "chill" starting from "chain." I went first to conglomerate, then corporation, then management... thinking I would at some point encounter "cold," and then "chill". Unfortunately not. Then I tried from chain to something like (my memory is imperfect here), necklace, jewelry, brilliance, glow, tranquil, calm-- and on a couple of other tries, appease, mollify, relax-- but could never get to "chill." I was able to win eventually by appealing to temperature which led me to chill.

Is there anything the user could do to modify the next steps, other than picking a word? Perhaps selecting some sort of valence related to metaphor or meaning? "I want to pick 'pacify', but in the sense of calming down, not to utterly destroy."

michaeld123 2 days ago

Thanks for reporting on your experience! Those are good questions, and I will think about your valence idea for the future.
On a shorter horizon, I can tune the probability that on-path terms appear in the cloud. We store a larger pool of words than are displayed, and calculate lookaheads (and lookbacks from the target).
- mmooss 2 days ago
  
  Maybe the user could type in their own words, and the app could approve/disapprove based on the 40 word list.
  But maybe that adds an entirely new normalization function - user types 'runs' or 'ran', the app has to normalize to 'run'.
  The app could just have a 'more words' button, loading the next 17.
- us-merul 2 days ago
  
  Thanks for your response. Getting feedback like "hot or cold" in the algorithm's mind is exactly what I'm thinking of. It's a tricky issue and reminds me a lot of this: https://www.datcreativity.com/
  I had tried hard to pick a set of fairly simple words, thinking I had an intricately unique association in my head, only to find out that the reported connections were nothing more than average. My partner obviously landed in an extremely high percentile by instantly picking the first words that came to her without much thought.
  
  michaeld123 2 days ago
  
  For good or bad, Semantle is able to report hot/cold because it's vector-based. We tried a few types of vectors, but I thought they were consistently unintuitive. So the best (and most relevant) proxy is remaining shortest distance-to-target, but often the player is only two hops away (spanning 17^2=289 options), and when they go astray and are much further, it's computationally too slow to look out more than 5 hops with brute force.
  
  vessenes 2 days ago
  
  To chime in on Semantle - those vectors are often infuriating. High dimensional space truly doesn’t work the same as R3 or R2. I think a more human-oriented word vector database would make games like that more fun. I wonder if there’s a way to turn your codified data into an embedding that has more semantic value.
  
  michaeld123 2 days ago
  
  Thanks for that datcreativity.com link. My score was 94.11, higher than 99.88% of the people who have completed this task! I should hope so after working on relations for years. ;)
  My words were: apple, shotgun, stardust, anger, hygiene, etymology, proctology, slant, dictator, and displacement.

Jordan-117 2 days ago

How is a largely text-based app 3.47 GB? Is the dictionary/semantic DB just that large or is there other stuff going on?

michaeld123 2 days ago

I wish it were not so! Here's the breakdown: 1.5M headwords × ~2KB average per entry >= 3GB Each entry contains: 40 associations in the core graph Multiple senses (up to 8) × 17 associations each = up to 136 more Stems and morphological variants In-game clue definitions Longer definition entries with several types of related word lists.
The only good news is it works offline.
- senkora 2 days ago
  
  I do think that as a practical matter it might be better to store the word graph in the cloud and query it from the client.
  You could either store the word graph as a partitioned set of S3 buckets, or have a back-end that serves individual words and does rate-limiting. I guess that the back-end might be better to avoid surprise egress charges from anyone trying to download the entire dataset.
  I want to try out the game but I'm discouraged by the download size.
  
  kevin_thibedeau 2 days ago
  
  There would be significant data reduction if it was stored as a prefix trie flattened into an array.
hx8 2 days ago

The first really large text data I ever encountered was Google Ngram[0], the total size of which is about 3TB. I would have guessed it was closer to 3GB before I started downloading it.
[0] https://storage.googleapis.com/books/ngrams/books/datasetsv3...
- michaeld123 2 days ago
  
  Yes. I love Google Ngrams.
  We use the top google Ngrams in 2 ways. (a) we share it in the reference mode of our app, i.e. common words before or after; (b) we use longer N-grams, where possible, like a 4-gram, to choose literary examples that also show a common pattern.

suddenlybananas 2 days ago

I don't find many of these transitions very appealing. Sweet to harmony? Seems a stretch. Nightjar to chirring to bombylious? Might as well be gobbledygook.

droopyEyelids 2 days ago

It said the path between “double lock” and “dislodge” was a tortured 10 word chain, it seems like you could get there much faster
“Double lock” > “clasp” > “grab” > “dislodge”
It’s just a quick example, but I think it follows their “rough synonym“ style connections, and it’s not less reasonable than the examples.
To me, it feels like this project is kind of hampered by not having a rigorous definition of what is allowable, and then mixing in the sort of random effects of an LLM
- michaeld123 2 days ago
  
  Good points. The main limiter is often what words happen to surface as the top 17 connections, or in those random examples when there's a plural or conjugation.
  Since this is getting eyeballs here, I will look for some less tortured long-paths to add as examples.
- dleeftink 2 days ago
  
  A smaller trainable set would be a dictionary, and only linking the terms as expressed in the definition, possibly with substitutions. You'd miss more abstract jumps, but the initial walks would be tractable.
  (It is a game best played with a grandparent's pre-war dictionary before tea-time)
michaeld123 2 days ago

You are right. I replaced it with new paths that are +1 longer. These are actual paths from our game now.

rafram 2 days ago

I wanted to try out your app, but I cancelled the download after noticing that it's 3.5 gigabytes. How?! That's by far the biggest iOS app I've ever seen.

michaeld123 2 days ago

Sorry! The problem is ~2kb of data per 1.5M headwords. We already use indicies and brotli compression internally. I doubt we could smush below 3GB.
- rafram 2 days ago
  
  Could you... have fewer headwords? That's like 5x the number of headwords actually used in modern English. Or at least download some of the data on demand?
- neuroelectron 2 days ago
  
  Wow only 3gb? Finally i have something to use the 128gb this iThing came with besides Firefox and Kindle.

permo-w a day ago

my pedantry made me write this, and this is by no means a criticism of the overall art of the article, but of the examples given in the first animation, "Batman" does not need 4 jumps to "inspect" (e.g. Batman -> detective -> inspect), and "emerald" doesn't really connect with "foliage" enough for me: I'd suggest it needing a "green" in between for it to really make sense

trizoza 2 days ago

Any plans on launching on Android, or simply just browser based web version?

michaeld123 2 days ago

Thanks for your interest! The data is used in our iOS game and visualizer.
- komali2 2 days ago
  
  I might be miscomprehending but that doesn't feel like an answer...
  I had the same question, but more generally - is this an American company thing? I can't imagine not wanting to tap the vastly larger android market. Especially for an app like this which could be marketed as a fun English learning game.

o11c 2 days ago

Something about this website makes scrolling lag even with Javascript disabled. Firefox 128 on Linux.

Very interesting topic though.

michaeld123 2 days ago

thanks. Let me see if I can diagnose the scrolling lag.
- michaeld123 2 days ago
  
  Removed an ill-concieved backdrop-filter: blur(10px); downsampled the images, and some other little fixes.

6stringmerc 2 days ago

Dissociating English terms from their context and focusing on the ease of relationship is a hilariously bad habit that people actively are trained AWAY from using. The nuance of English is absolutely going to break AI because even the example of “strong” relationships are suspect in utility.

Seriously, when is the last time a casual speaker, writer, or translator used “domicile” in place of “house” in your world? It’s an archaic term appropriated into legal jargon. Flattening out language and drawing lines between terms is funny to me.

The only issue is normalizing “Thesaurus bashing” type mentalities - like this - to degrade the value of coherent, purposeful, meaningful use of English. It’s an amalgamation language with extremely difficult fluency. It’s rife with idioms and contradictory emotional context.

Oh well, I can grasp that I tend to yell at clouds when it comes to this sort of thing. It doesn’t change my opinion this is a harmful exercise and probably should not exist. There are few instances where playing a game will actually make one more stupid, but here we are.

genewitch 2 days ago

I used domicile about 45 minutes ago in casual conversation about fire ants in my abode. Habitation. Flat.

michaeld123 2 days ago

We built a 1.5M word semantic network where any two words connect in ~6.43 hops (76% connect in ≤7). The hard part wasn't the graph theory—it was getting rich, non-obvious associations. GPT-4's associations were painfully generic: "coffee → beverage, caffeine, morning." But we discovered LLMs excel at validation, not generation. Our solution: Mine Library of Congress classifications (648k of them, representing 125 years of human categorization). "Coffee" appears in 2,542 different book classifications—from "Coffee trade—Labor—Guatemala" to "Coffee rust disease—Hawaii." Each classification became a focused prompt for generating domain-specific associations. Then we inverted the index: which classifications contain both "algorithm" and "fractals"? Turns out: "Mathematics in art" and "Algorithmic composition." This revealed connections like algorithm→Fibonacci→golden ratio that pure co-occurrence or word vectors miss. The "Montreal Effect" nearly tanked the project—geographic contamination where "bagels" spuriously linked to "Expo 67" because Montreal is famous for bagels. We used LLMs to filter true semantic relationships from geographic coincidence. Technical details: 80M API calls, superconnector deprecation (inverse document frequency variant), morphological deduplication. Built for a word game but the dataset has broader applications.

gagzilla 2 days ago

Very cool and fascinating. I wonder if there are other insights that can be drawn from what you've built. Like which two words (or such pairs) have the longest sequence of hops to connect? Or what are the top "superconnectors"? Or if there is a plausible correlation between how well a word is connected to how old it is?
- michaeld123 2 days ago
  
  Longest paths: The tail maxes out at 15 hops. These extreme paths are disappointingly mechanical—not the poetic distances you'd hope for:
  * Technical jargon → unrelated obscurities: gryllacridid (cricket family) → microclots * Proper nouns → common words: Trish Stratus (wrestler) → federating Numbers → anything: 9451 → shoulds
  Mostly hyper-specific terms with few inbound connections, obscure conjugations, or rare idioms.
  Superconnectors: We systematically removed generic hubs, but your question prompted us to analyze which words still act as natural bridges. Added it to the article with an interactive explorer! Top survivors:
  * polish (0.18% of paths) - verb/nationality homograph * symbiosis (0.14%) - biology → cooperation bridge * treaty (0.13%) - conflict → resolution bridge
  Thanks for the curiosity—it led to an interesting addition. Age correlation: No hard data, but I suspect you're right. Older words have had centuries to accumulate meanings and develop polysemous bridges.
marviel 2 days ago

Thanks for sharing!
Which embedding types did you try? I'm surprised that embeddings weren't able to take you further with this.
- michaeld123 2 days ago
  
  Early on, I tried both older (Word2Vec & GloVe), and later newer (OpenAI's ada + the text-embedding-3-x).

BlazeNova a day ago

[dead]

akudha 2 days ago

What other word games do people enjoy? My favorites on iOS

Alpha Omega

Sticky Terms (I struggle with this)

Typeshift

Blackbar (old, not maintained, but we can still play. Not a game in strict sense, very enjoyable)

michaeld123 2 days ago

And where and how do people discover new word games?
- akudha 2 days ago
  
  I found the above from iOS search. I also ask around, but not many people I know are interested in word games unfortunately.
  I suppose other languages have way less word games than English?

curtisszmania 2 days ago

[dead]

ajkjk 2 days ago

[flagged]