cesine: June 2011

Over a year ago a friend sent me a link on Facebook, knowing that I was a "programming linguist" he figured rightly that I would be interested. Since then the comments have grown to include many interesting windows into the publics understanding of linguistics and machine learning... which I shall paste here for preservation and perennial viewing pleasure.

Automated Language Deciphering By Computer AI

Posted by samzenpus on Wednesday June 30 2010, @11:42PM
from the what-about-dwarvish? dept.

eldavojohn writes"Ugaritic has been deciphered by an unaided computer program that relied only on four basic assumptions present in many languages. The paper (PDF) may aid researchers in deciphering eight undecipherable languages (Ugaritic has already been deciphered and proved their system worked) as well as increase the number of languages automated translation sites offer. The researchers claim 'orders of magnitude' speedups in deciphering languages with their new system."

Comments Log In/Create an Account Search Discussion

50

50 Full0 Abbreviated0 Hidden

Score:

-1

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Sweet (Score:2)
by The MAZZTer (911996) <megazzt@gma i l . com> on Wednesday June 30 2010, @11:49PM (#32753044) Homepage
Universal translator, here we come!
- Re:Sweet (Score:4, Funny)
  by Fluffeh (1273756) on Wednesday June 30 2010, @11:51PM (#32753054)
  But will it go into your ear, or will it be injected via a syringe and live in your gut is the question?
  Parent
  - Re:Sweet (Score:5, Funny)
    by Anonymous Coward on Wednesday June 30 2010, @11:58PM (#32753088)
    Good news, it's a suppository.
    Parent
  - Re:Sweet (Score:2)
    by sortius_nod (1080919) on Thursday July 01 2010, @12:24AM (#32753226) Homepage
    You forgot the third option... but we need a TARDIS for that.
    Parent
- Re:Sweet (Score:3, Informative)
  by doishmere (1587181) on Wednesday June 30 2010, @11:53PM (#32753074)
  Their method relies heavily on the unknown language being related to a known language by some degree. At their heart of their technique is Bayesian statistics applied to lexical and frequency analysis; for this approach to work, there must be some basis for comparison.
  Parent
- Re:Sweet (Score:5, Funny)
  by grcumb (781340) on Thursday July 01 2010, @12:18AM (#32753200) Homepage Journal
  Universal translator, here we come!
  Cool! Can I bring it into my next marketing meeting?
  Parent
  - Re:Sweet (Score:5, Funny)
    by Walt Dismal (534799) on Thursday July 01 2010, @12:32AM (#32753272)
    Only if the gross gains in closing juncture exceed the long-term sustainability goals of the viability imperative for all mass interoperability. We at Mega Industries believe this will move us forward to our cloud-based monetization of the human-media dynamic which is strategically important in an ever-evolving mobile continuum. We have directed our customer experience champions to ensure consumers realize this when they call in with emphatic expressions of dissatisfaction.
    Parent
    - Re:Sweet (Score:2)
      by L4t3r4lu5 (1216702) on Thursday July 01 2010, @05:07AM (#32754274)
      Only if the gross gains in closing juncture exceed the long-term sustainability goals of the viability imperative for all mass interoperability.
      Only if we can update the UI for version 2 and sell it a second time to the same saps.
      We at Mega Industries believe this will move us forward to our cloud-based monetization of the human-media dynamic which is strategically important in an ever-evolving mobile continuum.
      When everyone has it, we can turn it into a subscription-based cash cow.
      We have directed our customer experience champions to ensure consumers realize this when they call in with emphatic expressions of dissatisfaction.
      Tell the whining losers that premium support is only available with the Platinum Care package, and transfer them to "Gord-on" in the Mumbai sales office.
      Parent
    - Re:Sweet (Score:2)
      by roman_mir (125474) on Thursday July 01 2010, @06:34AM (#32754740) Homepage Journal
      Pffft, please, your plan is to have emphatically expressed dissatisfied consumers realize that your gross gains within the closing juncture exceed your long-term sustainability goals for all viability imperatives, which will allow the move to cloud-based monetization of the human-media dynamic? It is but a futile attempt, you may as well give up right now, no matter how much time your customer experience champions waste on a single call.
      Here, at GOD Industries, we know better than to rely on such clearly misguided attempts of human-human interactions.
      We simply induce meditative sublimation of continuous exasperation through excitation of vernacular instinctual continuum within the subject's predisposition to acceptance of the delirium through faith chakra. There is no possible manner in which the subject can abjugate oneself from the excited forces of undifferentiated love, and that is what we, at GOD Industries are specializing in: Love.
      If you think your business plan can compete with ours, it is only because we haven't descended our super-existential love upon you person just yet.
      Parent
    - Re:Sweet (Score:2)
      by Dr. Eggman (932300) on Thursday July 01 2010, @08:55AM (#32755718)
      I tried running your statement through the deciphering-AI, but the process killed itself before completion. I checked the debug logs, but the weren't very helpful. Just a bunch of 'e's, 'y's, 'a's, and some 'r's and 'g's strung together.
      
      I...I think it was screaming...
      Parent
  - Re:Sweet (Score:2)
    by oiron (697563) on Thursday July 01 2010, @03:08AM (#32753888)
    Read it again: It depends on similarity to a known language...
    Parent
  - Re:Sweet (Score:2)
    by Posting=!Working (197779) on Thursday July 01 2010, @09:25AM (#32756010)
    He said universal translator, as in it only works on languages of this universe. Marketing speak is from the anti-matter dominant universe, as evidenced by the fact that the more it is spoken, the less is actually communicated.
    Parent
Answers to all TFA questions (Score:5, Informative)
by cappp (1822388) on Wednesday June 30 2010, @11:53PM (#32753068)
Just so we can keep the “didn’t read TFA” comments to a minimum: The four assumptions as laid out in the article are:
- The language being deciphered is closely related to some other language: In the case of Ugaritic, the researchers chose Hebrew.

- There’s a systematic way to map the alphabet of one language on to the alphabet of the other, and that correlated symbols will occur with similar frequencies in the two languages. The system makes a similar assumption at the level of the word: The languages should have at least some cognates, or words with shared roots, like main and mano in French and Spanish, or homme and hombre.

- The system assumes a similar mapping for parts of words. A word like “overloading,” for instance, has both a prefix — “over” — and a suffix — “ing.” The system would anticipate that other words in the language will feature the prefix “over” or the suffix “ing” or both, and that a cognate of “overloading” in another language — say, “surchargeant” in French — would have a similar three-part structure.
. The article also notes the success rates where it states that
Ugaritic has already been deciphered: Otherwise, the researchers would have had no way to gauge their system’s performance. The Ugaritic alphabet has 30 letters, and the system correctly mapped 29 of them to their Hebrew counterparts. Roughly one-third of the words in Ugaritic have Hebrew cognates, and of those, the system correctly identified 60 percent. “Of those that are incorrect, often they’re incorrect only by a single letter, so they’re often very good guesses,” Snyder says.
Critics noted that
The researchers’ approach, he says, presupposes that the language to be deciphered has an alphabet that can be mapped onto the alphabet of a known language — “which is almost certainly not the case with any of the important remaining undeciphered scripts.” It also assumes, he argues, that it’s clear where one character or word ends and another begins, which is not the case with many deciphered and undeciphered scripts. The decipherment of Ugaritic took years and relied on some happy coincidences — such as the discovery of an axe that had the word “axe” written on it in Ugaritic.
- Re:Answers to all TFA questions (Score:5, Insightful)
  by MichaelSmith (789609) on Wednesday June 30 2010, @11:59PM (#32753092) Homepage Journal
  The decipherment of Ugaritic took years and relied on some happy coincidences — such as the discovery of an axe that had the word “axe” written on it in Ugaritic.
  Maybe I should go around and write "computer" in English on all my computers, as a service to future language researchers.
  Parent
  - Pfft, why? (Score:5, Funny)
    by mdenham (747985) on Thursday July 01 2010, @12:01AM (#32753108)
    Label at least one computer "ham sandwich" to confuse future language researchers.
    Alternatively, label each computer with a character's name from (insert show of your choice here).
    Parent
    - ›
      Re:Pfft, why? (Score:2)
      by steelfood (895457) on Thursday July 01 2010, @01:40AM (#32753606)
      In all of the computer labs I've been to, the name of the computer is visibly displayed in front somewhere. The names of all teh computers in the lab usually revolve around a common theme, e.g. periodic table of elements, Simpsons characters, HHGTTG characters, etc.
      You better hope English never becomes extinct, because an important period in human history would be forever lost.
      Parent
      - Re:Pfft, why? (Score:3, Insightful)
        by L4t3r4lu5 (1216702) on Thursday July 01 2010, @05:11AM (#32754284)
        How idiotic. Name servers that way if you must, but workstations should be named by geographic location, building, room, station number. Nicknames don't count, but for sanity's sake name your equipment logically.
        Parent
        Re:Pfft, why? (Score:2)
        by ultranova (717540) on Thursday July 01 2010, @07:31AM (#32755054)
        The word you are looking for is "systematically", not "logically". And unless you're talking about a whole building's worth of computers, it's simply not worth it to indicate a location in the name, "Huey" is a lot easier to remember than "B2R22S15".
        Parent
        Re:Pfft, why? (Score:2)
        by L4t3r4lu5 (1216702) on Thursday July 01 2010, @07:50AM (#32755208)
        This guy was talking about a computer lab. I get the impression that Huey, Duey, Louie, Barney, Smarmey, Charley, Blarney, Indigo Montarney etc will get particularly bothersome to keep tabs on as a convention. Why not B2R22Cad4? CAD machine 4 in lab 22, building 2. Not easy to remember, but memory isn't required. You have all of the information you need without having to learn anything but a naming convention.
        Parent
  - Re:Answers to all TFA questions (Score:3, Interesting)
    by vlueboy (1799360) on Thursday July 01 2010, @02:29AM (#32753760)
    The decipherment of Ugaritic took years and relied on some happy coincidences — such as the discovery of an axe that had the word “axe” written on it in Ugaritic.
    Maybe I should go around and write "computer" in English on all my computers, as a service to future language researchers.
    Extinct language researchers examining english would fail at this same task 3000 years from now. English has no nouns --it has brand names: today's "computers" have big "Dell" logos but not "Computer."
    Also, how would researchers realize that [Apple Mac Glyph] isn't an integral part of our "ancient moon runes" if seen from their era? :)
    Parent
    - Re:Answers to all TFA questions (Score:2)
      by MichaelSmith (789609) on Thursday July 01 2010, @02:43AM (#32753804) Homepage Journal
      Going further OT: In Harry Harrison's Stainless Steel Rat books people from the distant future wondered why their ancestors had named their planet "dirt".
      Parent
    - Re:Answers to all TFA questions (Score:5, Funny)
      by mrsurb (1484303) on Thursday July 01 2010, @03:46AM (#32753998)
      Also, how would researchers realize that [Apple Mac Glyph] isn't an integral part of our "ancient moon runes" if seen from their era? :)
      They'd probably see it as having some sort of religious significance. And they'd be correct.
      Parent
- Re:Answers to all TFA questions (Score:3, Interesting)
  by DurendalMac (736637) on Thursday July 01 2010, @12:01AM (#32753104)
  Darn. So the Voynich Manuscript is probably not a prime candidate.
  Parent
  - Re:Answers to all TFA questions (Score:2)
    by oljanx (1318801) on Thursday July 01 2010, @12:53AM (#32753386)
    I wouldn't worry too much about that. They Voynich Manuscript is likely the work of a madman, who used a very inconsistent cipher to encode plain text from a language he was not fluent in. Then he added several hundred little tiny pictures of naked women, and a bunch of plants he saw on some sort of "vision quest".
    Parent
    - Re:Answers to all TFA questions (Score:3, Funny)
      by L4t3r4lu5 (1216702) on Thursday July 01 2010, @05:14AM (#32754302)
      So you're telling me he was at Woodstock '69?
      
      For those who don't know what it was like, clicky [youtube.com]
      Parent
  - Re:Answers to all TFA questions (Score:3, Insightful)
    by jd (1658) <imipak.yahoo@com> on Thursday July 01 2010, @01:43AM (#32753624) Homepage Journal
    Neither is my great great grandmother's cookbook. Which really is a shame, as I strongly suspect the recipes make something more edible than what's served at the local coffee shop.
    Parent
- Re:Answers to all TFA questions (Score:2)
  by OnePumpChump (1560417) on Thursday July 01 2010, @12:59AM (#32753416)
  So this probably isn't going to help with Rongorongo, then.
  Parent
  - Re:Answers to all TFA questions (Score:2)
    by grouchomarxist (127479) on Thursday July 01 2010, @01:33AM (#32753570)
    In the case of Rongorongo, if it is a written language, then it is probably a written form of the Rapa Nui, the language of Easter Island. In any case since Rapa Nui is a polynesian language we'd be able to compare it to other Polynesian languages. However, this has already been done with no success.
    Part of the problem with Rongorongo and with other undeciphered scripts is that we don't know what counts as a distinct character, the character vs. glyph problem. It is not clear from the article if this system helps with that problem. The article doesn't have enough detail, but it seems that their system makes a lot of assumptions that you can't make when trying to work with an undeciphered script.
    Parent
Linear A Implications (Score:5, Interesting)
by DowdyGoat (1830958) on Thursday July 01 2010, @12:03AM (#32753124)
This is very cool for us undeciphered language fans.
In the article, the language author Andrew Robinson correctly points out that this computer program won't work for languages that don't have a known language that is close to them, say like for Linear A found on Crete, which is definitely not Greek like Linear B turned out to be. There is a lot of speculation that Linear A is a native Minoan (Cretan) script, largely unrelated to any other known script.
However, parallel with Linear A on Crete was a Cretan pictographic script, which may, or may not be related to Egyptian hieroglyphics. The Minoans had known trading ties to Egypt, which had written language long before them. If a relationship could be found (via this computer program) between the Minoan pictographic script and Egyptian hieroglyphs, then that might give insights into how the Linear A script was set up (which is a syllabary script).
The only difficulty is that there may not be enough of the pictographic script to work--I'd imagine you'd need a fair number of examples to really allow the computer to compare and contrast.
- Re:Linear A Implications (Score:3, Informative)
  by KritonK (949258) on Thursday July 01 2010, @01:59AM (#32753686)
  Actually, the program might be able to help: From what I understand, the Linear A alphabet is related to the linear B alphabet, which has been deciphered, even though the languages may be different. We know a bit about context (what we have are mostly inventories), and we even know the meaning of one word: the one next to the total of the amounts in the inventory probably means "total". Furthermore, that word, ku-ro, is similar to a form of a Greek word for "total" ("houlon"), so it is very likely that the language is at least indoeuropean in origin. One could try using various indoeuropean languages as candidates for the related language, until the program comes up with something meanngful.
  Now, if only we had a larger sample of the language of the disk of Phaestos...
  Parent
- Re:Linear A Implications (Score:2)
  by jd (1658) <imipak.yahoo@com> on Thursday July 01 2010, @02:01AM (#32753696) Homepage Journal
  Well, a more obvious implication is that if you fed in some percentage of Linear A texts and Cretan pictographic texts, you'd get virtually the same results as feeding in a different set of texts (ie: symbols should always equate to the same opposite number) if they are truly related.
  This would at least let you identify if the texts are indeed of the same language, even if you can't read it, which is further along than we are now.
  Parent
- Re:Linear A Implications (Score:2)
  by Hognoxious (631665) on Thursday July 01 2010, @09:02AM (#32755800) Homepage Journal
  What's the difference between linear A and perl?
  One day we might be able to read linear A ... drrrtish!
  Parent
Next step: (Score:2, Insightful)
by BoppreH (1520463) on Thursday July 01 2010, @12:08AM (#32753140)
Voynich manuscript! [wikipedia.org]

If only we could find a language that is similar enough...
- Re:Next step: (Score:2)
  by MichaelSmith (789609) on Thursday July 01 2010, @12:22AM (#32753216) Homepage Journal
  Thats amazing. I will have to set aside some time to go through it. My guess is that the document is an attempt to create a written script for an Asian language which is only spoken. Cantonese comes to mind because speakers of that language currently borrow mandarin and chinese writing when they want to write stuff down.
  Parent
  - Re:Next step: (Score:2)
    by iserlohn (49556) on Thursday July 01 2010, @05:34AM (#32754394) Homepage
    Cantonese is a dialect of Chinese (as is Mandarin). In fact it is more akin to Middle Chinese than modern Mandarin. It is commonly accepted that Tang dynasty poetry sounds better in Cantonese due to the more similar tonal structure. Basically, it is believed that Cantonese has gone through less changes over the (1500) years from Middle Chinese than Mandarin.
    It is similar to how it is now believed that Elizabethan English sounds more like American English than British English / Received Pronunciation. When colonists leave a mother country to settle a new area, which is sufficiently cut off from the rest of its culture, it tends to preserve more features of the original spoken language.
    For Cantonese, the areas of Guangdung Province (ie. Canton Province) was settled around the time of the Han and Tang Dynasties, displacing the native (most likely) Polynesian tribes that lived there before. You can write Cantonese in Chinese, but some charaters that are used is specific to Cantonese to denote Cantonese words.
    Another thing to note is that Chinese was not invented to write Mandarin, but in fact was the script used to Classical Chinese (a standard form of written Chinese grammar and lexicon from well over 2000 years ago). All Chinese dialects adapted the script to write venercular words subsequently.
    Parent
- Re:Next step: (Score:2)
  by lakeland (218447) <lakeland@acm.org> on Thursday July 01 2010, @01:06AM (#32753444) Homepage
  That's interesting, I have not come across this before.
  I last worked in computational linguistics over five years ago and but when I left there were a good supply of techniques for automatically extracting meaning from an unknown text.
  My own research was able to build up both a dendrogram and word vectors from any sufficiently large corpus, and a quick google search turned uphttp://www.springerlink.com/content/fp17278783422256/ [springerlink.com] which shows that the field is continuing to develop. I would expect that by now it would be pretty easy to feed a text like this in and get word associations out. From your word associations, building up a basic dictionary will still need you to bootstrap associated concepts but at least the task is much smaller and there's a lot of support for checking.
  I don't recall much successful research into automatic parsing of unknown languages, but since I left the field it could've progressed. Shallow parsing would be a good place to start. Since the language's stemming is unknown you're going to be hard-pressed to parse it anyway but POS tagging should be doable.
  I have not done any work with cyphered texts, so I'm assuming that approaches to natural languages will apply. No doubt there is research in this area, I'm just not familiar with it.
  Parent
- Re:Next step: (Score:2)
  by Trepidity (597) <[gro.hsikcah] [ta] [todhsals-muiriled]> on Thursday July 01 2010, @01:31AM (#32753566)
  The problem is that one of their four assumptions is that the script for the undeciphered language maps characters 1-to-1 onto an existing language's script in a way such that letter frequencies are similar, which is something people have already looked for and which appears not to be the case with the Voynich manuscript.
  Parent
Sigh. (Score:2)
by slasho81 (455509) on Thursday July 01 2010, @12:26AM (#32753240)
Unaided computer program != computer AI. Not even if you use Bayesian statistics. Leave the hyperbolic headlines to the common newspapers. After all, This Is Slashdot.
Voynich ? (Score:2)
by mbone (558574) on Thursday July 01 2010, @01:33AM (#32753574)
So, when are they going to apply this to the Voynich manuscript [wikipedia.org] ?
Google is missing out (Score:2)
by WindBourne (631190) on Thursday July 01 2010, @01:59AM (#32753684) Journal
They should put on-line a DB of documents that have been translated and then allow others to build a translator. In fact, if smart, they would do this as a competition in which the winner could create a new company based on it, with a large investment by Google.
Screw the article.... (Score:3, Informative)
by djupedal (584558) on Thursday July 01 2010, @02:02AM (#32753698)
IBM, as one example, has been on this hard since 2002 ( http://news.cnet.com/2100-1008-998264.html [cnet.com] ) when the prize was first announced....stop going all lady gaga over stuf that is so old it can't even be recycled properly.
You want to impress me... (Score:3, Funny)
by ngc5194 (847747) on Thursday July 01 2010, @02:44AM (#32753806)
... see if it can decipher some of the perl code I've had to take over.
undecipherable languages? (Score:2)
by ArcadeNut (85398) on Thursday July 01 2010, @06:16AM (#32754656) Homepage
If they are undecipherable languages, how do they verify the results are accurate?

Monday, June 13, 2011

Under the hood of Angry Birds: Java

Web gaming technologies: Angry Birds’ cross-compiled Java versus native JavaScript

Cross-compiled Java

Native JavaScript

Related reading

Comments on Automated Language Deciphering By Computer AI

Comments Log In/Create an Account Search Discussion

50

Sweet (Score:2)

Re:Sweet (Score:4, Funny)

Re:Sweet (Score:5, Funny)

Re:Sweet (Score:2)

Re:Sweet (Score:3, Informative)

Re:Sweet (Score:5, Funny)

Re:Sweet (Score:5, Funny)

Re:Sweet (Score:2)

Re:Sweet (Score:2)

Re:Sweet (Score:2)

Re:Sweet (Score:2)

Re:Sweet (Score:2)

Answers to all TFA questions (Score:5, Informative)

Re:Answers to all TFA questions (Score:5, Insightful)

Pfft, why? (Score:5, Funny)

Re:Pfft, why? (Score:2)

Re:Pfft, why? (Score:3, Insightful)

Re:Pfft, why? (Score:2)

Re:Pfft, why? (Score:2)

Re:Answers to all TFA questions (Score:3, Interesting)

Re:Answers to all TFA questions (Score:2)

Re:Answers to all TFA questions (Score:5, Funny)

Re:Answers to all TFA questions (Score:3, Interesting)

Re:Answers to all TFA questions (Score:2)

Re:Answers to all TFA questions (Score:3, Funny)

Re:Answers to all TFA questions (Score:3, Insightful)

Re:Answers to all TFA questions (Score:2)

Re:Answers to all TFA questions (Score:2)

Linear A Implications (Score:5, Interesting)

Re:Linear A Implications (Score:3, Informative)

Re:Linear A Implications (Score:2)

Re:Linear A Implications (Score:2)

Next step: (Score:2, Insightful)

Re:Next step: (Score:2)

Re:Next step: (Score:2)

Re:Next step: (Score:2)

Re:Next step: (Score:2)

Sigh. (Score:2)

Voynich ? (Score:2)

Google is missing out (Score:2)

Screw the article.... (Score:3, Informative)

You want to impress me... (Score:3, Funny)

undecipherable languages? (Score:2)

The consequences of archiving open source repos