[I do not] carry such information in my mind since it is readily available in books. ...The value of a college education is not the learning of many facts but the training of the mind to think— Albert Einstein
CENDARI IS A RESEARCH INFRASTRUCTURE PROJECT AIMED AT INTEGRATING DIGITAL ARCHIVES FOR MEDIEVAL AND MODERN EUROPEAN HISTORY
7, 69, 78, 68, 65, 82, 73, 32, 73, 83, 32, 65, 32, 82, 69, 83, 69, 65, 82, 67, 72, 32, 73, 78, 70, 82, 65, 83, 84, 82, 85, 67, 84, 85, 82, 69, 32, 80, 82, 79, 74, 69, 67, 84, 32, 65, 73, 77, 69, 68, 32, 65, 84, 32, 73, 78, 84, 69, 71, 82, 65, 84, 73, 78, 71, 32, 68, 73, 71, 73, 84, 65, 76, 32, 65, 82, 67, 72, 73, 86, 69, 83, 32, 70, 79, 82, 32, 77, 69, 68, 73, 69, 86, 65, 76, 32, 65, 78, 68, 32, 77, 79, 68, 69, 82, 78, 32, 69, 85, 82, 79, 80, 69, 65, 78, 32, 72, 73, 83, 84, 79, 82, 89
'Erected to the memory of Mrs. Dermot O'Brien'
O commemorate me where there is water,
Canal water, preferably, so stilly
Greeny at the heart of summer. Brother
Commemorate me thus beautifully
Where by a lock niagarously roars
The falls for those who sit in the tremendous silence
Of mid-July.No one will speak in prose
Who finds his way to these Parnassian islands.
A swan goes by head low with many apologies,
Fantastic light looks through the eyes of bridges -
And look! a barge comes bringing from Athy
And other far-flung towns mythologies.
O commemorate me with no hero-courageous
Tomb - just a canal-bank seat for the passer-by.
-Patrick Kavanagh
Copyright © Estate of Katherine Kavanagh
A, And, And, Athy, Brother, Canal, Commemorate, Dermot, Erected, Fantastic, Greeny, Mrs, No, O, O, OBrien, Of, Parnassian, The, Tomb, Where, Who, a, a, a, apologies, at, barge, beautifully, bridges, bringing, by, by, canal-bank, comes, commemorate, commemorate, eyes, falls, far-flung, finds, for, for, from, goes, head, heart, hero-courageous, his, in, in, is, islands, just, light, lock, look, looks, low, many, me, me, me, memory, mid-July, mythologies, niagarously, no, of, of, of, one, other, passer-by, preferably, prose, roars, seat, silence, sit, so, speak, stilly, summer, swan, the, the, the, the, the, there, these, those, through, thus, to, to, towns, tremendous, water, water, way, where, who, will, with, with
Apache Lucene default stopwords:
a, an, and, are, as, at, be, but, by,
for, if, in, into, is, it,
no, not, of, on, or, such,
that, the, their, then, there, these,
they, this, to, was, will, with
More aggressive lists can be extremely powerful. (XIXthC. fiction.)
↵Stemmers and Lemmatisers are used to group together words that differ only for grammatical reasons (plurals, verbs, nouns, adjectives, adverbs).
Since it effectively provides a ‘suffix STRIPPER GRAMmar’, I had toyed with the idea of calling it ‘strippergram’, but good sense has prevailed, and so it is ‘Snowball’ named as a tribute to SNOBOL, the excellent string handling language of Messrs Farber, Griswold, Poage and Polonsky from the 1960s.—Martin Porter
The World Wide Web is a bare-bones Hypertext system.
Imagine a web surfer who clicks at random, eventually getting bored.
Markov Chains are systems which undergo transitions to a finite set of states with certain probabilities based only on the current state. They can be used to simulate sentences, markets, users, amongst countless other examples.
Lucene, especially Solr
Elastic Search
A good discussion on the merits of each (and some others)
There are many others, for example if you want to do algorithm research, or fine-tune the system. It's also worth noting that many systems already embed Lucene/Solr, or permit it.
This material is based on slides and a tutorial by Dr. Declan Groves. I would like to gratefully acknowledge his generosity in permitting me to use and adapt them.
Star Trek is copyright © CBS Television Studios
Poetry is what gets lost in translation.—Robert Frost.
Translate from source language to target language
Suppose we already know (from a sentence-aligned bilingual corpus) that:
Even though we have never seen "I have a dog" before, statistical machine translation induces information about unseen input, based on previously known translations
(Primarily co-occurrence statistics; takes contextual information into account)
All modern approaches are based on building translations for complete sentences by putting together smaller pieces of translation.
In reality SMT systems calculate much more complex statistical models over millions of sentence pairs for a pair of languages.
Upwards of 2M sentence pairs on average for large-scale systems.
Statistics calculated to represent:
Human evaluation is expensive, time-consuming, not 100% objective (two evaluators may not agree on their judgement of the same MT output)
Automatic evaluation is cheap, fast and consistent
Reference: “the Iraqi weapons are to be handed over to the army within two weeks"
MT output: "in two weeks Iraq’s weapons will give army"
Possible metric components:
How do we build a distributed, bare-bones, decentralised, global pool of machine-readable knowledge?
Or,
How do we move from a web of virtual documents (human-consumable text) to a web of knowledge (machine-consumable facts)?
This is the most commonly-used approach to encoding semantic data
The Schema can either be RDFS or OWL-based (SKOS is also an option)
<Subject, Predicate, Object>
The Triples are then built into a Graph
ISBN | Title | Author | PublisherID | Pages |
---|---|---|---|---|
596000480 | Javascript | D.Flanagan | 3556 | 936 |
596002637 | Practical RDF | S. Powers | 7642 | 350 |
... | ... | ... | ... | ... |
ISBN | Title | Author | PublisherID | Pages |
---|---|---|---|---|
596000480 | Javascript | D.Flanagan | 3556 | 936 |
596002637 | Practical RDF | S. Powers | 7642 | 350 |
... | ... | ... | ... | ... |
ISBN | Title | Author | PublisherID | Pages |
---|---|---|---|---|
596000480 | Javascript | D.Flanagan | 3556 | 936 |
596002637 | Practical RDF | S. Powers | 7642 | 350 |
... | ... | ... | ... | ... |
ISBN | Title | Author | PublisherID | Pages |
---|---|---|---|---|
596000480 | Javascript | D.Flanagan | 3556 | 936 |
596002637 | Practical RDF | S. Powers | 7642 | 350 |
... | ... | ... | ... | ... |
aBook is our instance
aBook hasTitle "Javascript"^^XSD:String.
aBook hasIsbn "0596000480^XSD:Integer.
aBook hasPublisher aPublisher.
aBook is-a Bib:Book.
Provide a common API for data on the Web which is more convenient than many separately and differently designed APIs published by individual data suppliers. Tim Berners-Lee, the inventor of the Web and initiator of the Linked Data project, proposed the following principles upon which Linked Data is based:From the Linked Data Glossary
- Use URIs to name things;
- Use HTTP URIs so that things can be referred to and looked up ("dereferenced") by people and user agents;
- When someone looks up a URI, provide useful information, using the open Web standards such as RDF, SPARQL;
- Include links to other related things using their URIs when publishing on the Web.
PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX : <http://dbpedia.org/resource/> PREFIX dbpedia2: <http://dbpedia.org/property/> PREFIX dbpedia: <http://dbpedia.org/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> ASK where{ <http://dbpedia.org/resource/Mount_Everest> dbpedia2:elevationM ?everestA. <http://dbpedia.org/resource/K2> dbpedia2:elevationM ?k2A. FILTER(?everestA > ?k2A). }
See also: Europeana SPARQL Endpoint
This research is supported by the Science Foundation Ireland (Grant 12/CE/I2267) as part of the Centre for Next Generation Localisation at Trinity College Dublin.
Built with impress.js, d3.js and github.io