Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Text as Networks

Dr. Alexander O'Connor
ADAPT Centre, Dublin City University

@uberalex

Example Speaker Notes.

Volume, Variety, Velocity

Depending on who you listen to, 'unstructured data' makes up as much as 80% of the stored knowledge of mankind. To a very great extent, the way we transform data into knowledge is to add context: to group things that are alike, to count them and to examine their change over time, space or in relation to each other. In this talk, what I will try to address is some of the types of things that we can use computers to count, and to aggregate, especially from text. When we see a block of text, what are we really looking at?

Source: Intel

0x1f600
0x0041 (A)
0x0061 (a)
0x0040 (@)
0x00C5 (Å)
0x00212B (Ångstrom)

A computer just sees a series of byte codes. Two different codes can appear the same to a human, or, depending on the system, two different glyphs can be displayed for the same code.

At three o’clock precisely I was at Baker Street, but Holmes had not yet returned. The landlady informed me that he had left the house shortly after eight o’clock in the morning.

At|IN three|CD o’clock|JJ precisely|RB I|PRP was|VBD at|IN Baker|NNP Street|NNP ,|, but|CC Holmes|NNP had|VBD not|RB yet|RB returned|VBN .|.

People don't think in terms of bytecodes, or even really in terms of letters. They tend to think in terms of concepts. How do we get from a stream of bytecodes to concepts people can use?

At, Baker, Holmes, I, Indeed, So, Street, The, a, accustomed, after, already, and, apart, associated, at, awaiting, be., beside, but, by, case, ceased, character, client, crimes, deeply, disentangled, down, eight, enter, exalted, failing, features, fire, follow, for, friend, from, gave, grasp, grim, had, hand, have, he, head., him, his, house, however, however, in, incisive, inextricable, informed, inquiry, intention, interested, into, invariable, investigation, it, its, keen, landlady, left, long, made, masterly, me, methods, might, morning., most, my, mysteries., nature, none, not, o'clock, of, on, own., pleasure, possibility, precisely, quick, reasoning, recorded, returned., sat, shortly, situation, something, station, still, strange, study, subtle, success, surrounded, system, that, the, there, though, three, to, two, very, was, were, which, with, work, yet

We can look at it from the word level. This is how your classic search engine works. The Document is jumbled up into a bag of words, and we count those words that are like each other.

When you pop in a query keyword, we use the index (a network which links the keywords to the documents) to find our path

  1. Extract Text
  2. Clean Text
  3. Divide the Corpus
  4. (LSI, LDA, NMF)
  5. Interpret the Results
  6. ...
  7. Profit!

We can take this concept of word networks one level further, we can begin to look at words and their neighbours.

King + Woman - Man = Queen

Obama + Russia - USA = Putin

Breakfast, Cereal, Lunch, Dinner

Bad is to Worse as Good is to Better

Try it here: http://rare-technologies.com/word2vec-tutorial/

From Topic Models we can get, with very large amounts of data, to word embeddgings: we can learn the meaning of words from the company they keep

Source: Emma Clarke

This movie was actually neither that funny, nor super witty.

Great for a romantic evening, but over-priced.

People like David Cameron are happy.

Oh Great! Word crashed again

This day is just getting better and better.

Entities (people, places, times, events, things) represent another semantic level. Now we're associating several labels with one concept.

Image Source: http://www.jfsowa.com/ontology/ontometa.htm


President(s) Clinton?

Entities (people, places, times, events, things) represent another semantic level. Now we're associating several labels with one concept.

We can also take advantage of the document network, literally in the case of hypertext, but also through references, citations and other networks

PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.

Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.

DUKE VINCENTIO:
Well, your wit is in the care of side and that.

Second Lord:
They would be ruled after this chamber, and
my fair nues begun out of the fact, to be conveyed,
Whose noble souls I'll have the heart of the wars.Clown:
Come, sir, I will make did behold your worship.

VIOLA:
I'll drink it.

Don't underestimate the power of character-level data.
Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Empiricism & Scepticism

Use the spacebar or arrow keys to navigate