Commit 2fbf04bd authored by ana's avatar ana
Browse files

adding shortened captions EN/FR

parent 9dc5488d
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Algolit Catalog</title>
<link href="captions.css" rel="stylesheet">
</link></meta></head>
<body>
</body>
</html>
<section class="language en">
<section class="asciiheaderwrapper groupheader center" id="algoliterary-works"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary works</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<section class="group"><section class="lemma i-could-have-written-that"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% i-could-have-written-that</pre></section><table>
<tr>
<td> Type: </td>
<td> Algoliterary work
</td></tr>
<tr>
<td> Datasets: </td>
<td> custom textual sources, modality.py, Twitter API, DuckDuckGo API, Wikipedia API
</td></tr>
<tr>
<td> Technique: </td>
<td> rule-based learning, supervised learning, unsupervised learning, <a class="mw-redirect" href="http://www.algolit.net/index.php/Bag-of-words" title="Bag-of-words">bag-of-words</a>, cosine_similarity
</td></tr>
<tr>
<td> Developed by: </td>
<td> Tom De Smedt/Pattern, teams of SciKit Learn, Python, Nltk, Jinja2 &amp; Manetta Berends and kindly supported by <a class="external text" href="https://www.cbkrotterdam.nl/" rel="nofollow">CBK Rotterdam</a>
</td></tr></table><p><i>i-could-have-written-that*</i> is a practice based research project about text based machine learning, questioning the readerly nature of the techniques and proposing to represent them as writing machines. The project includes three writing-systems: <i>writing from Myth (-1.00) to Power (+1.00)</i>, <i>Supervised writing</i> &amp; <i>Cosine Similarity morphs</i>. These translate technical elements from machine learning into graphic user interfaces in the browser.
</p><p>The interfaces enable their users to explore the techniques and do a series of test-runs themselves with a textual data source of choice. After processing the textual source of choice, the writing systems offer the option to export their outputs to a PDF document.
</p><p><small>* The title <i>i-could-have-written-that</i> is derived from the paper <a class="external text" href="https://www.csee.umbc.edu/courses/331/papers/eliza.html" rel="nofollow">ELIZA--A Computer Program For the Study of Natural Language Communication Between Man and Machine</a>, written by Joseph Weizenbaum and published in 1966. </small>
</p><h2 id="rule-based-writing"><span class="mw-headline" id="Rule-based_writing">Rule-based writing</span></h2><p><a class="image" href="http://www.algolit.net/index.php/File:Screenshot-rule-based-modality.py_result.png"><img alt="Screenshot-rule-based-modality.py result.png" height="427" src="http://www.algolit.net/images/thumb/b/b4/Screenshot-rule-based-modality.py_result.png/300px-Screenshot-rule-based-modality.py_result.png" srcset="/images/thumb/b/b4/Screenshot-rule-based-modality.py_result.png/450px-Screenshot-rule-based-modality.py_result.png 1.5x, /images/thumb/b/b4/Screenshot-rule-based-modality.py_result.png/600px-Screenshot-rule-based-modality.py_result.png 2x" width="300"/></a>
</p><p>The writing-system <i>writing from Myth (-1.00) to Power (+1.00)</i> is based on a script that is included in the text mining software package <a class="external text" href="https://www.clips.uantwerpen.be/pattern" rel="nofollow">Pattern</a> (University of Antwerp), called modality.py. is a rule-based program, one of the older types of text mining techniques. The series of calculations in a rule-based program are determined by a set of rules, written after linguistic research on a specific subject.</p>
<p>A rule-based program is very precise, effective, but also very static and specific, which makes them an expensive type of text-mining technique, in terms of time, labour, and the difficulty to re-use a program on different types of text.
This rule-based script is written to calculate the degree of certainty of a sentence, expressed as a value between -1.00 and +1.00. The interface is reading tool for this script, that highlights the effect of rules that are written by the scientists at the University of Antwerp. The interface also offers the option to change the rules and create a custom reading-rule-set applied to a text of choice. It's a poetic translation exercise, from an interest in a numerical perception of human language, while bending strict categories.</p>
</p><p>The script modality.py comes with pre-defined values. The words fact (+1.00), evidence (+0.75) and (even) data (+0.75) indicate a high level of certainty. As opposed to words like fiction (-1.00), and belief (-0.25).
</p><p>In the script, the concept of being certain is divided up in 9 categories:
</p><pre>
-1.00 = NEGATIVE
-0.75 = NEGATIVE, with slight doubts
-0.50 = NEGATIVE, with doubts
-0.25 = NEUTRAL, slightly negative
+0.00 = NEUTRAL
+0.25 = NEUTRAL, slightly positive
+0.50 = POSITIVE, with doubts
+0.75 = POSITIVE, with slight doubts
+1.00 = POSITIVE
</pre><p>after which a set of words is connected to each category, for example this set of nouns:
</p><pre>
-1.00: d("fantasy", "fiction", "lie", "myth", "nonsense"),
-0.75: d("controversy"),
-0.50: d("criticism", "debate", "doubt"),
-0.25: d("belief", "chance", "faith", "luck", "perception", "speculation"),
0.00: d("challenge", "guess", "feeling", "hunch", "opinion", "possibility", "question"),
+0.25: d("assumption", "expectation", "hypothesis", "notion", "others", "team"),
+0.50: d("example", "proces", "theory"),
+0.75: d("conclusion", "data", "evidence", "majority", "proof", "symptom", "symptoms"),
+1.00: d("fact", "truth", "power")
</pre><h2 id="supervised-writing"><span class="mw-headline" id="Supervised_writing">Supervised writing</span></h2><p><a class="image" href="http://www.algolit.net/index.php/File:Screenshot-supervised-writing-pdf_v2.png"><img alt="Screenshot-supervised-writing-pdf v2.png" height="424" src="http://www.algolit.net/images/thumb/b/b1/Screenshot-supervised-writing-pdf_v2.png/300px-Screenshot-supervised-writing-pdf_v2.png" srcset="/images/thumb/b/b1/Screenshot-supervised-writing-pdf_v2.png/450px-Screenshot-supervised-writing-pdf_v2.png 1.5x, /images/b/b1/Screenshot-supervised-writing-pdf_v2.png 2x" width="300"/></a>
</p><p>The writing system <i>Supervised writing</i> is built with a set of techniques that are often used in a supervised machine learning project. In a series of steps, the user is guided through a language processing system to create a custom counted vocabulary writing exercise. On the way, the user meets the <i><a class="external text" href="http://www.algolit.net/index.php/A-Bag-of-Words" rel="nofollow">bag-of-words</a></i> counting principle by exploring its numerical view on human language. With the option to work with text material from three external input sources, Twitter, DuckDuckGo or Wikipedia, this writing system offers an alternative numerical view to well-known sources of textual data.
</p><h2 id="cosine-similarity-morphs"><span class="mw-headline" id="Cosine_Similarity_morphs">Cosine Similarity morphs</span></h2><p><a class="image" href="http://www.algolit.net/index.php/File:Screenshot_from_2017-10-07_00-53-56.png"><img alt="Screenshot from 2017-10-07 00-53-56.png" height="415" src="http://www.algolit.net/images/thumb/1/1b/Screenshot_from_2017-10-07_00-53-56.png/300px-Screenshot_from_2017-10-07_00-53-56.png" srcset="/images/thumb/1/1b/Screenshot_from_2017-10-07_00-53-56.png/450px-Screenshot_from_2017-10-07_00-53-56.png 1.5x, /images/1/1b/Screenshot_from_2017-10-07_00-53-56.png 2x" width="300"/></a>
</p><p>The writing-system <i>Cosine Similarity morphs</i> works with unsupervised similarity measurements on sentence level. The textual source of choice is first transformed into a corpus and a vector matrix, after which the cosine similarity function from SciKit Learn is applied. The <a class="external text" href="https://en.wikipedia.org/wiki/Cosine_similarity" rel="nofollow">cosine similarity</a> function is often used in unsupervised machine learning practises to extract 'hidden' semantic information from text. Since the textual data is shown to the computer without any label, this technique is often referred to as 'unsupervised' learning.
</p><p>The interface allows the user to select from a set of possible counting methods, also called features, to create a spectra of four most similar sentences. While creating multiplicity as result, the interface includes numerical information on the similarity calculations that have been made. The user, the cosine similarity function, the author of the text of choice, and the maker of this writing system, collectively create a quartet of sentences that morph between linguistic and numerical understanding of similarity.
</p></section>
<section class="asciiheaderwrapper groupheader center" id="algoliterary-works"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary works</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<section class="lemma the-weekly-address-a-model-for-a-politician"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% The Weekly Address, A model for a politician</pre></section><table>
<tr>
<td> Type: </td>
<td> Algoliterary Work
</td></tr>
<tr>
<td> Datasets: </td>
<td> The Weekly Address, videos on <a class="external text" href="https://www.youtube.com/channel/UCDGknzyQfNiThyt4vg4MlTQ/search?query=weekly+address" rel="nofollow">youtube</a>
</td></tr>
<tr>
<td> Technique: </td>
<td> Markov Chain, PocketSphinx
</td></tr>
<tr>
<td> Developed by: </td>
<td> Gijs de Heij
</td></tr></table><p><br/>
<i>The Weekly Address, a Model for a Politician</i> researches the role of language and image profiles in politics and their ability to influence our judgment.
</p><p>The installation employs speech recognition and machine learning to analyse patterns in a politician's way of speaking. While machine learning recognises patterns and produces reliable and repeatable results based on a data set, politicians construct patterns through rhetoric, often repeating their message to convey their own truth.
</p><p>The installation is based on a database which was generated by a speech recognition algorithm listening to Obama's Weekly Address. The recognised words were stored as text and analysed using a Markov Chain. There are two interfaces to this database, one showing repeated sentences or word combinations and the other allowing the generation of new speeches.
</p></section>
<section class="asciiheaderwrapper groupheader center" id="algoliterary-works"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary works</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<section class="lemma in-the-company-of-cluebotng"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% In the company of CluebotNG</pre></section><table>
<tr>
<td> Type: </td>
<td> Algoliterary work
</td></tr>
<tr>
<td> Datasets: </td>
<td> Wikipedia edits
</td></tr>
<tr>
<td> Technique: </td>
<td> Supervised machine learning, Naive Bayesian classifiers
</td></tr>
<tr>
<td> Developed by: </td>
<td> User:Cobi, User:Crispy1989, Cristina Cochior
</td></tr></table><p>Wikipedia relies on machine assistance when it comes to maintenance. One of its most active applications is <a class="external text" href="https://en.wikipedia.org/wiki/User:ClueBot_NG" rel="nofollow">CluebotNG</a>, an anti-vandalism bot operating on the English Wikipedia since December 2010.
</p><p>CluebotNG uses a series of different Bayesian classifiers, which measure word weights to attribute a likeliness score for an edit to be considered vandalist. The results of this are fed to an artificial neural network which further allocates a number between 0 and 1 to the edits, where 1 represents a 100% chance that an edit is ill intended.
</p>
<p>Through a reenactment bot that goes through each of the edits of CluebotNG and displays them on a monitor, the sequential replication of its edits is weaving together a narrative of the nonhuman voices that usually pass unnoticed on media platforms. Each micro-interaction of the bot is intrinsically performed in connection to a human editor, whom the algorithm is policing. As the taxidermic program runs, a sense of body emerges through the time span it would take to get to the end of the bot's edits.
</p></section>
<section class="asciiheaderwrapper groupheader center" id="algoliterary-works"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary works</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<section class="lemma oulipo-recipes"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% Oulipo recipes</pre></section>
<table>
<tr>
<td> Type: </td>
<td> Algoliterary Work
</td></tr>
<tr>
<td> Datasets: </td>
<td> Human inspiration, Wordnet, 1984 by George Orwell, objects of a handbag
</td></tr>
<tr>
<td> Technique: </td>
<td> Quicksort, Markov Chain
</td></tr>
<tr>
<td> Developed by: </td>
<td> Oulipo, Marcel Bénabou, Tony Hoare, Allen Downey, Andrey Markov, Consonni, Algolit
</td></tr></table><p><a class="external text" href="https://gitlab.constantvzw.org/algolit/algolit/tree/master/algoliterary_encounter/oulipo" rel="nofollow"><b>Download the scripts</b></a>
</p><h2 id="labcdaire-a-game"><span class="mw-headline" id="L.27Ab.C3.A9c.C3.A9daire.2C_a_game">L'Abécédaire, a game</span></h2><p><i><a class="external text" href="http://oulipo.net/fr/contraintes/abecedaire" rel="nofollow">L'Abécédaire</a></i> is a text of which the first letters of each word follow the alphabetical order. The Quicksort-algorithm is a fruitful algoritme to play <i>l’abécédaire</i> as a game, inside or on the street.
</p><p><b>Quicksort</b> is invented in 1960 by Tony Hoare, a visiting student from Oxford at the University of Moscou. He developed Quicksort to alphabetically order Russian words as part of a translation machine. Nowadays Quicksort is part of the standard programmingsystems such as Unix, C, C++.
</p><p>This Hungarian dance company executes the Quicksort as a performance:
<a class="external text" href="https://www.youtube.com/embed/ywWBy6J5gz8" rel="nofollow">Quicksort Dance</a>
</p><p>Play <i>l'Abécédaire</i> as a game, developed by Algolit: <a href="http://www.algolit.net/index.php/Abecedaire_rules" title="Abecedaire rules">Abecedaire rules</a>
</p>
<h2 id="markov-chain-a-game"><span class="mw-headline" id="Markov_Chain.2C_a_game">Markov Chain, a game</span></h2><p><b>Markov Chain</b> was developed in 1906 by Andrey Markov, a Russian mathematician who died in 1992. This algorithm is part of many spam generating softwares. It is applied in systems that describe respective dependent events. What happens, only depends of the output of the previous step. That is why Markov Chains are also called ‘memory less’.
</p><p>This game was developed in two versions, one using sentences and a writing card system (in collaboration with Brendan Howell, Catherine Lenoble and Désert Numérique, 2014); and a version using objects (in collaboration with Consonni, Bilbao: Itziar Olaizola, Emanuel Cantero, Pablo Mendez, Ariadna Chezran, Iñigo Benito, Itziar Markiegi, Josefina Rocco, Andrea Estankona, Mawa Tres (Juan Pablo Orduñez), Maria Ptqk, 2015).
</p></section></section>
<section class="asciiheaderwrapper groupheader center" id="algoliterary-explorations"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary explorations</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<section class="group"><section class="lemma charnn-text-generator"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% CHARNN text generator</pre></section><table>
<tr>
<td> Type: </td>
<td> Algoliterary exploration
</td></tr>
<tr>
<td> Dataset(s): </td>
<td> Complete Works by Shakespeare &amp; Jules Verne, Enron Email Archive
</td></tr>
<tr>
<td> Technique: </td>
<td> Torch, Cuda, Recurrent Neural Network, LSTM
</td></tr>
<tr>
<td> Developed by: </td>
<td> Justin Johnson (original version: Andrej Karpathy)
</td></tr></table><p><i>The CharRNN text generator</i> produces text using the CharRNN model. This is a recurrent neural network that reads a text character per character. In the training phase the model analyzes which characters occur after each other and learns the chances of the next character based on the previous character it has seen. The model has a memory that varies in size. In the learning process it can forget certain information it has seen as the network is constructed using Long Short Term Memory modules.
</p><p>One of the first things the model learns is that words are separated by spaces and sentences are separated by a period, a space, followed by an uppercase. Although it might seem the model has learned that a text is constructed using multiple words and sentences, it has actually learned that after a small amount of characters the chances are high a space will occur, and after more series of characters and spaces there will be a period, a space and an uppercase character.
</p><p>The generator interface is trained on various data-sets and can be explored.
The model is based on a <a class="external text" href="https://github.com/jcjohnson/torch-rnn" rel="nofollow">script by Justin Johnson</a>
This script is an improved version of the original script by <a class="external text" href="https://github.com/karpathy/char-rnn" rel="nofollow">Andrej Karpathy</a>
</p></section>
<section class="asciiheaderwrapper groupheader center" id="algoliterary-explorations"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary explorations</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<section class="lemma you-shall-know-a-word-by-the-company-it-keeps"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% You shall know a word by the company it keeps</pre></section><table>
<tr>
<td> Type: </td>
<td> Algoliterary exploration
</td></tr>
<tr>
<td> Datasets: </td>
<td> <a href="http://www.algolit.net/index.php/Frankenstein" title="Frankenstein">Frankenstein</a>, <a href="http://www.algolit.net/index.php/AstroBlackness" title="AstroBlackness">AstroBlackness</a>, <a href="http://www.algolit.net/index.php/WikiHarass" title="WikiHarass">WikiHarass</a>, <a href="http://www.algolit.net/index.php/Learning_from_Deep_Learning" title="Learning from Deep Learning">Learning from Deep Learning</a>, <a href="http://www.algolit.net/index.php/NearbySaussure" title="NearbySaussure">nearbySaussure</a>
</td></tr>
<tr>
<td> Technique: </td>
<td> word embeddings
</td></tr>
<tr>
<td> Developed by: </td>
<td> Google Tensorflow's word2vec, Algolit
</td></tr></table><p><i>You shall know a word by the company it keeps</i> is a series of 5 landscapes that are based on different datasets. Each landscape includes the words 'human', 'learning', 'system' in company of different semantic clusters. The belief that distances in the graph are connected to semantic similarity of words, is one of the basic ideas behind word2vec.
</p><p>The graphs are the result of a code study based on an existing word-embedding tutorial script <a href="http://www.algolit.net/index.php/Word2vec_basic.py" title="Word2vec basic.py">word2vec_basic.py</a>. In a machine learning practise, graphs like these function as one of the validation tools to see if a model starts to make sense. It is interesting how this validation process is fuelled by individual semantic understanding of the clusters and the words.
</p><p>How can we use these semantic landscapes as reading tools?
</p><h2 id="settings"><span class="mw-headline" id="settings">settings</span></h2><ul><li> Vocabulary size: 5000</li>
<li> Algorithm: <a class="external text" href="https://arxiv.org/abs/1412.6980" rel="nofollow">Adam Optimizer</a></li>
<li> Learning rate: 0.01</li></ul><h2 id="graph-1-frankenstein"><span class="mw-headline" id="graph_1:_Frankenstein">graph 1: Frankenstein</span></h2><p>Includes the book <a class="external text" href="http://www.algolit.net/index.php/Frankenstein" rel="nofollow">Frankenstein or, The Modern Prometheus by Mary Shelly</a>.
</p><pre>
loss value: 4.45983128536
Nearest to human: fair, active, crevice, sympathizing, pretence, fellow, nightingale, productions, deaths, medicine,
Nearest to learning: steeple, clump, electricity, security, foretaste, fluctuating, finding, gazes, pour, decides,
Nearest to system: philosophy, coincidences, threatening, selfcontrol, distinctly, babe, stream, chimney, recess, accounts,
</pre><p><a class="image" href="http://www.algolit.net/index.php/File:Detail-frankenstein.png"><img alt="Detail-frankenstein.png" height="509" src="http://www.algolit.net/images/a/a3/Detail-frankenstein.png" width="790"/></a>
<a class="image" href="http://www.algolit.net/index.php/File:5_graphs_frankenstein_gutenberg_tf.png"><img alt="5 graphs frankenstein gutenberg tf.png" height="2000" src="http://www.algolit.net/images/0/07/5_graphs_frankenstein_gutenberg_tf.png" width="2000"/></a>
</p><h2 id="graph-2-astroblackness"><span class="mw-headline" id="graph_2:_AstroBlackness">graph 2: AstroBlackness</span></h2><p>A selection of texts from an afrofuturist perspective.
</p><pre>
loss value: 5.8195698024
Nearest to human: black, difference, white, gender, otherwise, 3, 7, ignorance, contemporary, greater,
Nearest to learning: superior, truth, function, lens, start, dying, existence, changing, symbol, place,
Nearest to system: attempts, adapt, programmed, varieties, limit, realization, color, promise, population, voice,
</pre><p><a class="image" href="http://www.algolit.net/index.php/File:Detail-astroBlackness.png"><img alt="Detail-astroBlackness.png" height="420" src="http://www.algolit.net/images/5/55/Detail-astroBlackness.png" width="735"/></a>
<a class="image" href="http://www.algolit.net/index.php/File:5_graphs_astroBlackness.png"><img alt="5 graphs astroBlackness.png" height="2000" src="http://www.algolit.net/images/0/0c/5_graphs_astroBlackness.png" width="2000"/></a>
</p><h2 id="graph-3-nearbysaussure"><span class="mw-headline" id="graph_3:_nearbySaussure">graph 3: nearbySaussure</span></h2><p>Includes three secondary books about Saussure's work in structuralist linguistics.
</p><pre>
loss value: 5.78265964687
Nearest to human: cultural, 181, psychic, Human, rational, physical, story, chance, domain, furthermore,
Nearest to system: structure, content, community, System, term, center, study, plurality, form, value,
The word 'learning' did not appear in the list of 5000 most common words.
</pre><p><a class="image" href="http://www.algolit.net/index.php/File:Detail-nearbySaussure.png"><img alt="Detail-nearbySaussure.png" height="452" src="http://www.algolit.net/images/8/86/Detail-nearbySaussure.png" width="800"/></a>
<a class="image" href="http://www.algolit.net/index.php/File:5_graphs_nearbySaussure.png"><img alt="5 graphs nearbySaussure.png" height="2000" src="http://www.algolit.net/images/3/34/5_graphs_nearbySaussure.png" width="2000"/></a>
</p><h2 id="graph-4-learning-from-deep-learning"><span class="mw-headline" id="graph_4:_Learning_from_Deep_Learning">graph 4: Learning from Deep Learning</span></h2><p>Includes seven text books on the topic of deep learning.
</p><pre>
loss value: 6.65393904257
Nearest to human: healthy, given, modeling, poorly, inspired, criterion, specifically, Accuracy, surface, predicting,
Nearest to learning: Learning, pretrained, sparse, neat, 21, inference, tuning, adagrad, tested, Use,
Nearest to system: UNK, roi, dataframe, code, win, page, approach, diagonal, cae, letter,
</pre><p><a class="image" href="http://www.algolit.net/index.php/File:Detail-learning-deep-learning.png"><img alt="Detail-learning-deep-learning.png" height="480" src="http://www.algolit.net/images/7/71/Detail-learning-deep-learning.png" width="850"/></a>
<a class="image" href="http://www.algolit.net/index.php/File:5_graphs_deep-learning-trainingset.png"><img alt="5 graphs deep-learning-trainingset.png" height="2000" src="http://www.algolit.net/images/7/78/5_graphs_deep-learning-trainingset.png" width="2000"/></a>
</p><h2 id="graph-5-wikiharass"><span class="mw-headline" id="graph_5:_WikiHarass">graph 5: WikiHarass</span></h2><p>Includes examples of harassment on Talk page comments from Wikipedia.
</p><pre>
loss value: 3.93717244664
Nearest to human: jacob, Persianyes, phrase, track, star, attack, puts, jews, helps, plastic,
Nearest to learning: sound, people, getting, writing, thinking, talking, thoughts, modify, less, prince,
Nearest to system: armenian, UNK, georgia, george, n, developed, its, each, daniele, claim,
</pre><p><a class="image" href="http://www.algolit.net/index.php/File:Detail-WikiHarass.png"><img alt="Detail-WikiHarass.png" height="434" src="http://www.algolit.net/images/c/c1/Detail-WikiHarass.png" width="798"/></a>
<a class="image" href="http://www.algolit.net/index.php/File:5_graphs_Talk_page_comments_from_Wikipedia_stripped.png"><img alt="5 graphs Talk page comments from Wikipedia stripped.png" height="2000" src="http://www.algolit.net/images/d/d1/5_graphs_Talk_page_comments_from_Wikipedia_stripped.png" width="2000"/></a>
</p></section></section>
<h5 id="public-datasets"><span class="mw-headline" id="Public_datasets">Public datasets</span></h5>
<p>Most commonly used public datasets are gathered at <a class="external text" href="https://aws.amazon.com/public-datasets/" rel="nofollow">Amazon</a>.
We looked closely at the following two:
</p>
<section class="asciiheaderwrapper groupheader center" id="algoliterary-explorations"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary explorations</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<section class="group"><section class="lemma common-crawl"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% Common Crawl</pre></section><table>
<tr>
<td> Type: </td>
<td> Dataset
</td></tr>
<tr>
<td> Technique: </td>
<td> scraping
</td></tr>
<tr>
<td> Developed by: </td>
<td> The Common Crawl Foundation, California, US
</td></tr></table><p><a class="external text" href="http://commoncrawl.org" rel="nofollow">Common Crawl</a> is a registered non-profit organisation founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable.
</p><p><i>Common Crawl</i> completes four crawls a year. Amazon Web Services began hosting Common Crawl's archive through its Public Data Sets program in 2012. The crawl of September 2017 contains 3.01 billion web pages and over 250 TiB of uncompressed content, or about 75% of the Internet.
</p><p>The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available.
</p><p><i>Common Crawl</i> datasets are used to create pretrained word embeddings datasets, like GloVe (see <a class="external text" href="http://www.algolit.net/index.php/The_GloVe_Reader" rel="nofollow">The GloVe Reader</a>). word2vec is another much used pretrained word embeddings dataset, it is based on Google News' texts.
</p></section>
<section class="asciiheaderwrapper groupheader center" id="algoliterary-explorations"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary explorations</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<h4 id="datasets"><span class="mw-headline" id="Datasets">Datasets</span></h4>
<section class="lemma wikiharass"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% WikiHarass</pre></section><table>
<tr>
<td> Type: </td>
<td> Dataset
</td></tr>
<tr>
<td>Number of words: </td>
<td> 1.039.789
</td></tr>
<tr>
<td>Unique words: </td>
<td> 64.136
</td></tr>
<tr>
<td> Source: </td>
<td> English Wikipedia
</td></tr>
<tr>
<td> Developed by: </td>
<td> Wikimedia Foundation
</td></tr></table><p>The <a class="external text" href="https://meta.wikimedia.org/wiki/Research:Detox" rel="nofollow">Detox dataset</a> is a project by Wikimedia and <a href="http://www.algolit.net/index.php/Crowd_Embeddings" title="Crowd Embeddings"> Perspective API</a> to train a neural network that would detect the level of toxicity of a comment.
</p><p>The <a class="external text" href="https://figshare.com/projects/Wikipedia_Talk/16731" rel="nofollow">original dataset</a> consists of:
</p><ul><li>A corpus of all 95 million user and article talk diffs made between 2001 and 2015 scored by the personal attack model.</li>
<li>A human annotated dataset of 1m crowd-sourced annotations that cover 100k talk page diffs (with 10 judgements per diff).</li></ul><p>For Algolit, a smaller section of the Detox dataset was used, taken from <a class="external text" href="https://conversationai.github.io/wikidetox/testdata/tox-sorted/Wikipedia%20Toxicity%20Sorted%20%28Toxicity%405%5BAlpha%5D%29.html" rel="nofollow">Jigsaw's Github</a>, which contains both constructive and vandalist edits.
</p></section></section>
<section class="asciiheaderwrapper groupheader center" id="algoliterary-explorations"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary explorations</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<section class="group"><section class="lemma the-data-espeaks"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% The data (e)speaks</pre></section><table>
<tr>
<td> Type: </td>
<td> Algoliterary exploration
</td></tr>
<tr>
<td> Datasets: </td>
<td> <a href="http://www.algolit.net/index.php/Frankenstein" title="Frankenstein">Frankenstein</a>, <a href="http://www.algolit.net/index.php/Learning_from_Deep_Learning" title="Learning from Deep Learning">Learning from Deep Learning</a>, <a href="http://www.algolit.net/index.php/NearbySaussure" title="NearbySaussure">nearbySaussure</a>, <a href="http://www.algolit.net/index.php/AstroBlackness" title="AstroBlackness">astroBlackness</a>
</td></tr>
<tr>
<td> Technique: </td>
<td> espeak
</td></tr>
<tr>
<td> Developed by: </td>
<td> &amp; Algolit
</td></tr></table><p>In the process of making the Algolit datasets, careful consideration was given to the selection of the source texts. Our attempt was to have a variety of tone of voices that highlights the heterogeneity of all of them combined.
</p><p>The texts were gathered from aaaaarg.fail, gen.lib.rus.ec, archive.org and gutenberg.org, processed with terminal commands such as <a class="external text" href="https://en.wikipedia.org/wiki/Pdftotext" rel="nofollow">pdftotext</a> in order to generate .txt files and stripped of punctuation marks with the help of <a class="external text" href="https://gitlab.constantvzw.org/algolit/algolit/blob/master/algoliterary_encounter/algoliterary-toolkit/text-punctuation-clean-up.py" rel="nofollow">a Python code snippet</a>.
</p><p>The ensuing datasets are:
</p><ul><li> <a href="http://www.algolit.net/index.php/Frankenstein" title="Frankenstein">Frankenstein</a></li>
<li> <a href="http://www.algolit.net/index.php/Learning_from_Deep_Learning" title="Learning from Deep Learning">Learning from Deep Learning</a></li>
<li> <a href="http://www.algolit.net/index.php/NearbySaussure" title="NearbySaussure">nearbySaussure</a></li>
<li> <a href="http://www.algolit.net/index.php/AstroBlackness" title="AstroBlackness">astroBlackness</a></li></ul><p><i>The data (e)speaks</i> is an audio installation that reads out specific sentences from their text body.
</p></section>
<section class="asciiheaderwrapper groupheader center" id="algoliterary-explorations"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary explorations</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<section class="lemma frankenstein"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% Frankenstein</pre></section><table>
<tr>
<td> Type: </td>
<td> Dataset
</td></tr>
<tr>
<td>Number of words: </td>
<td> 75.092
</td></tr>
<tr>
<td>Unique words: </td>
<td> 7.205
</td></tr>
<tr>
<td> Source(s): </td>
<td> Gutenberg.org
</td></tr>
<tr>
<td> Developed by: </td>
<td> Mary Shelly (1797-1851)
</td></tr></table><p><i>Frankenstein; or, The Modern Prometheus</i> (or simply, <i>Frankenstein</i> for short), is a novel written by English author Mary Shelley (1797-1851) that tells the story of Victor Frankenstein, a young scientist who creates a grotesque but sapient creature in an unorthodox scientific experiment. Shelley started writing the story when she was 18, and the first edition of the novel was published anonymously in London in 1818, when she was 20. Her name first appeared on the second edition, published in France in 1823.
</p><p>The Gutenberg text-version of <i>Frankenstein</i> was used for an <a class="external text" href="http://constantvzw.org/site/Frankenstein-Chatbot-Parade.html?lang=en" rel="nofollow">Algolit residency and creation</a> in the framework of Mad scientist Festival in 2016. The text stayed as a dummy text to quickly try out scripts in a literary context.
</p></section>
<section class="asciiheaderwrapper groupheader center" id="algoliterary-explorations"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary explorations</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<section class="lemma learning-from-deep-learning"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% Learning from Deep Learning</pre></section><table>
<tr>
<td> Type: </td>
<td> Dataset
</td></tr>
<tr>
<td>Number of words: </td>
<td> 835.867
</td></tr>
<tr>
<td>Unique words: </td>
<td> 38.587
</td></tr>
<tr>
<td> Source: </td>
<td> <a class="external text" href="https://archive.org/details/DataScienceBookV3" rel="nofollow">An Introduction to Data Science, J Stanton</a>, <a class="external text" href="https://deeplearning4j.org/neuralnet-overview.html" rel="nofollow">Deep Learning: A Practitioner's Approach, O'Reilly media</a>, <a class="external text" href="http://www.deeplearningbook.org/" rel="nofollow">Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville</a>, <a class="external text" href="http://neuralnetworksanddeeplearning.com/index.html" rel="nofollow">Neural Networks and Deep Learning, Michael Nielsen</a>, <a class="external text" href="http://www.heatonresearch.com/book/aifh-vol3-deep-neural.html" rel="nofollow">Artificial Intelligence for Humans - Volume 3: Deep Learning and Neural Networks, Jeff Heaton</a>, <a class="external text" href="http://www.apress.com/us/book/9781484228449" rel="nofollow">MatLab Deep Learning with Machine Learning - Neural Networks and Artificial Intelligence-Apress, Phil Kim</a>, <a class="external text" href="http://www.springer.com/gp/book/9783319429984" rel="nofollow">Advances in Computer Vision and Pattern Recognition, Le Lu, Yefeng Zheng, Gustavo Carneiro, Lin Yang (eds.)</a>
</td></tr></table><p>The <i>Learning from Deep Learning</i> dataset is an accumulation of 7 text books that give an technical explanation about deep learning. The books are all published in the last two years. This dataset was created to explore the effect of a technical practical language to the word2vec graphs.
</p></section>
<section class="asciiheaderwrapper groupheader center" id="algoliterary-explorations"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary explorations</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<section class="lemma nearbysaussure"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% nearbySaussure</pre></section><table>
<tr>
<td> Type: </td>
<td> Dataset
</td></tr>
<tr>
<td>Number of words: </td>
<td> 424.811
</td></tr>
<tr>
<td>Unique words: </td>
<td> 24.651
</td></tr>
<tr>
<td> Source(s): </td>
<td> <a class="external text" href="http://aaaaarg.fail" rel="nofollow">aaaaarg.fail</a>
</td></tr>
<tr>
<td> Developed by: </td>
<td> Ferdinand de Saussure, Carol Sanders (editor), Beata Stawarska, Robert M. Strozier, Algolit
</td></tr></table><p><i>nearbySaussure</i> is a compiled dataset that arose out of an interest in structuralist linguistics and the work of the Swiss linguist Ferdinand de Saussure (1857-1913). Saussure's interest manifested in what he called semiology: “a science which would study the life of signs within society” and most of his thoughts are published in the book <a class="external text" href="https://archive.org/details/courseingenerall00saus" rel="nofollow">Course in General Linguistics</a> in 1916.
</p><p>The choice for this dataset was induced by a great text by Johanna Drucker on <a class="external text" href="http://www.digitalhumanities.org/dhq/vol/7/1/000143/000143.html" rel="nofollow">performative interfaces</a>, where she emphasizes in various ways how important it is to see reading as an active, interpretative and creating act. When she refers to Saussure with these intentions, she points out that <i>Classic structuralism, as exemplified by Saussurean linguistics, de-essentialized and systematized the understanding of meaning as value, and performative materiality builds on that basic shift into the post-structuralist engagement with readerly production of texts, and beyond, to a probabilistic perspective that synthesizes these critical traditions with those of user experience.</i>
</p><p>The <i>performative reader</i>, <i>readerly production of text</i>, and <i>probabilistic user experiences</i> seem to connect in an interesting way to the numerical and statistical techniques that are applied to natural language in a machine learning practise.
</p><p>The dataset consists of the following three books that are responses to his thinking:
</p><ul><li> The Cambridge Companion to Saussure, by Carol Sanders (editor), Anna Morpurgo Davies, Rudolf Engler, John E. Joseph, W. Terrence Gordon, Claudine Normand, Julia S. Falk, Christian Puech, Stephen C. Hutchings, Steven Ungar, Peter Wunderli, Geoffrey Bennington, Simon Bouquet, Christopher Norris, Paul Bouissac </li>
<li> Saussure's Philosophy of Language as Phenomenology: Undoing the Doctrine of the Course in General Linguistics, by Beata Stawarska </li>
<li> Saussure, Derrida, and the Metaphysics of Subjectivity, by Robert M. Strozier</li></ul></section>
<section class="asciiheaderwrapper groupheader center" id="algoliterary-explorations"><pre class="ascii">
%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre><div class="asciiname">Algoliterary explorations</div><pre class="ascii">%%% %%%
%%% %%%
%%% %%%
%%% %%%</pre></section>
<section class="lemma astroblackness"><section class="asciiheaderwrapper lemmaheader"><pre class="ascii">%
%
% astroBlackness</pre></section><table>
<tr>