Commit a0395031 authored by manetta's avatar manetta

deleting all the heavy print pdf's from the git

parent 18a4a94e
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body><p><br>
Start of the Algoliterary Encounters catalog.
</p>
<h2><span class="mw-headline" id="Introduction">Introduction</span></h2>
<ul><li> <a href="http://www.algolit.net/index.php/Algolit%27s_Algoliterary_Journey" title="Algolit's Algoliterary Journey">Algolit</a></li>
<li> <a href="http://www.algolit.net/index.php/Program" title="Program">Program</a></li></ul>
<h2><span class="mw-headline" id="Algoliterary_works">Algoliterary works</span></h2>
<ul><li> <a href="http://www.algolit.net/index.php/Oulipo_recipes" title="Oulipo recipes">Oulipo recipes</a></li>
<li> <a href="http://www.algolit.net/index.php/I-could-have-written-that" title="I-could-have-written-that">i-could-have-written-that</a></li>
<li> Obama, model for a politician</li>
<li> <a href="http://www.algolit.net/index.php/In_the_company_of_CluebotNG" title="In the company of CluebotNG">In the company of CluebotNG</a></li></ul>
<h2><span class="mw-headline" id="Algoliterary_explorations">Algoliterary explorations</span></h2>
<h3><span class="mw-headline" id="What_the_Machine_Writes:_a_closer_look_at_the_output">What the Machine Writes: a closer look at the output</span></h3>
<ul><li> <a href="http://www.algolit.net/index.php/CHARNN_text_generator" title="CHARNN text generator">CHARNN text generator</a></li>
<li> <a href="http://www.algolit.net/index.php/You_shall_know_a_word_by_the_company_it_keeps" title="You shall know a word by the company it keeps">You shall know a word by the company it keeps</a></li></ul>
<h3><span class="mw-headline" id="How_the_Machine_Reads:_Dissecting_Neural_Networks">How the Machine Reads: Dissecting Neural Networks</span></h3>
<h4><span class="mw-headline" id="Datasets">Datasets</span></h4>
<ul><li> <a href="http://www.algolit.net/index.php/Many_many_words" title="Many many words">Many many words</a> </li>
<li> <a href="http://www.algolit.net/index.php/The_data_%28e%29speaks" title="The data (e)speaks">The data (e)speaks</a></li></ul>
<h5><span class="mw-headline" id="Common_public_datasets">Common public datasets</span></h5>
<ul><li> <a href="http://www.algolit.net/index.php/Common_Crawl" title="Common Crawl">Common Crawl</a> </li>
<li> <a href="http://www.algolit.net/index.php/WikiHarass" title="WikiHarass">WikiHarass</a></li></ul>
<h5><span class="mw-headline" id="Algoliterary_datasets">Algoliterary datasets</span></h5>
<ul><li> <a href="http://www.algolit.net/index.php/Frankenstein" title="Frankenstein">Frankenstein</a></li>
<li> <a href="http://www.algolit.net/index.php/Learning_from_Deep_Learning" title="Learning from Deep Learning">Learning from Deep Learning</a> </li>
<li> <a href="http://www.algolit.net/index.php/AnarchFem" title="AnarchFem">AnarchFem</a> </li>
<li> <a href="http://www.algolit.net/index.php/Tristes_Tropiques" title="Tristes Tropiques">Tristes Tropiques</a></li></ul>
<h4><span class="mw-headline" id="From_words_to_numbers">From words to numbers</span></h4>
<ul><li> <a href="http://www.algolit.net/index.php/A_Bag_of_Words" title="A Bag of Words">A Bag of Words</a></li>
<li> <a href="http://www.algolit.net/index.php/A_One_Hot_Vector" title="A One Hot Vector">A One Hot Vector</a></li></ul>
<h4><span class="mw-headline" id="Special_Focus:_Word_Embeddings">Special Focus: Word Embeddings</span></h4>
<ul><li> <a href="http://www.algolit.net/index.php/About_Word_embeddings" title="About Word embeddings">About Word embeddings</a></li>
<li> <a href="http://www.algolit.net/index.php/Crowd_Embeddings" title="Crowd Embeddings">Crowd Embeddings</a> </li></ul>
<h5><span class="mw-headline" id="Different_portraits_of_word_embeddings">Different portraits of word embeddings</span></h5>
<ul><li> <a href="http://www.algolit.net/index.php/Word_embedding_Projector" title="Word embedding Projector">Word embedding Projector</a></li>
<li> <a href="http://www.algolit.net/index.php/5_dimensions_32_graphs" title="5 dimensions 32 graphs">5 dimensions 32 graphs</a></li>
<li> <a href="http://www.algolit.net/index.php/The_GloVe_Reader" title="The GloVe Reader">The GloVe Reader</a></li></ul>
<h5><span class="mw-headline" id="Inspecting_the_technique">Inspecting the technique</span></h5>
<ul><li> <a href="http://www.algolit.net/index.php/Word2vec_basic.py" title="Word2vec basic.py">word2vec_basic.py</a></li>
<li> <a href="http://www.algolit.net/index.php/Softmax_annotated" title="Softmax annotated">softmax annotated</a></li>
<li> <a href="http://www.algolit.net/index.php/Reverse_Algebra" title="Reverse Algebra">Reverse Algebra</a></li></ul>
<h3><span class="mw-headline" id="How_a_Machine_Might_Speak">How a Machine Might Speak</span></h3>
<ul><li> <a href="http://www.algolit.net/index.php/We_Are_A_Sentiment_Thermometer" title="We Are A Sentiment Thermometer">We Are A Sentiment Thermometer</a></li></ul>
<h2><span class="mw-headline" id="Sources">Sources</span></h2>
<ul><li> <a rel="nofollow" class="external text" href="https://gitlab.constantvzw.org/algolit/algolit/tree/master/algoliterary_encounter">Algoliterary Toolkit</a></li>
<li> <a href="http://www.algolit.net/index.php/Algoliterary_Bibliography" title="Algoliterary Bibliography">Algoliterary Bibliography</a></li></ul>
<!--
NewPP limit report
Cached time: 20171026093925
Cache expiry: 86400
Dynamic content: false
CPU time usage: 0.068 seconds
Real time usage: 0.139 seconds
Preprocessor visited node count: 55/1000000
Preprocessor generated node count: 60/1000000
Post‐expand include size: 0/2097152 bytes
Template argument size: 0/2097152 bytes
Highest expansion depth: 2/40
Expensive parser function count: 0/100
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00% 0.000 1 - -total
-->
<!-- Saved in parser cache with key algolit-mw_:pcache:idhash:2711-1!*!0!!*!*!* and timestamp 20171026093924 and revision id 9975
-->
</body>
</html>
\ No newline at end of file
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
import nltk
sentences = [
"I like deep learning",
"I like NLP",
......@@ -41,6 +43,6 @@ for word1 in words:
matrix.push(row)
"""
print "{: >10}".format('') + ' ' + ''.join(["{: <10}".format(word) for word in words])
print("{: >10}".format('') + ' ' + ''.join(["{: <10}".format(word) for word in words]))
for k, word in enumerate(words):
print "{: >10}".format(word) + ' ' + ''.join(["{: <10}".format(c) for c in matrix[k]])
\ No newline at end of file
print("{: >10}".format(word) + ' ' + ''.join(["{: <10}".format(c) for c in matrix[k]]))
\ No newline at end of file
ORIGINAL TEXT
Begone, or let us try our strength in a fight, in which one must fall.
Yet even thus I loved them to adoration; and to save them, I resolved to dedicate myself to my most abhorred task.
Almost spent, as I was, by fatigue and the dreadful suspense I endured for several hours, this sudden certainty of life rushed like a flood of warm joy to my heart, and tears gushed from my eyes.
But even human sympathies were not sufficient to satisfy his eager mind.
And why should I describe a sorrow which all have felt, and must feel?
LITTERATURE DEFINITIONELLE
Begone , or let us try our < the property of being physically or mentally strong > in a < a hostile meeting of opposing military forces in the course of a war > , in which one must fall . Yet even thus I loved them to < a feeling of profound love and admiration > ; and to save them , I resolved to dedicate myself to my most abhorred < any piece of work that is undertaken or attempted > . Almost spent , as I was , by < temporary loss of strength and energy resulting from hard physical or mental work > and the dreadful < apprehension about what is going to happen > I endured for several hours , this sudden < the state of being certain > of < a characteristic state or mode of living > rushed like a < the rising of a body of water and its overflowing onto normally dry land > of warm < the emotion of great happiness > to my < the locus of feelings and intuitions > , and tears gushed from my eyes . But even human sympathies were not sufficient to satisfy his eager < that which is responsible for one's thoughts and feelings; the seat of the faculty of reason > . And why should I describe a < an emotion of great sadness associated with loss or bereavement > which all have felt , and must feel ?
\ No newline at end of file
ORIGINAL TEXT
Begone, or let us try our strength in a fight, in which one must fall.
Yet even thus I loved them to adoration; and to save them, I resolved to dedicate myself to my most abhorred task.
Almost spent, as I was, by fatigue and the dreadful suspense I endured for several hours, this sudden certainty of life rushed like a flood of warm joy to my heart, and tears gushed from my eyes.
But even human sympathies were not sufficient to satisfy his eager mind.
And why should I describe a sorrow which all have felt, and must feel?
Such were my thoughts when the door of my apartment was opened and Mr. Kirwin entered.
My creator, make me happy; let me feel gratitude towards you for one benefit!
Explanation!
I spoke; I told them to retire and consider of what had been said, that I would not lead them farther north if they strenuously desired the contrary, but that I hoped that, with reflection, their courage would return.
I have traversed a vast portion of the earth and have endured all the hardships which travellers in deserts and barbarous countries are wont to meet.
......@@ -13,4 +13,4 @@ And why should I describe a sorrow which all have felt, and must feel?
LITTERATURE DEFINITIONELLE
Begone , or let us try our < the property of being physically or mentally strong > in a < a hostile meeting of opposing military forces in the course of a war > , in which one must fall . Yet even thus I loved them to < a feeling of profound love and admiration > ; and to save them , I resolved to dedicate myself to my most abhorred < any piece of work that is undertaken or attempted > . Almost spent , as I was , by < temporary loss of strength and energy resulting from hard physical or mental work > and the dreadful < apprehension about what is going to happen > I endured for several hours , this sudden < the state of being certain > of < a characteristic state or mode of living > rushed like a < the rising of a body of water and its overflowing onto normally dry land > of warm < the emotion of great happiness > to my < the locus of feelings and intuitions > , and tears gushed from my eyes . But even human sympathies were not sufficient to satisfy his eager < that which is responsible for one's thoughts and feelings; the seat of the faculty of reason > . And why should I describe a < an emotion of great sadness associated with loss or bereavement > which all have felt , and must feel ?
\ No newline at end of file
Such were my thoughts when the < a swinging or sliding barrier that will close the entrance to a room or building or vehicle > of my < a suite of rooms usually on one floor of an apartment house > was opened and Mr. Kirwin entered . My < terms referring to the Judeo-Christian God > , make me happy ; let me feel < a feeling of thankfulness and appreciation > towards you for one < financial assistance in time of need > ! < a statement that makes something comprehensible by describing the relevant structure or operation or circumstances etc. > ! I spoke ; I told them to retire and consider of what had been said , that I would not lead them farther north if they strenuously desired the contrary , but that I hoped that , with < a calm, lengthy, intent consideration > , their < a quality of spirit that enables you to face danger or pain without showing fear > would return . I have traversed a vast < something determined in relation to something that includes it > of the earth and have endured all the hardships which travellers in deserts and barbarous countries are wont to meet .
\ No newline at end of file
Tools for computing distributed representtion of words
------------------------------------------------------
We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts.
Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous
Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following:
- desired vector dimensionality
- the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
- training algorithm: hierarchical softmax and / or negative sampling
- threshold for downsampling the frequent words
- number of threads to use
- the format of the output word vector file (text or binary)
Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets.
The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training
is finished, the user can interactively explore the similarity of the words.
More information about the scripts is provided at https://code.google.com/p/word2vec/
// Copyright 2013 Google Inc. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <malloc.h>
#include <ctype.h>
const long long max_size = 2000; // max length of strings
const long long N = 1; // number of closest words
const long long max_w = 50; // max length of vocabulary entries
int main(int argc, char **argv)
{
FILE *f;
char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size], ch;
float dist, len, bestd[N], vec[max_size];
long long words, size, a, b, c, d, b1, b2, b3, threshold = 0;
float *M;
char *vocab;
int TCN, CCN = 0, TACN = 0, CACN = 0, SECN = 0, SYCN = 0, SEAC = 0, SYAC = 0, QID = 0, TQ = 0, TQS = 0;
if (argc < 2) {
printf("Usage: ./compute-accuracy <FILE> <threshold>\nwhere FILE contains word projections, and threshold is used to reduce vocabulary of the model for fast approximate evaluation (0 = off, otherwise typical value is 30000)\n");
return 0;
}
strcpy(file_name, argv[1]);
if (argc > 2) threshold = atoi(argv[2]);
f = fopen(file_name, "rb");
if (f == NULL) {
printf("Input file not found\n");
return -1;
}
fscanf(f, "%lld", &words);
if (threshold) if (words > threshold) words = threshold;
fscanf(f, "%lld", &size);
vocab = (char *)malloc(words * max_w * sizeof(char));
M = (float *)malloc(words * size * sizeof(float));
if (M == NULL) {
printf("Cannot allocate memory: %lld MB\n", words * size * sizeof(float) / 1048576);
return -1;
}
for (b = 0; b < words; b++) {
a = 0;
while (1) {
vocab[b * max_w + a] = fgetc(f);
if (feof(f) || (vocab[b * max_w + a] == ' ')) break;
if ((a < max_w) && (vocab[b * max_w + a] != '\n')) a++;
}
vocab[b * max_w + a] = 0;
for (a = 0; a < max_w; a++) vocab[b * max_w + a] = toupper(vocab[b * max_w + a]);
for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f);
len = 0;
for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size];
len = sqrt(len);
for (a = 0; a < size; a++) M[a + b * size] /= len;
}
fclose(f);
TCN = 0;
while (1) {
for (a = 0; a < N; a++) bestd[a] = 0;
for (a = 0; a < N; a++) bestw[a][0] = 0;
scanf("%s", st1);
for (a = 0; a < strlen(st1); a++) st1[a] = toupper(st1[a]);
if ((!strcmp(st1, ":")) || (!strcmp(st1, "EXIT")) || feof(stdin)) {
if (TCN == 0) TCN = 1;
if (QID != 0) {
printf("ACCURACY TOP1: %.2f %% (%d / %d)\n", CCN / (float)TCN * 100, CCN, TCN);
printf("Total accuracy: %.2f %% Semantic accuracy: %.2f %% Syntactic accuracy: %.2f %% \n", CACN / (float)TACN * 100, SEAC / (float)SECN * 100, SYAC / (float)SYCN * 100);
}
QID++;
scanf("%s", st1);
if (feof(stdin)) break;
printf("%s:\n", st1);
TCN = 0;
CCN = 0;
continue;
}
if (!strcmp(st1, "EXIT")) break;
scanf("%s", st2);
for (a = 0; a < strlen(st2); a++) st2[a] = toupper(st2[a]);
scanf("%s", st3);
for (a = 0; a<strlen(st3); a++) st3[a] = toupper(st3[a]);
scanf("%s", st4);
for (a = 0; a < strlen(st4); a++) st4[a] = toupper(st4[a]);
for (b = 0; b < words; b++) if (!strcmp(&vocab[b * max_w], st1)) break;
b1 = b;
for (b = 0; b < words; b++) if (!strcmp(&vocab[b * max_w], st2)) break;
b2 = b;
for (b = 0; b < words; b++) if (!strcmp(&vocab[b * max_w], st3)) break;
b3 = b;
for (a = 0; a < N; a++) bestd[a] = 0;
for (a = 0; a < N; a++) bestw[a][0] = 0;
TQ++;
if (b1 == words) continue;
if (b2 == words) continue;
if (b3 == words) continue;
for (b = 0; b < words; b++) if (!strcmp(&vocab[b * max_w], st4)) break;
if (b == words) continue;
for (a = 0; a < size; a++) vec[a] = (M[a + b2 * size] - M[a + b1 * size]) + M[a + b3 * size];
TQS++;
for (c = 0; c < words; c++) {
if (c == b1) continue;
if (c == b2) continue;
if (c == b3) continue;
dist = 0;
for (a = 0; a < size; a++) dist += vec[a] * M[a + c * size];
for (a = 0; a < N; a++) {
if (dist > bestd[a]) {
for (d = N - 1; d > a; d--) {
bestd[d] = bestd[d - 1];
strcpy(bestw[d], bestw[d - 1]);
}
bestd[a] = dist;
strcpy(bestw[a], &vocab[c * max_w]);
break;
}
}
}
if (!strcmp(st4, bestw[0])) {
CCN++;
CACN++;
if (QID <= 5) SEAC++; else SYAC++;
}
if (QID <= 5) SECN++; else SYCN++;
TCN++;
TACN++;
}
printf("Questions seen / total: %d %d %.2f %% \n", TQS, TQ, TQS/(float)TQ*100);
return 0;
}
make
if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
fi
echo ---------------------------------------------------------------------------------------------------
echo Note that for the word analogy to perform well, the model should be trained on much larger data set
echo Example input: paris france berlin
echo ---------------------------------------------------------------------------------------------------
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./word-analogy vectors.bin
make
if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
fi
time ./word2vec -train text8 -output classes.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -iter 15 -classes 500
sort classes.txt -k 2 -n > classes.sorted.txt
echo The word classes were saved to file classes.sorted.txt
make
if [ ! -e news.2012.en.shuffled ]; then
wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2012.en.shuffled.gz
gzip -d news.2012.en.shuffled.gz -f
fi
sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" < news.2012.en.shuffled | tr -c "A-Za-z'_ \n" " " > news.2012.en.shuffled-norm0
time ./word2phrase -train news.2012.en.shuffled-norm0 -output news.2012.en.shuffled-norm0-phrase0 -threshold 200 -debug 2
time ./word2phrase -train news.2012.en.shuffled-norm0-phrase0 -output news.2012.en.shuffled-norm0-phrase1 -threshold 100 -debug 2
tr A-Z a-z < news.2012.en.shuffled-norm0-phrase1 > news.2012.en.shuffled-norm1-phrase1
time ./word2vec -train news.2012.en.shuffled-norm1-phrase1 -output vectors-phrase.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 0 -sample 1e-5 -threads 20 -binary 1 -iter 15
./compute-accuracy vectors-phrase.bin < questions-phrases.txt
make
if [ ! -e news.2012.en.shuffled ]; then
wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2012.en.shuffled.gz
gzip -d news.2012.en.shuffled.gz -f
fi
sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" < news.2012.en.shuffled | tr -c "A-Za-z'_ \n" " " > news.2012.en.shuffled-norm0
time ./word2phrase -train news.2012.en.shuffled-norm0 -output news.2012.en.shuffled-norm0-phrase0 -threshold 200 -debug 2
time ./word2phrase -train news.2012.en.shuffled-norm0-phrase0 -output news.2012.en.shuffled-norm0-phrase1 -threshold 100 -debug 2
tr A-Z a-z < news.2012.en.shuffled-norm0-phrase1 > news.2012.en.shuffled-norm1-phrase1
time ./word2vec -train news.2012.en.shuffled-norm1-phrase1 -output vectors-phrase.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 0 -sample 1e-5 -threads 20 -binary 1 -iter 15
./distance vectors-phrase.bin
###############################################################################################
#
# Script for training good word and phrase vector model using public corpora, version 1.0.
# The training time will be from several hours to about a day.
#
# Downloads about 8 billion words, makes phrases using two runs of word2phrase, trains
# a 500-dimensional vector model and evaluates it on word and phrase analogy tasks.
#
###############################################################################################
# This function will convert text to lowercase and remove special characters
normalize_text() {
awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
-e 's/"/ " /g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
-e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
-e 's/«/ /g' | tr 0-9 " "
}
mkdir word2vec
cd word2vec
wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2012.en.shuffled.gz
wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2012.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text < news.2012.en.shuffled > data.txt
normalize_text < news.2013.en.shuffled >> data.txt
wget http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
tar -xvf 1-billion-word-language-modeling-benchmark-r13output.tar.gz
for i in `ls 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled`; do
normalize_text < 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/$i >> data.txt
done
wget http://ebiquity.umbc.edu/redirect/to/resource/id/351/UMBC-webbase-corpus
tar -zxvf umbc_webbase_corpus.tar.gz webbase_all/*.txt
for i in `ls webbase_all`; do
normalize_text < webbase_all/$i >> data.txt
done
wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
bzip2 -c -d enwiki-latest-pages-articles.xml.bz2 | awk '{print tolower($0);}' | perl -e '
# Program to filter Wikipedia XML dumps to "clean" text consisting only of lowercase
# letters (a-z, converted from A-Z), and spaces (never consecutive)...
# All other characters are converted to spaces. Only text which normally appears.
# in the web browser is displayed. Tables are removed. Image captions are.
# preserved. Links are converted to normal text. Digits are spelled out.
# *** Modified to not spell digits or throw away non-ASCII characters ***
# Written by Matt Mahoney, June 10, 2006. This program is released to the public domain.
$/=">"; # input record separator
while (<>) {
if (/<text /) {$text=1;} # remove all but between <text> ... </text>
if (/#redirect/i) {$text=0;} # remove #REDIRECT
if ($text) {
# Remove any text not normally visible
if (/<\/text>/) {$text=0;}
s/<.*>//; # remove xml tags
s/&amp;/&/g; # decode URL encoded chars
s/&lt;/</g;
s/&gt;/>/g;
s/<ref[^<]*<\/ref>//g; # remove references <ref...> ... </ref>
s/<[^>]*>//g; # remove xhtml tags
s/\[http:[^] ]*/[/g; # remove normal url, preserve visible text
s/\|thumb//ig; # remove images links, preserve caption
s/\|left//ig;
s/\|right//ig;
s/\|\d+px//ig;
s/\[\[image:[^\[\]]*\|//ig;
s/\[\[category:([^|\]]*)[^]]*\]\]/[[$1]]/ig; # show categories without markup
s/\[\[[a-z\-]*:[^\]]*\]\]//g; # remove links to other languages
s/\[\[[^\|\]]*\|/[[/g; # remove wiki url, preserve visible text
s/{{[^}]*}}//g; # remove {{icons}} and {tables}
s/{[^}]*}//g;
s/\[//g; # remove [ and ]
s/\]//g;
s/&[^;]*;/ /g; # remove URL encoded chars
$_=" $_ ";
chop;
print $_;
}
}
' | normalize_text | awk '{if (NF>1) print;}' >> data.txt
wget http://word2vec.googlecode.com/svn/trunk/word2vec.c
wget http://word2vec.googlecode.com/svn/trunk/word2phrase.c
wget http://word2vec.googlecode.com/svn/trunk/compute-accuracy.c
wget http://word2vec.googlecode.com/svn/trunk/questions-words.txt
wget http://word2vec.googlecode.com/svn/trunk/questions-phrases.txt
gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -funroll-loops
gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -funroll-loops
gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -funroll-loops
./word2phrase -train data.txt -output data-phrase.txt -threshold 200 -debug 2
./word2phrase -train data-phrase.txt -output data-phrase2.txt -threshold 100 -debug 2
./word2vec -train data-phrase2.txt -output vectors.bin -cbow 1 -size 500 -window 10 -negative 10 -hs 0 -sample 1e-5 -threads 40 -binary 1 -iter 3 -min-count 10
./compute-accuracy vectors.bin 400000 < questions-words.txt # should get to almost 78% accuracy on 99.7% of questions
./compute-accuracy vectors.bin 1000000 < questions-phrases.txt # about 78% accuracy with 77% coverage
make
if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
fi
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./compute-accuracy vectors.bin 30000 < questions-words.txt
# to compute accuracy with the full vocabulary, use: ./compute-accuracy vectors.bin < questions-words.txt
make
if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
fi
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./distance vectors.bin
// Copyright 2013 Google Inc. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <malloc.h>
const long long max_size = 2000; // max length of strings
const long long N = 40; // number of closest words that will be shown
const long long max_w = 50; // max length of vocabulary entries
int main(int argc, char **argv) {
FILE *f;
char st1[max_size];
char *bestw[N];
char file_name[max_size], st[100][max_size];
float dist, len, bestd[N], vec[max_size];
long long words, size, a, b, c, d, cn, bi[100];
char ch;
float *M;
char *vocab;
if (argc < 2) {
printf("Usage: ./distance <FILE>\nwhere FILE contains word projections in the BINARY FORMAT\n");
return 0;
}
strcpy(file_name, argv[1]);
f = fopen(file_name, "rb");
if (f == NULL) {
printf("Input file not found\n");
return -1;
}
fscanf(f, "%lld", &words);
fscanf(f, "%lld", &size);
vocab = (char *)malloc((long long)words * max_w * sizeof(char));
for (a = 0; a < N; a++) bestw[a] = (char *)malloc(max_size * sizeof(char));
M = (float *)malloc((long long)words * (long long)size * sizeof(float));
if (M == NULL) {
printf("Cannot allocate memory: %lld MB %lld %lld\n", (long long)words * size * sizeof(float) / 1048576, words, size);
return -1;
}
for (b = 0; b < words; b++) {
a = 0;
while (1) {