Commit 43d7addb authored by ana's avatar ana
Browse files

adding datafolder, scripts for scoring textfiles per sentece/word

parent ea143895
Part I
Chapter I
It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him. The hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured poster, too large for indoor display, had been tacked to the wall. It depicted simply an enormous face, more than a metre wide: the face of a man of about forty-five, with a heavy black moustache and ruggedly handsome features. Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours. It was part of the economy drive in preparation for Hate Week. The flat was seven flights up, and Winston, who was thirty-nine and had a varicose ulcer above his right ankle, went slowly, resting several times on the way. On each landing, opposite the lift shaft, the poster with the enormous face gazed from the wall. It was one of those pictures which are so contrived that the eyes follow you about when you move. BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
Inside the flat a fruity voice was reading out a list of figures which had something to do with the production of pig-iron. The voice came from an oblong metal plaque like a dulled mirror which formed part of the surface of the right-hand wall. Winston turned a switch and the voice sank somewhat, though the words were still distinguishable. The instrument (the telescreen, it was called) could be dimmed, but there was no way of shutting it off completely. He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the Party. His hair was very fair, his face naturally sanguine, his skin roughened by coarse soap and blunt razor blades and the cold of the winter that had just ended.
Outside, even through the shut window-pane, the world looked cold. Down in the street little eddies of wind were whirling dust and torn paper into spirals, and though the sun was shining and the sky a harsh blue, there seemed to be no colour in anything, except the posters that were plastered everywhere. The black-moustachio’d face gazed down from every commanding corner. There was one on the house-front immediately opposite. BIG BROTHER IS WATCHING YOU, the caption said, while the dark eyes looked deep into Winston’s own. Down at street level another poster, torn at one corner, flapped fitfully in the wind, alternately covering and uncovering the single word INGSOC. In the far distance a helicopter skimmed down between the roofs, hovered for an instant like a bluebottle, and darted away again with a curving flight. It was the police patrol, snooping into people's windows. The patrols did not matter, however. Only the Thought Police mattered. Behind Winston's back the voice from the telescreen was still babbling away about pig-iron and the overfulfilment of the Ninth Three-Year Plan. The telescreen received and transmitted simultaneously. Any sound that Winston made, above the level of a very low whisper, would be picked up by it; moreover, so long as he remained within the field of vision which the metal plaque commanded, he could be seen as well as heard. There was of course no way of knowing whether you were being watched at any given moment. How often, or on what system, the Thought Police plugged in on any individual wire was guesswork. It was even conceivable that they watched everybody all the time. But at any rate they could plug in your wire whenever they wanted to. You had to live - did live, from habit that became instinct - in the assumption that every sound you made was overheard, and, except in darkness, every movement scrutinised.
Winston kept his back turned to the telescreen. It was safer; though, as he well knew, even a back can be revealing. A kilometre away the Ministry of Truth, his place of work, towered vast and white above the grimy landscape. This, he thought with a sort of vague distaste - this was London, chief city of Airstrip One, itself the third most populous of the provinces of Oceania. He tried to squeeze out some childhood memory that should tell him whether London had always been quite like this. Were there always these vistas of rotting nineteenth-century houses, their sides shored up with baulks of timber, their windows patched with cardboard and their roofs with corrugated iron, their crazy garden walls sagging in all directions? And the bombed sites where the plaster dust swirled in the air and the willowherb straggled over the heaps of rubble; and the places where the bombs had cleared a larger patch and there had sprung up sordid colonies of wooden dwellings like chicken-houses? But it was no use, he could not remember: nothing remained of his childhood except a series of bright-lit tableaux, occurring against no background and mostly unintelligible.
I was guiltless, but I had indeed drawn down a horrible curse upon my head, as mortal as that of crime.
We sat late.
said the old man.
The agonies of remorse poison the luxury there is otherwise sometimes found in indulging the excess of grief.
Blasted as thou wert, my agony was still superior to thine, for the bitter sting of remorse will not cease to rankle in my wounds until death shall close them forever.
Man, you shall repent of the injuries you inflict.
In one corner, near a small fire, sat an old man, leaning his head on his hands in a disconsolate attitude.
I saw him on the point of repeating his blow, when, overcome by pain and anguish, I quitted the cottage, and in the general tumult escaped unperceived to my hovel.
Her mild eyes seemed incapable of any severity or guile, and yet she has committed a murder.
Never did I behold a vision so horrible as his face, of such loathsome yet appalling hideousness.
But since the murderer has been discovered--The murderer discovered!
I may die, but first you, my tyrant and tormentor, shall curse the sun that gazes on your misery.
More miserable than man ever was before, why did I not sink into forgetfulness and rest?
Yet I am certainly unjust.
This sound disturbed an old woman who was sleeping in a chair beside me.
Chapter 12 I lay on my straw, but I could not sleep.
Saying this, he suddenly quitted me, fearful, perhaps, of any change in my sentiments.
The poor victim, who on the morrow was to pass the awful boundary between life and death, felt not, as I did, such deep and bitter agony.
I paused.
Was there no injustice in this?
Yet mine shall not be the submission of abject slavery.
Why did you form a monster so hideous that even YOU turned from me in disgust?
A sister or a brother can never, unless indeed such symptoms have been shown early, suspect the other of fraud or false dealing, when another friend, however strongly he may be attached, may, in spite of himself, be contemplated with suspicion.
My swelling heart involuntarily pours itself out thus.
I am the assassin of those most innocent victims; they died by my machinations.
I shall no longer see the sun or stars or feel the winds play on my cheeks.
In a thousand spots the traces of the winter avalanche may be perceived, where trees lie broken and strewed on the ground, some entirely destroyed, others bent, leaning upon the jutting rocks of the mountain or transversely upon other trees.
Or whither does your senseless curiosity lead you?
The poor that stopped at their door were never driven away.
Mans yesterday may neer be like his morrow; Nought may endure but mutability!
Devil, cease; and do not poison the air with these sounds of malice.
Elizabeth had caught the scarlet fever; her illness was severe, and she was in the greatest danger.
I writhed under his words, yet dared not exhibit the pain I felt.
I passed the night wretchedly.
His jaws opened, and he muttered some inarticulate sounds, while a grin wrinkled his cheeks.
Felix trembled violently as he said this.
I remained silent.
I was a poor, helpless, miserable wretch; I knew, and could distinguish, nothing; but feeling pain invade me on all sides, I sat down and wept.
He became the victim of its weakness.
Fear overcame me; I dared no advance, dreading a thousand nameless evils that made me tremble, although I was unable to define them.
Fiend that thou art!
Let the cursed and hellish monster drink deep of agony; let him feel the despair that now torments me.
The wet wood which I had placed near the heat dried and itself became inflamed.
A frightful selfishness hurried me on, while my heart was poisoned with remorse.
I was alone; none were near me to dissipate the gloom and relieve me from the sickening oppression of the most terrible reveries.
I sat down, and a silence ensued.
As my sickness quitted me, I was absorbed by a gloomy and black melancholy that nothing could dissipate.
I then paused, and a cold shivering came over me.
I do not fear to die, she said; that pang is past.
Tears, unrestrained, fell from my brothers eyes; a sense of mortal agony crept over my frame.
I shall die.
His limbs were nearly frozen, and his body dreadfully emaciated by fatigue and suffering.
Cursed, cursed be the fiend that brought misery on his grey hairs and doomed him to waste in wretchedness!
What do these sounds portend?
Geneva, March 18, 17--.
Again do I vow vengeance; again do I devote thee, miserable fiend, to torture and death.
Had my eyes deceived me?
They spurn and hate me.
Then, overcome by fatigue, I lay down among some straw and fell asleep.
And do not you fear the fierce vengeance of my arm wreaked on your miserable head?
Justine shook her head mournfully.
I paused; at length he spoke, in broken accents: Unhappy man!
How often did I imprecate curses on the cause of my being!
But now crime has degraded me beneath the meanest animal.
During the whole of this wretched mockery of justice I suffered living torture.
Wherefore not?
Shall I not then hate them who abhor me?
I never beheld anything so utterly destroyed.
My own agitation and anguish was extreme during the whole trial.
I remained motionless.
He threatened excommunication and hell fire in my last moments if I continued obdurate.
I, a miserable wretch, haunted by a curse that shut up every avenue to enjoyment.
The monster continued to utter wild and incoherent self-reproaches.
There he lies, white and cold in death.
Miserable himself that he may render no other wretched, he ought to die.
Oh, Frankenstein!
Why did I not die?
I could have torn him limb from limb, as the lion rends the antelope.
Why did I not then expire!
Leave me; I am inexorable.
I never saw a man in so wretched a condition.
I pitied Frankenstein; my pity amounted to horror; I abhorred myself.
I do refuse it, I replied; and no torture shall ever extort a consent from me.
Poor little fellow!
Or rather, stay, that I may trample you to dust!
I have endured incalculable fatigue, and cold, and hunger; do you dare destroy my hopes?Begone!
Miserable, unhappy wretch!
Ugly wretch!
Chapter 16 Cursed, cursed creator!
I did confess, but I confessed a lie.
Thus I might proclaim myself a madman, but not revoke the sentence passed upon my wretched victim.
Oh, not abhorred!
No guilt, no mischief, no malignity, no misery, can be found comparable to mine.
Poor, poor girl, is she the accused?
What a miserable night I passed!
I did not yet entirely know the fatal effects of this miserable deformity.
Geneva, May 12th, 17--.
I never could survive so horrible a misfortune.
Nay, then I was not miserable.
No; I am not so selfish.
I am interrupted.
Poor William!
I trembled.
Oh, no!
Suddenly a heavy storm of rain descended.
Poor girl!
You may hate, but beware!
I knocked.
I thought (foolish wretch!)
He struggled violently.
I am malicious because I am miserable.
Abhorred monster!
Poor Clerval!
Unfeeling, heartless creator!
Do not despair.
I trembled violently, apprehending some dreadful misfortune.
Do not fear.
Scoffing devil!
Begone, vile insect!
Wretched devil!
Hypocritical fiend!
Hideous monster!
......@@ -3,8 +3,8 @@
This script applies a trained model to textfiles.
It splits the text in sentences and predicts a sentiment score for each of the sentences.
The score & sentence are saved in a file, ordered from small to big scores
It splits the text in sentences and predicts a sentiment score for each of the sentences / or for each of the words
The score & sentence/word are saved in a file, ordered from small to big scores
clean your text using
......@@ -13,15 +13,10 @@ clean your text using
import numpy as np
import pandas as pd
import re
import time
import random
import os, sys
import nltk
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from sklearn.externals import joblib
......@@ -30,16 +25,6 @@ from sklearn.externals import joblib
### ------------
def write(sentence):
words = sentence.split(" ")
for word in words:
for char in word:
sys.stdout.write('%s' % char)
sys.stdout.write(" ")
def archive(sentence, filename):
with open(filename, "a") as destination:
......@@ -71,47 +56,6 @@ def load_embeddings(filename):
return pd.DataFrame(arr, index=labels, dtype='f')
### Load Lexicon of POSITIVE and NEGATIVE words
### -------------------------------------------
def load_lexicon(filename):
Load a file from Bing Liu's sentiment lexicon
(, containing
English words in Latin-1 encoding.
One file contains a list of positive words, and the other contains
a list of negative words. The files contain comment lines starting
with ';' and blank lines, which should be skipped.
lexicon = []
with open(filename, encoding='latin-1') as infile:
for line in infile:
line = line.rstrip()
if line and not line.startswith(';'):
return lexicon
### See the sentiment that this classifier predicts for particular words
### --------------------------------------------------------------------
def vecs_to_sentiment(vecs):
# predict_log_proba gives the log probability for each class
predictions = model.predict_log_proba(vecs)
# To see an overall positive vs. negative classification in one number,
# we take the log probability of positive sentiment minus the log
# probability of negative sentiment.
return predictions[:, 1] - predictions[:, 0]
### Use sentiment function to see some examples of its predictions on the test data
### --------------------------------------------------------------------------------
def words_to_sentiment(words):
vecs = embeddings.loc[words].dropna()
log_odds = vecs_to_sentiment(vecs)
return pd.DataFrame({'sentiment': log_odds}, index=vecs.index)
### Combine sentiments for word vectors into an overall sentiment score by averaging them
### --------------------------------------------------------------------------------------
TOKEN_RE = re.compile(r"\w.*?\b")
......@@ -123,20 +67,6 @@ def text_to_sentiment(text):
sentiments = words_to_sentiment(tokens)
return sentiments['sentiment'].mean()
### Use Pandas to make a table of names, their predominant ethnic background, and the predicted sentiment score
### ------------------------------------------------------------------------------------------------------------
def name_sentiment_table():
frames = []
for group, name_list in sorted(NAMES_BY_ETHNICITY.items()):
lower_names = [name.lower() for name in name_list]
sentiments = words_to_sentiment(lower_names)
sentiments['group'] = group
# Put together the data we got from each ethnic group into one big table
return pd.concat(frames)
### ------------------------------------------
......@@ -144,165 +74,17 @@ def name_sentiment_table():
embeddings = load_embeddings('data/glove.42B.300d.txt')
embeddings = load_embeddings('data/glove.840B.300d.txt')
#embeddings = load_embeddings('data/glove.42B.300d.txt')
#embeddings = load_embeddings('data/glovesample.txt')
filename = 'data/1984_all_stripped.txt'
#filename = 'data/1984_fragment.txt'
#filename = 'data/frankenstein_for_machines.txt'
pos_output = filename.replace('.txt','_pos.txt')
neg_output = filename.replace('.txt','_neg.txt')
### Welcome & choice
### ----------------
# rows = embeddings.shape[0]
# columns = embeddings.shape[1]
# pos_words = load_lexicon('data/positive-words.txt')
# neg_words = load_lexicon('data/negative-words.txt')
# ### CLEAN UP positive and negative words
# ### ------------------------------------
# #the data points here are the embeddings of these positive and negative words.
# #We use the Pandas .loc[] operation to look up the embeddings of all the words.
# pos_vectors = embeddings.loc[pos_words]
# neg_vectors = embeddings.loc[neg_words]
# #Some of these words are not in the GloVe vocabulary, particularly the misspellings such as "fancinating".
# #Those words end up with rows full of NaN to indicate their missing embeddings, so we use .dropna() to remove them.
# pos_vectors = embeddings.loc[pos_words].dropna()
# neg_vectors = embeddings.loc[neg_words].dropna()
# print("\t\tTidied up, you see that each word is represented by exactly 300 points in the vector landscape: \n", pos_vectors[:5], "\n")
# #time.sleep(10)
# len_pos = len(pos_vectors)
# len_neg = len(neg_vectors)
# '''
# Now we make arrays of the desired inputs and outputs.
# The inputs are the embeddings, and the outputs are 1 for positive words and -1 for negative words.
# We also make sure to keep track of the words they're labeled with, so we can interpret the results.
# '''
# vectors = pd.concat([pos_vectors, neg_vectors])
# targets = np.array([1 for entry in pos_vectors.index] + [-1 for entry in neg_vectors.index])
# labels = list(pos_vectors.index) + list(neg_vectors.index)
# ### ___________________
# '''
# Using the scikit-learn train_test_split function, we simultaneously separate the input vectors,
# output values, and labels into training and test data, with 10% of the data used for testing.
# '''
# train_vectors, test_vectors, train_targets, test_targets, train_labels, test_labels = \
# train_test_split(vectors, targets, labels, test_size=0.1, random_state=0)
# '''
# Now we make our classifier, and train it by running the training vectors through it for 100 iterations.
# We use a logistic function as the loss, so that the resulting classifier can output the probability
# that a word is positive or negative.
# '''
# model = SGDClassifier(loss='log', random_state=0, n_iter=100)
#, train_targets)
# '''
# ### EVALUATION - Finetuning the scoring
# ### ____________________________________
# We evaluate the classifier on the test vectors.
# It predicts the correct sentiment for sentiment words outside of its training data 95% of the #time.
# Precision: (also called positive predictive value) is the fraction of relevant instances among the retrieved instances:
# -> When it predicts yes, how often is it correct?
# Recall: (also known as sensitivity) is the fraction of relevant instances that have been retrieved over
# the total amount of relevant instances: how many instances did the classifier classify correctly?
# Confusion Matrix: True Positives | False Negatives
# False Positives | True Negatives
# '''
# confusion_matrix = (confusion_matrix(model.predict(test_vectors), test_targets))
# #print("confusion matrix", confusion_matrix)
# cm = np.split(confusion_matrix, 2, 1)
# print("\t\tLet's ", blue("evaluate our findings!\n"))
# #time.sleep(4)
# print("\t\tFor each of 10% of test words, we predict their overall sentiment.\n")
# #time.sleep(4)
# print("\t\tWe compare our results to the given labels.\n")
# #time.sleep(4)
# print("\t\tFor this test we scored the following:\n")
# #time.sleep(4)
# TP = cm[0][0]
# FP = cm[0][1]
# TN = cm[1][1]
# FN = cm[1][0]
# print("\t\tWe matched ", green(str(TP))," items correctly as positive words.\n")
# #time.sleep(4)
# print("\t\tThese are also called ", red("True Positives."))
# #time.sleep(4)
# print("\n")
# print("\t\tWe mismatched ", green(str(FP))," items, we labeled them incorrectly as positive words.\n")
# #time.sleep(4)
# print("\t\tThese are also called ", red("False Positives."))
# #time.sleep(4)
# print("\n")
# print("\t\tWe matched ", green(str(TN))," items, we labeled them correctly as negative words.\n")
# #time.sleep(4)
# print("\t\tThese are also called ", red("True Negatives."))
# #time.sleep(4)
# print("\n")
# print("\t\tWe mismatched ", green(str(FN))," items, we labeled them incorrectly as negative words.\n")
# #time.sleep(4)
# print("\t\tThese are also called ", red("False Negatives."))
# #time.sleep(4)
# print("\n")
# ### QUESTION::: How to map weights features/outcome numbers back to original words??]
# ### QUESTION: how to show examples of TP/FP/TN/FN???
# #print("Weights assigned to features: ", model.coef_)
# accuracy_score = (accuracy_score(model.predict(test_vectors), test_targets))
# print("\t\tOur accuracy score is ", accuracy_score)
# '''
# ### Predicted sentiment for Particular Word
# ### ________________________________________
# Let's use the function vecs_to_sentiment(vecs) and words_to_sentiment(words) above to see the sentiment that this classifier predicts for particular words,
# to see some examples of its predictions on the test data.
# '''
# # Show 20 examples from the test set
# samples = words_to_sentiment(test_labels).ix[:20]
# print("\t\tHere are a few samples to get an idea: \n", samples)
# #time.sleep(4)
# print("\n")
# '''
# There are many ways to combine sentiments for word vectors into an overall sentiment score.
# Again, because we're following the path of least resistance, we're just going to average them.
# '''
### -----------------
# # using joblib
# joblib.dump(model, 'sentiment_thermometer_glove.pkl')
scored_words = filename.replace('.txt','_scored_words.txt')
......@@ -322,10 +104,10 @@ with open(filename, "r") as source:
print("line", line)
# this returns a list with 1 element containing the entire text, sentences separated by \n
sentences = '\n'.join(finding_sentences.tokenize(line.strip()))
print("sentences", sentences)
#print("sentences", sentences)
# transform string into list of sentences
sentences_list = sentences.split("\n")
print('sentences_list', sentences_list)
#print('sentences_list', sentences_list)