rostlab / js16_projectd_group4 Goto Github PK

Joffrey Baratheon is one of the most loathed characters in TV history. As a matter of fact people were celebrating his TV death on Twitter. We are interested to learn more on how people feel about different characters by analyzing tweets mentioning GoT characters. In this project you will be analyzing Twitter feeds across a timeline, you will look for the name of GoT characters in that feed and try to identify whether the tweet is positive or negative. You can then generate a metric that evaluates what is the accumulated sentiment expressed on Twitter for that given character at a given point in time, and what is the trend (positive, negative). It will be interesting to intersect the sentiments for characters following the airing of a certain episode (you can easily get the airing date for an episode from the database constructed in Project A).

License: GNU General Public License v3.0

JavaScript 96.08% HTML 1.80% CSS 1.94% Shell 0.18%

js16_projectd_group4's People

Watchers

Forkers

emiliyana

js16_projectd_group4's Issues

Number of tweets analyzed

Fellas, as part of the media blitz we're planning there will be a press release that will throw some big numbers at the readers. Can you provide some impressive statistics about the data your tools processed e.g. our crawler fetched 2M tweets and 10M sentiments keyword processed. Any thing that you think might be interesting IS interesting.

Improve sentiment module training

The percentage of tweets with a usable Sentiment score seems rather low.
Someone should analyze some of the tweets with score=0 and look for words which we can manually add to the wordlist.

retext().use(sentiment, {
    'cat': -3,
    'dog': 3
});

Share schema definition with project A and use same database

In an effort to bring together the pieces, we came up with the following requirements:

The schema of what you store about the sentiments of a character needs to be merged with the database of project A.

This can be done in the following fashion:

{
  name: "Some character",
  ...,
  D5: {
     // D5's schema here
  },
  D4: {
     // YOUR TWITTER SCHEMA HERE
  }
}

Where I mean this by your twitter schema.

You will write directly on the database that A uses that should be somewhere on mLab. If @kordianbruck is using his own server to store the data, please come up with some idea.. I understand that you are not keen on having an API over mutable data, but until 1 day ago you could get DDosed.. so.. you know 💃

@AlexMoroz I assigned the issue to you because E should supervise this, but defer commitment to one of your teammates.

MouseDrag in chart

When draging the mouse and there are two charts on one page the last initialized chart is being changed
And the background lines of the first chart are moving.

Coding style

Please, please your code should be very, very well structured (indentation!!) and understandable!

This is not: https://github.com/Rostlab/JS16_ProjectD_Group4/blob/develop/crawler/twitter.js

commenting out console.log isn't nice. Delete them
indentation, indentation, indentation!
your comments are not really helping me understand the code (maybe it's indentation, maybe I'm just stupid , which is perfectly possible ahahahahah )

@julienschmidt you approved that pull request. Approving pull requests is not only seeing it and merging it, a quick look at the code and some comments in case they are needed are also very accepted! :)

Visualization Fixes

Main issues
~~- Fix Dates to align with Eastern Time~~ Changed timezone to UTC
~~- Fix Touch Interaction with Buttons~~ Done!
~~- Zooming buttons (for single touchpads)~~ can't manage in time for the deadline
~~- Limit domain to available data if less than a few months~~ Done!

~~Minor issues~~ Skipped due to missing time
~~- Add clicking functionality to scrollbar, make it align with mouse / touch gesture~~

~~Rename trendline (in ________________ )~~

~~- Redesign Score / Day (Hour) so it doesn't look like a button~~ Done!

Get Tweets

get last 7 days for each character (or more, if somehow possible)
get new tweets with streaming API

Keeping @sacdallago busy

Remember to upload the new version @sacdallago 😉

[meta] Feature Freeze

~~Define DB Schema for Project A #20~~
~~Export Crawler init functions~~
Export Crawler update functions
Define CSV API
Export Function to render chart for one Character
~~¿ List of most popular characters (or are they just using the DB) ? #18~~
JS / CSS files which have to be included || Function to set path so that we get them right in our Views

See https://github.com/Rostlab/JS16_ProjectD_Group4/milestones/Feature%20freeze

[meta] status report

@marcusnovotny @santanumohanta what are you working on? I haven't seen any code from you in over a week. Our deadlines are approaching fast 😉

Status of my work: My mobile-site crawler should now work fine. The JS API should now also be more or less production-ready. I'm currently working on getting the aggregation (analyzing the tweets and write the CSV files) working. Expect a PR later. This will also be the last thing I work on.

Revert master to initial state

Hey Julien, I just checked master and saw it was not in the cb4f4d0 state. Could you revert your commit?

Thx

Data Aggregation

We should make a plan how we aggregate the data

Our input is a database with tons of tweets and our output should be CSV data per character and week / day / hour (/ 10/5/1 minutes?)

My idea would be local "buckets" (basically just a multi-dimensional array) where we can easily append data from streamed tweets etc. Any other ideas?

Suspicious requests

Hey guys. Looking at the server logs (of the api) I see a bunch of:

[Wed Mar 30 2016 00:01:50] [LOG]   [Worker 4] Request incoming: /sentiment/find
[Wed Mar 30 2016 00:01:50] [LOG]   [Worker 4] 401 - no token sent

Between 21:47 and 1:50. I don't know if this was you, the other twitter group, or someone else. I just wanted to report, in case you were having some issues or spontaneous exceptions in your code, this might be the reason!

Make seasons visible on full zoom out

Hey @sacdallago @yashha

Can you please pull the latest changes from pull request #127 ? I noticed the got.show antagonist feature doesn't show the season labels when zooming out even though there's enough space so I adjusted them to show for widths > 400. It's just a 1 line change

Crawling blacklist

Here is a very incomplete list of names which are too generic / short / have a different meaning and therefore produce unusable results or just bloat the database (e.g. Will). I'd therefore like to blacklist them from the crawling by a simple attribute.
Please let me know if any of those persons is important.

Will
Will Humble
Willam
Yna
Zei
Quill
Rafe
Randa
Rigney
Rob
Robin
Roggo
Rolder
Rolfe
Rorge
Rosey
Rowan
Rudge
Rugen
Rus
Ryk
Sallor
Satin
Serra
Shella
Skinner
Squint
Squirrel
Tarle
Terro
Timon
Timeon
Timoth
Timett
Utt
Val
Violet
Wallen
Walton
Wate
Watt
Addam
Aggo
Albett
Alia
Alyn
Amabel
Armen
Arneld
Arnell
Arron
Bandy
Bannen
Barre
Barra
Bass
Barth
Becca
Beck
Bella
Ben
Benedict
Bennis
Biter
Boy
Brea
Brella
Brenett
Briar
Bryen
Bump
Buu
Byron
Cass
Carrot
Cerrick
Chiggen
Clement
Conn
Dake
Dalla
Dan
Del
Desmond
Dirk
Dobber
Dolf
Easy
Eggon
Elza
Fern
Ferret
Frenya
Gage
Galt
Gared
Garizon
Gariss
Gascoyne
Gavin
Gerren
Ghael
Gillam
Gilly
Grunt
Gulian
Hake
Hali
Halder
Haldon
Harra
Harwin
Helly
Henk
Hod
Hobb
Hoke
Holger
Holly
Hugh
Husband
Iggo
Jacks
Jayde
Kegs
Kurz
Kyle
Lanna
Leathers
Lem
Lenn
Lester
Lorren
Lothar
Lum
Maddy
Maggy
Mago
Malcolm
Maris
Matrice
Matt
Meg
Mela
Mord
Moro
Mullin
Mushroom
Myles
Nail
Nan
Nella
Nolla
Notch
Ogo
Ossy
Owen
Penny
Pia
Pudding
Qos
Quaro

No more Yellow Dick

Yellow Dick is apparently still lingering in the got show database. We'll have to manually remove him with db.charactersentiments.remove({"name": "Yellow Dick"}) so he doesn't show up in the statistics 😆

Report

5 pages
write up mostly about topic (references)
basically translating slides into text

Last page about project:

discussion about challenges that we faced
who did what?
include final slides
references pages

Deadline: 2016-04-15

Partly missing tweet history

For most of our main characters, the tweet history goes back all the way until 2011-2010. However, I noticed that for some of them we only have data from mid / end 2014 on (Daenerys Targaryen, Sansa Stark) or even later (Tyrion Lannister starts in August 2015), even though those characters were already important before. The history also looks cut off unnaturally. What could prevent the crawler from getting those older tweets?

https vs http

Hey guys, I noticed this

Can't you add a layer of abstraction giving the possibility to the user implementing the package to decide whether he/she wants to use https or http?

I say this because:

The package needs to be as reusable as possible, and enforcing the most restrictive of two alternatives is already meh
I will probably run the whole thing unencrypted because who has time to generate an SSL key anyway 💃

Visualization

What's the progress on the visualization part?
At least some graphical part (with mock data) would be nice for the presentation.

CC @marcusnovotny

[deployment] Where to store which data?

I'm still not really clear where we are supposed to store which data and how we provide data to the other Projects.

As far as I understood, we are supposed to use the same DB as Project A to save our data on characters (#20). So we should update the DB directly with our metadata on characters?

@sacdallago told us that we may use "our own" (i.e. configurable local DB?) for the Tweets. This database might get huge... up to several Gigabyte. Do we leave that up to the integration group and just provide a JS function to initialize the DB configured in the config?
Should we provide a dump of our existing crawled data?

How will our chart be integrated? We basically just need one client-side JS file and one CSS file to be included. Moreover one the init-function has the be called on one SVG-node.

The JS Script then requests CSV data. For that we need an HTTP handler available. Do we also provide just the controller function?

Anything else I forgot, @marcusnovotny @santanumohanta ?

API: Most popular characters

Controller to return the most popular characters

CSV files too larget for the server

@julienschmidt @marcusnovotny guys, we have an EMERGENCY. We really need your HELP!

The site keeps getting down and the servers are 100% full. Is there a way for storing the files somewhere else?

SyntaxError: Unexpected token ...

after executing var d4 = require('gotsentimental');

/home/yasar/Workspace/web/jsseminar/JS16_ProjectF/node_modules/gotsentimental/core/debug.js:10
        console.log("[\x1b[36m"+prefix+"\x1b[0m]", ...args);
                                                   ^^^

SyntaxError: Unexpected token ...

Replace Grunt with Gulp

Guys, if you do not have any sincere feelings towards Grunt 😜 can I replace it with Gulp? 😈

Gulp: Lint not enforced

Currently warnings by jshint are not treated as errors and therefore don't let the test fail.

We're strict here, please change the behavior so that warnings are treated as errors 😈

Update D4 latest changes

@sacdallago @yashha

Final changes applied to our project. Please update your copy of our Twitter Sentiment Tool with the following files.

Download link

/csv

Replaces your /example/app/csv folder

/gotsentimental

Dump of our database, created with mongodump. Use mongorestore

chart.js

Replaces your /public/chart.js

defaults.json

Replaces your defaults.json

mobile.js

Replaces your /crawler/mobile.js

Please get back to me in case there's any problems. Looking forward to the finished thing!

Twitter Streaming API

What is the current status of the Streaming API @santanumohanta ?
How does it have to be activated? Once or in Intervals?

Database for tweets

Hello Guys,

this is to start the discussion about the tweet database that stores all tweets gathered. I still think it is a good idea to use the same database for both our groups to increase sample size.

Now we need to figure out what database to use and what information we want to store. I took a look at your schema you used for mongodb and we have nothing to add, it seems to cover all the information needed. (We can drop some information though)

Happy to hear your thoughts on this.

Make a log-free or > /dev/null version of the package

I'm trying to debug Fs project and although all your logging is nice, it's also sort of mega spamming the console :D
Can you have a look at if it's possible to easily re-route the log to something else? A file?

Thx

Adding new items to our crawler list

Hey there!

A question for @julienschmidt - can we just add new items to the list of characters we crawl without disturbing the other projects? "House Stark", "House Lannister" and "George RR Martin" would be really cool to have for the landing pages

move API docs to the wiki

keep install, usage etc in readme.

Scraping Twitter mobile site

1. Make search

Go to https://mobile.twitter.com/search?q={character+name}&s=typd&x=13&y=16, e.g. https://mobile.twitter.com/search?q=tyrion+lannister&s=typd&x=13&y=16 with JavaScript disabled (otherwise a non-static site is served) or just wget / cURL it.

2. Extract content

We get a list of tweets. For each tweet there is an entry like the following:

<table class="tweet  " href="/PrinceSalad1/status/710145884009467904?p=v">
    <tr class="tweet-header ">
        <td class="avatar" rowspan="3">
            <a href="/PrinceSalad1?p=i">
                <img alt="Be The Change!" src="https://pbs.twimg.com/profile_images/693937036697419776/K0f92PDI_normal.jpg" />
            </a>
        </td>
        <td class="user-info">
            <a href="/PrinceSalad1?p=s">
                <strong class="fullname">Be The Change!</strong>
                <div class="username">
                    <span>@</span>PrinceSalad1
                </div>
            </a>
        </td>
        <td class="timestamp">
            <a name="tweet_710145884009467904" href="/PrinceSalad1/status/710145884009467904?p=p">2h</a>
        </td>
    </tr>
    <tr class="tweet-container">
        <td colspan="2" class="tweet-content">
            <div class="tweet-text" data-id="710145884009467904">
                <div class="dir-ltr" dir="ltr"> <span class="twitter-hit-highlight">Tyrion</span> <span class="twitter-hit-highlight">Lannister</span> is the smartest man I have ever seen on TV 😭
                </div>
            </div>
        </td>
    </tr>
    <tr>
        <td colspan="2" class="meta-and-actions">
            <span class="metadata">
        <a href="/PrinceSalad1/status/710145884009467904?p=v">View details</a>
        <span class="middot">&middot;</span>
            </span>
            <span class="tweet-actions">
        <a href="/PrinceSalad1/reply/710145884009467904" class="first">
            <img alt="Reply"   src="https://ma.twimg.com/twitter-mobile/3d969fa99f67efe5d80be77d08a716718f92bfea/images/sprites/tweet_reply.gif">
        </a>
          <a href="/statuses/710145884009467904/retweet">
                <img alt="Retweet"   src="https://ma.twimg.com/twitter-mobile/3d969fa99f67efe5d80be77d08a716718f92bfea/images/sprites/tweet_rt.gif">
          </a>
        <a href="/statuses/710145884009467904/favorite?authenticity_token=386bbbe1ca2cdd2af9e4967bade88dd1" class="favorite">
              <img alt="Like"   src="https://ma.twimg.com/twitter-mobile/3d969fa99f67efe5d80be77d08a716718f92bfea/images/sprites/tweet_heart.gif">
        </a>
        <a href="/PrinceSalad1/status/710145884009467904/actions" class="last"></a>
      </span>
        </td>
    </tr>
</table>

Of specific interest is

            <div class="tweet-text" data-id="710145884009467904">
                <div class="dir-ltr" dir="ltr"> <span class="twitter-hit-highlight">Tyrion</span> <span class="twitter-hit-highlight">Lannister</span> is the smartest man I have ever seen on TV 😭
                </div>
            </div>

It contains:
Unique Tweet ID: 710145884009467904
Tweet Text: "Tyrion Lannister is the smartest man I have ever seen on TV 😭"

3. Iterate

At the end of the page is a link to older tweets:

        <a href="/search?q=tyrion%20lannister&amp;s=typd&amp;next_cursor=TWEET-706813915489898497-710173609021599744-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAgAQAAAAAAAAAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAA"> Load older Tweets </a>

This links seems to be valid forever. We can use those links to iterate.

Things to figure out

~~Is there request limiting?~~ Nope!
~~How to easily parse HTML?~~ Parse?! RegEx!
How to clear tweet text from HTML tags?

Need for recrawl after host replacement?

Just tried replacing the host in our config files since the bruck db has now migrated to "api.got.show". After starting the crawler, it gave me a MongoError E11000 for duplicate keys, referring to the character slug. It works after switching to a new db and I'm currently recrawling - Just wanted to ask if that's necessary or if there's a workaround which allows us to keep the old db.

Special characters in character slug

Stumbled upon this guy in our dataset and noticed the route does not work with special characters like ' at the moment.

http://localhost:1337/Jaqen_H'ghar

Need for a menu / legend?

Will we need a user menu for our application? My guess right now is no, because looking at the current website, our tool will only be shown on the character pages, so we don't even need navigation.

A legend explaining the chart elements definitely seems necessary though. Could be integrated as a pop up box or just plain Text below our Tool on the Character page (definitely easiest).

Any thoughts on this? How is Group 5 handling this @kajo404 @jonny3576 @Logarythms ?

[crawler] random hangs

The crawler currently hangs randomly (?) sometimes without any exceptions for some reason after a long run time. To be debugged...

[crawler/mobile] crawl sometimes fails due to server errors

Sometimes the Twitter Servers return 503 (Service Unavailable) errors.
The Mobile Crawler then fails:
[crawler][ERROR] FAILED MCRWL Tommen Baratheon { status: 503, data: '' }
I'm not sure if the Mobile Search site or the REST API returns this errors. The Crawler should retry after a few seconds in such a case.

b676261#diff-726c97f0fff6e3231157be02c25032f2R66 in #45 already contains some code which should handle errors returned by the API but the crawler still fails sometimes.

Visualization

TODO

Tool / Results explanation on About page, or: More sophisticated Legend with foreignObject

Error when starting

@CavidSalahov
I still get this error, I think it comes from your plugin. The config file is fully filled.
[main][ERROR] { status: 503, data: '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>503 Service Unavailable</title>\n</head><body>\n<h1>Service Unavailable</h1>\n<p>The server is temporarily unable to service your\nrequest due to maintenance downtime or capacity\nproblems. Please try again later.</p>\n<p>Additionally, a 503 Service Unavailable\nerror was encountered while trying to use an ErrorDocument to handle the request.</p>\n</body></html>\n' }
[main][ERROR] { status: 503, data: '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>503 Service Unavailable</title>\n</head><body>\n<h1>Service Unavailable</h1>\n<p>The server is temporarily unable to service your\nrequest due to maintenance downtime or capacity\nproblems. Please try again later.</p>\n<p>Additionally, a 503 Service Unavailable\nerror was encountered while trying to use an ErrorDocument to handle the request.</p>\n</body></html>\n' }
After that message:
[core/db][INFO] connected.

[visualization] change data on zoom

The CSV "API" provides data for 3 zoom levels: year (contains data per day), month (per hour), day (per minute).

When reaching a certain zoom level the more specific data should be loaded and the graph replaced.
Maybe it is also necessary to load 2 files and append them when e.g. parts of 2 months are currently visible.

Jon Snow not entirely crawled

In the newest implementation, the mobile js crawler only crawled our most popular character Jon Snow post October 2015. Restarting didn't change anything. Did anybody experience the same?

Stock up db for main characters

Seems like the first full crawl of csv.got.show is completed and key characters are still missing. We have about 1,5 weeks of data for:

http://csv.got.show/Daenerys_Targaryen
http://csv.got.show/Tywin_Lannister
http://csv.got.show/Tyrion_Lannister
http://csv.got.show/Jon_Snow

I don't know where this is coming from suddenly, but I'm having the same issue on my machine at home. Those characters used to top my Most Discussed list during the first crawl and now we barely get data on them.

Is there anything we can do to squeeze the remaining tweets out of Twitter? Is it possible they stop us from accessing them because we crawled a lot of data from these terms? It feels super strange to me.

Remove bloat

With PR #14 we have the following (heavy) libraries:

D3.js
jQuery
jQuery UI
Modernizr (date form input)

It would be nice if we could remove bloat.

[Twitter] Classify remaining characters

Currently the crawler shows a list of "unclassified" characters at every start:

[crawler/blacklist][WARN] filtered unclassified: Willam
[crawler/blacklist][WARN] filtered unclassified: Willit
[crawler/blacklist][WARN] filtered unclassified: Willum
[crawler/blacklist][WARN] filtered unclassified: Wolmer
[crawler/blacklist][WARN] filtered unclassified: Wulfe
[crawler/blacklist][WARN] filtered unclassified: Wylla
...

Those are characters with names that are rather short and could possibly be "unsearchable" on Twitter.

Those remaining characters have to be classified into the whitelist and the blacklist (config.json[.sample]) like this:

Go to https://twitter.com/search?f=tweets&vertical=default&q="{charactername}"&src=sprv
If there are either not many tweets at all or at least half of the posts are about the GoT character, add it the the whitelist, otherwise add it to the blacklist.

Out of memory

After #81 we have a new problem: for some characters (e.g. Jon Snow, ~800.000 tweets) not all tweets fit into memory, which is currently necessary for the aggregation / generation of the CSV files:

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
[1]    34838 abort      node app

Any idea except working in chunks (which would complicate the aggregation a lot)?

@sacdallago @gyachdav

rostlab / js16_projectd_group4 Goto Github PK

js16_projectd_group4's People

Watchers

Forkers

js16_projectd_group4's Issues

1. Make search

2. Extract content

3. Iterate

Things to figure out

Recommend Projects

Recommend Topics

Recommend Org

Jobs