docnow / dnflow Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 6.0 1023 KB

A design prototype for DocNow to learn with

License: MIT License

Python 59.43% HTML 31.32% JavaScript 8.31% CSS 0.94%

dnflow's People

Contributors

Stargazers

Watchers

Forkers

edsu pombredanne kayiwa egfrank rizplate egbutter

dnflow's Issues

START: RunFlow

It would be useful if a dataset was marked complete or something a bit more understandable than (START: RunFlow) when the collection is finished, and the report is available.

It seems like quite a few URLs are broken in the top URLs report. For example the top URL in this report is http://on.rare.us/2byp2gq which is a 404. But looking at the underlying Twitter data they URL seems to be a little different http://on.rare.us/2byp2Gq -- perhaps there's some downcasing going on that is breaking case sensitive URLs?

FetchMedia error

I'm not sure how this happened, but it looks like get_block_size is throwing an error sometimes when n is negative?

Runtime error:
Traceback (most recent call last):
  File "/opt/docnow/lib/python3.5/site-packages/luigi/worker.py", line 181, in run
    new_deps = self._run_get_new_deps()
  File "/opt/docnow/lib/python3.5/site-packages/luigi/worker.py", line 119, in _run_get_new_deps
    task_gen = self.task.run()
  File "/home/docnow/dnflow/summarize.py", line 355, in run
    update_block_size = get_block_size(count, 5)
  File "/home/docnow/dnflow/summarize.py", line 72, in get_block_size
    return int(n / (math.ceil(math.log10(n)) * d))
ValueError: math domain error

image number zero

The image numbers in the similarity report are all zero. This was working fairly recently so there must've been breaking change recently.

require login for viewing

Given the sensitive nature of some searches, and the fact that they can contain tweets from protected users you follow, the report should only show up when you are logged in.

hardcoded config var

In summarize.py: PopulateRedis.run() has a hard-coded reference to localhost when instantiating a Redis instance.

id file output

It would be useful if the luigi workflow generated a tweet ID file and made it avaialble in the report. You know, in case there was an app somewhere for, uh, hydrating them again? Just to get people thinking ya know?

(full disclosure: this was @bergisjules idea)

normalize /api/ urls

RIght now there are a variety of urls for getting at data related to a search. I propose we normalize these under /api/. I know this is just a prototype, but it would be nice to have a bit more clarity.

make media urls clickable in x-axes on summary screen

per @bergisjules request, like with the user intent popups.

see number of tweets in search

It would be useful to be able to see the number of tweets in the search on the home page.

edgelists aren't in download links

and the media graph, and checksums

delete search

It would be useful to be able to delete searches that you have created.

searching for spanish

Question from Silvia Gutiérrez on Twitter:

https://twitter.com/espejolento/status/771007720941957120

It looks like a search for #ReuniónPeñaTrump is only turning up 21 tweets whereas TAGS was able to get ~500:

https://twitter.com/ernestopriego/status/771009725194444800

filter search list

Bergis and a few other people expected to only see their own searches. We were thinking this should be the default behavior. But perhaps it could be a configuration option, depending on the preference of the person running it?

login

It could be useful to demonstrate people logging in and doing data collection as themselves. If they login via Twitter then their keys can be used to do the data collection, and the homepage can indicate who is doing what. I think it could be useful to get people to consider the optics of what they are doing, and some of the ethical dimensions to the work.

This isn't essential for this prototypes purpose in the STL meeting, but if it's not too difficult to do it might be worth it.

Two possible contenders for adding the functionality:

/job insecure

Right now anyone can POST to /job and update the database. If we are going to continue along this path of having Luigi tasks update the app database using the webapp then we'll need to figure out a way to make it secure.

link webrecorder

Link URLs to Webrecorder to get people thinking about the web archiving function.

memory usage

I was attempting to collect 50k tweets and my ec2 micro instance fell over. When I was able to log in I noticed that redis had died because it didn't have enough memory to write its snapshot. This then caused the rq workers to fail because they couldn't talk to redis.

warcs for images

This is here a placeholder for a discussion I had with @ikreymer at our first meeting in St Louis. Ilya asked about how media files are downloaded from the web, and I told him we came up with our own way of storing the downloaded images on the local filesystem using the URL that Twitter had assigned to the uploaded file. Ilya asked if we considered storing the images in a WARC file, which would preserve the data as long as where the data came from.

Currently the images can only come from one place: http://pbs.twimg.com because we're only looking at images that are uploaded to Twitter. But if we start pulling images from other places such as Instagram, Flickr, etc it might be useful to think about how recording the data in WARCs could be useful. I think it will be particularly useful when transferring data out of DocNow and into something else.

docnow / dnflow Goto Github PK

dnflow's People

Contributors

Stargazers

Watchers

Forkers

dnflow's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs