GithubHelp home page GithubHelp logo

dnflow's People

Contributors

dchud avatar edsu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dnflow's Issues

START: RunFlow

It would be useful if a dataset was marked complete or something a bit more understandable than (START: RunFlow) when the collection is finished, and the report is available.

FetchMedia error

I'm not sure how this happened, but it looks like get_block_size is throwing an error sometimes when n is negative?

Runtime error:
Traceback (most recent call last):
  File "/opt/docnow/lib/python3.5/site-packages/luigi/worker.py", line 181, in run
    new_deps = self._run_get_new_deps()
  File "/opt/docnow/lib/python3.5/site-packages/luigi/worker.py", line 119, in _run_get_new_deps
    task_gen = self.task.run()
  File "/home/docnow/dnflow/summarize.py", line 355, in run
    update_block_size = get_block_size(count, 5)
  File "/home/docnow/dnflow/summarize.py", line 72, in get_block_size
    return int(n / (math.ceil(math.log10(n)) * d))
ValueError: math domain error

screen shot 2016-08-19 at 4 21 05 pm

image number zero

The image numbers in the similarity report are all zero. This was working fairly recently so there must've been breaking change recently.

screen shot 2016-08-19 at 4 44 52 pm

require login for viewing

Given the sensitive nature of some searches, and the fact that they can contain tweets from protected users you follow, the report should only show up when you are logged in.

hardcoded config var

In summarize.py: PopulateRedis.run() has a hard-coded reference to localhost when instantiating a Redis instance.

id file output

It would be useful if the luigi workflow generated a tweet ID file and made it avaialble in the report. You know, in case there was an app somewhere for, uh, hydrating them again? Just to get people thinking ya know?

(full disclosure: this was @bergisjules idea)

normalize /api/ urls

RIght now there are a variety of urls for getting at data related to a search. I propose we normalize these under /api/. I know this is just a prototype, but it would be nice to have a bit more clarity.

delete search

It would be useful to be able to delete searches that you have created.

filter search list

Bergis and a few other people expected to only see their own searches. We were thinking this should be the default behavior. But perhaps it could be a configuration option, depending on the preference of the person running it?

login

It could be useful to demonstrate people logging in and doing data collection as themselves. If they login via Twitter then their keys can be used to do the data collection, and the homepage can indicate who is doing what. I think it could be useful to get people to consider the optics of what they are doing, and some of the ethical dimensions to the work.

This isn't essential for this prototypes purpose in the STL meeting, but if it's not too difficult to do it might be worth it.

Two possible contenders for adding the functionality:

/job insecure

Right now anyone can POST to /job and update the database. If we are going to continue along this path of having Luigi tasks update the app database using the webapp then we'll need to figure out a way to make it secure.

link webrecorder

Link URLs to Webrecorder to get people thinking about the web archiving function.

memory usage

I was attempting to collect 50k tweets and my ec2 micro instance fell over. When I was able to log in I noticed that redis had died because it didn't have enough memory to write its snapshot. This then caused the rq workers to fail because they couldn't talk to redis.

warcs for images

This is here a placeholder for a discussion I had with @ikreymer at our first meeting in St Louis. Ilya asked about how media files are downloaded from the web, and I told him we came up with our own way of storing the downloaded images on the local filesystem using the URL that Twitter had assigned to the uploaded file. Ilya asked if we considered storing the images in a WARC file, which would preserve the data as long as where the data came from.

Currently the images can only come from one place: http://pbs.twimg.com because we're only looking at images that are uploaded to Twitter. But if we start pulling images from other places such as Instagram, Flickr, etc it might be useful to think about how recording the data in WARCs could be useful. I think it will be particularly useful when transferring data out of DocNow and into something else.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.