docnow / dnflow Goto Github PK
View Code? Open in Web Editor NEWA design prototype for DocNow to learn with
License: MIT License
A design prototype for DocNow to learn with
License: MIT License
It would be useful if a dataset was marked complete or something a bit more understandable than (START: RunFlow) when the collection is finished, and the report is available.
It seems like quite a few URLs are broken in the top URLs report. For example the top URL in this report is http://on.rare.us/2byp2gq which is a 404. But looking at the underlying Twitter data they URL seems to be a little different http://on.rare.us/2byp2Gq -- perhaps there's some downcasing going on that is breaking case sensitive URLs?
I'm not sure how this happened, but it looks like get_block_size
is throwing an error sometimes when n is negative?
Runtime error:
Traceback (most recent call last):
File "/opt/docnow/lib/python3.5/site-packages/luigi/worker.py", line 181, in run
new_deps = self._run_get_new_deps()
File "/opt/docnow/lib/python3.5/site-packages/luigi/worker.py", line 119, in _run_get_new_deps
task_gen = self.task.run()
File "/home/docnow/dnflow/summarize.py", line 355, in run
update_block_size = get_block_size(count, 5)
File "/home/docnow/dnflow/summarize.py", line 72, in get_block_size
return int(n / (math.ceil(math.log10(n)) * d))
ValueError: math domain error
Given the sensitive nature of some searches, and the fact that they can contain tweets from protected users you follow, the report should only show up when you are logged in.
In summarize.py
: PopulateRedis.run()
has a hard-coded reference to localhost when instantiating a Redis instance.
It would be useful if the luigi workflow generated a tweet ID file and made it avaialble in the report. You know, in case there was an app somewhere for, uh, hydrating them again? Just to get people thinking ya know?
(full disclosure: this was @bergisjules idea)
RIght now there are a variety of urls for getting at data related to a search. I propose we normalize these under /api/
. I know this is just a prototype, but it would be nice to have a bit more clarity.
per @bergisjules request, like with the user intent popups.
It would be useful to be able to see the number of tweets in the search on the home page.
and the media graph, and checksums
It would be useful to be able to delete searches that you have created.
Question from Silvia Gutiérrez on Twitter:
https://twitter.com/espejolento/status/771007720941957120
It looks like a search for #ReuniónPeñaTrump is only turning up 21 tweets whereas TAGS was able to get ~500:
Bergis and a few other people expected to only see their own searches. We were thinking this should be the default behavior. But perhaps it could be a configuration option, depending on the preference of the person running it?
It could be useful to demonstrate people logging in and doing data collection as themselves. If they login via Twitter then their keys can be used to do the data collection, and the homepage can indicate who is doing what. I think it could be useful to get people to consider the optics of what they are doing, and some of the ethical dimensions to the work.
This isn't essential for this prototypes purpose in the STL meeting, but if it's not too difficult to do it might be worth it.
Two possible contenders for adding the functionality:
Right now anyone can POST to /job and update the database. If we are going to continue along this path of having Luigi tasks update the app database using the webapp then we'll need to figure out a way to make it secure.
Link URLs to Webrecorder to get people thinking about the web archiving function.
I was attempting to collect 50k tweets and my ec2 micro instance fell over. When I was able to log in I noticed that redis had died because it didn't have enough memory to write its snapshot. This then caused the rq workers to fail because they couldn't talk to redis.
This is here a placeholder for a discussion I had with @ikreymer at our first meeting in St Louis. Ilya asked about how media files are downloaded from the web, and I told him we came up with our own way of storing the downloaded images on the local filesystem using the URL that Twitter had assigned to the uploaded file. Ilya asked if we considered storing the images in a WARC file, which would preserve the data as long as where the data came from.
Currently the images can only come from one place: http://pbs.twimg.com because we're only looking at images that are uploaded to Twitter. But if we start pulling images from other places such as Instagram, Flickr, etc it might be useful to think about how recording the data in WARCs could be useful. I think it will be particularly useful when transferring data out of DocNow and into something else.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.