GithubHelp home page GithubHelp logo

old-pyvideo-data's Introduction

pyvideo

Join the chat at https://gitter.im/pyvideo/pyvideo

https://pyvideo.org is simply an index of Python-related media records. The raw data being used here comes out of the pyvideo/data repo.

Before opening a PR, please check out our Development Philosophy.

Development setup

Setting up a development environment is as simple as four easy steps.

  1. Clone repo (recursively; it contains submodules)
  2. Install dependencies
  3. Build reST files from JSON files
  4. Build HTML files from reST files

All of these steps are explained in detail below.

First, pull down this repo's code:

$ git clone --recursive https://github.com/pyvideo/pyvideo.git

Then, install the dependencies for building this site. It is recommended to install all the requirements inside virtualenv, use virtualenvwrapper to manage virtualenvs. Building pyvideo.org requires Python 3.5

First of all, create a virtual environment to install all the dependencies into either using virtualenvwrapper:

$ mkvirtualenv -p python3 pyvideo

... or using pyvenv:

$ pyvenv .env && source .env/bin/activate

From the root of the repo, run the following command:

$ pip install -r requirements/dev.in

Finally, you'll be able to generate the HTML site. From the root of the repo, run the following command:

$ make html

To view the site, run the following command:

$ make serve

This will start development server on port 8000. Goto browser and open http://localhost:8000 to view your local version of pyvideo.org!

Debugging

If you're trying to debug unexpected build results, you can pass one of two variables to the make process to influence to logging level:

# Show Pelican warnings
$ make VERBOSE=1 html

# Show even more output
$ make DEBUG=1 html

Accessibility tests

There are automated tests to ensure that none of the pages have significant accessibility problems; to run them:

  1. Download chromedriver and add it to your PATH environment variable (copy to /usr/local/bin, etc.)
  2. Run make test

Want to help?

We'd love the help! All feature ideas and bugs-to-be-fixed are listed in the issues associated with this repo. Please check there for ideas on how to contribute. Thanks!

If you want to contribute new media, please check the pyvideo/data repo and its contribution docs.

Found an issue?

If you've found an issue with the site or something that could be done better, please open an issue on Github.

Want our Google Analytics info?

PyVideo tries to be as open source as possible. We share the data that Google Analytics collects on request. Please feel free to send an email to [email protected] with the header "Google Analytics Access Request" if you would like access to this data. Please note that it may take a few weeks to be granted this access.

old-pyvideo-data's People

Contributors

codersquid avatar dhoffman34 avatar logston avatar markush avatar redapple avatar willkg avatar zerok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

old-pyvideo-data's Issues

validate reST in summary and description

The validator should be enhanced so that it can verify that the summary and description are valid restructured text.

@codersquid pointed out that invalid restructured text should be a WARNING and not an ERROR--it should be something we can make better over time, but should never block a PR from landing.

I think it's probably safe to assume that "raises an error when parsed as restructured text" is a sufficient proxy for "invalid restructured text" for now.

At some point in the future, we might also want to traverse the resulting tree of nodes for nodes we're not allowing. As a contrived example, maybe we have a rule against footnotes, so any footnote blocks would raise a warning, too. I think we should push any thinking about this off into the distant future when we decide the kinds of restructured text bits we've got are an issue or there's some additional facets we need to validate.

For now, let's just go with "does it kick up errors when being parsed"? If it does kick up errors, this is a WARNING and we should print out the error details.

markdown or reST

Currently summary and description data is formatted with "markdown". We're using very basic elements that should work across the various markdown flavors, but it's probably the case that we'll hit one that's icky at some point.

When we converted from HTML to markdown in pyvideo a few years ago, we did so because more people were familiar and enthusiastic about editing in markdown than in reST. We've been using markdown ever since.

This issue covers figuring out whether to switch to reST for summaries and descriptions.

If we decide to switch, we should create two new issues: one to enhance the validator to validate reST in summary and description and one to convert from markdown to reST.

If we decide not to switch, we should figure out how to validate markdown in summary and description.

Current summary (March 21st, 2016):

markdown

pros:

  • easy to use for our use case (lists, headers and occasionally links)
  • generally people are familiar with markdown or at least are less fearful of markdown given its ubiquity
  • we're using it now and have been for some time and haven't had any issues

cons:

  • there are multiple markdown flavors, though I'm not sure this is an issue for our data set since we're primarily using lists, headers and occasionally links
  • poorer tooling because of multiple markdown flavors?

restructured text

pros:

  • easy to use for our use case (lists, headers and occasionally links)
  • the tools are great and we can output to multiple formats (we've never had a need for this so far)
  • better integration with the kinds of static site generators that might be used for the next pyvideo.org?
  • very extensible (we've never needed to extend the markup)
  • everyone who's commented in the markdown vs. reST issue is pro-reST

cons:

  • seems like there's more fear about restructured text vs. markdown which might lead to fewer people helping (I think this would be hard to measure and I'm not sure we'd notice if we're affected by it)

convert summary and description from markdown to reST

Per issue #22, we decided to switch from Markdown to Restructured Text.

This issue covers going through all the existing videos and converting the summary and description fields.

When we converted from HTML to Markdown a while back, we used pandoc. It's probably sufficient here, too. I don't think we need to worry about the fractured landscape of Markdown standards because we're predominantly using lists, bold, italics, headers and links.

For summaries and descriptions that pandoc throws an error on, we can just ignore those for now. We'll re-discover which ones are problematic later when we write a validator.

I think the next step is to write a script that:

  1. takes a conference directory as an argument
  2. opens all the JSON files in that directory (use the clive library for this)
  3. then for each JSON file, converts the summary and description
  4. saves all the JSON files back to disk (use the clive library for this)

write validation script

We need a validation script that goes through all the files and validates the contents. Validation involves:

  1. verifying data types and shapes
  2. validating urls

@codersquid pointed out we could lift this code from steve. Maybe if there are other things we want to lift from steve we should just rework steve to work specifically on pyvideo-data and use that.

recover related_urls data

In pyvideo.org we have related_urls in the video scheme, but not in the data here. related_urls were added manually, by me, and there aren't that many of them. Typically I'd add a link to slides, ipython notebooks, repos. I don't know the utility of that field, so maybe we can drop it? Close this issue if that's the disposition.

If it is not, someone could write a script based on existing scripts that goes through and extracts the related_urls from the existing pyvideo.org site to make a PR. If this is done, I'd like someone to do a trial run, then have a review of the script. And then, we can decide whether it makes sense to have a big PR and merge everything all at once, or whether it is better to do some PRs with manual oversite.

write up documentation

We want a README that states what this repo is about.

We want a CONTRIBUTING document that specifies how to add new data, how to edit data, conventions.

We also want a LICENSE file. Though I have no idea if I can establish copyright ownership of the data in order to license it. I think I might just do it with some explanation and put it under a CC license.

solidify and document JSON schemas

We have two kinds of things we're tracking now: conferences (category.json) and videos (slug.json).

There are tentative schemas for both of these that are tied into the validation code here:

https://github.com/pyvideo/pyvideo-data/blob/8b8e19e70300c25e742a931d9190f76d10daca17/src/clive/validate.py#L155

This issue covers:

  1. discussing those schemas on the mailing list and establishing a version 1 of those schemas
  2. figuring out whether we want to do schema versioning at this time
  3. figuring out how to document the schemas because clearly using the code as documentation is pretty meh

document how to make a reference to this data

If this data is ever used in a research project, I think the researcher needs a way to refer to a specific version of this data.

This issue covers talking to someone who does that work and establish whether there are conventions for this sort of thing already and/or what we need to facilitate proper references.

pycon-apac-2015 has no videos

There are no videos in the pycon-apac-2015 directory. I'm not sure why offhand.

This issue covers getting the video data for that conference.

version schemas

We have a schema right now. We need to add the infrastructure to allow for specifying multiple versions of a schema and having the validator correctly handle versions.

That's not super hard. I've got it in my head. I'll keep this in my queue until I either implement it or write down enough of what I'm thinking that I can pass it off to someone else.

[clive] save_json_data doesn't differentiate between category and video files

save_json_data throws things in an OrderedDict so as to maintain a stable ordering in json files. However, this has two problems:

  1. it doesn't differentiate between category and video files and thus does the wrong thing with category files and drops data
  2. it doesn't correctly handle nested containers
  3. it drops any data it doesn't know about

We should fix save_json_data so that it uses the schema for ordering of things.

improve validator to catch duplicate slugs

I think we want slugs unique across the data set. I'm not sure whether slugs by themselves need to be unique or whether we should do category + slug.

This issue covers:

  1. figuring out whether slug or category + slug needs to be unique
  2. improving the validator to check for this

[clive] should it stay or should it go?

clive is the software that we're using to validate pyvideo data json files. I figured I'd throw it in the same repository for now since the two will grow at the same time and it's a lot less work if they're in the same repo.

At some point p, it'll probably be better if they were in separate repositories.

This issue covers figuring out how to figure out whether we're at that point p and then figure out what to do.

add all the existing data

@codersquid sent me her script to pull data from pyvideo. I'll write a script based on that to pull all the data and store it on disk. Then we'll check it all in.

video data should be in a videos list

pyvideo stored video urls in a denormalized way using fields like video_mp4_url, video_mp4_length, video_mp4_download_only or some silly thing like that. That's not wildly helpful and it doesn't allow for multiple mp4 links.

This issue covers changing that structure so instead we have a videos key which has a list of dicts of something like url, length and type. For now, we can use the format as the type. I don't know what length is (is it a file size? is it a duration of time? what?`--maybe we can drop that. url is self-explanatory.

document SLA for data

We need to document our "service level agreement" for the data. Things like:

  1. we will never force-push to the repository
  2. all changes to the schema are announced at xyz
  3. we will produce reports on the state of the data
  4. data is licensed under CC0

Or something to that effect.

This issue covers coming up with the first SLA and documenting it in a publicly available place. Probably have an SLA.rst in the repo root and also have that exposed in the docs site.

docs

We need documentation on the following things:

  1. what is pyvideo-data?
  2. what infrastructure exists and how does it work?
  3. where does this community hang out?
  4. how does someone assemble data for a conference? tips, tricks, tools, ...

figure out language values

We're currently doing the language name in English as the value for the language field. That's not great. We should probably switch to a better standard. Maybe iso-639-1?

close down pyvideo/pyvideo-data

The https://github.com/pytube/pyvideo-data fork is becoming the official fork and we're passing everything off to @logston.

This issue covers all the things we need to do to transition properly.

(Copying the todo items here.)

  • clear the PR queue (@willkg)
  • deprecate this repository
  • figure out how to reparent pytube/data repo in github so that pull requests to the repo there don't automatically get made to here. It's a github thing that will confuse people. (@logston)
  • write a blog post and link it on pyvideo.org (needs discussion) (@willkg)
  • in progress: transition the pyvideo.org domain to @logston or the PSF (@willkg, @codersquid)
  • figure out what to do with clive and any other bits that I half-did (needs a list of bits and discussion about what to do about them)
  • figure out what to do with the pyvideo donated rackspace account
  • in progress: move videos from rackspace CDN to archive.org (@logston)
  • announce the end of pyvideoorg twitter account (@willkg)

fix honza's name

Honza Král's name is spelled in two ways across our data:

  1. The correct way.
  2. The all-ascii way: Honza Kral.

We should fix his name in the places it's wrong.

mailing list

We need a mailing list in order to coordinate data curation and usage.

data quality

We're still bootstrapping this repository and focusing on things like can we validate data? what's the workflow for fixing small issues? what's the workflow for adding new data and fixing large issues? how do we do review? what's the licensing? how do we onboard new people? what's our "service level agreement" for this data in regards to what we will and won't change and how we change it? ...

That's great. I think that constitutes "phase 1".

Phase 2 is the sorts of things we want to do long term. Long term, we want the data to improve. In order to know what data needs fixing and how good it is now, we need to figure out what factors into data quality for our project and then probably build some kind of metrics/reporting system so that we can track that over time and also surface issues that need fixing.

This issue covers that at a really high level with the expectation that this issue will spawn a bunch of smaller work-product type issues.

DOC: Video content status.

I am sorry pyvideo is leaving. Thank you for all your hard work. I loved it with a passion because all our user conferences' videos went up so quickly.

Where is all the content going? If I'd like to watch PyGotham from 2015 where should I look after the site dies?
thank you!
Anne

what to do with lightning talks?

A long time ago, I wanted to surface the different topics in the lightning talk videos so it was possible to search for specific lightning talks and also click on a link which jumps to that specific talk in the video.

The "click on the link" thing worked partially at one point, but I'm pretty sure it broke a few times and I'm not sure it ever worked well. So that's a bummer.

Anyhow, the problem this created is that I encoded the information entirely in formatting markup. There's no way to extract that information programmatically plus it's kind of a mess.

This issue covers figuring out how to persist that information in a better way.

Should we track chapters in videos where a "chapter" consists of a timecode (h:m:s) and a topic? Maybe tags? What's the minimal amount of work we can do here that's maximally helpful?

figure out a helpful convention for commit messages

The amount of data we're tracking here will increase over time and it's likely we'll be doing a lot of data fixes.

This issue covers figuring out a convention for commit messages to surface the important information regarding those fixes to the git history.

keys in JSON files should be sorted

The keys in the JSON files should be sorted so that we can keep them stable over time which will reduce review work.

We could sort them alphabetically, but it's easier to skim the raw data if the data follows a logical progression.

Instead, we should use an OrderedDict and then add the keyvals in a specified order and then json.dump the result. That'll order them in an order we like.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.