pyvideo / old-pyvideo-data Goto Github PK

View Code? Open in Web Editor NEW

107.0 107.0 38.0 7.78 MB

DEPRECATED: Video data for Python related conferences

License: Other

Python 91.44% Shell 4.34% Makefile 4.22%

old-pyvideo-data's Introduction

pyvideo

https://pyvideo.org is simply an index of Python-related media records. The raw data being used here comes out of the pyvideo/data repo.

Before opening a PR, please check out our Development Philosophy.

Development setup

Setting up a development environment is as simple as four easy steps.

Clone repo (recursively; it contains submodules)
Install dependencies
Build reST files from JSON files
Build HTML files from reST files

All of these steps are explained in detail below.

First, pull down this repo's code:

$ git clone --recursive https://github.com/pyvideo/pyvideo.git

Then, install the dependencies for building this site. It is recommended to install all the requirements inside virtualenv, use virtualenvwrapper to manage virtualenvs. Building pyvideo.org requires Python 3.5

First of all, create a virtual environment to install all the dependencies into either using virtualenvwrapper:

$ mkvirtualenv -p python3 pyvideo

... or using pyvenv:

$ pyvenv .env && source .env/bin/activate

From the root of the repo, run the following command:

$ pip install -r requirements/dev.in

Finally, you'll be able to generate the HTML site. From the root of the repo, run the following command:

$ make html

To view the site, run the following command:

$ make serve

This will start development server on port 8000. Goto browser and open http://localhost:8000 to view your local version of pyvideo.org!

Debugging

If you're trying to debug unexpected build results, you can pass one of two variables to the make process to influence to logging level:

# Show Pelican warnings
$ make VERBOSE=1 html

# Show even more output
$ make DEBUG=1 html

Accessibility tests

There are automated tests to ensure that none of the pages have significant accessibility problems; to run them:

Download chromedriver and add it to your PATH environment variable (copy to /usr/local/bin, etc.)
Run make test

Want to help?

We'd love the help! All feature ideas and bugs-to-be-fixed are listed in the issues associated with this repo. Please check there for ideas on how to contribute. Thanks!

If you want to contribute new media, please check the pyvideo/data repo and its contribution docs.

Found an issue?

If you've found an issue with the site or something that could be done better, please open an issue on Github.

Want our Google Analytics info?

PyVideo tries to be as open source as possible. We share the data that Google Analytics collects on request. Please feel free to send an email to [email protected] with the header "Google Analytics Access Request" if you would like access to this data. Please note that it may take a few weeks to be granted this access.

old-pyvideo-data's People

Contributors

Stargazers

Watchers

old-pyvideo-data's Issues

validate reST in summary and description

The validator should be enhanced so that it can verify that the summary and description are valid restructured text.

@codersquid pointed out that invalid restructured text should be a WARNING and not an ERROR--it should be something we can make better over time, but should never block a PR from landing.

I think it's probably safe to assume that "raises an error when parsed as restructured text" is a sufficient proxy for "invalid restructured text" for now.

At some point in the future, we might also want to traverse the resulting tree of nodes for nodes we're not allowing. As a contrived example, maybe we have a rule against footnotes, so any footnote blocks would raise a warning, too. I think we should push any thinking about this off into the distant future when we decide the kinds of restructured text bits we've got are an issue or there's some additional facets we need to validate.

For now, let's just go with "does it kick up errors when being parsed"? If it does kick up errors, this is a WARNING and we should print out the error details.

in the md to reST transition, we dropped "url" and "start_date" from category information

I was going through the PRs for the markdown to restructured text changeover and it looks like in most/all of them we dropped "url" and "start_date" from the category.json files.

Seems like a pain in the ass to fix each individual PR. Seems easier to fix this in a separate pass as a single PR after all the other PRs have landed.

markdown or reST

Currently summary and description data is formatted with "markdown". We're using very basic elements that should work across the various markdown flavors, but it's probably the case that we'll hit one that's icky at some point.

When we converted from HTML to markdown in pyvideo a few years ago, we did so because more people were familiar and enthusiastic about editing in markdown than in reST. We've been using markdown ever since.

This issue covers figuring out whether to switch to reST for summaries and descriptions.

If we decide to switch, we should create two new issues: one to enhance the validator to validate reST in summary and description and one to convert from markdown to reST.

If we decide not to switch, we should figure out how to validate markdown in summary and description.

Current summary (March 21st, 2016):

markdown

pros:

easy to use for our use case (lists, headers and occasionally links)
generally people are familiar with markdown or at least are less fearful of markdown given its ubiquity
we're using it now and have been for some time and haven't had any issues

cons:

there are multiple markdown flavors, though I'm not sure this is an issue for our data set since we're primarily using lists, headers and occasionally links
poorer tooling because of multiple markdown flavors?

restructured text

pros:

easy to use for our use case (lists, headers and occasionally links)
the tools are great and we can output to multiple formats (we've never had a need for this so far)
better integration with the kinds of static site generators that might be used for the next pyvideo.org?
very extensible (we've never needed to extend the markup)
everyone who's commented in the markdown vs. reST issue is pro-reST

cons:

seems like there's more fear about restructured text vs. markdown which might lead to fewer people helping (I think this would be hard to measure and I'm not sure we'd notice if we're affected by it)

convert summary and description from markdown to reST

Per issue #22, we decided to switch from Markdown to Restructured Text.

This issue covers going through all the existing videos and converting the summary and description fields.

When we converted from HTML to Markdown a while back, we used pandoc. It's probably sufficient here, too. I don't think we need to worry about the fractured landscape of Markdown standards because we're predominantly using lists, bold, italics, headers and links.

For summaries and descriptions that pandoc throws an error on, we can just ignore those for now. We'll re-discover which ones are problematic later when we write a validator.

I think the next step is to write a script that:

takes a conference directory as an argument
opens all the JSON files in that directory (use the clive library for this)
then for each JSON file, converts the summary and description
saves all the JSON files back to disk (use the clive library for this)

write validation script

We need a validation script that goes through all the files and validates the contents. Validation involves:

verifying data types and shapes
validating urls

@codersquid pointed out we could lift this code from steve. Maybe if there are other things we want to lift from steve we should just rework steve to work specifically on pyvideo-data and use that.

recover related_urls data

In pyvideo.org we have related_urls in the video scheme, but not in the data here. related_urls were added manually, by me, and there aren't that many of them. Typically I'd add a link to slides, ipython notebooks, repos. I don't know the utility of that field, so maybe we can drop it? Close this issue if that's the disposition.

If it is not, someone could write a script based on existing scripts that goes through and extracts the related_urls from the existing pyvideo.org site to make a PR. If this is done, I'd like someone to do a trial run, then have a review of the script. And then, we can decide whether it makes sense to have a big PR and merge everything all at once, or whether it is better to do some PRs with manual oversite.

write up documentation

We want a README that states what this repo is about.

We want a CONTRIBUTING document that specifies how to add new data, how to edit data, conventions.

We also want a LICENSE file. Though I have no idea if I can establish copyright ownership of the data in order to license it. I think I might just do it with some explanation and put it under a CC license.

solidify and document JSON schemas

We have two kinds of things we're tracking now: conferences (category.json) and videos (slug.json).

There are tentative schemas for both of these that are tied into the validation code here:

https://github.com/pyvideo/pyvideo-data/blob/8b8e19e70300c25e742a931d9190f76d10daca17/src/clive/validate.py#L155

This issue covers:

discussing those schemas on the mailing list and establishing a version 1 of those schemas
figuring out whether we want to do schema versioning at this time
figuring out how to document the schemas because clearly using the code as documentation is pretty meh

document how to make a reference to this data

If this data is ever used in a research project, I think the researcher needs a way to refer to a specific version of this data.

This issue covers talking to someone who does that work and establish whether there are conventions for this sort of thing already and/or what we need to facilitate proper references.

pycon-apac-2015 has no videos

There are no videos in the pycon-apac-2015 directory. I'm not sure why offhand.

This issue covers getting the video data for that conference.

version schemas

We have a schema right now. We need to add the infrastructure to allow for specifying multiple versions of a schema and having the validator correctly handle versions.

That's not super hard. I've got it in my head. I'll keep this in my queue until I either implement it or write down enough of what I'm thinking that I can pass it off to someone else.

fix description and speakers mentioned for pycon-ca-2012

Fix description and speakers mentioned by @redapple in #46

[clive] save_json_data doesn't differentiate between category and video files

save_json_data throws things in an OrderedDict so as to maintain a stable ordering in json files. However, this has two problems:

it doesn't differentiate between category and video files and thus does the wrong thing with category files and drops data
it doesn't correctly handle nested containers
it drops any data it doesn't know about

We should fix save_json_data so that it uses the schema for ordering of things.

improve validator to catch duplicate slugs

I think we want slugs unique across the data set. I'm not sure whether slugs by themselves need to be unique or whether we should do category + slug.

This issue covers:

figuring out whether slug or category + slug needs to be unique
improving the validator to check for this

[clive] should it stay or should it go?

clive is the software that we're using to validate pyvideo data json files. I figured I'd throw it in the same repository for now since the two will grow at the same time and it's a lot less work if they're in the same repo.

At some point p, it'll probably be better if they were in separate repositories.

This issue covers figuring out how to figure out whether we're at that point p and then figure out what to do.

add docs to readthedocs

Set up the docs on readthedocs once we have docs.

add all the existing data

@codersquid sent me her script to pull data from pyvideo. I'll write a script based on that to pull all the data and store it on disk. Then we'll check it all in.

video data should be in a videos list

pyvideo stored video urls in a denormalized way using fields like video_mp4_url, video_mp4_length, video_mp4_download_only or some silly thing like that. That's not wildly helpful and it doesn't allow for multiple mp4 links.

This issue covers changing that structure so instead we have a videos key which has a list of dicts of something like url, length and type. For now, we can use the format as the type. I don't know what length is (is it a file size? is it a duration of time? what?`--maybe we can drop that. url is self-explanatory.

document SLA for data

We need to document our "service level agreement" for the data. Things like:

we will never force-push to the repository
all changes to the schema are announced at xyz
we will produce reports on the state of the data
data is licensed under CC0

Or something to that effect.

This issue covers coming up with the first SLA and documenting it in a publicly available place. Probably have an SLA.rst in the repo root and also have that exposed in the docs site.

fix descriptions and speakers for djangocon-2009

Speakers and descriptions need to be cleaned up as mentioned by @redapple in #45.

djangocon 2014

Videos are here:

https://www.youtube.com/playlist?list=PLE7tQUdRKcybbNiuhLcc3h6WzmZGVBMr3

Conference site is here:

https://2014.djangocon.us/

docs

We need documentation on the following things:

what is pyvideo-data?
what infrastructure exists and how does it work?
where does this community hang out?
how does someone assemble data for a conference? tips, tricks, tools, ...

figure out language values

We're currently doing the language name in English as the value for the language field. That's not great. We should probably switch to a better standard. Maybe iso-639-1?

djangocon 2013

Videos are here:

https://www.youtube.com/playlist?list=PLtqtTJ4wP09YOFqm_lBCoQtmS6S0omW3J

Site doesn't seem to be working for reasons I'm not sure about, but seems like a bug on their side.

https://2013.djangocon.us/

close down pyvideo/pyvideo-data

The https://github.com/pytube/pyvideo-data fork is becoming the official fork and we're passing everything off to @logston.

This issue covers all the things we need to do to transition properly.

(Copying the todo items here.)

clear the PR queue (@willkg)
deprecate this repository
figure out how to reparent pytube/data repo in github so that pull requests to the repo there don't automatically get made to here. It's a github thing that will confuse people. (@logston)
write a blog post and link it on pyvideo.org (needs discussion) (@willkg)
in progress: transition the pyvideo.org domain to @logston or the PSF (@willkg, @codersquid)
figure out what to do with clive and any other bits that I half-did (needs a list of bits and discussion about what to do about them)
figure out what to do with the pyvideo donated rackspace account
in progress: move videos from rackspace CDN to archive.org (@logston)
announce the end of pyvideoorg twitter account (@willkg)

fix honza's name

Honza Král's name is spelled in two ways across our data:

The correct way.
The all-ascii way: Honza Kral.

We should fix his name in the places it's wrong.

add section to CONTRIBUTING.rst about contributing new data

Need some instructions in CONTRIBUTING.rst about how to contribute new conference data and edit existing conference data.

Maybe it should just be a link to docs that talk about tools to use?

[djangocon-2015] Extract speakers from title

See 40d9896 for motivation

DjangoCon EU 2010: fix titles, summaries, descriptions and speakers

Data in data/djangocon-eu-2010/videos/ is virtually empty, with only video links, and title being the video files names.

Talks information for the conference can be found:

mailing list

We need a mailing list in order to coordinate data curation and usage.

data quality

We're still bootstrapping this repository and focusing on things like can we validate data? what's the workflow for fixing small issues? what's the workflow for adding new data and fixing large issues? how do we do review? what's the licensing? how do we onboard new people? what's our "service level agreement" for this data in regards to what we will and won't change and how we change it? ...

That's great. I think that constitutes "phase 1".

Phase 2 is the sorts of things we want to do long term. Long term, we want the data to improve. In order to know what data needs fixing and how good it is now, we need to figure out what factors into data quality for our project and then probably build some kind of metrics/reporting system so that we can track that over time and also surface issues that need fixing.

This issue covers that at a really high level with the expectation that this issue will spawn a bunch of smaller work-product type issues.

DOC: Video content status.

I am sorry pyvideo is leaving. Thank you for all your hard work. I loved it with a passion because all our user conferences' videos went up so quickly.

Where is all the content going? If I'd like to watch PyGotham from 2015 where should I look after the site dies?
thank you!
Anne

what to do with lightning talks?

A long time ago, I wanted to surface the different topics in the lightning talk videos so it was possible to search for specific lightning talks and also click on a link which jumps to that specific talk in the video.

The "click on the link" thing worked partially at one point, but I'm pretty sure it broke a few times and I'm not sure it ever worked well. So that's a bummer.

Anyhow, the problem this created is that I encoded the information entirely in formatting markup. There's no way to extract that information programmatically plus it's kind of a mess.

This issue covers figuring out how to persist that information in a better way.

Should we track chapters in videos where a "chapter" consists of a timecode (h:m:s) and a topic? Maybe tags? What's the minimal amount of work we can do here that's maximally helpful?

Instead, we should use an OrderedDict and then add the keyvals in a specified order and then json.dump the result. That'll order them in an order we like.